Re: advice on creating a solr index when data source is from many unrelated db tables

2010-07-30 Thread Chantal Ackermann
Hi Ahmed,

fields that are empty do not impact the index. It's different from a
database.
I have text fields for different languages and per document there is
always only one of the languages set (the text fields for the other
languages are empty/not set). It works all very well and fast.

I wonder more about what you describe as unrelated data - why would
you want to put unrelated data into a single index? If you want to
search on all the data and return mixed results there surely must be
some kind of relation between the documents?

Chantal

On Thu, 2010-07-29 at 21:33 +0200, S Ahmed wrote:
 I understand (and its straightforward) when you want to create a index for
 something simple like Products.
 
 But how do you go about creating a Solr index when you have data coming from
 10-15 database tables, and the tables have unrelated data?
 
 The issue is then you would have many 'columns' in your index, and they will
 be NULL for much of the data since you are trying to shove 15 db tables into
 a single Solr/Lucense index.
 
 
 This must be a common problem, what are the potential solutions?





Re: advice on creating a solr index when data source is from many unrelated db tables

2010-07-30 Thread Gora Mohanty
On Thu, 29 Jul 2010 15:33:42 -0400
S Ahmed sahmed1...@gmail.com wrote:

 I understand (and its straightforward) when you want to create a
 index for something simple like Products.
 
 But how do you go about creating a Solr index when you have data
 coming from 10-15 database tables, and the tables have unrelated
 data?
 
 The issue is then you would have many 'columns' in your index,
 and they will be NULL for much of the data since you are trying
 to shove 15 db tables into a single Solr/Lucense index.
[...]

This should not be a problem. With the Solr DataImportHandler, any
NULL values for a given record will simply be ignored, i.e., the
Solr index for that document will not contain an entry for that
field.

Regards,
Gora


Re: Get unique values

2010-07-30 Thread Rafal Bluszcz Zawadzki
2010/7/28 Rafal Bluszcz Zawadzki ra...@headnet.dk

 Hi,

 In my schema I have (inter ali) fields CollectionID, and CollectionName.
  These two values always match together, which means that for every value of
 CollectionID there is matching value from CollectionName.

 I am interested in query which allow me to get unique values of
 CollectionID with matching CollectionNames (rest of fields is not interested
 for me in this query).


Finally I decided to store values in one indexed field (Collections) and
below query did the trick:

q=*:*rows=0facet=onfacet.field=Collections

-- 
Rafał Zawadzki
http://dev.bluszcz.net/


Re: wildcard and proximity searches

2010-07-30 Thread Ahmet Arslan

 What approach shoud I use to perform wildcard and proximity
 searches?
 
  
 
 Like: solr mail*~10
 
  
 
 For getting docs where solr is within 10 words of mailing
 for
 instance?


You can do it with the plug-in described here:
https://issues.apache.org/jira/browse/SOLR-1604
It would be great if you test it and give feedback.



  


Solr and Lucene in South Africa

2010-07-30 Thread Jaco Olivier
Hi to all Solr/Lucene Users...

Out team had a discussion today regarding the Solr/Lucene community closer to 
home.
I am hereby putting out an SOS to all Solr/Lucene users in the South African 
market and wish to organize a meet-up (or user support group) if at all 
possible.
It would be great to share some triumphs and pitfalls that were experienced.

* Sorry for hogging the User Mailing list on non-technical question, but think 
this is the easiest way to get it done :)

Jaco Olivier
Web Specialist

Please note: This email and its content are subject to the disclaimer as 
displayed at the following link 
http://www.sabinet.co.za/?page=e-mail-disclaimer. Should you not have Web 
access, send an email to i...@sabinet.co.zamailto:i...@sabinet.co.za and a 
copy will be sent to you


RE: wildcard and proximity searches

2010-07-30 Thread Frederico Azeiteiro
Hi Ahmet,

Thank you. I'll be happy to test it if I manage to install it ok.. I'm a
newbie at solr but I'm going to try the instructions in the thread to
load it.

Another doubts I have about wildcard searches:

a) I think wildcard search is by default case sensitive? Is there a
way to make case insensitive?

b) I have about 6000 queries to run (could have widlcards, proximity
searches or just normal queries). I discovered that the normal query
type doesn't work with wildcards and so I'm using the Filter Query to
query these. Is this field slower? I notice that using this field my
queries are much slower (I have some queries like *word* or *word1* or
*word2* that take about one minute to perform)
Is there a way to optimize these queries (without removing the wildcards
:))?

c)Is there a way to do phrase queries with wildcards? Like This solr*
mail*? Because the tests I made, when using quotes I think the
wildcards are ignored.

d)How exactly works the pf (phrase fields) and ps (phrase slop)
parameters and what's the difference for the proximity searches (ex:
word word2~20)?

Sorry for the long email and thank you for your help...
Frederico

-Original Message-
From: Ahmet Arslan [mailto:iori...@yahoo.com] 
Sent: sexta-feira, 30 de Julho de 2010 10:57
To: solr-user@lucene.apache.org
Subject: Re: wildcard and proximity searches


 What approach shoud I use to perform wildcard and proximity
 searches?
 
  
 
 Like: solr mail*~10
 
  
 
 For getting docs where solr is within 10 words of mailing
 for
 instance?


You can do it with the plug-in described here:
https://issues.apache.org/jira/browse/SOLR-1604
It would be great if you test it and give feedback.



  


RE: wildcard and proximity searches

2010-07-30 Thread Ahmet Arslan
 a) I think wildcard search is by default case sensitive?
 Is there a
 way to make case insensitive?

Wildcard searches are not analyzed. To case insensitive search you can 
lowercase query terms at client side. (with using lowercasefilter at index 
time) e.g. Mail* = mail*

 
 I discovered that the normal query type doesn't work with wildcards
 and so I'm using the Filter Query to query these. 

I don't understand this. Wildcard search works with q parameter if you are 
asking that. q=mail*

 field my
 queries are much slower (I have some queries like *word* or
 *word1* or
 *word2* that take about one minute to perform)
 Is there a way to optimize these queries (without removing
 the wildcards
 :))?

It is normal for leading wildcard search to be slow. Using 
ReversedWildcardFilterFactory at index time can speedup it.

But it is unusual to use both leading and trailing * operator. Why are you 
doing this? 

 c)Is there a way to do phrase queries with wildcards? Like
 This solr*
 mail*? Because the tests I made, when using quotes I think
 the wildcards are ignored.

By default it is not supported. With SOLR-1604 is it possible.

 d)How exactly works the pf (phrase fields) and ps (phrase
 slop)
 parameters and what's the difference for the proximity
 searches (ex:
 word word2~20)?

These parameters are specific to dismax query parser. 
http://wiki.apache.org/solr/DisMaxQParserPlugin



  


Re: Can't find org.apache.solr.client.solrj.embedded

2010-07-30 Thread Uwe Reh

Sorry,

I had inspected the ...core.jar three times, without recognizing the 
package. I was realy blind. =8-)


thanks
Uwe

Am 26.07.2010 20:48, schrieb Chris Hostetter:

: where is a Jar, containing org.apache.solr.client.solrj.embedded?

Classes in the embedded package are useless w/o the rest of the Solr
internal core classes, so they are included directly in the
apache-solr-core-1.4.1.jar.

-Hoss



RE: wildcard and proximity searches

2010-07-30 Thread Frederico Azeiteiro
Hi Ahmet,

 a) I think wildcard search is by default case sensitive?
 Is there a
 way to make case insensitive?
Wildcard searches are not analyzed. To case insensitive search you can
lowercase query terms at client side. (with using lowercasefilter at
index time) e.g. Mail* = mail*
 
 I discovered that the normal query type doesn't work with wildcards
 and so I'm using the Filter Query to query these. 
I don't understand this. Wildcard search works with q parameter if you
are asking that. q=mail*

For the 2 points above, my bad. I'm already using the lowercasefilter
but I was not lowering the query with wildcards (the others are lowered
by the analyser). So it's working fine now! On my tests yesterday
probably I was testing q=Mail* and fq=mail* (and didn't notice the
difference) and read somewhere that it wasn't possible (probably on
older solr version) so I get the wrong conclusion that it wasn't
working. 

But it is unusual to use both leading and trailing * operator. Why are
you doing this?

Yes I know, but I have a few queries that need this. I'll try the
ReversedWildcardFilterFactory. 

By default it is not supported. With SOLR-1604 is it possible.
Ok then. I guess SOLR-1604 is the answer for most of my problems. I'm
going to give it a try and then I'll share some feedback.

Thanks for your help and sorry for my newbie confusions. :)
Frederico

-Original Message-
From: Ahmet Arslan [mailto:iori...@yahoo.com] 
Sent: sexta-feira, 30 de Julho de 2010 12:09
To: solr-user@lucene.apache.org
Subject: RE: wildcard and proximity searches

 a) I think wildcard search is by default case sensitive?
 Is there a
 way to make case insensitive?

Wildcard searches are not analyzed. To case insensitive search you can
lowercase query terms at client side. (with using lowercasefilter at
index time) e.g. Mail* = mail*

 
 I discovered that the normal query type doesn't work with wildcards
 and so I'm using the Filter Query to query these. 

I don't understand this. Wildcard search works with q parameter if you
are asking that. q=mail*

 field my
 queries are much slower (I have some queries like *word* or
 *word1* or
 *word2* that take about one minute to perform)
 Is there a way to optimize these queries (without removing
 the wildcards
 :))?

It is normal for leading wildcard search to be slow. Using
ReversedWildcardFilterFactory at index time can speedup it.

But it is unusual to use both leading and trailing * operator. Why are
you doing this? 

 c)Is there a way to do phrase queries with wildcards? Like
 This solr*
 mail*? Because the tests I made, when using quotes I think
 the wildcards are ignored.

By default it is not supported. With SOLR-1604 is it possible.

 d)How exactly works the pf (phrase fields) and ps (phrase
 slop)
 parameters and what's the difference for the proximity
 searches (ex:
 word word2~20)?

These parameters are specific to dismax query parser. 
http://wiki.apache.org/solr/DisMaxQParserPlugin



  


Re: Customize order field list ???

2010-07-30 Thread kenf_nc

I believe they come back alphabetically sorted (not sure if this is language
specific or not), so a quick way might be to change the name from createdate
to zz_createdate or something like that. 

Generally with XML you should not be worried about order however. It's
usually a sign of a design issue somewhere if the order of the fields
matters.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Customize-order-field-list-tp1007996p1008924.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Using Solr to perform range queries in Dspace

2010-07-30 Thread Mckeane



Thank you for your reply.

 This is a background as to what I am trying to achieve. I want to be able
to perform a search  across numeric index ranges and get the results in
logical ordering instead of a lexicographic ordering using dspace. Currently
if I do a search using the query: var:[10 TO 50] if there are any values
with index 1000, 100 or a float say 10.x the result returns all these values
plus any other values that falls within the lexicographic range. Similar
result is returned if I enter any other numeric data type.  In solr I see
where TrieDoubleField,TrieLongField,SortableIntField, etc.. can be use to
perform numeric range queries and return the result in logical ordering. I
was thinking about using either TrieField classes for int, double etc..
and/or SortableIntField, SortableLongField classes defined in solr to
perform range query search in dspace.




-Mckeane
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Using-Solr-to-perform-range-queries-in-Dspace-tp987049p1008941.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr searching performance issues, using large documents

2010-07-30 Thread Li Li
hightlight's time is mainly spent on getting the field which you want
to highlight and tokenize this field(If you don't store term vector) .
you can check what's wrong,

2010/7/30 Peter Spam ps...@mac.com:
 If I don't do highlighting, it's really fast.  Optimize has no effect.

 -Peter

 On Jul 29, 2010, at 11:54 AM, dc tech wrote:

 Are you storing the entire log file text in SOLR? That's almost 3gb of
 text that you are storing in the SOLR. Try to
 1) Is this first time performance or on repaat queries with the same fields?
 2) Optimze the index and test performance again
 3) index without storing the text and see what the performance looks like.


 On 7/29/10, Peter Spam ps...@mac.com wrote:
 Any ideas?  I've got 5000 documents with an average size of 850k each, and
 it sometimes takes 2 minutes for a query to come back when highlighting is
 turned on!  Help!


 -Pete

 On Jul 21, 2010, at 2:41 PM, Peter Spam wrote:

 From the mailing list archive, Koji wrote:

 1. Provide another field for highlighting and use copyField to copy
 plainText to the highlighting field.

 and Lance wrote:
 http://www.mail-archive.com/solr-user@lucene.apache.org/msg35548.html

 If you want to highlight field X, doing the
 termOffsets/termPositions/termVectors will make highlighting that field
 faster. You should make a separate field and apply these options to that
 field.

 Now: doing a copyfield adds a value to a multiValued field. For a text
 field, you get a multi-valued text field. You should only copy one value
 to the highlighted field, so just copyField the document to your special
 field. To enforce this, I would add multiValued=false to that field,
 just to avoid mistakes.

 So, all_text should be indexed without the term* attributes, and should
 not be stored. Then your document stored in a separate field that you use
 for highlighting and has the term* attributes.

 I've been experimenting with this, and here's what I've tried:

  field name=body type=text_pl indexed=true stored=false
 multiValued=true termVectors=true termPositions=true termOff
 sets=true /
  field name=body_all type=text_pl indexed=false stored=true
 multiValued=true /
  copyField source=body dest=body_all/

 ... but it's still very slow (10+ seconds).  Why is it better to have two
 fields (one indexed but not stored, and the other not indexed but stored)
 rather than just one field that's both indexed and stored?


 From the Perf wiki page http://wiki.apache.org/solr/SolrPerformanceFactors

 If you aren't always using all the stored fields, then enabling lazy
 field loading can be a huge boon, especially if compressed fields are
 used.

 What does this mean?  How do you load a field lazily?

 Thanks for your time, guys - this has started to become frustrating, since
 it works so well, but is very slow!


 -Pete

 On Jul 20, 2010, at 5:36 PM, Peter Spam wrote:

 Data set: About 4,000 log files (will eventually grow to millions).
 Average log file is 850k.  Largest log file (so far) is about 70MB.

 Problem: When I search for common terms, the query time goes from under
 2-3 seconds to about 60 seconds.  TermVectors etc are enabled.  When I
 disable highlighting, performance improves a lot, but is still slow for
 some queries (7 seconds).  Thanks in advance for any ideas!


 -Peter


 -

 4GB RAM server
 % java -Xms2048M -Xmx3072M -jar start.jar

 -

 schema.xml changes:

  fieldType name=text_pl class=solr.TextField
    analyzer
      tokenizer class=solr.WhitespaceTokenizerFactory/
    filter class=solr.LowerCaseFilterFactory/
    filter class=solr.WordDelimiterFilterFactory generateWordParts=0
 generateNumberParts=0 catenateWords=0 catenateNumbers=0
 catenateAll=0 splitOnCaseChange=0/
    /analyzer
  /fieldType

 ...

 field name=body type=text_pl indexed=true stored=true
 multiValued=false termVectors=true termPositions=true
 termOffsets=true /
  field name=timestamp type=date indexed=true stored=true
 default=NOW multiValued=false/
 field name=version type=string indexed=true stored=true
 multiValued=false/
 field name=device type=string indexed=true stored=true
 multiValued=false/
 field name=filename type=string indexed=true stored=true
 multiValued=false/
 field name=filesize type=long indexed=true stored=true
 multiValued=false/
 field name=pversion type=int indexed=true stored=true
 multiValued=false/
 field name=first2md5 type=string indexed=false stored=true
 multiValued=false/
 field name=ckey type=string indexed=true stored=true
 multiValued=false/

 ...

 dynamicField name=* type=ignored multiValued=true /
 defaultSearchFieldbody/defaultSearchField
 solrQueryParser defaultOperator=AND/

 

Re: question about relevance

2010-07-30 Thread Bharat Jain
Hi,
   Thanks a lot for the info and your time. I think field collapse will work
for us. I looked at the https://issues.apache.org/jira/browse/SOLR-236 but
which file I should use for patch. We use solr-1.3.

Thanks
Bharat Jain


On Fri, Jul 30, 2010 at 12:53 AM, Chris Hostetter
hossman_luc...@fucit.orgwrote:


 : 1. There are user records of type A, B, C etc. (userId field in index is
 : common to all records)
 : 2. A user can have any number of A, B, C etc (e.g. think of A being a
 : language then user can know many languages like french, english, german
 etc)
 : 3. Records are currently stored as a document in index.
 : 4. A given query can match multiple records for the user
 : 5. If for a user more records are matched (e.g. if he knows both french
 and
 : german) then he is more relevant and should come top in UI. This is the
 : reason I wanted to add lucene scores assuming the greater score means
 more
 : relevance.

 if your goal is to get back users from each search, then you should
 probably change your indexing strategry so that each user has a single
 document -- fields like langauge can be multivalued, etc...

 then a search for language:en langauge:fr will return users who speak
 english or french, and hte ones that speak both will score higher.

 if you really cant change the index structure, then essentially waht you
 are looking for is a field collapsing solution on the userId field,
 where you want each collapsed group to get a cumulative score.  i don't
 know if the existing field collapsing patches support this -- if you are
 already willing/capable to do it in the lcient then that may be the
 simplest thing to support moving foward.

 Adding the scores is certainly one metric you could use -- it's generally
 suspicious to try and imply too much meaning to scores in lucene/solr but
 that's becuase people typically try to imply broader absolute meaning.  in
 the case of a single query the scores are relative eachother, and adding
 up all the scores for a given userId is approximaly what would happen in
 my example above -- except that there is also a coord factor that would
 penalalize documents that only match one clause ... it's complicated, but
 as an approximation adding the scores might give you what you are looking
 for -- only you can know for sure based on your specific data.



 -Hoss




Document Boost with Solr Extraction - SolrContentHandler

2010-07-30 Thread jayendra patil
We are using Solr Extract Handler for indexing document metadata with
attachments. (/update/extract)
However, the SolrContentHandler doesn't seem to support index time document
boost attribute.
Probably , document.setDocumentBoost(Float.parseFloat(boost)) is missing.

Regards,
Jayendra


Re: advice on creating a solr index when data source is from many unrelated db tables

2010-07-30 Thread S Ahmed
So I have tables like this:

Users
UserSales
UserHistory
UserAddresses
UserNotes
ClientAddress
CalenderEvent
Articles
Blogs

Just seems odd to me, jamming on these tables into a single index.  But I
guess the idea of using a 'type' field to quality exactly what I am
searching is a good idea, in case I need to filter for only 'articles' or
blogs or contacts etc.

But there might be 50 fields if I do this no?



On Fri, Jul 30, 2010 at 4:01 AM, Chantal Ackermann 
chantal.ackerm...@btelligent.de wrote:

 Hi Ahmed,

 fields that are empty do not impact the index. It's different from a
 database.
 I have text fields for different languages and per document there is
 always only one of the languages set (the text fields for the other
 languages are empty/not set). It works all very well and fast.

 I wonder more about what you describe as unrelated data - why would
 you want to put unrelated data into a single index? If you want to
 search on all the data and return mixed results there surely must be
 some kind of relation between the documents?

 Chantal

 On Thu, 2010-07-29 at 21:33 +0200, S Ahmed wrote:
  I understand (and its straightforward) when you want to create a index
 for
  something simple like Products.
 
  But how do you go about creating a Solr index when you have data coming
 from
  10-15 database tables, and the tables have unrelated data?
 
  The issue is then you would have many 'columns' in your index, and they
 will
  be NULL for much of the data since you are trying to shove 15 db tables
 into
  a single Solr/Lucense index.
 
 
  This must be a common problem, what are the potential solutions?






Re: Solr searching performance issues, using large documents

2010-07-30 Thread Peter Spam
I do store term vector:

field name=body type=text_pl indexed=true stored=true 
multiValued=false termVectors=true termPositions=true termOffsets=true 
/

-Pete

On Jul 30, 2010, at 7:30 AM, Li Li wrote:

 hightlight's time is mainly spent on getting the field which you want
 to highlight and tokenize this field(If you don't store term vector) .
 you can check what's wrong,
 
 2010/7/30 Peter Spam ps...@mac.com:
 If I don't do highlighting, it's really fast.  Optimize has no effect.
 
 -Peter
 
 On Jul 29, 2010, at 11:54 AM, dc tech wrote:
 
 Are you storing the entire log file text in SOLR? That's almost 3gb of
 text that you are storing in the SOLR. Try to
 1) Is this first time performance or on repaat queries with the same fields?
 2) Optimze the index and test performance again
 3) index without storing the text and see what the performance looks like.
 
 
 On 7/29/10, Peter Spam ps...@mac.com wrote:
 Any ideas?  I've got 5000 documents with an average size of 850k each, and
 it sometimes takes 2 minutes for a query to come back when highlighting is
 turned on!  Help!
 
 
 -Pete
 
 On Jul 21, 2010, at 2:41 PM, Peter Spam wrote:
 
 From the mailing list archive, Koji wrote:
 
 1. Provide another field for highlighting and use copyField to copy
 plainText to the highlighting field.
 
 and Lance wrote:
 http://www.mail-archive.com/solr-user@lucene.apache.org/msg35548.html
 
 If you want to highlight field X, doing the
 termOffsets/termPositions/termVectors will make highlighting that field
 faster. You should make a separate field and apply these options to that
 field.
 
 Now: doing a copyfield adds a value to a multiValued field. For a text
 field, you get a multi-valued text field. You should only copy one value
 to the highlighted field, so just copyField the document to your special
 field. To enforce this, I would add multiValued=false to that field,
 just to avoid mistakes.
 
 So, all_text should be indexed without the term* attributes, and should
 not be stored. Then your document stored in a separate field that you use
 for highlighting and has the term* attributes.
 
 I've been experimenting with this, and here's what I've tried:
 
  field name=body type=text_pl indexed=true stored=false
 multiValued=true termVectors=true termPositions=true termOff
 sets=true /
  field name=body_all type=text_pl indexed=false stored=true
 multiValued=true /
  copyField source=body dest=body_all/
 
 ... but it's still very slow (10+ seconds).  Why is it better to have two
 fields (one indexed but not stored, and the other not indexed but stored)
 rather than just one field that's both indexed and stored?
 
 
 From the Perf wiki page http://wiki.apache.org/solr/SolrPerformanceFactors
 
 If you aren't always using all the stored fields, then enabling lazy
 field loading can be a huge boon, especially if compressed fields are
 used.
 
 What does this mean?  How do you load a field lazily?
 
 Thanks for your time, guys - this has started to become frustrating, since
 it works so well, but is very slow!
 
 
 -Pete
 
 On Jul 20, 2010, at 5:36 PM, Peter Spam wrote:
 
 Data set: About 4,000 log files (will eventually grow to millions).
 Average log file is 850k.  Largest log file (so far) is about 70MB.
 
 Problem: When I search for common terms, the query time goes from under
 2-3 seconds to about 60 seconds.  TermVectors etc are enabled.  When I
 disable highlighting, performance improves a lot, but is still slow for
 some queries (7 seconds).  Thanks in advance for any ideas!
 
 
 -Peter
 
 
 -
 
 4GB RAM server
 % java -Xms2048M -Xmx3072M -jar start.jar
 
 -
 
 schema.xml changes:
 
  fieldType name=text_pl class=solr.TextField
analyzer
  tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.WordDelimiterFilterFactory generateWordParts=0
 generateNumberParts=0 catenateWords=0 catenateNumbers=0
 catenateAll=0 splitOnCaseChange=0/
/analyzer
  /fieldType
 
 ...
 
 field name=body type=text_pl indexed=true stored=true
 multiValued=false termVectors=true termPositions=true
 termOffsets=true /
  field name=timestamp type=date indexed=true stored=true
 default=NOW multiValued=false/
 field name=version type=string indexed=true stored=true
 multiValued=false/
 field name=device type=string indexed=true stored=true
 multiValued=false/
 field name=filename type=string indexed=true stored=true
 multiValued=false/
 field name=filesize type=long indexed=true stored=true
 multiValued=false/
 field name=pversion type=int indexed=true stored=true
 multiValued=false/
 field name=first2md5 type=string indexed=false stored=true
 multiValued=false/
 field name=ckey type=string indexed=true stored=true
 

Programmatically retrieving numDocs (or any other statistic)

2010-07-30 Thread John DeRosa
I want to programmatically retrieve the number of indexed documents. I.e., get 
the value of numDocs.

The only two ways I've come up with are searching for *:* and reporting the 
hit count, or sending an Http GET to 
http://xxx.xx.xxx.xxx:8080/solr/admin/stats.jsp#core and searching for stat 
name=numDocs  /stat in the response.

Both seem to be overkill. Is there an easier way to ask SolrIndexSearcher, 
what's numDocs?

(I'm doing this in Python, using Pysolr, if that matters.)

Thanks!



Re: Help with schema design

2010-07-30 Thread Erick Erickson
I'd just index the eventtype, eventby and eventtime as separate fields. Then
queries something like eventtype:update AND eventtime:[targettime TO *].

Similarly for events update by pramod, the query would be something like:
eventby:pramod AND eventtype:update

HTH
Erick

On Wed, Jul 28, 2010 at 11:05 PM, Pramod Goyal pramod.go...@gmail.comwrote:

 Hi,
I have a use case where i get a document and a list of events that has
 happened on the document. For example

 First document:
  Some text content
 Events:
  Event TypeEvent By Event Time
  Update  Pramod  06062010 2:30:00
  Update  Raj 06062010 2:30:00
  View Rahul  07062010 1:30:00


 I would like to support queries like get all document Event Type = ? and
 Event time greater than ? ,  also query like get all the documents Updated
 by Pramod.
 How should i design my schema to support this use case.

 Thanks,
 Regards,
 Pramod Goyal



Re: Solr using 1500 threads - is that normal?

2010-07-30 Thread Erick Erickson
Glad to help. Do be aware that there are several config values that
influence
the commit frequency, they might also be relevant.

Best
Erick

On Thu, Jul 29, 2010 at 5:11 AM, Christos Constantinou 
ch...@simpleweb.co.uk wrote:

 Eric,

 Thank you very much for the indicators! I had a closer look at the commit
 intervals and it seems that the application is gradually increasing the
 commits to almost once per second after some time - something that was
 hidden in the massive amount of queries in the log file. I have changed the
 code to use commitWithin rather than commit and everything looks much better
 now. I believe that might have solved the problem so thanks again.

 Christos

 On 29 Jul 2010, at 01:44, Erick Erickson wrote:

  Your commits are very suspect. How often are you making changes to your
  index?
  Do you have autocommit on? Do you commit when updating each document?
  Committing
  too often and consequently firing off warmup queries is the first place
 I'd
  look. But I
  agree with dc tech, 1,500 is wy more than I would expect.
 
  Best
  Erick
 
 
 
  On Wed, Jul 28, 2010 at 6:53 AM, Christos Constantinou 
  ch...@simpleweb.co.uk wrote:
 
  Hi,
 
  Solr seems to be crashing after a JVM exception that new threads cannot
 be
  created. I am writing in hope of advice from someone that has
 experienced
  this before. The exception that is causing the problem is:
 
  Exception in thread btpool0-5 java.lang.OutOfMemoryError: unable to
  create new native thread
 
  The memory that is allocated to Solr is 3072MB, which should be enough
  memory for a ~6GB data set. The documents are not big either, they have
  around 10 fields of which only one stores large text ranging between
 1k-50k.
 
  The top command at the time of the crash shows Solr using around 1500
  threads, which I assume it is not normal. Could it be that the threads
 are
  crashing one by one and new ones are created to cope with the queries?
 
  In the log file, right after the the exception, there are several
 thousand
  commits before the server stalls completely. Normally, the log file
 would
  report 20-30 document existence queries per second, then 1 commit per
 5-30
  seconds, and some more infrequent faceted document searches on the data.
  However after the exception, there are only commits until the end of the
 log
  file.
 
  I am wondering if anyone has experienced this before or if it is some
 sort
  of known bug from Solr 1.4? Is there a way to increase the details of
 the
  exception in the logfile?
 
  I am attaching the output of a grep Exception command on the logfile.
 
  Jul 28, 2010 8:19:31 AM org.apache.solr.common.SolrException log
  SEVERE: org.apache.solr.common.SolrException: Error opening new
 searcher.
  exceeded limit of maxWarmingSearchers=2, try again later.
  Jul 28, 2010 8:19:31 AM org.apache.solr.common.SolrException log
  SEVERE: org.apache.solr.common.SolrException: Error opening new
 searcher.
  exceeded limit of maxWarmingSearchers=2, try again later.
  Jul 28, 2010 8:19:31 AM org.apache.solr.common.SolrException log
  SEVERE: org.apache.solr.common.SolrException: Error opening new
 searcher.
  exceeded limit of maxWarmingSearchers=2, try again later.
  Jul 28, 2010 8:19:32 AM org.apache.solr.common.SolrException log
  SEVERE: org.apache.solr.common.SolrException: Error opening new
 searcher.
  exceeded limit of maxWarmingSearchers=2, try again later.
  Jul 28, 2010 8:20:18 AM org.apache.solr.common.SolrException log
  SEVERE: org.apache.solr.common.SolrException: Error opening new
 searcher.
  exceeded limit of maxWarmingSearchers=2, try again later.
  Jul 28, 2010 8:20:48 AM org.apache.solr.common.SolrException log
  SEVERE: org.apache.solr.common.SolrException: Error opening new
 searcher.
  exceeded limit of maxWarmingSearchers=2, try again later.
  Jul 28, 2010 8:22:43 AM org.apache.solr.common.SolrException log
  SEVERE: org.apache.solr.common.SolrException: Error opening new
 searcher.
  exceeded limit of maxWarmingSearchers=2, try again later.
  Jul 28, 2010 8:27:53 AM org.apache.solr.common.SolrException log
  SEVERE: org.apache.solr.common.SolrException: Error opening new
 searcher.
  exceeded limit of maxWarmingSearchers=2, try again later.
  Jul 28, 2010 8:27:53 AM org.apache.solr.common.SolrException log
  SEVERE: org.apache.solr.common.SolrException: Error opening new
 searcher.
  exceeded limit of maxWarmingSearchers=2, try again later.
  Jul 28, 2010 8:27:53 AM org.apache.solr.common.SolrException log
  SEVERE: org.apache.solr.common.SolrException: Error opening new
 searcher.
  exceeded limit of maxWarmingSearchers=2, try again later.
  Jul 28, 2010 8:28:50 AM org.apache.solr.common.SolrException log
  SEVERE: org.apache.solr.common.SolrException: Error opening new
 searcher.
  exceeded limit of maxWarmingSearchers=2, try again later.
  Jul 28, 2010 8:33:19 AM org.apache.solr.common.SolrException log
  SEVERE: org.apache.solr.common.SolrException: Error opening new
 

Re: Solr Indexing slows down

2010-07-30 Thread Erick Erickson
See the subject about 1500 threads. The first place I'd look is how
often you're committing. If you're committing before the warmup queries
from the previous commit have done their magic, you might be getting
into a death spiral.

HTH
Erick

On Thu, Jul 29, 2010 at 7:02 AM, Peter Karich peat...@yahoo.de wrote:

 Hi,

 I am indexing a solr 1.4.0 core and commiting gets slower and slower.
 Starting from 3-5 seconds for ~200 documents and ending with over 60
 seconds after 800 commits. Then, if I reloaded the index, it is as fast
 as before! And today I have read a similar thread [1] and indeed: if I
 set autowarming for the caches to 0 the slowdown disappears.

 BUT at the same time I would like to offer searching on that core, which
 would be dramatically slowed down (due to no autowarming).

 Does someone know a better solution to avoid index-slow-down?

 Regards,
 Peter.

 [1] http://www.mail-archive.com/solr-user@lucene.apache.org/msg20785.html



Problems running on tomcat

2010-07-30 Thread Claudio Devecchi
Hi,

I'm new with solr and I'm doing my first installation under tomcat, I
followed the documentation on link (
http://wiki.apache.org/solr/SolrTomcat#Installing_Tomcat_6) but there are
some problems.
The http://localhost:8080/solr/admin works fine, but in some cases, for
example to see my schema.xml from the admin console the error bellow
happensHTTP
Status 404 - /solr/admin/file/index.jspSomebody already saw this? There are
some trick to do?

Tks

-- 
Claudio Devecchi


Re: Solr Indexing slows down

2010-07-30 Thread Peter Karich
Hi Erick!

thanks for the response!
I will answer your questions ;-)

 How often are you making changes to your index?

Every 30-60 seconds. Too heavy?


 Do you have autocommit on?

No.


 Do you commit when updating each document?

No. I commit after a batch update of 200 documents


 Committing too often and consequently firing off warmup queries is the first 
 place I'd look.

Why is commiting firing warmup queries? Is there any documentation about
this subject?
How can I be sure that the previous commit has done its magic?

 there are several config values that influence the commit frequency


I now know the autowarm and the mergeFactor config. What else? Is this
documentation complete:
http://wiki.apache.org/lucene-java/ImproveIndexingSpeed ?

Regards,
Peter.

 See the subject about 1500 threads. The first place I'd look is how
 often you're committing. If you're committing before the warmup queries
 from the previous commit have done their magic, you might be getting
 into a death spiral.

 HTH
 Erick

 On Thu, Jul 29, 2010 at 7:02 AM, Peter Karich peat...@yahoo.de wrote:

   
 Hi,

 I am indexing a solr 1.4.0 core and commiting gets slower and slower.
 Starting from 3-5 seconds for ~200 documents and ending with over 60
 seconds after 800 commits. Then, if I reloaded the index, it is as fast
 as before! And today I have read a similar thread [1] and indeed: if I
 set autowarming for the caches to 0 the slowdown disappears.

 BUT at the same time I would like to offer searching on that core, which
 would be dramatically slowed down (due to no autowarming).

 Does someone know a better solution to avoid index-slow-down?

 Regards,
 Peter.

 [1] http://www.mail-archive.com/solr-user@lucene.apache.org/msg20785.html

 


Re: Programmatically retrieving numDocs (or any other statistic)

2010-07-30 Thread Peter Karich
Both approaches are ok, I think. (although I don't know the python API)
BTW: If you query q=*:* then add rows=0 to avoid some traffic.

Regards,
Peter.

 I want to programmatically retrieve the number of indexed documents. I.e., 
 get the value of numDocs.

 The only two ways I've come up with are searching for *:* and reporting the 
 hit count, or sending an Http GET to 
 http://xxx.xx.xxx.xxx:8080/solr/admin/stats.jsp#core and searching for stat 
 name=numDocs  /stat in the response.

 Both seem to be overkill. Is there an easier way to ask SolrIndexSearcher, 
 what's numDocs?

 (I'm doing this in Python, using Pysolr, if that matters.)

 Thanks!


Re: Solr searching performance issues, using large documents

2010-07-30 Thread Peter Karich
Hi Peter :-),

did you already try other values for

hl.maxAnalyzedChars=2147483647

? Also regular expression highlighting is more expensive, I think.
What does the 'fuzzy' variable mean? If you use this to query via
~someTerm instead someTerm
then you should try the trunk of solr which is a lot faster for fuzzy or
other wildcard search.

Regards,
Peter.
 
 Data set: About 4,000 log files (will eventually grow to millions).  Average 
 log file is 850k.  Largest log file (so far) is about 70MB.

 Problem: When I search for common terms, the query time goes from under 2-3 
 seconds to about 60 seconds.  TermVectors etc are enabled.  When I disable 
 highlighting, performance improves a lot, but is still slow for some queries 
 (7 seconds).  Thanks in advance for any ideas!


 -Peter


 -

 4GB RAM server
 % java -Xms2048M -Xmx3072M -jar start.jar

 -

 schema.xml changes:

 fieldType name=text_pl class=solr.TextField
   analyzer
 tokenizer class=solr.WhitespaceTokenizerFactory/
   filter class=solr.LowerCaseFilterFactory/ 
   filter class=solr.WordDelimiterFilterFactory generateWordParts=0 
 generateNumberParts=0 catenateWords=0 catenateNumbers=0 catenateAll=0 
 splitOnCaseChange=0/
   /analyzer
 /fieldType

 ...

field name=body type=text_pl indexed=true stored=true 
 multiValued=false termVectors=true termPositions=true 
 termOffsets=true /
 field name=timestamp type=date indexed=true stored=true 
 default=NOW multiValued=false/
field name=version type=string indexed=true stored=true 
 multiValued=false/
field name=device type=string indexed=true stored=true 
 multiValued=false/
field name=filename type=string indexed=true stored=true 
 multiValued=false/
field name=filesize type=long indexed=true stored=true 
 multiValued=false/
field name=pversion type=int indexed=true stored=true 
 multiValued=false/
field name=first2md5 type=string indexed=false stored=true 
 multiValued=false/
field name=ckey type=string indexed=true stored=true 
 multiValued=false/

 ...

  dynamicField name=* type=ignored multiValued=true /
  defaultSearchFieldbody/defaultSearchField
  solrQueryParser defaultOperator=AND/

 -

 solrconfig.xml changes:

 maxFieldLength2147483647/maxFieldLength
 ramBufferSizeMB128/ramBufferSizeMB

 -

 The query:

 rowStr = rows=10
 facet = 
 facet=truefacet.limit=10facet.field=devicefacet.field=ckeyfacet.field=version
 fields = fl=id,score,filename,version,device,first2md5,filesize,ckey
 termvectors = tv=trueqt=tvrhtv.all=true
 hl = hl=truehl.fl=bodyhl.snippets=1hl.fragsize=400
 regexv = (?m)^.*\n.*\n.*$
 hl_regex = hl.regex.pattern= + CGI::escape(regexv) + 
 hl.regex.slop=1hl.fragmenter=regexhl.regex.maxAnalyzedChars=2147483647hl.maxAnalyzedChars=2147483647
 justq = 'q=' + CGI::escape('body:' + fuzzy + p['q'].to_s.gsub(/\\/, 
 '').gsub(/([:~!=])/,'\1') + fuzzy + minLogSizeStr)

 thequery = '/solr/select?timeAllowed=5000wt=ruby' + (p['fq'].empty? ? '' : 
 ('fq='+p['fq'].to_s) ) + justq + rowStr + facet + fields + termvectors + hl 
 + hl_regex

 baseurl = '/cgi-bin/search.rb?q=' + CGI::escape(p['q'].to_s) + 'rows=' + 
 p['rows'].to_s + 'minLogSize=' + p['minLogSize'].to_s


   


-- 
http://karussell.wordpress.com/



Good list of English words that get butchered by Porter Stemmer

2010-07-30 Thread Otis Gospodnetic
Hello,

I'm looking for a list of English  words that, when stemmed by Porter stemmer, 
end up in the same stem as  some similar, but unrelated words.  Below are some 
examples:

# this gets stemmed to iron, so if you search for ironic, you'll get iron 
matches
ironic

# same stem as animal
anime
animated 
animation
animations

I imagine such a list could be added to the example protwords.txt

Thanks,
Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



Re: Solr Indexing slows down

2010-07-30 Thread Otis Gospodnetic
Peter, there are events in solrconfig where you define warm up queries when a 
new searcher is opened.

There are also cache settings that play a role here.

30-60 seconds is pretty frequent for Solr.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Peter Karich peat...@yahoo.de
 To: solr-user@lucene.apache.org
 Sent: Fri, July 30, 2010 4:06:48 PM
 Subject: Re: Solr Indexing slows down
 
 Hi Erick!
 
 thanks for the response!
 I will answer your questions  ;-)
 
  How often are you making changes to your index?
 
 Every  30-60 seconds. Too heavy?
 
 
  Do you have autocommit  on?
 
 No.
 
 
  Do you commit when updating each  document?
 
 No. I commit after a batch update of 200  documents
 
 
  Committing too often and consequently firing off  warmup queries is the 
  first 
place I'd look.
 
 Why is commiting firing  warmup queries? Is there any documentation about
 this subject?
 How can I  be sure that the previous commit has done its magic?
 
  there are  several config values that influence the commit frequency
 
 
 I now know  the autowarm and the mergeFactor config. What else? Is this
 documentation  complete:
 http://wiki.apache.org/lucene-java/ImproveIndexingSpeed ?
 
 Regards,
 Peter.
 
  See the subject about 1500 threads. The  first place I'd look is how
  often you're committing. If you're  committing before the warmup queries
  from the previous commit have done  their magic, you might be getting
  into a death spiral.
 
   HTH
  Erick
 
  On Thu, Jul 29, 2010 at 7:02 AM, Peter Karich  peat...@yahoo.de  wrote:
 
   
  Hi,
 
  I am  indexing a solr 1.4.0 core and commiting gets slower and slower.
   Starting from 3-5 seconds for ~200 documents and ending with over 60
   seconds after 800 commits. Then, if I reloaded the index, it is as  fast
  as before! And today I have read a similar thread [1] and  indeed: if I
  set autowarming for the caches to 0 the slowdown  disappears.
 
  BUT at the same time I would like to offer  searching on that core, which
  would be dramatically slowed down (due  to no autowarming).
 
  Does someone know a better solution  to avoid index-slow-down?
 
  Regards,
   Peter.
 
  [1] http://www.mail-archive.com/solr-user@lucene.apache.org/msg20785.html
 
  
 


Re: Programmatically retrieving numDocs (or any other statistic)

2010-07-30 Thread Otis Gospodnetic
I suppose you could write a component that just gets this info from 
SolrIndexSearcher and write that in the response?

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: John DeRosa jo...@ipstreet.com
 To: solr-user@lucene.apache.org
 Sent: Fri, July 30, 2010 1:39:03 PM
 Subject: Programmatically retrieving numDocs (or any other statistic)
 
 I want to programmatically retrieve the number of indexed documents. I.e., 
 get  
the value of numDocs.
 
 The only two ways I've come up with are searching  for *:* and reporting 
 the 
hit count, or sending an Http GET to 
http://xxx.xx.xxx.xxx:8080/solr/admin/stats.jsp#core and searching for  stat 
name=numDocs  /stat in the response.
 
 Both seem  to be overkill. Is there an easier way to ask SolrIndexSearcher, 
what's  numDocs?
 
 (I'm doing this in Python, using Pysolr, if that  matters.)
 
 Thanks!
 
 


Re: question about relevance

2010-07-30 Thread Otis Gospodnetic
May I suggest looking at some of the related issues, say SOLR-1682


This issue is related to:  
  SOLR-1682 Implement CollapseComponent   
 SOLR-1311 pseudo-field-collapsing   
 LUCENE-1421 Ability to group search results by field   
 SOLR-1773 Field Collapsing (lightweight version)   
  SOLR-237  Field collapsing  

 

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Bharat Jain bharat.j...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Fri, July 30, 2010 10:40:19 AM
 Subject: Re: question about relevance
 
 Hi,
Thanks a lot for the info and your time. I think field collapse  will work
 for us. I looked at the https://issues.apache.org/jira/browse/SOLR-236 but
 which file I should  use for patch. We use solr-1.3.
 
 Thanks
 Bharat Jain
 
 
 On Fri,  Jul 30, 2010 at 12:53 AM, Chris Hostetter
 hossman_luc...@fucit.orgwrote:
 
 
   : 1. There are user records of type A, B, C etc. (userId field in index  is
  : common to all records)
  : 2. A user can have any number of  A, B, C etc (e.g. think of A being a
  : language then user can know many  languages like french, english, german
  etc)
  : 3. Records are  currently stored as a document in index.
  : 4. A given query can match  multiple records for the user
  : 5. If for a user more records are  matched (e.g. if he knows both french
  and
  : german) then he is  more relevant and should come top in UI. This is the
  : reason I wanted  to add lucene scores assuming the greater score means
  more
  :  relevance.
 
  if your goal is to get back users from each search,  then you should
  probably change your indexing strategry so that each  user has a single
  document -- fields like langauge can be  multivalued, etc...
 
  then a search for language:en langauge:fr  will return users who speak
  english or french, and hte ones that speak  both will score higher.
 
  if you really cant change the index  structure, then essentially waht you
  are looking for is a field  collapsing solution on the userId field,
  where you want each collapsed  group to get a cumulative score.  i don't
  know if the existing  field collapsing patches support this -- if you are
  already  willing/capable to do it in the lcient then that may be the
  simplest  thing to support moving foward.
 
  Adding the scores is certainly  one metric you could use -- it's generally
  suspicious to try and imply  too much meaning to scores in lucene/solr but
  that's becuase people  typically try to imply broader absolute meaning.  in
  the case of a  single query the scores are relative eachother, and adding
  up all the  scores for a given userId is approximaly what would happen in
  my example  above -- except that there is also a coord factor that would
   penalalize documents that only match one clause ... it's complicated,  but
  as an approximation adding the scores might give you what you are  looking
  for -- only you can know for sure based on your specific  data.
 
 
 
  -Hoss
 
 
 


Re: Good list of English words that get butchered by Porter Stemmer

2010-07-30 Thread Walter Underwood
Some collisions are listed here:

http://www.attivio.com/blog/34-attivio-blog/333-doing-things-with-words-part-three-stemming-and-lemmatization.html

Have you asked Martin Porter? You can find his e-mail here: 
http://tartarus.org/~martin/

wunder

On Jul 30, 2010, at 1:41 PM, Otis Gospodnetic wrote:

 Hello,
 
 I'm looking for a list of English  words that, when stemmed by Porter 
 stemmer, 
 end up in the same stem as  some similar, but unrelated words.  Below are 
 some 
 examples:
 
 # this gets stemmed to iron, so if you search for ironic, you'll get 
 iron 
 matches
 ironic
 
 # same stem as animal
 anime
 animated 
 animation
 animations
 
 I imagine such a list could be added to the example protwords.txt
 
 Thanks,
 Otis
 
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 Lucene ecosystem search :: http://search-lucene.com/







Stem collision, word protection, synonym hack

2010-07-30 Thread Otis Gospodnetic
Hello,

I'm wondering if anyone has good ideas for handling the following (Porter) 
stemming problem.
The word city gets stemmed to citi.  But citi is short for citibank, so 
we have a conflict - the stems of both city and citi are citi, so when 
you 

search for city, you will get matches that are really about citi(bank).

Now, we could put citi in the  do not stem list (protwords.txt), but it 
will 

be of no use because citi is already in the fully stemmed form.  This  leaves 
the option of not stemming cities or city (and perhaps  making city a 
synonym for cities as a work around) by adding those words to protwords.txt, 
but this feels like a kluge.

Are there more elegant solutions for cases like this one?

Thanks,
Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



Re: Stem collision, word protection, synonym hack

2010-07-30 Thread Robert Zotter

Otis,

https://issues.apache.org/jira/browse/LUCENE-2055 may be of some help.

cheers

On 7/30/10 2:18 PM, Otis Gospodnetic wrote:

Hello,

I'm wondering if anyone has good ideas for handling the following (Porter)
stemming problem.
The word city gets stemmed to citi.  But citi is short for citibank, so
we have a conflict - the stems of both city and citi are citi, so when you

search for city, you will get matches that are really about citi(bank).

Now, we could put citi in the  do not stem list (protwords.txt), but it will

be of no use because citi is already in the fully stemmed form.  This  leaves
the option of not stemming cities or city (and perhaps  making city a
synonym for cities as a work around) by adding those words to protwords.txt,
but this feels like a kluge.

Are there more elegant solutions for cases like this one?

Thanks,
Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/

   




Re: Programmatically retrieving numDocs (or any other statistic)

2010-07-30 Thread John DeRosa
Thanks!

On Jul 30, 2010, at 1:11 PM, Peter Karich wrote:

 Both approaches are ok, I think. (although I don't know the python API)
 BTW: If you query q=*:* then add rows=0 to avoid some traffic.
 
 Regards,
 Peter.
 
 I want to programmatically retrieve the number of indexed documents. I.e., 
 get the value of numDocs.
 
 The only two ways I've come up with are searching for *:* and reporting 
 the hit count, or sending an Http GET to 
 http://xxx.xx.xxx.xxx:8080/solr/admin/stats.jsp#core and searching for stat 
 name=numDocs  /stat in the response.
 
 Both seem to be overkill. Is there an easier way to ask SolrIndexSearcher, 
 what's numDocs?
 
 (I'm doing this in Python, using Pysolr, if that matters.)
 
 Thanks!



Re: Good list of English words that get butchered by Porter Stemmer

2010-07-30 Thread Yonik Seeley
On Fri, Jul 30, 2010 at 4:41 PM, Otis Gospodnetic
otis_gospodne...@yahoo.com wrote:
 I'm looking for a list of English  words that, when stemmed by Porter stemmer,
 end up in the same stem as  some similar, but unrelated words.  Below are some
 examples:

 # this gets stemmed to iron, so if you search for ironic, you'll get 
 iron
 matches
 ironic

 # same stem as animal
 anime
 animated
 animation
 animations

 I imagine such a list could be added to the example protwords.txt

+1

No reason to make everyone come up with their own list.
Unless a good list already exists out there... we could semi-automate
it by running a large corpus through the stemmer and then for each
stem, list the original words.  The manual part would be looking at
the output to see the collisions (unless someone has a better idea).

-Yonik
http://www.lucidimagination.com


Some basic DataImportHandler questions

2010-07-30 Thread Harry Smith
Just starting with DataImportHandler and had a few simple questions.

Is there a location for more in depth documentation other than
http://wiki.apache.org/solr/DataImportHandler?

Specifically I was looking for a detailed document outlining
data-config.xml, the fields and attributes and how they are used.

* Is there a way to dynamically generate field elements from the
supplied sql statement?

Example: Suppose one has a table of 100 fields. Entering this manually
for each field is not very efficient.

ie, if table has only 3 columns this is easy enough...

entity name=item query=select * from item
field column=ID name=id /
field column=NAME name=name /
field column=MANU name=manu /
entity

What are the options if ITEM table has dozens or hundreds?


* Is there a way to apply insert logic based on the value of the incoming field?

My specific use case would be, if the incoming value is null, do not
add to Solr.

ie Record is :
ID : 50
NAME : Blahblah
MANU : null

entity name=item query=select * from item
field column=ID name=id /
field column=NAME name=name /
field column=MANU name=manu /
entity

Using the following in data-config.xml is there are way to ignore null
fields altogether? I see some special commands listed such as
$skipRecord, is there some type of $skipField operation?

Thanks


Re: Solr Indexing slows down

2010-07-30 Thread Peter Karich
Hi Otis,

does it mean that a new searcher is opened after I commit?
I thought only on startup...(?)

Regards,
Peter.

 Peter, there are events in solrconfig where you define warm up queries when a 
 new searcher is opened.

 There are also cache settings that play a role here.

 30-60 seconds is pretty frequent for Solr.

 Otis
 
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 Lucene ecosystem search :: http://search-lucene.com/



 - Original Message 
   
 From: Peter Karich peat...@yahoo.de
 To: solr-user@lucene.apache.org
 Sent: Fri, July 30, 2010 4:06:48 PM
 Subject: Re: Solr Indexing slows down

 Hi Erick!

 thanks for the response!
 I will answer your questions  ;-)

 
 How often are you making changes to your index?
   
 Every  30-60 seconds. Too heavy?


 
 Do you have autocommit  on?
   
 No.


 
 Do you commit when updating each  document?
   
 No. I commit after a batch update of 200  documents


 
 Committing too often and consequently firing off  warmup queries is the 
 first 
   
 place I'd look.

 Why is commiting firing  warmup queries? Is there any documentation about
 this subject?
 How can I  be sure that the previous commit has done its magic?

 
 there are  several config values that influence the commit frequency
   

 I now know  the autowarm and the mergeFactor config. What else? Is this
 documentation  complete:
 http://wiki.apache.org/lucene-java/ImproveIndexingSpeed ?

 Regards,
 Peter.

 
 See the subject about 1500 threads. The  first place I'd look is how
 often you're committing. If you're  committing before the warmup queries
 from the previous commit have done  their magic, you might be getting
 into a death spiral.

  HTH
 Erick

 On Thu, Jul 29, 2010 at 7:02 AM, Peter Karich  peat...@yahoo.de  wrote:

  
   
 Hi,

 I am  indexing a solr 1.4.0 core and commiting gets slower and slower.
  Starting from 3-5 seconds for ~200 documents and ending with over 60
  seconds after 800 commits. Then, if I reloaded the index, it is as  fast
 as before! And today I have read a similar thread [1] and  indeed: if I
 set autowarming for the caches to 0 the slowdown  disappears.

 BUT at the same time I would like to offer  searching on that core, which
 would be dramatically slowed down (due  to no autowarming).

 Does someone know a better solution  to avoid index-slow-down?

 Regards,
  Peter.

 [1] http://www.mail-archive.com/solr-user@lucene.apache.org/msg20785.html

 
 
 
   


-- 
http://karussell.wordpress.com/



RE: Good list of English words that get butchered by Porter Stemmer

2010-07-30 Thread Burton-West, Tom
A good starting place might be the list of stemming errors for the original 
Porter stemmer in this article that describes k-stem:

Krovetz, R. (1993). Viewing morphology as an inference process. In Proceedings 
of the 16th annual international ACM SIGIR conference on Research and 
development in information retrieval (pp. 191-202). Pittsburgh, Pennsylvania, 
United States: ACM. doi:10.1145/160688.160718

I don't know if the current porter stemmer is different.  I do see that on the 
snowball page there is a porter and a porter2 stemmer and this explanation is 
linked from the porter2 stemmer page: 
http://snowball.tartarus.org/algorithms/english/stemmer.html


Tom Burton-West
http://www.hathitrust.org/blogs/large-scale-search

-Original Message-
From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] 
Sent: Friday, July 30, 2010 4:42 PM
To: solr-user@lucene.apache.org
Subject: Good list of English words that get butchered by Porter Stemmer

Hello,

I'm looking for a list of English  words that, when stemmed by Porter stemmer, 
end up in the same stem as  some similar, but unrelated words.  Below are some 
examples:

# this gets stemmed to iron, so if you search for ironic, you'll get iron 
matches
ironic

# same stem as animal
anime
animated 
animation
animations

I imagine such a list could be added to the example protwords.txt

Thanks,
Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



Re: Good list of English words that get butchered by Porter Stemmer

2010-07-30 Thread Robert Muir
Otis,

I think this is a great idea.

you could also go even further by making a better example for
StemmerOverrideFilter (stemdict.txt)
(
http://wiki.apache.org/solr/LanguageAnalysis#solr.StemmerOverrideFilterFactory
)

for example:
animated tab animate
animation tab animation
animations tab animation

this might be a bit better (but more work!) than protected words since then
you could let animation and animations conflate, rather than just forcing
them to be all unchanged. i wouldnt go crazy and worry about animator
matching animation etc, but would at least let plural forms match the
singular, without screwing other things up.

On Fri, Jul 30, 2010 at 4:41 PM, Otis Gospodnetic 
otis_gospodne...@yahoo.com wrote:

 Hello,

 I'm looking for a list of English  words that, when stemmed by Porter
 stemmer,
 end up in the same stem as  some similar, but unrelated words.  Below are
 some
 examples:

 # this gets stemmed to iron, so if you search for ironic, you'll get
 iron
 matches
 ironic

 # same stem as animal
 anime
 animated
 animation
 animations

 I imagine such a list could be added to the example protwords.txt

 Thanks,
 Otis
 
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 Lucene ecosystem search :: http://search-lucene.com/




-- 
Robert Muir
rcm...@gmail.com


Re: Solr searching performance issues, using large documents

2010-07-30 Thread Lance Norskog
Wait- how much text are you highlighting? You say these logfiles are X
big- how big are the actual documents you are storing?



On Fri, Jul 30, 2010 at 1:16 PM, Peter Karich peat...@yahoo.de wrote:
 Hi Peter :-),

 did you already try other values for

 hl.maxAnalyzedChars=2147483647

 ? Also regular expression highlighting is more expensive, I think.
 What does the 'fuzzy' variable mean? If you use this to query via
 ~someTerm instead someTerm
 then you should try the trunk of solr which is a lot faster for fuzzy or
 other wildcard search.

 Regards,
 Peter.

 Data set: About 4,000 log files (will eventually grow to millions).  Average 
 log file is 850k.  Largest log file (so far) is about 70MB.

 Problem: When I search for common terms, the query time goes from under 2-3 
 seconds to about 60 seconds.  TermVectors etc are enabled.  When I disable 
 highlighting, performance improves a lot, but is still slow for some queries 
 (7 seconds).  Thanks in advance for any ideas!


 -Peter


 -

 4GB RAM server
 % java -Xms2048M -Xmx3072M -jar start.jar

 -

 schema.xml changes:

     fieldType name=text_pl class=solr.TextField
       analyzer
         tokenizer class=solr.WhitespaceTokenizerFactory/
       filter class=solr.LowerCaseFilterFactory/
       filter class=solr.WordDelimiterFilterFactory generateWordParts=0 
 generateNumberParts=0 catenateWords=0 catenateNumbers=0 
 catenateAll=0 splitOnCaseChange=0/
       /analyzer
     /fieldType

 ...

    field name=body type=text_pl indexed=true stored=true 
 multiValued=false termVectors=true termPositions=true 
 termOffsets=true /
     field name=timestamp type=date indexed=true stored=true 
 default=NOW multiValued=false/
    field name=version type=string indexed=true stored=true 
 multiValued=false/
    field name=device type=string indexed=true stored=true 
 multiValued=false/
    field name=filename type=string indexed=true stored=true 
 multiValued=false/
    field name=filesize type=long indexed=true stored=true 
 multiValued=false/
    field name=pversion type=int indexed=true stored=true 
 multiValued=false/
    field name=first2md5 type=string indexed=false stored=true 
 multiValued=false/
    field name=ckey type=string indexed=true stored=true 
 multiValued=false/

 ...

  dynamicField name=* type=ignored multiValued=true /
  defaultSearchFieldbody/defaultSearchField
  solrQueryParser defaultOperator=AND/

 -

 solrconfig.xml changes:

     maxFieldLength2147483647/maxFieldLength
     ramBufferSizeMB128/ramBufferSizeMB

 -

 The query:

 rowStr = rows=10
 facet = 
 facet=truefacet.limit=10facet.field=devicefacet.field=ckeyfacet.field=version
 fields = fl=id,score,filename,version,device,first2md5,filesize,ckey
 termvectors = tv=trueqt=tvrhtv.all=true
 hl = hl=truehl.fl=bodyhl.snippets=1hl.fragsize=400
 regexv = (?m)^.*\n.*\n.*$
 hl_regex = hl.regex.pattern= + CGI::escape(regexv) + 
 hl.regex.slop=1hl.fragmenter=regexhl.regex.maxAnalyzedChars=2147483647hl.maxAnalyzedChars=2147483647
 justq = 'q=' + CGI::escape('body:' + fuzzy + p['q'].to_s.gsub(/\\/, 
 '').gsub(/([:~!=])/,'\1') + fuzzy + minLogSizeStr)

 thequery = '/solr/select?timeAllowed=5000wt=ruby' + (p['fq'].empty? ? '' : 
 ('fq='+p['fq'].to_s) ) + justq + rowStr + facet + fields + termvectors + hl 
 + hl_regex

 baseurl = '/cgi-bin/search.rb?q=' + CGI::escape(p['q'].to_s) + 'rows=' + 
 p['rows'].to_s + 'minLogSize=' + p['minLogSize'].to_s





 --
 http://karussell.wordpress.com/





-- 
Lance Norskog
goks...@gmail.com


Re: Solr Indexing slows down

2010-07-30 Thread Otis Gospodnetic
As you make changes to your index, you probably want to see the new/modified 
documents in your search results.  In order to do that, the new searcher needs 
to be reopened, and this happens on commit.
Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Peter Karich peat...@yahoo.de
 To: solr-user@lucene.apache.org
 Sent: Fri, July 30, 2010 6:19:03 PM
 Subject: Re: Solr Indexing slows down
 
 Hi Otis,
 
 does it mean that a new searcher is opened after I commit?
 I  thought only on startup...(?)
 
 Regards,
 Peter.
 
  Peter, there  are events in solrconfig where you define warm up queries 
  when 
a 

  new  searcher is opened.
 
  There are also cache settings that play a  role here.
 
  30-60 seconds is pretty frequent for  Solr.
 
  Otis
  
  Sematext :: http://sematext.com/ :: Solr -  Lucene - Nutch
  Lucene ecosystem search :: http://search-lucene.com/
 
 
 
  - Original  Message 
   
  From: Peter Karich peat...@yahoo.de
  To: solr-user@lucene.apache.org
   Sent: Fri, July 30, 2010 4:06:48 PM
  Subject: Re: Solr Indexing  slows down
 
  Hi Erick!
 
  thanks for  the response!
  I will answer your questions   ;-)
 
 
  How often are you  making changes to your index?
   
   Every  30-60 seconds. Too heavy?
 
 
  
  Do you have autocommit  on?

  No.
 
 
 
  Do you commit when updating each   document?
   
  No. I commit after a  batch update of 200  documents
 
 
  
  Committing too often and consequently firing off   warmup queries is the 
first 

   
   place I'd look.
 
  Why is commiting firing  warmup  queries? Is there any documentation about
  this subject?
   How can I  be sure that the previous commit has done its  magic?
 
 
  there are   several config values that influence the commit frequency

 
  I now know  the autowarm and the  mergeFactor config. What else? Is this
  documentation   complete:
  http://wiki.apache.org/lucene-java/ImproveIndexingSpeed ?
 
  Regards,
   Peter.
 
 
  See the subject  about 1500 threads. The  first place I'd look is how
  often  you're committing. If you're  committing before the warmup  queries
  from the previous commit have done  their magic,  you might be getting
  into a death  spiral.
 
   HTH
   Erick
 
  On Thu, Jul 29, 2010 at 7:02 AM, Peter  Karich  peat...@yahoo.de   
wrote:
 
   

  Hi,
 
  I  am  indexing a solr 1.4.0 core and commiting gets slower and  slower.
   Starting from 3-5 seconds for ~200 documents  and ending with over 60
   seconds after 800 commits.  Then, if I reloaded the index, it is as  
fast
  as  before! And today I have read a similar thread [1] and  indeed: if  I
  set autowarming for the caches to 0 the slowdown   disappears.
 
  BUT at the same time I would  like to offer  searching on that core, 
which
  would be  dramatically slowed down (due  to no  autowarming).
 
  Does someone know a better  solution  to avoid index-slow-down?
 
   Regards,
Peter.
 
  [1] http://www.mail-archive.com/solr-user@lucene.apache.org/msg20785.html
 
  
 
  
   
 
 
 -- 
 http://karussell.wordpress.com/
 
 


Re: Some basic DataImportHandler questions

2010-07-30 Thread Shalin Shekhar Mangar
On Sat, Jul 31, 2010 at 3:40 AM, Harry Smith harrysmith...@gmail.comwrote:

 Just starting with DataImportHandler and had a few simple questions.

 Is there a location for more in depth documentation other than
 http://wiki.apache.org/solr/DataImportHandler?


Umm, no, but let us know what is not covered well and it can be added.


 Specifically I was looking for a detailed document outlining
 data-config.xml, the fields and attributes and how they are used.

 * Is there a way to dynamically generate field elements from the
 supplied sql statement?

 Example: Suppose one has a table of 100 fields. Entering this manually
 for each field is not very efficient.

 ie, if table has only 3 columns this is easy enough...

 entity name=item query=select * from item
field column=ID name=id /
field column=NAME name=name /
field column=MANU name=manu /
 entity

 What are the options if ITEM table has dozens or hundreds?


Yes! You do not need to specify the column names as long as your Solr schema
defines the same field names.

See http://wiki.apache.org/solr/DataImportHandler#A_shorter_data-config



 * Is there a way to apply insert logic based on the value of the incoming
 field?

 My specific use case would be, if the incoming value is null, do not
 add to Solr.


DIH Transformers are the way. However, in this particular case, you do not
need to worry because nulls are not inserted into the index.

-- 
Regards,
Shalin Shekhar Mangar.


Re: Programmatically retrieving numDocs (or any other statistic)

2010-07-30 Thread Chris Hostetter

: I want to programmatically retrieve the number of indexed documents. I.e., 
get the value of numDocs.

Index level stats like this can be fetched from the LukeRequestHandler in 
any recent version of SOlr...
http://localhost:8983/solr/admin/luke?numTerms=0

In future releases (ie: already in trunk and branch 3x) there is also the 
SolrInfoMBeanRequestHandler which will replace registry.jsp and stats.jsp 

https://issues.apache.org/jira/browse/SOLR-1750


-Hoss