Re: Explicitly tell Solr the analyzed value when indexing a document

2011-11-17 Thread Ahmet Arslan
 I have a couple of string fields. For some of them I want
 from my
 application to be able to index a lowercased string but
 store the
 original value. Is there some way to do this? Or would I
 have to come
 up with a new field type and implement an analyzer?

If you have stored=true in your field definition, solr always stores original 
value. Response returns that original and stored values.

Search and faceting are done against indexed values. Therefore you don't need 
to to anything special in your case.




Re: Explicitly tell Solr the analyzed value when indexing a document

2011-11-17 Thread Tim Terlegård
 I have a couple of string fields. For some of them I want
 from my
 application to be able to index a lowercased string but
 store the
 original value. Is there some way to do this? Or would I
 have to come
 up with a new field type and implement an analyzer?

 If you have stored=true in your field definition, solr always stores 
 original value. Response returns that original and stored values.

 Search and faceting are done against indexed values. Therefore you don't need 
 to to anything special in your case.

I want for a string field faceting to return monkey while the
original value is *Monkey. So I want indexed be lowercase and stored
the original value. That is, I want to do the analyzing in my
application and tell Solr what to use for indexed and what to use for
stored.

/Tim


Solr Near Real-Time Search, Soft Commit Problem

2011-11-17 Thread Jak Akdemir
Hi,

I was trying to configure a Solr instance with the near real-time search
and auto-complete capabilities. I stuck in the NRT feature. There are
15 new records per second that inserted into the database (mysql) and I
indexed them with DIH. First, I tried to manage autoCommits from
solrconfig.xml with the configuration below.

autoCommit
 maxDocs1/maxDocs
 maxTime10/maxTime
   /autoCommit

autoSoftCommit
 maxDocs15/maxDocs
 maxTime1000/maxTime
/autoSoftCommit

And the bash script below responsible for getting delta's without
committing.

while [ 1 ]; do
wget -O /dev/null '
http://localhost:8080/solr-jak/dataimport?command=delta-importcommit=false'
2/dev/null
sleep 1
done

Then I run my query from browser
http://localhost:8080/solr-jak/select?q=movie_name_prefix_full:dogvilledefType=luceneq.op=ORhttp://localhost:8080/solr-sprongo/select?q=movie_name_prefix_full:%221398%22defType=luceneq.op=OR

But I realized that, with this configuration index files are changing every
second and after a minute there are only 600 new records in Solr index
while 900 new records in the database.
After experienced that, I removed autoCommit and autoSoftCommit elements in
solrconfig.xml And updated my bashscript as follows. But still index files
are changing and solr can not syncronized with database.

while [ 1 ]; do
echo Soft commit applied!
wget -O /dev/null '
http://localhost:8080/solr-jak/dataimport?command=delta-importcommit=false'
2/dev/null
curl http://localhost:8080/solr-jak/update -H Content-Type: text/xml
--data-binary 'commit softCommit=true waitFlush=false
waitSearcher=false/' 2/dev/null
sleep 3
done

Even I decreased the pressure on Solr as 1 new record per sec. and soft
commits within 6 sec. still there is a gap between index and db. Is there
anything that I missed? I took a look to /get too, but it is working only
for pk. If there is an example configuration list (like 1 sec for soft
commit and 10 min for hard commit) as a best practice it would be great.

Finally, here is my configuration.
Ubuntu 11.04
JDK 1.6.0_27
Tomcat 7.0.21
Solr 4.0 2011-10-24_08-53-02

All advices are appreciated,

Best Regards,

Jak


Re: Explicitly tell Solr the analyzed value when indexing a document

2011-11-17 Thread Tim Terlegård
 I have a couple of string fields. For some of them I want from my
 application to be able to index a lowercased string but store the
 original value. Is there some way to do this? Or would I have to come
 up with a new field type and implement an analyzer?

I think I should be able to do what I want to do with
solr.PatternReplaceCharFilterFactory.

/Tim


Re: Explicitly tell Solr the analyzed value when indexing a document

2011-11-17 Thread Ahmet Arslan
 I want for a string field faceting to return monkey while
 the
 original value is *Monkey. So I want indexed be lowercase
 and stored
 the original value. That is, I want to do the analyzing in
 my
 application and tell Solr what to use for indexed and what
 to use for
 stored.

Sorry but I don't follow. Why don't you just use lowercase filter? 


Re: Highlighting apostrophe

2011-11-17 Thread rychu
Hi, have you found the solution to your highlighting apostrophe problem?

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Highlighting-apostrophe-tp731155p3515139.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: delta-import of rich documents like word and pdf files!

2011-11-17 Thread neuron005
Thank you for your replies guys.that helped a lot. Thanks
iorixxx that was the command that worked out.
I also tried my solr with mysql and that worked too. Congo! :)
Now, I want to index my files according to their size and facet them
according to their size ranges. I know that there is an option of fileSize
in FileListEntityProcessor but I am not getting any way to perform this.
Is fileSize a metadata? If it is, then the steps I performed are
-
I created a field name and dynamic field name in schema. as

dynamicField name=metadata_* type=string indexed=true  stored=true
multiValued=false/
field name=fileSize type=string indexed=true stored=true
required=false/

Added range facet in solrconfig.xml and in data-config.xml I added a str
according to field in data-config.xml
--solrconfig.xml
int name=f.fileSize.facet.range.start0/int
   int name=f.fileSize.facet.range.gap100/int
int name=f.fileSize.facet.range.end600/int
---
data-config.xml---
field column=FileSize name=fileSize /
---
But that did not work out!
Am I missing something?
Please help me out.
Thanks in advance.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/delta-import-of-rich-documents-like-word-and-pdf-files-tp3502039p3515298.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: to prevent number-of-matching-terms in contributing score

2011-11-17 Thread Samarendra Pratap
On Thu, Nov 17, 2011 at 6:59 AM, Chris Hostetter
hossman_luc...@fucit.orgwrote:


 :  1. omitTermFreqAndPositions is very straightforward but if I avoid
 : positions I'll refuse to serve phrase queries. I had searched for this in

 but do you really need phrase queries on your cat field?  i thought the
 point was to have simple matching on those terms?

 Yes I need to match phrases. Consider following documents
Doc1 - categories: teak wooden chair, bamboo wooden chair
Doc2 - categories: wooden chair
Doc3 - categories: plastic chair, wooden cupboard.

A query wooden chair should give doc1 and doc2 with equal score (provided
other fields generate same score) and doc3 should be excluded. Non-phrase
match would include doc3 as well.


:  2. Function query seemed nice (though strange because I never used it
 : before) and I gave it a few hours but that too did not seem to solve my
 : requirement. The artificial score we are generating is getting
 multiplied
 : into rest of the score which includes score due to cat field as well.
 (I
 : can not remove cat from qf as I have to search there). It is only
 that
 : I don't want this field's score on the basis of matching tf.

 I don't think i realized you were using dismax ... if you just want a
 match on cat to help determine if the document is a match, but not have
 *any* impact on score, you could just set the qf boost to 0 (ie:
 qf=title^10 cat^0) but i'm not sure if that's really what you want.

 Well this is almost what I want. (Thanks for telling me about ^0. I
learned a new thing.).
I wanted a constant score for a match in cat and I did not want the
frequency of match in cat to affect the score which can be done this way.
But I definitely want to generate some score, equal to single match (tf =
1) so that less important fields like description may not get higher
boost than cat. Writing ^0 creates 0.00 score for a match in cat while
a match in description will generate some positive score greater than
zero (0).



 : After spending some hours on function queries I finally reached on
 : following query

 Honestly: i'm not really following what you tried there because of the
 formatting applied by your email client ... it seemed to be making tons of
 hyperlinks out of peices of the URL.

 Looking at your query explanation however the problem seems to be that you
 are still using the relevancy score of the matches on the cat field,
 instead of *just* using hte function boost...

 I did try *just* using the function boost, i.e. removed the cat from
qf, but it did not seem to return documents which have matching
categories just in cat field. The query was something like following (i
hope it be clear this time)

url?q={!boost b=$cat_boost v=$main_query}
*main_query={!dismax qf=title v=$qry}*
cat_boost={!func}map(query({!field f=cat v=$qry},-1),0,1000,5,1)
qry=chair
...

(note: i slightly modified the cat_boost parameter to use only single map()
function with 5 argument form)

It gave me just two docs where title contained the query word (chair)

I also tried changing main_query like
*main_query={!dismax qf=title cat v=$qry}*
which gave me all 4 required docs but with scores varying on the basis of
cat as well

and
*main_query={!dismax qf=title cat^0 v=$qry}*
which gave me all required docs with a constant (0.0) cat score. but when
I'll add description in qf, docs even with worst matching in
description will score higher than docs with a good match in cat which
is not exactly what is required.



 : But debugging the query showed that the boost value ($cat_boost) is being
 : multiplied into a value which is generated with the help of cat field
 : thus resulting in different scores for 1 and 3 (similarly for 2 and 4).
 :
 : 1.2942866 = (MATCH) boost(+(title:chair | cat:chair)~0.01
 : (),map(query(cat:chair,def=-1.0),0.0,1000.0,1.0)), product of:

 ...my point before was to take cat:chair out of the main part of your
 query, and *only* put it in the boost function.  if you are using dismax,
 the qf=cat^0 suggestion mentioned above *combined* with your boost
 function will probably get you what you want (i think)

 taking cat:chair out of main_query (dismax equivalent - removing cat
from qf) or using cat^0 did not produce desired effect as I described
earlier


 : I was thinking there should be some hook or plugin (or anything) which
 : could just change the score calculation formula *for a particular field*.
 : There is a function in DefaultSimilarity class - *public float tf(float
 : freq)* but that does not mention the field name. Is there a possibility
 to
 : look into this direction?

 on trunk, there is a distinct Similarity object per fieldtype, so you
 could certain look at that -- but you are correct that in 3x there is no
 way to override the tf() function on a per field basis.

 I'll definitely look at the Similarity class. I hope there are no
performance degradation issues with it :)


 -Hoss


Thank you very much.

-- 
Regards,
Samar


Re: delta-import of rich documents like word and pdf files!

2011-11-17 Thread Ahmet Arslan
 Now, I want to index my files according to their size and
 facet them
 according to their size ranges. I know that there is an
 option of fileSize
 in FileListEntityProcessor but I am not getting any way to
 perform this.
 Is fileSize a metadata? 

You don't need a dynamic field for this. The following additions should enable 
and populate fileSize:

in data-config.xml :

entity name=f processor=FileListEntityProcessor ... 
field column=fileSize name=fileSize/
/entity

in schema.xml :

field name=fileSize type=string indexed=true stored=true 
required=false/ 





Re: delta-import of rich documents like word and pdf files!

2011-11-17 Thread neuron005
Thanks for your reply, I performed these steps.
in data-config.xml :

entity name=f processor=FileListEntityProcessor ... 
field column=fileSize name=fileSize/
/entity

in schema.xml :

field name=fileSize type=string indexed=true stored=true
required=false/
--
But still there is no response in browse sectionI edited facet_ranges.vm
for this. It does not calculate size of the documents. can you please tell
me the command to check that in response it shows size of file?
Thanks again


--
View this message in context: 
http://lucene.472066.n3.nabble.com/delta-import-of-rich-documents-like-word-and-pdf-files-tp3502039p3515495.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: delta-import of rich documents like word and pdf files!

2011-11-17 Thread neuron005
And also I set my fileSize of type long. String will not work I think !
Size can not be a string...it shows error on using string as type.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/delta-import-of-rich-documents-like-word-and-pdf-files-tp3502039p3515505.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: delta-import of rich documents like word and pdf files!

2011-11-17 Thread neuron005
I ran this command and can see size of my files
http://localhost:8080/solr/select?q=userf.fileSize.facet.range.start=100
Great thanks...string worked...i dont know why that did not work last time

But when I do that in browse section..following output i saw in my logs
SEVERE: Exception during facet.range of
fileSize:org.apache.solr.common.SolrException: Unable to range facet on
field:fileSize{type=string,properties=indexed,stored,omitNorms,omitTermFreqAndPositions,sortMissingLast}
at
org.apache.solr.request.SimpleFacets.getFacetRangeCounts(SimpleFacets.java:834)
at
org.apache.solr.request.SimpleFacets.getFacetRangeCounts(SimpleFacets.java:778)
at
org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:178)
at
org.apache.solr.handler.component.FacetComponent.process(FacetComponent.java:72)
at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:194)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368)
at
org.apache.solr.servl..

This does not come when I set it to type int and when I use int it does not
show size!!
 Please help me out

--
View this message in context: 
http://lucene.472066.n3.nabble.com/delta-import-of-rich-documents-like-word-and-pdf-files-tp3502039p3515567.html
Sent from the Solr - User mailing list archive at Nabble.com.


Solr Master High Availability

2011-11-17 Thread KARHU Toni
Hi, im looking into High availability SOLR master configurations. Does anybody 
have a good solution to this the things im looking into are:

*   Using SOLR replication to keep a second backup master.
*   Indexing in a separate machine(s), problem being here that the index 
will be different from the other machine needing a full replication to all 
slaves in case of failure to first master.
*   Having the whole setup replicated to another machine which is then used 
as a master machine if primary master fails?

Any more ideas/experiences?

Toni

**
IMPORTANT: This message is intended exclusively for information purposes. It 
cannot be considered as an 
official OHIM communication concerning procedures laid down in the Community 
Trade Mark Regulations 
and Designs Regulations. It is therefore not legally binding on the OHIM for 
the purpose of those procedures.
The information contained in this message and attachments is intended solely 
for the attention and use of the 
named addressee and may be confidential. If you are not the intended recipient, 
you are reminded that the 
information remains the property of the sender. You must not use, disclose, 
distribute, copy, print or rely on this 
e-mail. If you have received this message in error, please contact the sender 
immediately and irrevocably 
delete or destroy this message and any copies.

**

ISO8601 Date format

2011-11-17 Thread Gerke, Axel
Hello,

due a different Bug in another system, we stored a date in a datefield
with an value like 999-12-31T23:00:00Z. As you can see in the schema
browser below, solr stores it correct with four digits but in a response
the leading zero is missing.

My question is: is a three digit year a valid ISO-8601 date format for
the response or is this a bug? Because other languages (f.e. python) are
throwing an exception with a three digit year?!

Response: 
doc
...
date name=effective999-12-31T23:00:00Z/date

/doc

Schema browser: 
Field: effective
Field Type: date
Properties: Indexed, Tokenized, Stored, undefined
Schema: Indexed, Tokenized, Stored, undefined
Index: Indexed, Tokenized, Stored
Index Analyzer: org.apache.solr.analysis.TokenizerChain Details
Query Analyzer: org.apache.solr.analysis.TokenizerChain Details
Docs: 86727
Distinct: 4
termfrequency
0999-12-31T23:00:00Z165602
2011-11-05T23:00:00Z3543
2011-10-19T07:22:20.908Z2
2011-10-12T15:40:00Z2

Thx and best regards,

Axel 




Re: Aggregated indexing of updating RSS feeds

2011-11-17 Thread sbarriba
Thanks Chris.

(Bell rings)

The 'params' logging pointer was what I needed. So for reference its not a
good idea to use a 'wget' command directly in a crontab.
I was using:

wget http://localhost/solr/myfeed?command=full-importrows=5000clean=false

...but moving this into a separate shell script, wrapping the URL in quotes
and calling that resolved the issue.

Thanks very much.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Aggregated-indexing-of-updating-RSS-feeds-tp3485335p3515388.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Aggregated indexing of updating RSS feeds

2011-11-17 Thread Michael Kuhlmann

Am 17.11.2011 11:53, schrieb sbarriba:

The 'params' logging pointer was what I needed. So for reference its not a
good idea to use a 'wget' command directly in a crontab.
I was using:

wget http://localhost/solr/myfeed?command=full-importrows=5000clean=false


:))

I think the shell handled the and sign as a flag to put the wget command 
into background.


You could put the full url into quotes, or escape the and sign with a 
backslash. Then it should work as well.


-Kuli


Re: Problems with AutoSuggest feature(Terms Components)

2011-11-17 Thread Erick Erickson
TermsComponent only reacts to what you send it. How are these requests
getting to the TermsComponent?  That's where you should look.

As far as terms.limit, your requesthandler for TermsComponent in
solrconfig.xml has a defaults section and you can set whatever
you want in there and then override it as you choose if you
sometimes want other values in there.

Best
Erick

On Wed, Nov 16, 2011 at 9:17 AM, mechravi25 mechrav...@yahoo.co.in wrote:
 Hi,

 When i search for a data i noticed two things

 1.) I noticed that *terms.regex=.** in the logs which does a blank search
 on terms because of the query time is more. Is there anyway to overcome
 this. My actual query should go like the first one bolded but instead of
 that it happens like in the second case(the 2nd text highlighted in bold)

 2.) Also I noticed that *terms.limit=-1* which is very expensive as it asks
 solr to return all the terms back. It should be set to 10 or 20 at most.
 Please provide some suggestions to set the same.



 Nov 14, 2011 2:04:08 PM org.apache.solr.core.SolrCore execute
 INFO: [db] webapp=/solr path=/terms
 params={*terms.regex=ABC\+CCC\+lll*\+data.*terms.regex.flag=case_insensitiveterms.fl=nameFacet}
 status=0 QTime=935
 Nov 14, 2011 2:04:08 PM org.apache.solr.core.SolrCore execute
 INFO: [core2] webapp=/solr path=/terms
 params={terms.regex.flag=case_insensitiveshards.qt=/termsterms.fl=nameFacetterms=trueterms.limit=-1terms.regex=ABC\+CCC\+lll\+data.*isShard=trueqt=/termswt=javabinterms.sort=indexversion=1}
 status=0 QTime=842
 Nov 14, 2011 2:04:08 PM org.apache.solr.core.SolrCore execute
 INFO: [db] webapp=/solr path=/terms
 params={terms.regex=ABC\+CCC\+lll\+data.*terms.regex.flag=case_insensitiveterms.fl=nameFacet}
 status=0 QTime=927
 Nov 14, 2011 2:04:08 PM org.apache.solr.core.SolrCore execute
 INFO: [core3] webapp=/solr path=/terms
 params={terms.regex.flag=case_insensitiveshards.qt=/termsterms.fl=nameFacetterms=trueterms.limit=-1terms.regex=.*isShard=trueqt=/termswt=javabinterms.sort=indexversion=1}
 status=0 QTime=115

 Nov 14, 2011 2:05:55 PM org.apache.solr.core.SolrCore execute
 INFO: [core1] webapp=/solr path=/terms
 params={terms.regex.flag=case_insensitiveshards.qt=/termsterms.fl=nameFacetterms=trueterms.limit=-1*terms.regex=.**isShard=trueqt=/termswt=javabinterms.sort=indexversion=1}
 status=0 QTime=106767
 Nov 14, 2011 2:05:55 PM org.apache.solr.core.SolrCore execute
 INFO: [core4] webapp=/solr path=/terms
 params={terms.regex.flag=case_insensitiveshards.qt=/termsterms.fl=nameFacetterms=trueterms.limit=-1terms.regex=.*isShard=trueqt=/termswt=javabinterms.sort=indexversion=1}
 status=0 QTime=106766
 Nov 14, 2011 2:05:55 PM org.apache.solr.core.SolrCore execute

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Problems-with-AutoSuggest-feature-Terms-Components-tp3512734p3512734.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Phrase between quotes with dismax edismax

2011-11-17 Thread Jean-Claude Dauphin
Thanks Erick for your prompt response.

I am not sure but I think I found why the phrase chef de projet is not
found by dismax and edismax.
The following terms are indexed and can be seen with Luke:
   chef
   projet
   chef de projet
When searching for the phrase chef de projet, the terms 'chef' and
'projet' are found in the index but 'de' is not found. And thus no results.
Please note that using standard Lucene QueryParser, it works well.

This is just what I suspect, does it sounds correct??

Best wishes,

Jean-Claude

On Wed, Nov 16, 2011 at 9:26 PM, Erick Erickson erickerick...@gmail.comwrote:

 Ah, ok I was mis-reading some things. So, let's ignore the
 category bits for now.

 Questions:
 1 Can you refine down the problem. That is,
demonstrate this with a single field and leave out
the category stuff. Something like
q=title:chef de projet getting no results and
q=title:chef projet getting results? The idea
is to cycle through all the fields to see if we can
hone in on the problem. I'd get rid of any pf
parameters of your edismax definition too. I'm after
   the simplest case that can demonstrate the issue.
   For that matter, it'd be even easier if you could
   make this happen with the default searcher (
   solr/select?q=title:chef de projet
 2 if you can do 1, please post the field definitions
 from your schema.xml file. One possibility is that
 you are removing stopwords at index time but not
 query time or vice-versa, but that's a wild guess.
 3 Once you have a field, use the admin/analysis page
 to see the exact transformations that occur at index
 and query time to see if anything jumps out.

 All in all, I suspect you have a field that isn't being parsed
 as you expect at either index or query time, but as I said
 above, that's a guess.

 Best
 Erick

 On Wed, Nov 16, 2011 at 5:02 AM, Jean-Claude Dauphin
 jc.daup...@gmail.com wrote:
  Thanks Erick for yr quick answer.
 
  I am using Solr 3.1
 
  1) I have set the mm parameter to 0 and removed the categories from the
  search. Thus the query is only for chef de projet and nothing else.
  But the problem remains, i.e searching for chef de projet gives no
  results while searching for chef projet gives the right result.
 
  Here is an excerpt from the test I made:
 
  DISMAX query (q)=(chef de projet)
 
  =The Parameters=
 
  *queryResponse*=[{responseHeader={status=0,QTime=157,
 
  params={facet=true,
 
  f.createDate.facet.date.start=NOW/DAY-6DAYS,tie=0.1,
 
  facet.limit=4,
 
  f.location.facet.limit=3,
 
  *q.alt*=*:*,
 
  facet.date.other=all,
 
  hl=true,version=2,
 
  *bq*=[categoryPayloads:category1071^1,
  categoryPayloads:category10055078^1,
 categoryPayloads:category10055405^1],
 
  fl=*,score,
 
  debugQuery=true,
 
  facet.field=[soldProvisions, contractTypeText, nafCodeText, createDate,
  wage, keywords, labelLocation, jobCode, organizationName,
  requiredExperienceLevelText],
 
  *qs*=3,
 
  qt=edismax,
 
  facet.date.end=NOW/DAY,
 
  *mm*=0,
 
  facet.mincount=1,
 
  facet.date=createDate,
 
  *qf*= title^4.0 formattedDescription^2.0 nafCodeText^2.0 jobCodeText^3.0
  organizationName^1.0 keywords^3.0 location^1.0 labelLocation^1.0
  categoryPayloads^1.0,
 
  hl.fl=title,
 
  wt=javabin,
 
  rows=20,
 
  start=0,
 
  *q*=(chef de projet),
 
  facet.date.gap=+1DAY,
 
  *stopwords*=false,
 
  *ps*=3}},
 
  The Solr Response
  response={numFound=0
 
  Debug Info
 
  debug={
 
  *rawquerystring*=(chef de projet),
 
  *querystring*=(chef de projet),
 
  *---
  *
 
  *parsedquery*=
 
  +*DisjunctionMaxQuery*((title:chef de projet~3^4.0 | keywords:chef de
  projet^3.0 | organizationName:chef de projet | location:chef de projet |
  formattedDescription:chef de projet~3^2.0 | nafCodeText:chef de
  projet^2.0 | jobCodeText:chef de projet^3.0 | categoryPayloads:chef de
  projet~3 | labelLocation:chef de projet)~0.1)
  *DisjunctionMaxQuery*((title:((chef
  chef) de (projet) projet)~3^4.0)~0.1) categoryPayloads:category1071
  categoryPayloads:category10055078 categoryPayloads:category10055405,
 
  *---*
 
  *parsedquery_toString*=+(title:chef de projet~3^4.0 | keywords:chef de
  projet^3.0 | organizationName:chef de projet | location:chef de projet |
  formattedDescription:chef de projet~3^2.0 | nafCodeText:chef de
  projet^2.0 | jobCodeText:chef de projet^3.0 | categoryPayloads:chef de
  projet~3 | labelLocation:chef de projet)~0.1 (title:((chef chef) de
  (projet) projet)~3^4.0)~0.1 categoryPayloads:category1071
  categoryPayloads:category10055078 categoryPayloads:category10055405,
 
 
 
  explain={},
 
  QParser=ExtendedDismaxQParser,altquerystring=null,
 
  *boost_queries*=[categoryPayloads:category1071^1,
  categoryPayloads:category10055078^1,
 categoryPayloads:category10055405^1],
 
  *parsed_boost_queries*=[categoryPayloads:category1071,
  categoryPayloads:category10055078, 

Re: delta-import of rich documents like word and pdf files!

2011-11-17 Thread neuron005
Sorry for disturbing you allactually I had to add plong instead of type
string.
My problem is solved
Be ready for new thread
CHEERS

--
View this message in context: 
http://lucene.472066.n3.nabble.com/delta-import-of-rich-documents-like-word-and-pdf-files-tp3502039p3515711.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: ISO8601 Date format

2011-11-17 Thread Gora Mohanty
On Thu, Nov 17, 2011 at 6:06 PM, Gerke, Axel
axel.ge...@haufe-lexware.com wrote:
 Hello,

 due a different Bug in another system, we stored a date in a datefield
 with an value like 999-12-31T23:00:00Z. As you can see in the schema
 browser below, solr stores it correct with four digits but in a response
 the leading zero is missing.

 My question is: is a three digit year a valid ISO-8601 date format for
 the response or is this a bug? Because other languages (f.e. python) are
 throwing an exception with a three digit year?!

http://www.w3.org/TR/NOTE-datetime , and http://en.wikipedia.org/wiki/ISO_8601
seem to indicate that a four-digit year with leading zeroes is required. To
quote from the General principles section in the latter reference:
Each date and
time value has a fixed number of digits that must be padded with leading zeros.

Regards,
Gora


FunctionQuery score=0

2011-11-17 Thread John
Hi,

I am using a function query that based on the query of the user gives a
score for the results I am presenting.

Some of the results are receiving score=0 in my function and I would like
them not to appear in the search results.

How can I achieve that?

Thanks in advance.


Re: strange behavior of scores and term proximity use

2011-11-17 Thread Erick Erickson
Hmmm, I'm not seeing similar behavior on a trunk from today, when did
you get your copy?

Erick

On Wed, Nov 16, 2011 at 2:06 PM, Ariel Zerbib ariel.zer...@gmail.com wrote:
 Hi,

 For this term proximity query: ab_main_title_l0:to be or not to be~1000

 http://localhost:/solr/select?q=ab_main_title_l0%3A%22og54ct8n+to+be+or+not+to+be+5w8ojsx2%22~1000sort=score+descstart=0rows=3fl=ab_main_title_l0%2Cscore%2CiddebugQuery=true

 The third first results are the following one:

 ?xml version=1.0 encoding=UTF-8?
 response
 lst name=responseHeader
  int name=status0/int
  int name=QTime5/int
 /lst
 result name=response numFound=318 start=0 maxScore=3.0814114
  doc
    long name=id2315190010001021/long
    arr name=ab_main_title_l0
      strog54ct8n To be or not to be a Jew. 5w8ojsx2/str
    /arr
    float name=score3.0814114/float/doc
  doc
    long name=id2313006480001021/long
    arr name=ab_main_title_l0
      strog54ct8n To be or not to be 5w8ojsx2/str
    /arr
    float name=score3.0814114/float/doc
  doc
    long name=id2356410250001021/long
    arr name=ab_main_title_l0
      strog54ct8n Rumspringa : to be or not to be Amish / 5w8ojsx2/str
    /arr
    float name=score3.0814114/float/doc
 /result
 lst name=debug
  str name=rawquerystringab_main_title_l0:og54ct8n to be or not to be
 5w8ojsx2~1000/str
  str name=querystringab_main_title_l0:og54ct8n to be or not to be
 5w8ojsx2~1000/str
  str name=parsedqueryPhraseQuery(ab_main_title_l0:og54ct8n to be or
 not to be 5w8ojsx2~1000)/str
  str name=parsedquery_toStringab_main_title_l0:og54ct8n to be or not
 to be 5w8ojsx2~1000/str
  lst name=explain
    str name=2315190010001021
 5.337161 = (MATCH) weight(ab_main_title_l0:og54ct8n to be or not to be
 5w8ojsx2~1000 in 378403) [DefaultSimilarity], result of:
  5.337161 = fieldWeight in 378403, product of:
    0.57735026 = tf(freq=0.3334), with freq of:
      0.3334 = phraseFreq=0.3334
    29.581549 = idf(), sum of:
      1.0012436 = idf(docFreq=3297332, maxDocs=3301436)
      3.0405464 = idf(docFreq=429046, maxDocs=3301436)
      5.3583193 = idf(docFreq=42257, maxDocs=3301436)
      4.3826413 = idf(docFreq=112108, maxDocs=3301436)
      6.3982043 = idf(docFreq=14937, maxDocs=3301436)
      3.0405464 = idf(docFreq=429046, maxDocs=3301436)
      5.3583193 = idf(docFreq=42257, maxDocs=3301436)
      1.0017256 = idf(docFreq=3295743, maxDocs=3301436)
    0.3125 = fieldNorm(doc=378403)
 /str
    str name=2313006480001021
 9.244234 = (MATCH) weight(ab_main_title_l0:og54ct8n to be or not to be
 5w8ojsx2~1000 in 482807) [DefaultSimilarity], result of:
  9.244234 = fieldWeight in 482807, product of:
    1.0 = tf(freq=1.0), with freq of:
      1.0 = phraseFreq=1.0
    29.581549 = idf(), sum of:
      1.0012436 = idf(docFreq=3297332, maxDocs=3301436)
      3.0405464 = idf(docFreq=429046, maxDocs=3301436)
      5.3583193 = idf(docFreq=42257, maxDocs=3301436)
      4.3826413 = idf(docFreq=112108, maxDocs=3301436)
      6.3982043 = idf(docFreq=14937, maxDocs=3301436)
      3.0405464 = idf(docFreq=429046, maxDocs=3301436)
      5.3583193 = idf(docFreq=42257, maxDocs=3301436)
      1.0017256 = idf(docFreq=3295743, maxDocs=3301436)
    0.3125 = fieldNorm(doc=482807)
 /str
    str name=2356410250001021
 5.337161 = (MATCH) weight(ab_main_title_l0:og54ct8n to be or not to be
 5w8ojsx2~1000 in 1317563) [DefaultSimilarity], result of:
  5.337161 = fieldWeight in 1317563, product of:
    0.57735026 = tf(freq=0.3334), with freq of:
      0.3334 = phraseFreq=0.3334
    29.581549 = idf(), sum of:
      1.0012436 = idf(docFreq=3297332, maxDocs=3301436)
      3.0405464 = idf(docFreq=429046, maxDocs=3301436)
      5.3583193 = idf(docFreq=42257, maxDocs=3301436)
      4.3826413 = idf(docFreq=112108, maxDocs=3301436)
      6.3982043 = idf(docFreq=14937, maxDocs=3301436)
      3.0405464 = idf(docFreq=429046, maxDocs=3301436)
      5.3583193 = idf(docFreq=42257, maxDocs=3301436)
      1.0017256 = idf(docFreq=3295743, maxDocs=3301436)
    0.3125 = fieldNorm(doc=1317563)
 /str
 /response

 The used version is a 4.0 October snapshot.

 I have 2 questions about the result:
 - Why debug print and scores in result are different?
 - What is the expected behavior of this kind of term proximity query?
          - The debug scores seem to be well ordered but the result scores
 seem to be wrong.


 Thanks,
 Ariel



Re: Phrase between quotes with dismax edismax

2011-11-17 Thread Erick Erickson
OK, looks like you're mixing fieldTypes. That is,
you have some string types, which are
completely unanalyzed and some analyzed
fields. The analyzed fields have stopwords
removed at index time. Then it looks like
your query chain does NOT remove stopwords
or some such.

So it's probably a schema issue. The admin/analysis
page will help you understand how the analysis chains
work.

I'd also recommend that you NOT use eDismax when
experimenting with analyzers, having requests distributed
across all those fields can be confusing. Certainly DO use
eDismax when you're working for real, or use the
fielded form of the queries title:chef de projet just to
reduce the clutter of the output...

But you're on the right track
Best
Erick

On Thu, Nov 17, 2011 at 8:10 AM, Jean-Claude Dauphin
jc.daup...@gmail.com wrote:
 Thanks Erick for your prompt response.

 I am not sure but I think I found why the phrase chef de projet is not
 found by dismax and edismax.
 The following terms are indexed and can be seen with Luke:
   chef
   projet
   chef de projet
 When searching for the phrase chef de projet, the terms 'chef' and
 'projet' are found in the index but 'de' is not found. And thus no results.
 Please note that using standard Lucene QueryParser, it works well.

 This is just what I suspect, does it sounds correct??

 Best wishes,

 Jean-Claude

 On Wed, Nov 16, 2011 at 9:26 PM, Erick Erickson 
 erickerick...@gmail.comwrote:

 Ah, ok I was mis-reading some things. So, let's ignore the
 category bits for now.

 Questions:
 1 Can you refine down the problem. That is,
    demonstrate this with a single field and leave out
    the category stuff. Something like
    q=title:chef de projet getting no results and
    q=title:chef projet getting results? The idea
    is to cycle through all the fields to see if we can
    hone in on the problem. I'd get rid of any pf
    parameters of your edismax definition too. I'm after
   the simplest case that can demonstrate the issue.
   For that matter, it'd be even easier if you could
   make this happen with the default searcher (
   solr/select?q=title:chef de projet
 2 if you can do 1, please post the field definitions
     from your schema.xml file. One possibility is that
     you are removing stopwords at index time but not
     query time or vice-versa, but that's a wild guess.
 3 Once you have a field, use the admin/analysis page
     to see the exact transformations that occur at index
     and query time to see if anything jumps out.

 All in all, I suspect you have a field that isn't being parsed
 as you expect at either index or query time, but as I said
 above, that's a guess.

 Best
 Erick

 On Wed, Nov 16, 2011 at 5:02 AM, Jean-Claude Dauphin
 jc.daup...@gmail.com wrote:
  Thanks Erick for yr quick answer.
 
  I am using Solr 3.1
 
  1) I have set the mm parameter to 0 and removed the categories from the
  search. Thus the query is only for chef de projet and nothing else.
  But the problem remains, i.e searching for chef de projet gives no
  results while searching for chef projet gives the right result.
 
  Here is an excerpt from the test I made:
 
  DISMAX query (q)=(chef de projet)
 
  =The Parameters=
 
  *queryResponse*=[{responseHeader={status=0,QTime=157,
 
  params={facet=true,
 
  f.createDate.facet.date.start=NOW/DAY-6DAYS,tie=0.1,
 
  facet.limit=4,
 
  f.location.facet.limit=3,
 
  *q.alt*=*:*,
 
  facet.date.other=all,
 
  hl=true,version=2,
 
  *bq*=[categoryPayloads:category1071^1,
  categoryPayloads:category10055078^1,
 categoryPayloads:category10055405^1],
 
  fl=*,score,
 
  debugQuery=true,
 
  facet.field=[soldProvisions, contractTypeText, nafCodeText, createDate,
  wage, keywords, labelLocation, jobCode, organizationName,
  requiredExperienceLevelText],
 
  *qs*=3,
 
  qt=edismax,
 
  facet.date.end=NOW/DAY,
 
  *mm*=0,
 
  facet.mincount=1,
 
  facet.date=createDate,
 
  *qf*= title^4.0 formattedDescription^2.0 nafCodeText^2.0 jobCodeText^3.0
  organizationName^1.0 keywords^3.0 location^1.0 labelLocation^1.0
  categoryPayloads^1.0,
 
  hl.fl=title,
 
  wt=javabin,
 
  rows=20,
 
  start=0,
 
  *q*=(chef de projet),
 
  facet.date.gap=+1DAY,
 
  *stopwords*=false,
 
  *ps*=3}},
 
  The Solr Response
  response={numFound=0
 
  Debug Info
 
  debug={
 
  *rawquerystring*=(chef de projet),
 
  *querystring*=(chef de projet),
 
  *---
  *
 
  *parsedquery*=
 
  +*DisjunctionMaxQuery*((title:chef de projet~3^4.0 | keywords:chef de
  projet^3.0 | organizationName:chef de projet | location:chef de projet |
  formattedDescription:chef de projet~3^2.0 | nafCodeText:chef de
  projet^2.0 | jobCodeText:chef de projet^3.0 | categoryPayloads:chef de
  projet~3 | labelLocation:chef de projet)~0.1)
  *DisjunctionMaxQuery*((title:((chef
  chef) de (projet) projet)~3^4.0)~0.1) categoryPayloads:category1071
  categoryPayloads:category10055078 categoryPayloads:category10055405,
 
  

What is the best approach to do reindexing on the fly?

2011-11-17 Thread erolagnab
Hi all,

I'm using Solr 3.2 with DataImportHandler periodically update index every 5
min.
There's an house keeping script running weekly which delete some data in the
database.
I'd like to incorporate the reindexing strategy with this house keeping
script by:
1. Locking the DataImportHandler - not allow to perform any update on the
index - by having a flag in the database, every time scheduled job trigger,
it first checks for the flag before perform incremental index.
2. Run separate Solr instance, pointing to the same index and perform a
clean index

Now before coming to this setup, I had some options but they didn't fit very
well:
1. Trigger reindexing directy in the running Solr instance - I wrap Solr
with our own authentication mechanism and reindexing would be causing spike
in memory usage and affect the current running apps (sitting in the same
j2ee container) is the least thing I want
2. Master/Slave setup - I think this is the  most proper way to do but
looking at it as a long term solution, we have a time constraint so it won't
work for now

For the above selected strategy, would the searches be affected due to the
reindexing from 2nd solr instance?
Do we need to tell Solr to update new index once it's available?
Any better option that I can give a try?

Many thanks,

Ero

--
View this message in context: 
http://lucene.472066.n3.nabble.com/What-is-the-best-approach-to-do-reindexing-on-the-fly-tp3515948p3515948.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: FunctionQuery score=0

2011-11-17 Thread Andre Bois-Crettez

John wrote:

Some of the results are receiving score=0 in my function and I would like
them not to appear in the search results.
  

you can use frange, and filter by score:

q=ipodfq={!frange l=0 incl=false}query($q)

--
André Bois-Crettez

Search technology, Kelkoo
http://www.kelkoo.com/



Re: FunctionQuery score=0

2011-11-17 Thread John
Doesn't seem to work.
I though that FilterQueries work before the search is performed and not
after... no?

Debug doesn't include filter query only the below (changed a bit):

BoostedQuery(boost(+fieldName:,boostedFunction(ord(fieldName),query)))


On Thu, Nov 17, 2011 at 5:04 PM, Andre Bois-Crettez
andre.b...@kelkoo.comwrote:

 John wrote:

 Some of the results are receiving score=0 in my function and I would like
 them not to appear in the search results.


 you can use frange, and filter by score:

 q=ipodfq={!frange l=0 incl=false}query($q)

 --
 André Bois-Crettez

 Search technology, Kelkoo
 http://www.kelkoo.com/




RE: memory usage keep increase

2011-11-17 Thread Yongtao Liu
Erick,

Thanks for your reply.

Yes, virtual memory does not mean physical memory.
But if when virtual memory  physical memory, the system will change to 
slow, since lots for paging request happen.

Yongtao
-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Tuesday, November 15, 2011 8:37 AM
To: solr-user@lucene.apache.org
Subject: Re: memory usage keep increase

I'm pretty sure not. The words virtual memory address space is important 
here, that's not physical memory...

Best
Erick

On Mon, Nov 14, 2011 at 11:55 AM, Yongtao Liu y...@commvault.com wrote:
 Hi all,

 I saw one issue is ram usage keep increase when we run query.
 After look in the code, looks like Lucene use MMapDirectory to map index file 
 to ram.

 According to 
 http://lucene.apache.org/java/3_1_0/api/core/org/apache/lucene/store/MMapDirectory.html
  comments, it will use lot of memory.
 NOTE: memory mapping uses up a portion of the virtual memory address space in 
 your process equal to the size of the file being mapped. Before using this 
 class, be sure your have plenty of virtual address space, e.g. by using a 64 
 bit JRE, or a 32 bit JRE with indexes that are guaranteed to fit within the 
 address space.

 So, my understanding is solr request physical RAM = index file size, is it 
 right?

 Yongtao


 **Legal Disclaimer***
 This communication may contain confidential and privileged material 
 for the sole use of the intended recipient. Any unauthorized review, 
 use or distribution by others is strictly prohibited. If you have 
 received the message in error, please advise the sender by reply email 
 and delete the message. Thank you.
 *


Doubts in Shards concept

2011-11-17 Thread mechravi25
Hi,

I have implemented the shards concept. AFter giving the request this is how
is given in the logs

Nov 15, 2011 10:38:24 PM org.apache.solr.core.SolrCore execute
INFO: [core2] webapp=/solr path=/select
params={fl=uid,scorestart=0q=abcisShard=truewt=javabinfsv=truerows=1410version=1}
hits=3396 status=0 QTime=2
Nov 15, 2011 10:38:24 PM org.apache.solr.core.SolrCore execute
INFO: [db] webapp=/solr path=/select/
params={indent=onstart=1400q=abcversion=2.2rows=10} status=0 QTime=58

In db, I give start-1400 and rows as 10 and in the core2, the request is
passed as start=0 and rows=1410

While browsing  I came across this url

https://issues.apache.org/jira/browse/SOLR-659

which gives the reason for this behaviour and a patch file to give the
request as same in all the shards.

I need to know whether there is any other config file that can be changed
for this issue. The solr version details i am using is 

Solr Specification Version: 1.4.0.2010.01.13.08.09.44 
 Solr Implementation Version: 1.5-dev exported - yonik - 2010-01-13 08:09:44 
 Lucene Specification Version: 2.9.1-dev 
 Lucene Implementation Version: 2.9.1-dev 888785 - 2009-12-09 18:03:31 

Please let me know whether this solr version contains this patch. 

*This is how the query url looks

http://localhost:8080/solr/db/select?indent=onversion=2.2q=typeFacet%3AABCshards.start=1400shards.rows=10start=1400rows=10

We noticed that the request went same for both the shards and the underlying
server but no document was fetched even though the count was returned
properly.*
Please provide some suggestions.


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Doubts-in-Shards-concept-tp3516135p3516135.html
Sent from the Solr - User mailing list archive at Nabble.com.


Migrating from Hibernate Search to Solr

2011-11-17 Thread Ari King
I'm considering migrating from Hibernate Search to Solr, but in order
to make that decision, I'd appreciate insight on the following:

1. How difficult is getting Solr up and running? With Hibernate I had
to annotate a few classes and setup a config file; so it was pretty
easy.
2. How can/should one secure Solr?
3. From what I've read, Solr can work with NoSql databases. The
question is how well and how involved is the setup?

Thanks.

-Ari


Re: Highlighting with a default copy field with EdgeNGramFilterFactory

2011-11-17 Thread João Nelas
I found out the solution!

I needed to also add an EdgeNGramFilterFactory to the fields that are the
source of the copyField.
That got the highlighting working again.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Highlighting-with-a-default-copy-field-with-EdgeNGramFilterFactory-tp3510374p3516166.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr Near Real-Time Search, Soft Commit Problem

2011-11-17 Thread Erick Erickson
I guess my first question is what evidence you have that Solr is
unable to index fast enough? It's quite possible that your
database connection is the thing that's unable to process fast
enough.

That's certainly a guess, but unless your documents are
quite complex, 15 records/second isn't likely to cause Solr
problems. You might try to run a small Java program that
executes your database queries and see.

The other question I'd ask is if you're absolutely sure that
your delta-import query is correct? Is it possible that you're
re-indexing *everything* every time? There's an interactive
debugging console you can use that may help, try:
http://localhost:8983/solr/admin/dataimport.jsp

Best
Erick

On Thu, Nov 17, 2011 at 3:19 AM, Jak Akdemir jakde...@gmail.com wrote:
 Hi,

 I was trying to configure a Solr instance with the near real-time search
 and auto-complete capabilities. I stuck in the NRT feature. There are
 15 new records per second that inserted into the database (mysql) and I
 indexed them with DIH. First, I tried to manage autoCommits from
 solrconfig.xml with the configuration below.

 autoCommit
         maxDocs1/maxDocs
         maxTime10/maxTime
       /autoCommit

 autoSoftCommit
         maxDocs15/maxDocs
         maxTime1000/maxTime
 /autoSoftCommit

 And the bash script below responsible for getting delta's without
 committing.

 while [ 1 ]; do
 wget -O /dev/null '
 http://localhost:8080/solr-jak/dataimport?command=delta-importcommit=false'
 2/dev/null
 sleep 1
 done

 Then I run my query from browser
 http://localhost:8080/solr-jak/select?q=movie_name_prefix_full:dogvilledefType=luceneq.op=ORhttp://localhost:8080/solr-sprongo/select?q=movie_name_prefix_full:%221398%22defType=luceneq.op=OR

 But I realized that, with this configuration index files are changing every
 second and after a minute there are only 600 new records in Solr index
 while 900 new records in the database.
 After experienced that, I removed autoCommit and autoSoftCommit elements in
 solrconfig.xml And updated my bashscript as follows. But still index files
 are changing and solr can not syncronized with database.

 while [ 1 ]; do
 echo Soft commit applied!
 wget -O /dev/null '
 http://localhost:8080/solr-jak/dataimport?command=delta-importcommit=false'
 2/dev/null
 curl http://localhost:8080/solr-jak/update -H Content-Type: text/xml
 --data-binary 'commit softCommit=true waitFlush=false
 waitSearcher=false/' 2/dev/null
 sleep 3
 done

 Even I decreased the pressure on Solr as 1 new record per sec. and soft
 commits within 6 sec. still there is a gap between index and db. Is there
 anything that I missed? I took a look to /get too, but it is working only
 for pk. If there is an example configuration list (like 1 sec for soft
 commit and 10 min for hard commit) as a best practice it would be great.

 Finally, here is my configuration.
 Ubuntu 11.04
 JDK 1.6.0_27
 Tomcat 7.0.21
 Solr 4.0 2011-10-24_08-53-02

 All advices are appreciated,

 Best Regards,

 Jak



Re: Migrating from Hibernate Search to Solr

2011-11-17 Thread Erik Hatcher

On Nov 17, 2011, at 10:38 , Ari King wrote:

 I'm considering migrating from Hibernate Search to Solr, but in order
 to make that decision, I'd appreciate insight on the following:
 
 1. How difficult is getting Solr up and running? With Hibernate I had
 to annotate a few classes and setup a config file; so it was pretty
 easy.

So no Hibernate/Solr glue out there already?   It'd be nice if you could use 
Hibernate as you do, but instead of working with the Lucene API directly it 
would use SolrJ.   If this type of glue doesn't already exist, then that'd be 
the first step I think.

Otherwise, you could use Solr directly, but you'll likely be unhappy with the 
disconnect compared to what you're used to.  SolrJ supports annotations, but 
not to the degree that Hibernate does, and even so you'd be left to create an 
indexer and to wire in updates/deletes as well.

 2. How can/should one secure Solr?

Secure it from what?  Being secure is relative, depends on what you're trying 
to protect from.  In general, no security needs to be applied directly to Solr, 
but certainly protect it behind a firewall and even block all IP access except 
to your application.

 3. From what I've read, Solr can work with NoSql databases. The
 question is how well and how involved is the setup?

I imagine the specific nosql db's have their own Solr integration glue.  But in 
general, it's pretty trivial to iterate over a collection of objects and send 
them over to Solr in one way or another.

Erik





Highlighting and regex

2011-11-17 Thread Peter Sturge
Hi,

Been wrestling with a question on highlighting (or not) - perhaps
someone can help?

The question is this:
Is it possible, using highlighting or perhaps another more suited
component, to return words/tokens from a stored field based on a
regular expression's capture groups?

What I was kind of thinking would happen with highlighting regex
(hl.regex.pattern) - but doesn't seem to (although I am a highlighting
novice), is that capture groups specified in a regex would be
highlighted.

For example:
1) given a field called
desc

2) with a stored value of:
the quick brown fox jumps over the lazy dog

3) specify a regex of:
   .*quick\s(\S+)\sfox.+\sthe\s(\S+)\sdog.*

4) get in the response:
  embrown/em and
  emlazy/em
either as highlighting or through some other means.

(I find that using hl.regex.pattern on the above yields: emthe quick
brown fox jumps over the lazy dog/em)

I'm guessing that I'm misinterpreting the functionality offered by
highlighting, but I couldn't find much on the subject in the way of
usage docs.

I could write a custom highlighter or SearchComponent plugin that
would do this, but is there some mechanism out there that can do this
sort of thing already?
It wouldn't necessarily have to be based on regex, but regex tends to
be the de-facto standard for doing capture group token matching (not
sure how Solr syntax would do something similar unless there were
multiples, maybe?).

Any insights greatly appreciated.

Many thanks,
Peter


Re: Solr Near Real-Time Search, Soft Commit Problem

2011-11-17 Thread Jak Akdemir
Eric,

Thank you for your response,

1) I tried 2 new records (records have only 5 field in one table) per
second, in 6 sec interval too. It should be quite  easy for mysql. But I
will check query responses per second as you suggested.

2) I am sure about delta-queries configured well. Full-Import is completed
in 40 secs for 40 docs. And delta's are in 1 sec for 15 new records.
Also I checked it. There is no problem in it.

Couple of evidences that drove me to think this is a configuration problem
are
1- Index files are changing every second.
2- After a server restart last query results reserved. (In NRT they would
disappear, right?)

Please correct me if you see any problem in steps I applied for NRT.

Additional specs,
32 bit OS
4 core i7-2630QM CPU @ 2.00GHz
6 GB memory

Bests,

Jak

On Thu, Nov 17, 2011 at 10:44 AM, Erick Erickson erickerick...@gmail.comwrote:

 I guess my first question is what evidence you have that Solr is
 unable to index fast enough? It's quite possible that your
 database connection is the thing that's unable to process fast
 enough.

 That's certainly a guess, but unless your documents are
 quite complex, 15 records/second isn't likely to cause Solr
 problems. You might try to run a small Java program that
 executes your database queries and see.

 The other question I'd ask is if you're absolutely sure that
 your delta-import query is correct? Is it possible that you're
 re-indexing *everything* every time? There's an interactive
 debugging console you can use that may help, try:
 http://localhost:8983/solr/admin/dataimport.jsp

 Best
 Erick

 On Thu, Nov 17, 2011 at 3:19 AM, Jak Akdemir jakde...@gmail.com wrote:
  Hi,
 
  I was trying to configure a Solr instance with the near real-time search
  and auto-complete capabilities. I stuck in the NRT feature. There are
  15 new records per second that inserted into the database (mysql) and I
  indexed them with DIH. First, I tried to manage autoCommits from
  solrconfig.xml with the configuration below.
 
  autoCommit
  maxDocs1/maxDocs
  maxTime10/maxTime
/autoCommit
 
  autoSoftCommit
  maxDocs15/maxDocs
  maxTime1000/maxTime
  /autoSoftCommit
 
  And the bash script below responsible for getting delta's without
  committing.
 
  while [ 1 ]; do
  wget -O /dev/null '
 
 http://localhost:8080/solr-jak/dataimport?command=delta-importcommit=false
 '
  2/dev/null
  sleep 1
  done
 
  Then I run my query from browser
  http://localhost:8080/solr-jak/select?q=movie_name_prefix_full
 :dogvilledefType=luceneq.op=OR
 http://localhost:8080/solr-sprongo/select?q=movie_name_prefix_full:%221398%22defType=luceneq.op=OR
 
 
  But I realized that, with this configuration index files are changing
 every
  second and after a minute there are only 600 new records in Solr index
  while 900 new records in the database.
  After experienced that, I removed autoCommit and autoSoftCommit elements
 in
  solrconfig.xml And updated my bashscript as follows. But still index
 files
  are changing and solr can not syncronized with database.
 
  while [ 1 ]; do
  echo Soft commit applied!
  wget -O /dev/null '
 
 http://localhost:8080/solr-jak/dataimport?command=delta-importcommit=false
 '
  2/dev/null
  curl http://localhost:8080/solr-jak/update -H Content-Type: text/xml
  --data-binary 'commit softCommit=true waitFlush=false
  waitSearcher=false/' 2/dev/null
  sleep 3
  done
 
  Even I decreased the pressure on Solr as 1 new record per sec. and soft
  commits within 6 sec. still there is a gap between index and db. Is there
  anything that I missed? I took a look to /get too, but it is working
 only
  for pk. If there is an example configuration list (like 1 sec for soft
  commit and 10 min for hard commit) as a best practice it would be great.
 
  Finally, here is my configuration.
  Ubuntu 11.04
  JDK 1.6.0_27
  Tomcat 7.0.21
  Solr 4.0 2011-10-24_08-53-02
 
  All advices are appreciated,
 
  Best Regards,
 
  Jak
 



Re: Solr Near Real-Time Search, Soft Commit Problem

2011-11-17 Thread Yonik Seeley
On Thu, Nov 17, 2011 at 11:48 AM, Jak Akdemir jakde...@gmail.com wrote:
 2) I am sure about delta-queries configured well. Full-Import is completed
 in 40 secs for 40 docs. And delta's are in 1 sec for 15 new records.
 Also I checked it. There is no problem in it.

That's 10,000 docs/sec.  If you configure a soft commit for every 15
documents, that means solr is trying to do 666 commits/sec.
Autocommit by number of docs rarely makes sense anymore - I'd suggest
configuring both soft and hard commits based on time only.

-Yonik
http://www.lucidimagination.com


Implications of setting catenateAll=1

2011-11-17 Thread Brendan Grainger
Hi,

The default for catenateAll is 0 which we've been using on the 
WordDelimiterFilter. What would be the possibly negative implications of 
setting this to 1? So that:

wi-fi-800

would produce the tokens:

wi, fi, wifi, 800, wifi800

for example? 

Thanks

Re: ISO8601 Date format

2011-11-17 Thread Chris Hostetter

: My question is: is a three digit year a valid ISO-8601 date format for
: the response or is this a bug? Because other languages (f.e. python) are
: throwing an exception with a three digit year?!

There are some known bugs with esoteric years, but i think the one that's 
burning you here has been fixed in the 3x branch and will be included in 
3.5...

https://issues.apache.org/jira/browse/SOLR-2772


-Hoss


Re: Solr Near Real-Time Search, Soft Commit Problem

2011-11-17 Thread Jak Akdemir
Yonik,

I updated my solrconfig time based only as follows.
autoCommit
 maxTime30/maxTime
   /autoCommit

autoSoftCommit
 maxTime1000/maxTime
   /autoSoftCommit

And changed my soft commit script to the first case.
while [ 1 ]; do
echo Soft commit applied!
wget -O /dev/null '
http://localhost:8080/solr-jak/dataimport?command=delta-importcommit=false'
2/dev/null
sleep 1
done

After full-import,  I inserted 420 new records in a minute. (7 new records
per second)  And softCommitted every second as we can see in solrconfig.xml.
It seems that after all solr can return only 326 of these new 420 records.
Index files should not change every second, is it true? (After inserting
420 records if I call delta-import with commit true, all these records can
be seen in solr results)

Thanks,

Jak

On Thu, Nov 17, 2011 at 12:14 PM, Yonik Seeley
yo...@lucidimagination.comwrote:

 On Thu, Nov 17, 2011 at 11:48 AM, Jak Akdemir jakde...@gmail.com wrote:
  2) I am sure about delta-queries configured well. Full-Import is
 completed
  in 40 secs for 40 docs. And delta's are in 1 sec for 15 new records.
  Also I checked it. There is no problem in it.

 That's 10,000 docs/sec.  If you configure a soft commit for every 15
 documents, that means solr is trying to do 666 commits/sec.
 Autocommit by number of docs rarely makes sense anymore - I'd suggest
 configuring both soft and hard commits based on time only.

 -Yonik
 http://www.lucidimagination.com



Re: Solr Near Real-Time Search, Soft Commit Problem

2011-11-17 Thread Erick Erickson
Hmmm. It is suspicious that your index files change every
second. If you change our cron task to update every 10
seconds, do the index files change every 10 seconds?

Regarding your question about
After a server restart last query results reserved. (In NRT they would
disappear, right?)
not necessarily. If your autoCommit interval is exceeded, the soft commits
will be committed to disk so your Solr restart would pick them up after restart.

But if somehow you're getting a hard commit to happen every second, you should
also be seeing a lot of segment merging going on, are you?

I think I'd stop the cron job and execute this manually for a while in
order to see exactly
where the problem is. I'd go ahead and comment out the autoCommit section
as well. That should give you a much more reproducible test scenario.

Say you do that, issue your delta-import and immediately kill your
server. When it
starts up if you then see the delta-data, we should understand why.
Because it sure
would seem like the commit=false isn't doing what you expect.

Erick

On Thu, Nov 17, 2011 at 12:41 PM, Jak Akdemir jakde...@gmail.com wrote:
 Yonik,

 I updated my solrconfig time based only as follows.
 autoCommit
         maxTime30/maxTime
       /autoCommit

 autoSoftCommit
         maxTime1000/maxTime
       /autoSoftCommit

 And changed my soft commit script to the first case.
 while [ 1 ]; do
 echo Soft commit applied!
 wget -O /dev/null '
 http://localhost:8080/solr-jak/dataimport?command=delta-importcommit=false'
 2/dev/null
 sleep 1
 done

 After full-import,  I inserted 420 new records in a minute. (7 new records
 per second)  And softCommitted every second as we can see in solrconfig.xml.
 It seems that after all solr can return only 326 of these new 420 records.
 Index files should not change every second, is it true? (After inserting
 420 records if I call delta-import with commit true, all these records can
 be seen in solr results)

 Thanks,

 Jak

 On Thu, Nov 17, 2011 at 12:14 PM, Yonik Seeley
 yo...@lucidimagination.comwrote:

 On Thu, Nov 17, 2011 at 11:48 AM, Jak Akdemir jakde...@gmail.com wrote:
  2) I am sure about delta-queries configured well. Full-Import is
 completed
  in 40 secs for 40 docs. And delta's are in 1 sec for 15 new records.
  Also I checked it. There is no problem in it.

 That's 10,000 docs/sec.  If you configure a soft commit for every 15
 documents, that means solr is trying to do 666 commits/sec.
 Autocommit by number of docs rarely makes sense anymore - I'd suggest
 configuring both soft and hard commits based on time only.

 -Yonik
 http://www.lucidimagination.com




Re: Solr Near Real-Time Search, Soft Commit Problem

2011-11-17 Thread Jak Akdemir
1- There is an improvement on the issue. I add 10 seconds time interval
into the delta of data-config.xml, which will cover records that already
indexed.
revision_time gt; DATE_SUB('${dataimporter.last_index_time}', INTERVAL 10
SECOND);
In this case 1369 new records inserted with 7 records per sec frequency.
Solr response shows 1369 new records successfully.

2-
If I update bashscript to sleep 10 seconds and autosoftcommit to 1 sec,
index files are updated every 10 seconds
If I updated autosoftcommit to 10 seconds and bashscript to sleep 10 sec,
index files are updated every 10 seconds
In index folder after each update, I see that segments/index files are
changing.
I restart the server before fell into the autocommit interval. Delta's are
still in the result list.
Here is my solrconfig.
autoCommit
 maxTime30/maxTime
   /autoCommit
autoSoftCommit
 maxTime1000/maxTime
   /autoSoftCommit

4- I comment out the autocommit part. Still index files are changing.
!--
autoCommit
 maxTime30/maxTime
   /autoCommit
--
autoSoftCommit
 maxTime1000/maxTime
   /autoSoftCommit

I did not modify the request part in all of these cases.
wget -O /dev/null '
http://localhost:8080/solr-jak/dataimport?command=delta-importcommit=false'
2/dev/null
#curl http://localhost:8080/solr-jak/update -H Content-Type: text/xml
--data-binary 'commit softCommit=true waitFlush=false
waitSearcher=false/' 2/dev/null

Erick, as you mentioned, I believe that commit=false is not working
properly. If you need any information, I can serve it.
Thank you for all to your quick responses and advices.

Bests,

Jak


On Thu, Nov 17, 2011 at 1:34 PM, Erick Erickson erickerick...@gmail.comwrote:

 Hmmm. It is suspicious that your index files change every
 second. If you change our cron task to update every 10
 seconds, do the index files change every 10 seconds?

 Regarding your question about
 After a server restart last query results reserved. (In NRT they would
 disappear, right?)
 not necessarily. If your autoCommit interval is exceeded, the soft
 commits
 will be committed to disk so your Solr restart would pick them up after
 restart.

 But if somehow you're getting a hard commit to happen every second, you
 should
 also be seeing a lot of segment merging going on, are you?

 I think I'd stop the cron job and execute this manually for a while in
 order to see exactly
 where the problem is. I'd go ahead and comment out the autoCommit section
 as well. That should give you a much more reproducible test scenario.

 Say you do that, issue your delta-import and immediately kill your
 server. When it
 starts up if you then see the delta-data, we should understand why.
 Because it sure
 would seem like the commit=false isn't doing what you expect.

 Erick

 On Thu, Nov 17, 2011 at 12:41 PM, Jak Akdemir jakde...@gmail.com wrote:
  Yonik,
 
  I updated my solrconfig time based only as follows.
  autoCommit
  maxTime30/maxTime
/autoCommit
 
  autoSoftCommit
  maxTime1000/maxTime
/autoSoftCommit
 
  And changed my soft commit script to the first case.
  while [ 1 ]; do
  echo Soft commit applied!
  wget -O /dev/null '
 
 http://localhost:8080/solr-jak/dataimport?command=delta-importcommit=false
 '
  2/dev/null
  sleep 1
  done
 
  After full-import,  I inserted 420 new records in a minute. (7 new
 records
  per second)  And softCommitted every second as we can see in
 solrconfig.xml.
  It seems that after all solr can return only 326 of these new 420
 records.
  Index files should not change every second, is it true? (After inserting
  420 records if I call delta-import with commit true, all these records
 can
  be seen in solr results)
 
  Thanks,
 
  Jak
 
  On Thu, Nov 17, 2011 at 12:14 PM, Yonik Seeley
  yo...@lucidimagination.comwrote:
 
  On Thu, Nov 17, 2011 at 11:48 AM, Jak Akdemir jakde...@gmail.com
 wrote:
   2) I am sure about delta-queries configured well. Full-Import is
  completed
   in 40 secs for 40 docs. And delta's are in 1 sec for 15 new
 records.
   Also I checked it. There is no problem in it.
 
  That's 10,000 docs/sec.  If you configure a soft commit for every 15
  documents, that means solr is trying to do 666 commits/sec.
  Autocommit by number of docs rarely makes sense anymore - I'd suggest
  configuring both soft and hard commits based on time only.
 
  -Yonik
  http://www.lucidimagination.com
 
 



Re: Solr Near Real-Time Search, Soft Commit Problem

2011-11-17 Thread Yonik Seeley
On Thu, Nov 17, 2011 at 1:34 PM, Erick Erickson erickerick...@gmail.com wrote:
 Hmmm. It is suspicious that your index files change every
 second.

Why is this suspicious?
A soft commit still writes out some files currently... it just doesn't
fsync them.

-Yonik
http://www.lucidimagination.com


Boosting is slow

2011-11-17 Thread Brian Lamb
Hi all,

I have about 20 million records in my solr index. I'm running into a
problem now where doing a boost drastically slows down my search
application. A typical query for me looks something like:

http://localhost:8983/solr/mycore/search/?q=test {!boost
b=product(sum(log(sum(myfield,1)),1),recip(ms(NOW,mydate_field),3.16e-11,1,8))}

I've tried several variations on the boost to see if that was the problem
but even when doing something simple like:

http://localhost:8983/solr/mycore/search/?q=test {!boost b=2}

it is still really slow. Is there a different approach I should be taking?

Thanks,

Brian Lamb


Re: Boosting is slow

2011-11-17 Thread Brian Lamb
Sorry, the query is actually:

http://localhost:8983/solr/mycore/search/?q=test{!boost
b=product(sum(log(sum(myfield,1)),1),recip(ms(NOW,mydate_field),3.16e-11,1,8))}start=sort=score+desc,mydate_field+descwt=xslttr=mysite.xsl

On Thu, Nov 17, 2011 at 2:59 PM, Brian Lamb
brian.l...@journalexperts.comwrote:

 Hi all,

 I have about 20 million records in my solr index. I'm running into a
 problem now where doing a boost drastically slows down my search
 application. A typical query for me looks something like:

 http://localhost:8983/solr/mycore/search/?q=test {!boost
 b=product(sum(log(sum(myfield,1)),1),recip(ms(NOW,mydate_field),3.16e-11,1,8))}

 I've tried several variations on the boost to see if that was the problem
 but even when doing something simple like:

 http://localhost:8983/solr/mycore/search/?q=test {!boost b=2}

 it is still really slow. Is there a different approach I should be taking?

 Thanks,

 Brian Lamb




Re: Solr Near Real-Time Search, Soft Commit Problem

2011-11-17 Thread Jak Akdemir
Yonik,

Is it ok to see soft committed records after server restart, too? If it is,
there is no problem left at all.
I added changing files and 1 sec of log at the end of the e-mail. One
significant line says softCommit=true, so Solr recognizes our softCommit
request.
INFO: start
commit(optimize=false,waitSearcher=true,expungeDeletes=false,softCommit=true)

I want to fix just a little typo from my last e-mail.
... autosoftcommit to 10 seconds and bashscript to sleep 10 sec, index
files are ...
should be ... autosoftcommit to 10 seconds and bashscript to sleep *1 sec*,
index files are ...

Jak

___
jak@jak:/usr/java/solr4/data$ ls index/
_11_0.frq  _11.nrm_14.fdx_16_0.tip  _1b_0.prx  _1b.per_1c.fnm
 _1d.fdt_1e_0.tim  _1f.fdt   _l.fdt_r_0.tim  segments_2_t.fdx
_11_0.prx  _11.per_14.fnm_16.fdt_1b_0.tim  _1c_0.frq  _1c.nrm
 _1d.fdx_1e_0.tip  _1f.fdx   _l.fdx_r_0.tip  segments.gen  _t.fnm
_11_0.tim  _14_0.frq  _14.nrm_16.fdx_1b_0.tip  _1c_0.prx  _1c.per
 _1d.fnm_1e.fdt_1.fnx_l.fnm_r.fdt_t_0.frq  _t.nrm
_11_0.tip  _14_0.prx  _14.per_16.fnm_1b.fdt_1c_0.tim  _1d_0.frq
 _1d.nrm_1e.fdx_l_0.frq  _l.nrm_r.fdx_t_0.prx  _t.per
_11.fdt_14_0.tim  _16_0.frq  _16.nrm_1b.fdx_1c_0.tip  _1d_0.prx
 _1d.per_1e.fnm_l_0.prx  _l.per_r.fnm_t_0.tim
 write.lock
_11.fdx_14_0.tip  _16_0.prx  _16.per_1b.fnm_1c.fdt_1d_0.tim
 _1e_0.frq  _1e.nrm_l_0.tim  _r_0.frq  _r.nrm_t_0.tip
_11.fnm_14.fdt_16_0.tim  _1b_0.frq  _1b.nrm_1c.fdx_1d_0.tip
 _1e_0.prx  _1e.per_l_0.tip  _r_0.prx  _r.per_t.fdt
jak@jak:/usr/java/solr4/data$ ls index/
_11_0.frq  _11.nrm_14.fdx_16_0.tip  _1b_0.prx  _1b.per_1c.fnm
 _1d.fdt_1e_0.tim  _1f_0.frq  _1f.nrm   _l.fdt_r_0.tim  segments_2
   _t.fdx
_11_0.prx  _11.per_14.fnm_16.fdt_1b_0.tim  _1c_0.frq  _1c.nrm
 _1d.fdx_1e_0.tip  _1f_0.prx  _1.fnx_l.fdx_r_0.tip
 segments.gen  _t.fnm
_11_0.tim  _14_0.frq  _14.nrm_16.fdx_1b_0.tip  _1c_0.prx  _1c.per
 _1d.fnm_1e.fdt_1f_0.tim  _1f.per   _l.fnm_r.fdt_t_0.frq
   _t.nrm
_11_0.tip  _14_0.prx  _14.per_16.fnm_1b.fdt_1c_0.tim  _1d_0.frq
 _1d.nrm_1e.fdx_1f_0.tip  _l_0.frq  _l.nrm_r.fdx_t_0.prx
   _t.per
_11.fdt_14_0.tim  _16_0.frq  _16.nrm_1b.fdx_1c_0.tip  _1d_0.prx
 _1d.per_1e.fnm_1f.fdt_l_0.prx  _l.per_r.fnm_t_0.tim
   write.lock
_11.fdx_14_0.tip  _16_0.prx  _16.per_1b.fnm_1c.fdt_1d_0.tim
 _1e_0.frq  _1e.nrm_1f.fdx_l_0.tim  _r_0.frq  _r.nrm_t_0.tip
_11.fnm_14.fdt_16_0.tim  _1b_0.frq  _1b.nrm_1c.fdx_1d_0.tip
 _1e_0.prx  _1e.per_1f.fnm_l_0.tip  _r_0.prx  _r.per_t.fdt

___
Nov 17, 2011 2:55:17 PM org.apache.solr.handler.dataimport.DataImporter
doDeltaImport
INFO: Starting Delta Import
Nov 17, 2011 2:55:17 PM org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr-jak path=/dataimport
params={commit=falsecommand=delta-import} status=0 QTime=0
Nov 17, 2011 2:55:17 PM
org.apache.solr.handler.dataimport.SimplePropertiesWriter
readIndexerProperties
INFO: Read dataimport.properties
Nov 17, 2011 2:55:17 PM org.apache.solr.handler.dataimport.DocBuilder
doDelta
INFO: Starting delta collection.
Nov 17, 2011 2:55:17 PM org.apache.solr.handler.dataimport.DocBuilder
collectDelta
INFO: Running ModifiedRowKey() for Entity: movie
Nov 17, 2011 2:55:17 PM org.apache.solr.handler.dataimport.JdbcDataSource$1
call
INFO: Creating a connection for entity movie with URL:
jdbc:mysql://localhost/imdb
Nov 17, 2011 2:55:17 PM org.apache.solr.handler.dataimport.JdbcDataSource$1
call
INFO: Time taken for getConnection(): 8
Nov 17, 2011 2:55:17 PM org.apache.solr.handler.dataimport.DocBuilder
collectDelta
INFO: Completed ModifiedRowKey for Entity: movie rows obtained : 147
Nov 17, 2011 2:55:17 PM org.apache.solr.update.DirectUpdateHandler2 commit
INFO: start
commit(optimize=false,waitSearcher=true,expungeDeletes=false,softCommit=true)
Nov 17, 2011 2:55:17 PM org.apache.solr.handler.dataimport.DocBuilder
collectDelta
INFO: Completed DeletedRowKey for Entity: movie rows obtained : 0
Nov 17, 2011 2:55:17 PM org.apache.solr.handler.dataimport.DocBuilder
collectDelta
INFO: Completed parentDeltaQuery for Entity: movie
Nov 17, 2011 2:55:17 PM org.apache.solr.search.SolrIndexSearcher init
INFO: Opening Searcher@1520a8e main
Nov 17, 2011 2:55:17 PM org.apache.solr.update.DirectUpdateHandler2 commit
INFO: end_commit_flush
Nov 17, 2011 2:55:17 PM org.apache.solr.search.SolrIndexSearcher warm
INFO: autowarming
Searcher@1520a8emain{DirectoryReader(segments_2:1321559475026:nrt
_k(4.0):C388607
_50(4.0):C526/132 _3q(4.0):C444/141 _43(4.0):C450/126 _4r(4.0):C470/125
_4e(4.0):C456/135 _3f(4.0):C428/133 _51(4.0):C132/126 

Re: Solr Near Real-Time Search, Soft Commit Problem

2011-11-17 Thread Yonik Seeley
On Thu, Nov 17, 2011 at 3:56 PM, Jak Akdemir jakde...@gmail.com wrote:
 Is it ok to see soft committed records after server restart, too?

Yes... we currently have Jetty configured to call some cleanups on
exit (such as closing the index writer).

-Yonik
http://www.lucidimagination.com


Re: Solr Near Real-Time Search, Soft Commit Problem

2011-11-17 Thread Jak Akdemir
This is great! I guess, there is nothing left to worry about for a while.
Erick  Yonik, thank you again for your great responses.

Bests,

Jak

On Thu, Nov 17, 2011 at 4:01 PM, Yonik Seeley yo...@lucidimagination.comwrote:

 On Thu, Nov 17, 2011 at 3:56 PM, Jak Akdemir jakde...@gmail.com wrote:
  Is it ok to see soft committed records after server restart, too?

 Yes... we currently have Jetty configured to call some cleanups on
 exit (such as closing the index writer).

 -Yonik
 http://www.lucidimagination.com



Re: Multiple solr webapps

2011-11-17 Thread Chris Hostetter

: According to solr wiki, an instruction to use single war file and
: multiple context files (solr[1-2].xml).
...
: I wonder why following structure is not enough.  I think this is
: the simplest way (disk space is a bit more necessary, of course):

...there's nothing stoping you from actually cloning the entire webapp, 
but there is also no good reason for it.  you still have to to use 
something like JNDI to configure the individual webapp instances to know 
what solr home dir to use.

: I noticed this caution on the wiki page:
: 
:  Don't put anything related to Solr under the webapps directory.
: 
: Can someone tell me why don't put anything related to solr under
: the webapps?  Is this the reason why single war file configuration
: is recommended ?

Because tomcat does bad things if you have both a webapps/foo/ (or a 
webapps/foo.ar) and a context file named foo.xml.  i don't remember 
what exactly the problem is, but they are intended to be mutually 
exclusive -- ie: either you use the context file and point to the war 
outside of the webapps dir, or you use the webapps dir -- not both.


-Hoss


Re: two word phrase search using dismax

2011-11-17 Thread Chris Hostetter

: After putting the same score for title and content in qf filed, docs 
: with both words in content moved to fifth place. The doc in the first, 
: third and fourth places still have only one of the words in content and 
: title. The doc in the second place has one of the words in title and 
: both words in the content but in different places not together.

details matter -- if you send futher followup mails the full details of 
your dismax options and the score explanations for debugQuery are 
neccessary to be sure people understand what you are describing (a 
snapshot of reality is far more valuable then a vague description of 
reality)

off hand what you are describing sounds correct -- this is what the 
dismax parser is really designed to do.

even if you have given both title and content equal boosts, your title 
field is probably shorter then your content field, so words matching once 
in title are likly to score higher then the same word matching once in 
content due to length normalization -- and unless you set the tie param 
to something really high, the score contribution from the highest scoring 
field (in this case title) will be the dominant factor in the score (it's 
disjunction *max* by default ... if you make tie=1 then it's disjunction 
*sum*)

you haven't mentioned anything about hte pf param at all which i can 
only assume means you aren't using it -- the pf param is how you configure 
that scores should be increased if/when all of the words in teh query 
string appear together.  I would suggest putting all of the fields in your 
qf param in your pf param as well.


-Hoss


Re: FunctionQuery score=0

2011-11-17 Thread Chris Hostetter

: I am using a function query that based on the query of the user gives a
: score for the results I am presenting.

please be specific -- it's not at all clear what the structure of your 
query is, and the details matter.

: Some of the results are receiving score=0 in my function and I would like
: them not to appear in the search results.

this sounds expected given how functions work: by definition they match 
all documents, even if one of the inputs to the function is a query that 
only matches some documents.

you either need to use that query as a filter to constrain the set of 
documents returned, or you need to restructure your main query using 
something like the {!boost} parser (which only matches documents from it's 
nested query).

if you give us an actual example of what you are doing, we can give you 
suggestions on how to change it to achieve what you want.

-Hoss


Re: Solr Master High Availability

2011-11-17 Thread Erick Erickson
Look at the repeater setup on the replication page, and instead of
repeater, think backup master. But you don't really need to even
do this. You can simply provision yourself an extra slave. Now, if you
master goes south, you can reconfigure any slave as the new master by
just putting the configuration file you used for the master on it and
pointing the remaining slaves at the new master. Provision another
slave and point it at the new master and you're right back where you
started.

But you have one other worry. What if your master goes south in such a
way that the index in unusable? Solr/Lucene have a lot of safeguards
built in to prevent this, but

You can consider setting up the replicator mentioned above with a
deletion policy that keeps, say, 1 or 2 copies of the old index
around. Then only replicating, say, every day. That way, you have a
couple of days to notice the problem and a viable index to use again.

Under any circumstances, you need to create a mechanism whereby you
can re-index from a known good point. At the very least, if your
master goes down you may have uncommitted documents. Even if you have
all your documents committed, you still have to worry about the
polling interval to your backup master. So you should be ready to
re-index from the last known good point. But assuming you have a
uniqueKey defined, there's no problem with re-indexing documents
already in the index, the old copy will just be replaced.

Best
Erick

On Thu, Nov 17, 2011 at 7:30 AM, KARHU Toni
toni.ka...@ext.oami.europa.eu wrote:
 Hi, im looking into High availability SOLR master configurations. Does 
 anybody have a good solution to this the things im looking into are:

 *       Using SOLR replication to keep a second backup master.
 *       Indexing in a separate machine(s), problem being here that the index 
 will be different from the other machine needing a full replication to all 
 slaves in case of failure to first master.
 *       Having the whole setup replicated to another machine which is then 
 used as a master machine if primary master fails?

 Any more ideas/experiences?

 Toni

 **
 IMPORTANT: This message is intended exclusively for information purposes. It 
 cannot be considered as an
 official OHIM communication concerning procedures laid down in the Community 
 Trade Mark Regulations
 and Designs Regulations. It is therefore not legally binding on the OHIM for 
 the purpose of those procedures.
 The information contained in this message and attachments is intended solely 
 for the attention and use of the
 named addressee and may be confidential. If you are not the intended 
 recipient, you are reminded that the
 information remains the property of the sender. You must not use, disclose, 
 distribute, copy, print or rely on this
 e-mail. If you have received this message in error, please contact the sender 
 immediately and irrevocably
 delete or destroy this message and any copies.

 **


Re: What is the best approach to do reindexing on the fly?

2011-11-17 Thread Erick Erickson
Hmmm, the master/slave setup takes about a day to get completely
running assuming that you don't have any experience to start with,
so you may be able to fit that in your schedule. Otherwise, you won't
be able to avoid the memory and CPU spikes.

But there's another option. It's actually quite easy to write a SolrJ
program that you can do anything you want in, including examining
your tables for locking.

But there's also another option. Create a trigger on your tables
that inserts what you use to create Solr's uniqueKey in a
modified table. Have your SolrJ program simply query that table
and delete/update as required to keep the single index in sync with
the database

Of course, all that depends on how long it takes to re-index from scratch.
If it's reasonably quick, perhaps simply re-indexing at 3:00 AM (or whatever)
would work

Best
Erick

On Thu, Nov 17, 2011 at 9:34 AM, erolagnab trung@gmail.com wrote:
 Hi all,

 I'm using Solr 3.2 with DataImportHandler periodically update index every 5
 min.
 There's an house keeping script running weekly which delete some data in the
 database.
 I'd like to incorporate the reindexing strategy with this house keeping
 script by:
 1. Locking the DataImportHandler - not allow to perform any update on the
 index - by having a flag in the database, every time scheduled job trigger,
 it first checks for the flag before perform incremental index.
 2. Run separate Solr instance, pointing to the same index and perform a
 clean index

 Now before coming to this setup, I had some options but they didn't fit very
 well:
 1. Trigger reindexing directy in the running Solr instance - I wrap Solr
 with our own authentication mechanism and reindexing would be causing spike
 in memory usage and affect the current running apps (sitting in the same
 j2ee container) is the least thing I want
 2. Master/Slave setup - I think this is the  most proper way to do but
 looking at it as a long term solution, we have a time constraint so it won't
 work for now

 For the above selected strategy, would the searches be affected due to the
 reindexing from 2nd solr instance?
 Do we need to tell Solr to update new index once it's available?
 Any better option that I can give a try?

 Many thanks,

 Ero

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/What-is-the-best-approach-to-do-reindexing-on-the-fly-tp3515948p3515948.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: ExtractingRequestHandler HTTP GET Problem

2011-11-17 Thread Chris Hostetter

: indexed file. The CommonsHttpSolrServer sends the parameters as a HTTP
: GET request. Because of that I'll get a socket write error. If I
: change the CommonsHttpSolrServer to send the parameters as HTTP POST
: sending will work, but the ExtractingRequestHandler will not recognize
: the parameters. If I'm using the EmbeddedSolrServer there is no

that doesn't sound right ... if all you do is configure 
CommonsHttpSolrServer to use POST instead of GET it shouldn't change 
anything about how ExtractingRequestHandler is executed.

can you provide the code youhave that uses CommonsHttpSolrServer, and 
info on how you have configured ExtractingRequestHandler so we can better 
understand what exactly you are doing?

in general it seems weird to me thta you are base64 encoding some text and 
then sending it to the ExtractingRequestHandler -- why exactly aren't you 
just sending hte text as is?  (is there some special feature of Tika i'm 
not aware of that only works if you feed it base 64 encoded data?)


-Hoss


Re: Migrating from Hibernate Search to Solr

2011-11-17 Thread Ari King
 So no Hibernate/Solr glue out there already?   It'd be nice if you could use 
 Hibernate as you do, but instead of working with the Lucene API directly it 
 would  use SolrJ.   If this type of glue doesn't already exist, then 
 that'd be the first step I think.

 Otherwise, you could use Solr directly, but you'll likely be unhappy with the 
 disconnect compared to what you're used to.  SolrJ supports annotations, but 
 not to  the degree that Hibernate does, and even so you'd be left to create 
 an indexer and to wire in updates/deletes as well.

How involved/difficult would you describe using Solr directly is? I
have no experience with Solr, but from what you described it doesn't
sound too bad.

-Ari


[ANNOUNCEMENT] Second Edition of the First Book on Solr

2011-11-17 Thread Smiley, David W.
Fellow Solr users,

I am proud to announce that the book Apache Solr 3 Enterprise Search Server 
is officially published!  This is the second edition of the first book on Solr 
by me, David Smiley, and my co-author Eric Pugh.  You can find full details 
about the book, download a free chapter, and purchase it here:
  http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
It is also available through other channels like Amazon.  You can feel good 
about the purchase knowing that 5% of each sale goes to support the Apache 
Software Foundation.  If you buy directly from the publisher, then the basis of 
the percentage that goes to the ASF (and to me) is higher than if you buy it 
through other channels.  

This book naturally covers the latest features in Solr as of version 3.4 like 
Result Grouping and Geospatial, but this is not a small update to the first 
book.  We have more experience with Solr and we've listened to reader feedback 
from the first edition.  No chapter was untouched: Faceting gets its own 
chapter, all search relevancy matters are discussed in one chapter, 
auto-complete approaches are all discussed together, much of the chapter on 
integration was rewritten to discuss newer technologies, and the first chapter 
was greatly streamlined.  Furthermore, each chapter has a tip in the 
introduction that advises readers in a hurry on what parts should be read now 
or later.  Finally, we developed a 2-page parameter quick-reference appendix 
that you will surely find useful printed on your desk.  In summary, we improved 
the existing content, and added about 25% more by page count.

Software, errata, and other information about this book and the previous 
edition is on our website:
  http://www.solrenterprisesearchserver.com/
We've been working hard on this book for the last 10 months and we hope it 
really helps saves you time and improves your search project!

Apache Solr 3 Enterprise Search Server In Detail:

If you are a developer building an app today then you know how important a good 
search experience is.  Apache Solr, built on Apache Lucene, is a wildly popular 
open source enterprise search server that easily delivers powerful search and 
faceted navigation features that are elusive with databases.  Solr supports 
complex search criteria, faceting, result highlighting, query-completion, query 
spell-check, relevancy tuning, and more.

Apache Solr 3 Enterprise Search Server is a comprehensive reference guide for 
every feature Solr has to offer.  It serves the reader right from initiation to 
development to deployment.  It also comes with complete running examples to 
demonstrate its use and show how to integrate Solr with other languages and 
frameworks.

Through using a large set of metadata about artists, releases, and tracks 
courtesy of the MusicBrainz.org project, you will have a testing ground for 
Solr, and will learn how to import this data in various ways.  You will then 
learn how to search this data in different ways, including Solr's rich query 
syntax and boosting match scores based on record data.  Finally, we'll cover 
various deployment considerations to include indexing strategies and 
performance-oriented configuration that will enable you to scale Solr to meet 
the needs of a high-volume site.

Sincerely,

David Smiley (primary author)   david.w.smi...@gmail.com
Eric Pugh (co-author)   ep...@opensourceconnections.com