Explicitly tell Solr the analyzed value when indexing a document

2011-11-16 Thread Tim Terlegård
Hi,

I have a couple of string fields. For some of them I want from my
application to be able to index a lowercased string but store the
original value. Is there some way to do this? Or would I have to come
up with a new field type and implement an analyzer?

/Tim


Re: Add copyTo Field without re-indexing?

2011-11-16 Thread Michael Kuhlmann

Am 17.11.2011 08:46, schrieb Kashif Khan:

Please advise how we can reindex SOLR with having fields stored="false". we
can not reindex data from the beginning just want to read and write indexes
from the SOLRJ only. Please advise a solution. I know we can do it using
lucene classes using indexreader and indexwriter but want to index all
fields


This is not possible. At least not when the index is modified in any way 
(stemmed, lowercased, tokenized, etc.).


The original data is not saved when "stored" is false. You'll need your 
original source data to reindex then.


-Kuli


Re: Add copyTo Field without re-indexing?

2011-11-16 Thread Kashif Khan
Please advise how we can reindex SOLR with having fields stored="false". we
can not reindex data from the beginning just want to read and write indexes
from the SOLRJ only. Please advise a solution. I know we can do it using
lucene classes using indexreader and indexwriter but want to index all
fields

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Add-copyTo-Field-without-re-indexing-tp3342253p3515020.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Aggregated indexing of updating RSS feeds

2011-11-16 Thread Chris Hostetter

: ..but the request I'm making is..
: /solr/myfeed?command=full-import&rows=5000&clean=false
: 
: ..note the clean=false.

I see it, but i also see this in the logs you provided...

: INFO: [] webapp=/solr path=/myfeed params={command=full-import} status=0
: QTime=8

...which means someone somewhere is executing full-import w/o using 
clean=false.  

are you absolutely certain that you are executing the request you think 
you are?  can you find a request in your logs that includes clean=false?

if it's not you and your code -- it is comming from somewhere, and that's 
what's causing DIH to trigger a deleteAll...

: 10-Nov-2011 05:40:01 org.apache.solr.handler.dataimport.DataImporter
: doFullImport
: INFO: Starting Full Import
: 10-Nov-2011 05:40:01 org.apache.solr.handler.dataimport.SolrWriter
: readIndexerProperties
: INFO: Read myfeed.properties
: 10-Nov-2011 05:40:01 org.apache.solr.update.DirectUpdateHandler2 deleteAll
: INFO: [] REMOVING ALL DOCUMENTS FROM INDEX



-Hoss


Re: to prevent number-of-matching-terms in contributing score

2011-11-16 Thread Chris Hostetter

:  1. "omitTermFreqAndPositions" is very straightforward but if I avoid
: positions I'll refuse to serve phrase queries. I had searched for this in

but do you really need phrase queries on your "cat" field?  i thought the 
point was to have simple matching on those terms?

:  2. Function query seemed nice (though strange because I never used it
: before) and I gave it a few hours but that too did not seem to solve my
: requirement. The "artificial" score we are generating is getting multiplied
: into rest of the score which includes score due to "cat" field as well. (I
: can not remove "cat" from "qf" as I have to search there). It is only that
: I don't want this field's score on the basis of matching "tf".

I don't think i realized you were using dismax ... if you just want a 
match on "cat" to help determine if the document is a match, but not have 
*any* impact on score, you could just set the qf boost to 0 (ie: 
qf=title^10 cat^0) but i'm not sure if that's really what you want.

: After spending some hours on function queries I finally reached on
: following query

Honestly: i'm not really following what you tried there because of the 
formatting applied by your email client ... it seemed to be making tons of 
hyperlinks out of peices of the URL.

Looking at your query explanation however the problem seems to be that you 
are still using the relevancy score of the matches on the "cat" field, 
instead of *just* using hte function boost...

: But debugging the query showed that the boost value ($cat_boost) is being
: multiplied into a value which is generated with the help of "cat" field
: thus resulting in different scores for 1 and 3 (similarly for 2 and 4).
: 
: 1.2942866 = (MATCH) boost(+(title:chair | cat:chair)~0.01
: (),map(query(cat:chair,def=-1.0),0.0,1000.0,1.0)), product of:

...my point before was to take "cat:chair" out of the "main" part of your 
query, and *only* put it in the boost function.  if you are using dismax, 
the "qf=cat^0" suggestion mentioned above *combined* with your boost 
function will probably get you what you want (i think)

: I was thinking there should be some hook or plugin (or anything) which
: could just change the score calculation formula *for a particular field*.
: There is a function in DefaultSimilarity class - *public float tf(float
: freq)* but that does not mention the field name. Is there a possibility to
: look into this direction?

on trunk, there is a distinct Similarity object per fieldtype, so you 
could certain look at that -- but you are correct that in 3x there is no 
way to override the tf() function on a per field basis.


-Hoss


Re: to prevent number-of-matching-terms in contributing score

2011-11-16 Thread Chris Hostetter

:  1. "omitTermFreqAndPositions" is very straightforward but if I avoid
: positions I'll refuse to serve phrase queries. I had searched for this in

but do you really need phrase queries on your "cat" field?  i thought the 
point was to have simple matching on those terms?

:  2. Function query seemed nice (though strange because I never used it
: before) and I gave it a few hours but that too did not seem to solve my
: requirement. The "artificial" score we are generating is getting multiplied
: into rest of the score which includes score due to "cat" field as well. (I
: can not remove "cat" from "qf" as I have to search there). It is only that
: I don't want this field's score on the basis of matching "tf".

I don't think i realized you were using dismax ... if you just want a 
match on "cat" to help determine if the document is a match, but not have 
*any* impact on score, you could just set the qf boost to 0 (ie: 
qf=title^10 cat^0) but i'm not sure if that's really what you want.

: After spending some hours on function queries I finally reached on
: following query

Honestly: i'm not really following what you tried there because of the 
formatting applied by your email client ... it seemed to be making tons of 
hyperlinks out of peices of the URL.

Looking at your query explanation however the problem seems to be that you 
are still using the relevancy score of the matches on the "cat" field, 
instead of *just* using hte function boost...

: But debugging the query showed that the boost value ($cat_boost) is being
: multiplied into a value which is generated with the help of "cat" field
: thus resulting in different scores for 1 and 3 (similarly for 2 and 4).
: 
: 1.2942866 = (MATCH) boost(+(title:chair | cat:chair)~0.01
: (),map(query(cat:chair,def=-1.0),0.0,1000.0,1.0)), product of:

...my point before was to take "cat:chair" out of the "main" part of your 
query, and *only* put it in the boost function.  if you are using dismax, 
the "qf=cat^0" suggestion mentioned above *combined* with your boost 
function will probably get you what you want (i think)

: I was thinking there should be some hook or plugin (or anything) which
: could just change the score calculation formula *for a particular field*.
: There is a function in DefaultSimilarity class - *public float tf(float
: freq)* but that does not mention the field name. Is there a possibility to
: look into this direction?

on trunk, there is a distinct Similarity object per fieldtype, so you 
could certain look at that -- but you are correct that in 3x there is no 
way to override the tf() function on a per field basis.


-Hoss


Re: Solr Score Normalization

2011-11-16 Thread Chris Hostetter

: Perhaps you can solve your usecase by playing with the new eDismax 
: "boost" parameter, which multiplies the functions with the other score 
: instead of adding.

and FWIW: the "boost" param of the edismax parser is really just syntactic 
sugar for using the "BoostQParsre wrapped arround an edismax query -- you 
can wrap it around any query produced by any QParser...

  q={!edismax qf=foo}bar&boost=func(asdf)

...is the same as...

  q={!boost b=func(asdf) v=$qq}&qq={!edismax qf=foo}bar



-Hoss


Re: maxFieldLength clarifications

2011-11-16 Thread Chris Hostetter

:1. is the maxFieldLength parameter deprecated?
:2. what is maxFieldLength counting? I understood it's counting tokens
:per document (not per field)
:3. what if I simply remove the maxFieldLength setting from the
:solrconfig?

1. it has been deprecated and will not be used in Solr 4x, but still 
exists in Solr 3x

2. It should be terms per field per document, not just per document.

3) if you don't specify it in solrconfig.xml it defaults to "-1" which 
means no limit.

: From what I see if I remove it from the solrconfig the text values are
: still constrained to some bound since if I query the last term in a long
: document's text I don't get a match.

a) what version of solr are you using?
b) double check both the mainIndex and indexDefaults sections of your 
solrconfig.xml and make sure maxFieldLength isn't in either of them.

-Hoss


Re: size of data replicated

2011-11-16 Thread Chris Hostetter

: query response time. To get a clear picture, I would like to know how
: to get the size of data being replicated for each commit. Through the
: admin UI, you may read a x of y G data is being replicated; however,
: "y" is the total index size, instead of data being copied over. I
: couldn't find the info in the solr logs either. Any idea?

maybe i'm missunderstanding your question, but isn't "x" in your example 
the number that you are looking for? (ie: "how much data was replicated?")

-Hoss


Re: Similar documents and advantages / disadvantages of MLT / Deduplication

2011-11-16 Thread Chris Hostetter

: I index 1000 docs, 5 of them are 95% the same (for example: copy pasted
: blog articles from different sources, with slight changes (author name,
: etc..)).
: But they have differences.
: *Now i like to see 1 doc in my result set and the other 4 should be marked
: as similar.*

Do you actaully want al 1000 docs in your index, or do you want to prevent 
4 of the 5 copies of hte doc from being indexed?

Either way, if the the TextProfileSignature is doing a good job of 
identifying the 5 similar docs, then use that at index time.

If you want to keep 4/5 out of the index, then use the Deduplcation 
features to prefent the duplicates from being indexed and your done.  

If you wnat all docs in the index, then you have to decide how you want to 
"mark" docs as similar ... do you want to only have one of those docs 
appear in all of your results, or do you want all of them in the results 
but with an indication that there are other similar docs?  If the former: 
then take a look at "Grouping" and group on your signature field.  If the 
latter, use the MLT component, to find similar docs based on the signature 
field (ie: mlt.fl=signature_t)

https://wiki.apache.org/solr/FieldCollapsing

-Hoss


Re: Search in multivalued string field does not work

2011-11-16 Thread Erick Erickson
Attach &debugQuery=true to the URL and look at the results, that'll
show you what the query parsed as on the actual server.

Where did shards come from? I'd advise turning all the shard stuff
off until you answer this question and querying the server directly,
shards may be confusing the issue. Let's get to the bottom of your
query problems before introducing that complexity!

By Luke, I mean get a copy of the Luke program, see here:
http://code.google.com/p/luke/

Run that program and point it at the index for your severs. It'll allow
you to examine the contents of of the indexes at a fairly low level.
Look at the fields in question and see if the data you expect to match is,
indeed, there.

>From what you've said, I'd guess it's some difference between
the two servers, because on the surface of it I don't see why you'd
be seeing the differences you claim. So either what you think is on
the servers isn't there, I don't understand the problem or

Best
Erick

On Wed, Nov 16, 2011 at 9:11 AM, mechravi25  wrote:
> Hi,
>
> Thanks for the suggestions.
>
> The index is the same in both the servers. We index using JDBC drivers.
>
> We have not modified the request handler in solrconfig on either machine and
> also after the latest schema update, we have re-indexed the data.
>
>
> *We even checked the analysis page and there is no difference between both
> the servers and after checking the "highlight matches" option in the field
> value, the result was getting highlighted in the "term text" of Index
> Analyzer. But still we confused as to why we are not getting the result in
> the search page.*
>
> Actually i forgot to post the dynamic field declaration in my schema file
> and this is how it is declared.
>
>  multiValued="true" />
>  multiValued="true" stored="false"/>
>
> the textgen fieldtype definition is as follows:
>
> 
>  
>   
>
>    words="stopwords.txt" enablePositionIncrements="true" />
>    generateintegerParts="1" catenateWords="1" catenateintegers="1"
> catenateAll="1" splitOnCaseChange="1" splitOnNumerics="1"
> stemEnglishPossessive="1" />
>   
>         inject="true"/>
>
>  
>  
>
>   
>    ignoreCase="true" expand="true"/>
>              ignoreCase="true"
>           words="stopwords.txt"
>           enablePositionIncrements="true"
>           />
>    generateintegerParts="1" catenateWords="0" catenateintegers="0"
> catenateAll="0" splitOnCaseChange="0"/>
>   
>  
> 
>
>
> We have implemented shards in core DB which is in turn gets a result from
> shards core(core1 and core2). This actual data is present in core2. We tried
> all the options in core2 directly as well but with no success.
>
> The query is passed as follows :
>
> QueryString : idx_ABCFacet:"XXX... ABC DEF"
>
> INFO: [core2] webapp=/solr path=/select
> params={debugQuery=false&fl=uid,score&start=0&q=idx_ABCFacet:"XXX...+ABC+DEF"&isShard=true&wt=javabin&fsv=true&rows=10&version=1}
> hits=0 status=0 QTime=2
> Nov 16, 2011 5:44:17 AM org.apache.solr.core.SolrCore execute
> INFO: [core1] webapp=/solr path=/select
> params={debugQuery=false&fl=uid,score&start=0&q=idx_ABCFacet:"XXX...+ABC+DEF"&isShard=true&wt=javabin&fsv=true&rows=10&version=1}
> hits=0 status=0 QTime=0
> Nov 16, 2011 5:44:17 AM org.apache.solr.core.SolrCore execute
> INFO: [db] webapp=/solr path=/select/
> params={debugQuery=on&indent=on&start=0&q=idx_ABCFacet:"XXX...+ABC+DEF"&version=2.2&rows=10}
> status=0 QTime=64
>
>
>
> Also can you please elaborate on the 3rd point
>
> *3> Try using Luke to examine the indexes on both servers to determine
>     whether they're the same. *
>
>
>
>
> Thanks.
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Search-in-multivalued-string-field-does-not-work-tp3509458p3512710.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Dismax and phrases

2011-11-16 Thread Chris Hostetter

: I am starting to wonder whether the module giving finnish language support
: (lingsoft) might be the cause?

It's extremeley possible -- the details relaly matter when debugging 
things like this.

Since i don't have any access to these custom plugins, i don't know what 
they might be doing, or how they might be affecting the terms produced 
during analysis to explain why you are getting the structure you are -- 
but one explanation might be if every term produced by them gets a 
positionIncrement of "0" ... that would tell the query parser to treat 
them as alternatives -- it's the same thing SynonymFilter does.

you'd have to look at the output from the analysis tool ,feeding your 
example input into the query analyzer to see what terms it produces (and 
what attributes those terms have).  if it is a position increment issue, 
then you should see the same "OR" style query structure (instead of a 
phrase query) even if you use the default "lucene" parser and give it a 
quoted phrase...

text_fi:"asuntojen hinnat"


-Hoss


Look what i found here...

2011-11-16 Thread Harsha Vardhan Muthyala
Hi friend!I think I found the answer to everyones problems Look at this 
articlehttp://ulysse.co.za/profile/89LeeAlien/";>http://ulysse.co.za/profile/89LeeAlien/see
 you later.


Re: Phrase between quotes with dismax edismax

2011-11-16 Thread Erick Erickson
Ah, ok I was mis-reading some things. So, let's ignore the
category bits for now.

Questions:
1> Can you refine down the problem. That is,
demonstrate this with a single field and leave out
the category stuff. Something like
q=title:"chef de projet" getting no results and
q=title:"chef projet" getting results? The idea
is to cycle through all the fields to see if we can
hone in on the problem. I'd get rid of any pf
parameters of your edismax definition too. I'm after
   the simplest case that can demonstrate the issue.
   For that matter, it'd be even easier if you could
   make this happen with the default searcher (
   solr/select?q=title:"chef de projet"
2> if you can do <1>, please post the field definitions
 from your schema.xml file. One possibility is that
 you are removing stopwords at index time but not
 query time or vice-versa, but that's a wild guess.
3> Once you have a field, use the admin/analysis page
 to see the exact transformations that occur at index
 and query time to see if anything jumps out.

All in all, I suspect you have a field that isn't being parsed
as you expect at either index or query time, but as I said
above, that's a guess.

Best
Erick

On Wed, Nov 16, 2011 at 5:02 AM, Jean-Claude Dauphin
 wrote:
> Thanks Erick for yr quick answer.
>
> I am using Solr 3.1
>
> 1) I have set the mm parameter to 0 and removed the categories from the
> search. Thus the query is only for "chef de projet" and nothing else.
> But the problem remains, i.e searching for "chef de projet" gives no
> results while searching for "chef projet" gives the right result.
>
> Here is an excerpt from the test I made:
>
> DISMAX query (q)=("chef de projet")
>
> =The Parameters=
>
> *queryResponse*=[{responseHeader={status=0,QTime=157,
>
> params={facet=true,
>
> f.createDate.facet.date.start=NOW/DAY-6DAYS,tie=0.1,
>
> facet.limit=4,
>
> f.location.facet.limit=3,
>
> *q.alt*=*:*,
>
> facet.date.other=all,
>
> hl=true,version=2,
>
> *bq*=[categoryPayloads:category1071^1,
> categoryPayloads:category10055078^1, categoryPayloads:category10055405^1],
>
> fl=*,score,
>
> debugQuery=true,
>
> facet.field=[soldProvisions, contractTypeText, nafCodeText, createDate,
> wage, keywords, labelLocation, jobCode, organizationName,
> requiredExperienceLevelText],
>
> *qs*=3,
>
> qt=edismax,
>
> facet.date.end=NOW/DAY,
>
> *mm*=0,
>
> facet.mincount=1,
>
> facet.date=createDate,
>
> *qf*= title^4.0 formattedDescription^2.0 nafCodeText^2.0 jobCodeText^3.0
> organizationName^1.0 keywords^3.0 location^1.0 labelLocation^1.0
> categoryPayloads^1.0,
>
> hl.fl=title,
>
> wt=javabin,
>
> rows=20,
>
> start=0,
>
> *q*=("chef de projet"),
>
> facet.date.gap=+1DAY,
>
> *stopwords*=false,
>
> *ps*=3}},
>
> The Solr Response
> response={numFound=0
>
> Debug Info
>
> debug={
>
> *rawquerystring*=("chef de projet"),
>
> *querystring*=("chef de projet"),
>
> *---
> *
>
> *parsedquery*=
>
> +*DisjunctionMaxQuery*((title:"chef de projet"~3^4.0 | keywords:chef de
> projet^3.0 | organizationName:chef de projet | location:chef de projet |
> formattedDescription:"chef de projet"~3^2.0 | nafCodeText:chef de
> projet^2.0 | jobCodeText:chef de projet^3.0 | categoryPayloads:"chef de
> projet"~3 | labelLocation:chef de projet)~0.1)
> *DisjunctionMaxQuery*((title:"(("chef
> chef) de (projet") projet)"~3^4.0)~0.1) categoryPayloads:category1071
> categoryPayloads:category10055078 categoryPayloads:category10055405,
>
> *---*
>
> *parsedquery_toString*=+(title:"chef de projet"~3^4.0 | keywords:chef de
> projet^3.0 | organizationName:chef de projet | location:chef de projet |
> formattedDescription:"chef de projet"~3^2.0 | nafCodeText:chef de
> projet^2.0 | jobCodeText:chef de projet^3.0 | categoryPayloads:"chef de
> projet"~3 | labelLocation:chef de projet)~0.1 (title:"(("chef chef) de
> (projet") projet)"~3^4.0)~0.1 categoryPayloads:category1071
> categoryPayloads:category10055078 categoryPayloads:category10055405,
>
>
>
> explain={},
>
> QParser=ExtendedDismaxQParser,altquerystring=null,
>
> *boost_queries*=[categoryPayloads:category1071^1,
> categoryPayloads:category10055078^1, categoryPayloads:category10055405^1],
>
> *parsed_boost_queries*=[categoryPayloads:category1071,
> categoryPayloads:category10055078, categoryPayloads:category10055405],
> boostfuncs=null,
>
> 2) I tried to remove the bq values but no changes:
>
> *querystring*=("chef de projet"),
>
> *parsedquery*=+*DisjunctionMaxQuery*((title:"chef de projet"~3^4.0 |
> keywords:chef de projet^3.0 | organizationName:chef de projet |
> location:chef de projet | formattedDescription:"chef de projet"~3^2.0 |
> nafCodeText:chef de projet^2.0 | jobCodeText:chef de projet^3.0 |
> categoryPayloads:"chef de projet"~3 | labelLocation:chef de projet)~0.1) *
> DisjunctionMaxQuery*((title:"(("chef chef) de (projet")
> projet)"~3^4.0)~0.1),
> *parsedquery

strange behavior of scores and term proximity use

2011-11-16 Thread Ariel Zerbib
Hi,

For this term proximity query: ab_main_title_l0:"to be or not to be"~1000

http://localhost:/solr/select?q=ab_main_title_l0%3A%22og54ct8n+to+be+or+not+to+be+5w8ojsx2%22~1000&sort=score+desc&start=0&rows=3&fl=ab_main_title_l0%2Cscore%2Cid&debugQuery=true

The third first results are the following one:




  0
  5


  
2315190010001021

  og54ct8n To be or not to be a Jew. 5w8ojsx2

3.0814114
  
2313006480001021

  og54ct8n To be or not to be 5w8ojsx2

3.0814114
  
2356410250001021

  og54ct8n Rumspringa : to be or not to be Amish / 5w8ojsx2

3.0814114


  ab_main_title_l0:"og54ct8n to be or not to be
5w8ojsx2"~1000
  ab_main_title_l0:"og54ct8n to be or not to be
5w8ojsx2"~1000
  PhraseQuery(ab_main_title_l0:"og54ct8n to be or
not to be 5w8ojsx2"~1000)
  ab_main_title_l0:"og54ct8n to be or not
to be 5w8ojsx2"~1000
  

5.337161 = (MATCH) weight(ab_main_title_l0:"og54ct8n to be or not to be
5w8ojsx2"~1000 in 378403) [DefaultSimilarity], result of:
  5.337161 = fieldWeight in 378403, product of:
0.57735026 = tf(freq=0.3334), with freq of:
  0.3334 = phraseFreq=0.3334
29.581549 = idf(), sum of:
  1.0012436 = idf(docFreq=3297332, maxDocs=3301436)
  3.0405464 = idf(docFreq=429046, maxDocs=3301436)
  5.3583193 = idf(docFreq=42257, maxDocs=3301436)
  4.3826413 = idf(docFreq=112108, maxDocs=3301436)
  6.3982043 = idf(docFreq=14937, maxDocs=3301436)
  3.0405464 = idf(docFreq=429046, maxDocs=3301436)
  5.3583193 = idf(docFreq=42257, maxDocs=3301436)
  1.0017256 = idf(docFreq=3295743, maxDocs=3301436)
0.3125 = fieldNorm(doc=378403)


9.244234 = (MATCH) weight(ab_main_title_l0:"og54ct8n to be or not to be
5w8ojsx2"~1000 in 482807) [DefaultSimilarity], result of:
  9.244234 = fieldWeight in 482807, product of:
1.0 = tf(freq=1.0), with freq of:
  1.0 = phraseFreq=1.0
29.581549 = idf(), sum of:
  1.0012436 = idf(docFreq=3297332, maxDocs=3301436)
  3.0405464 = idf(docFreq=429046, maxDocs=3301436)
  5.3583193 = idf(docFreq=42257, maxDocs=3301436)
  4.3826413 = idf(docFreq=112108, maxDocs=3301436)
  6.3982043 = idf(docFreq=14937, maxDocs=3301436)
  3.0405464 = idf(docFreq=429046, maxDocs=3301436)
  5.3583193 = idf(docFreq=42257, maxDocs=3301436)
  1.0017256 = idf(docFreq=3295743, maxDocs=3301436)
0.3125 = fieldNorm(doc=482807)


5.337161 = (MATCH) weight(ab_main_title_l0:"og54ct8n to be or not to be
5w8ojsx2"~1000 in 1317563) [DefaultSimilarity], result of:
  5.337161 = fieldWeight in 1317563, product of:
0.57735026 = tf(freq=0.3334), with freq of:
  0.3334 = phraseFreq=0.3334
29.581549 = idf(), sum of:
  1.0012436 = idf(docFreq=3297332, maxDocs=3301436)
  3.0405464 = idf(docFreq=429046, maxDocs=3301436)
  5.3583193 = idf(docFreq=42257, maxDocs=3301436)
  4.3826413 = idf(docFreq=112108, maxDocs=3301436)
  6.3982043 = idf(docFreq=14937, maxDocs=3301436)
  3.0405464 = idf(docFreq=429046, maxDocs=3301436)
  5.3583193 = idf(docFreq=42257, maxDocs=3301436)
  1.0017256 = idf(docFreq=3295743, maxDocs=3301436)
0.3125 = fieldNorm(doc=1317563)



The used version is a 4.0 October snapshot.

I have 2 questions about the result:
- Why debug print and scores in result are different?
- What is the expected behavior of this kind of term proximity query?
  - The debug scores seem to be well ordered but the result scores
seem to be wrong.


Thanks,
Ariel


RE: Easy way to tell if there are pending documents

2011-11-16 Thread Latter, Antoine
Excellent. It looks like I can drill down into exactly what I want without 
having to load up the rest of the statistics.

-Original Message-
From: Justin Caratzas [mailto:justin.carat...@gmail.com] 
Sent: Wednesday, November 16, 2011 10:41 AM
To: solr-user@lucene.apache.org
Subject: Re: Easy way to tell if there are pending documents


You can enable the stats handler
(https://issues.apache.org/jira/browse/SOLR-1750), and get inspect the json 
pragmatically.

-- Justin

"Latter, Antoine"  writes:

> Thank you, that does help - but I am more looking for a way to get at this 
> programmatically.
>
> -Original Message-
> From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com]
> Sent: Tuesday, November 15, 2011 11:22 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Easy way to tell if there are pending documents
>
> Antoine,
>
> On Solr Admin Stats page search for "docsPending".  I think this is what you 
> are looking for.
>
> Otis
> 
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene 
> ecosystem search :: http://search-lucene.com/
>
>
>>
>>From: "Latter, Antoine" 
>>To: "'solr-user@lucene.apache.org'" 
>>Sent: Monday, November 14, 2011 11:39 AM
>>Subject: Easy way to tell if there are pending documents
>>
>>Hi Solr,
>>
>>Does anyone know of an easy way to tell if there are pending documents 
>>waiting for commit?
>>
>>Our application performs operations that are never safe to perform  
>>while commits are pending. We make this work by making sure that all  
>>indexing operations end in a commit, and stop the unsafe operations  
>>from running while a commit is running.
>>
>>This works great most of the time, except when we have enough disk  
>>space to add documents to the pending area, but not enough disk  space 
>>to do a commit - then the indexing operations only error out  after 
>>they've done all of their adds.
>>
>>It would be nice if the unsafe operation could somehow detect that there are 
>>pending documents and abort.
>>
>>In the interim I'll have the unsafe operation perform a commit when it 
>>starts, but I've been weeding out useless commits from my app recently and I 
>>don't like them creeping back in.
>>
>>Thanks,
>>Antoine
>>
>>
>>



Re: Easy way to tell if there are pending documents

2011-11-16 Thread Justin Caratzas

You can enable the stats handler
(https://issues.apache.org/jira/browse/SOLR-1750), and get inspect the
json pragmatically.

-- Justin

"Latter, Antoine"  writes:

> Thank you, that does help - but I am more looking for a way to get at this 
> programmatically.
>
> -Original Message-
> From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] 
> Sent: Tuesday, November 15, 2011 11:22 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Easy way to tell if there are pending documents
>
> Antoine,
>
> On Solr Admin Stats page search for "docsPending".  I think this is what you 
> are looking for.
>
> Otis
> 
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem 
> search :: http://search-lucene.com/
>
>
>>
>>From: "Latter, Antoine" 
>>To: "'solr-user@lucene.apache.org'" 
>>Sent: Monday, November 14, 2011 11:39 AM
>>Subject: Easy way to tell if there are pending documents
>>
>>Hi Solr,
>>
>>Does anyone know of an easy way to tell if there are pending documents 
>>waiting for commit?
>>
>>Our application performs operations that are never safe to perform
>> while commits are pending. We make this work by making sure that all
>> indexing operations end in a commit, and stop the unsafe operations
>> from running while a commit is running.
>>
>>This works great most of the time, except when we have enough disk
>> space to add documents to the pending area, but not enough disk
>> space to do a commit - then the indexing operations only error out
>> after they've done all of their adds.
>>
>>It would be nice if the unsafe operation could somehow detect that there are 
>>pending documents and abort.
>>
>>In the interim I'll have the unsafe operation perform a commit when it 
>>starts, but I've been weeding out useless commits from my app recently and I 
>>don't like them creeping back in.
>>
>>Thanks,
>>Antoine
>>
>>
>>



Re: Problems installing Solr PHP extension

2011-11-16 Thread Travis Low
Ah, ausgezeichnet, thank you Kuli!  We'll just use that.

On Wed, Nov 16, 2011 at 11:35 AM, Michael Kuhlmann  wrote:

> Am 16.11.2011 17:11, schrieb Travis Low:
>
>
>> If I can't solve this problem then we'll basically have to write our own
>> PHP Solr client, which would royally suck.
>>
>
> Oh, if you really can't get the library work, no problem - there are
> several PHP clients out there that don't need a PECL installation.
>
> Personally, I have used 
> http://code.google.com/p/solr-**php-client/,
> it works well.
>
> -Kuli
>



-- 

**

*Travis Low, Director of Development*


** * *

*Centurion Research Solutions, LLC*

*14048 ParkEast Circle *•* Suite 100 *•* Chantilly, VA 20151*

*703-956-6276 *•* 703-378-4474 (fax)*

*http://www.centurionresearch.com* 

**The information contained in this email message is confidential and
protected from disclosure.  If you are not the intended recipient, any use
or dissemination of this communication, including attachments, is strictly
prohibited.  If you received this email message in error, please delete it
and immediately notify the sender.

This email message and any attachments have been scanned and are believed
to be free of malicious software and defects that might affect any computer
system in which they are received and opened. No responsibility is accepted
by Centurion Research Solutions, LLC for any loss or damage arising from
the content of this email.


Re: Problems installing Solr PHP extension

2011-11-16 Thread Michael Kuhlmann

Am 16.11.2011 17:11, schrieb Travis Low:


If I can't solve this problem then we'll basically have to write our own
PHP Solr client, which would royally suck.


Oh, if you really can't get the library work, no problem - there are 
several PHP clients out there that don't need a PECL installation.


Personally, I have used http://code.google.com/p/solr-php-client/, it 
works well.


-Kuli


Re: Problems installing Solr PHP extension

2011-11-16 Thread Travis Low
Thanks so much for responding.  I tried your suggestion and the pecl build
*seems* to go okay, but after restarting Apache, I get this again in the
error_log:

> PHP Warning: PHP Startup: Unable to load dynamic library
> '/usr/lib64/php/modules/solr.so' - /usr/lib64/php/modules/solr.so:
> undefined symbol: curl_easy_getinfo in Unknown on line 0

I'm baffled by this because the undefined symbol is in libcurl.so, and I've
specified the path to that library.

If I can't solve this problem then we'll basically have to write our own
PHP Solr client, which would royally suck.

cheers,

Travis

On Wed, Nov 16, 2011 at 7:11 AM, Adolfo Castro Menna <
adolfo.castrome...@gmail.com> wrote:

> Pecl installation is kinda buggy. I installed it ignoring pecl dependencies
> because I already had them.
>
> Try: pecl install -n solr  (-n ignores dependencies)
> And when it prompts for curl and libxml, point the path to where you have
> installed them, probably in /usr/lib/
>
> Cheers,
> Adolfo.
>
> On Tue, Nov 15, 2011 at 7:27 PM, Travis Low  wrote:
>
> > I know this isn't strictly Solr, but I've been at this for hours and I'm
> at
> > my wits end.  I cannot install the Solr PECL extension (
> > http://pecl.php.net/package/solr), either by command line "pecl install
> > solr" or by downloading and using phpize.  Always the same error, which I
> > see here:
> >
> >
> http://www.lmpx.com/nav/article.php/news.php.net/php.qa.reports/24197/read/index.html
> >
> > It boils down to this:
> > PHP Warning: PHP Startup: Unable to load dynamic library
> > '/root/solr-0.9.11/modules/solr.so' - /root/solr-0.9.11/modules/solr.so:
> > undefined symbol: curl_easy_getinfo in Unknown on line 0
> >
> > I am using the current Solr PECL extension.  PHP 5.3.8.  Curl 7.21.3.
>  Yes,
> > libcurl and libcurl-dev are both installed, also 7.21.3.  Fedora Core 15,
> > patched to current levels.
> >
> > Please help!
> >
> > cheers,
> >
> > Travis
> > --
> >
> > **
> >
> > *Travis Low, Director of Development*
> >
> >
> > ** * *
> >
> > *Centurion Research Solutions, LLC*
> >
> > *14048 ParkEast Circle *•* Suite 100 *•* Chantilly, VA 20151*
> >
> > *703-956-6276 *•* 703-378-4474 (fax)*
> >
> > *http://www.centurionresearch.com* 
> >
> > **The information contained in this email message is confidential and
> > protected from disclosure.  If you are not the intended recipient, any
> use
> > or dissemination of this communication, including attachments, is
> strictly
> > prohibited.  If you received this email message in error, please delete
> it
> > and immediately notify the sender.
> >
> > This email message and any attachments have been scanned and are believed
> > to be free of malicious software and defects that might affect any
> computer
> > system in which they are received and opened. No responsibility is
> accepted
> > by Centurion Research Solutions, LLC for any loss or damage arising from
> > the content of this email.
> >
>



-- 

**

*Travis Low, Director of Development*


** * *

*Centurion Research Solutions, LLC*

*14048 ParkEast Circle *•* Suite 100 *•* Chantilly, VA 20151*

*703-956-6276 *•* 703-378-4474 (fax)*

*http://www.centurionresearch.com* 

**The information contained in this email message is confidential and
protected from disclosure.  If you are not the intended recipient, any use
or dissemination of this communication, including attachments, is strictly
prohibited.  If you received this email message in error, please delete it
and immediately notify the sender.

This email message and any attachments have been scanned and are believed
to be free of malicious software and defects that might affect any computer
system in which they are received and opened. No responsibility is accepted
by Centurion Research Solutions, LLC for any loss or damage arising from
the content of this email.


Re: How to mix solr query info into the apache httpd logging (reverseproxy)?

2011-11-16 Thread alex_mass
Thanks for the answer mixing it up with params will certainly be the easiest
solution.

Alex

--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-mix-solr-query-info-into-the-apache-httpd-logging-reverseproxy-tp3498539p3513097.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Help! - ContentStreamUpdateRequest

2011-11-16 Thread Tod

Erick,

Autocommit is commented out in solrconfig.xml.  I have avoided them 
until after the indexing process is complete.  As an experiment I tried 
committing every n records processed to see if varying n would make a 
difference, it really didn't change much.


My original use case had the client running from the Solr server and 
streaming the document content over from a web server based on the URL 
gathered by a query from a backend database.  The locking problem 
appeared there first so I tried moving the client code to the web server 
to be closer the the documents origin.  That helped a little but ended 
up locking which is where I am now.


Solr should be able to index way more documents than the 35K I'm trying 
to index.  It seems from other's accounts they are able to do what I'm 
trying to do successfully.  Therefore I believe I must be doing 
something extraordinarily dumb.  I'll be happy to share any information 
about my environment or configuration if it will help find my error.


Thanks for all of your help.


- Tod





On 11/15/2011 8:08 PM, Erick Erickson wrote:

That's odd. What are your autocommit parameters? And are you either
committing or optimizing as part of your program? I'd bump the
autocommit parameters up and NOT commit (or optimize) from your
client if you are

Best
Erick

On Tue, Nov 15, 2011 at 2:17 PM, Tod  wrote:

Otis,

The files are only part of the payload.  The supporting metadata exists in a
database.  I'm pulling that information, as well as the name and location of
the file, from the database and then sending it to a remote Solr instance to
be indexed.

I've heard Solr would prefer to get documents it needs to index in chunks
rather than one at a time as I'm doing now.  The one at a time approach is
locking up the Solr server at around 700 entries.  My thought was if I could
chunk them in a batch at a time the lockup will stop and indexing
performance would improve.


Thanks - Tod

On 11/15/2011 12:13 PM, Otis Gospodnetic wrote:


Hi,

How about just concatenating your files into one? �Would that work for
you?

Otis


Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/




From: Tod
To: solr-user@lucene.apache.org
Sent: Monday, November 14, 2011 4:24 PM
Subject: Help! - ContentStreamUpdateRequest

Could someone take a look at this page:

http://wiki.apache.org/solr/ContentStreamUpdateRequestExample

... and tell me what code changes I would need to make to be able to
stream a LOT of files at once rather than just one?� It has to be something
simple like a collection of some sort but I just can't get it figured out.�
Maybe I'm using the wrong class altogether?


TIA












Problems with AutoSuggest feature(Terms Components)

2011-11-16 Thread mechravi25
Hi,

When i search for a data i noticed two things

1.) I noticed that *terms.regex=.*&* in the logs which does a blank search
on terms because of the query time is more. Is there anyway to overcome
this. My actual query should go like the first one bolded but instead of
that it happens like in the second case(the 2nd text highlighted in bold)

2.) Also I noticed that *terms.limit=-1* which is very expensive as it asks
solr to return all the terms back. It should be set to 10 or 20 at most.
Please provide some suggestions to set the same.



Nov 14, 2011 2:04:08 PM org.apache.solr.core.SolrCore execute
INFO: [db] webapp=/solr path=/terms
params={*terms.regex=ABC\+CCC\+lll*\+data.*&terms.regex.flag=case_insensitive&terms.fl=nameFacet}
status=0 QTime=935 
Nov 14, 2011 2:04:08 PM org.apache.solr.core.SolrCore execute
INFO: [core2] webapp=/solr path=/terms
params={terms.regex.flag=case_insensitive&shards.qt=/terms&terms.fl=nameFacet&terms=true&terms.limit=-1&terms.regex=ABC\+CCC\+lll\+data.*&isShard=true&qt=/terms&wt=javabin&terms.sort=index&version=1}
status=0 QTime=842 
Nov 14, 2011 2:04:08 PM org.apache.solr.core.SolrCore execute
INFO: [db] webapp=/solr path=/terms
params={terms.regex=ABC\+CCC\+lll\+data.*&terms.regex.flag=case_insensitive&terms.fl=nameFacet}
status=0 QTime=927 
Nov 14, 2011 2:04:08 PM org.apache.solr.core.SolrCore execute
INFO: [core3] webapp=/solr path=/terms
params={terms.regex.flag=case_insensitive&shards.qt=/terms&terms.fl=nameFacet&terms=true&terms.limit=-1&terms.regex=.*&isShard=true&qt=/terms&wt=javabin&terms.sort=index&version=1}
status=0 QTime=115 

Nov 14, 2011 2:05:55 PM org.apache.solr.core.SolrCore execute
INFO: [core1] webapp=/solr path=/terms
params={terms.regex.flag=case_insensitive&shards.qt=/terms&terms.fl=nameFacet&terms=true&terms.limit=-1&*terms.regex=.**&isShard=true&qt=/terms&wt=javabin&terms.sort=index&version=1}
status=0 QTime=106767 
Nov 14, 2011 2:05:55 PM org.apache.solr.core.SolrCore execute
INFO: [core4] webapp=/solr path=/terms
params={terms.regex.flag=case_insensitive&shards.qt=/terms&terms.fl=nameFacet&terms=true&terms.limit=-1&terms.regex=.*&isShard=true&qt=/terms&wt=javabin&terms.sort=index&version=1}
status=0 QTime=106766 
Nov 14, 2011 2:05:55 PM org.apache.solr.core.SolrCore execute

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Problems-with-AutoSuggest-feature-Terms-Components-tp3512734p3512734.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Search in multivalued string field does not work

2011-11-16 Thread mechravi25
Hi,

Thanks for the suggestions.

The index is the same in both the servers. We index using JDBC drivers.

We have not modified the request handler in solrconfig on either machine and
also after the latest schema update, we have re-indexed the data.


*We even checked the analysis page and there is no difference between both
the servers and after checking the "highlight matches" option in the field
value, the result was getting highlighted in the "term text" of Index
Analyzer. But still we confused as to why we are not getting the result in
the search page.*

Actually i forgot to post the dynamic field declaration in my schema file
and this is how it is declared.

 
 

the textgen fieldtype definition is as follows:


   
   

   
   
   


 
 

   
   
   
   
   
 



We have implemented shards in core DB which is in turn gets a result from
shards core(core1 and core2). This actual data is present in core2. We tried
all the options in core2 directly as well but with no success.

The query is passed as follows :

QueryString : idx_ABCFacet:"XXX... ABC DEF"

INFO: [core2] webapp=/solr path=/select
params={debugQuery=false&fl=uid,score&start=0&q=idx_ABCFacet:"XXX...+ABC+DEF"&isShard=true&wt=javabin&fsv=true&rows=10&version=1}
hits=0 status=0 QTime=2 
Nov 16, 2011 5:44:17 AM org.apache.solr.core.SolrCore execute
INFO: [core1] webapp=/solr path=/select
params={debugQuery=false&fl=uid,score&start=0&q=idx_ABCFacet:"XXX...+ABC+DEF"&isShard=true&wt=javabin&fsv=true&rows=10&version=1}
hits=0 status=0 QTime=0 
Nov 16, 2011 5:44:17 AM org.apache.solr.core.SolrCore execute
INFO: [db] webapp=/solr path=/select/
params={debugQuery=on&indent=on&start=0&q=idx_ABCFacet:"XXX...+ABC+DEF"&version=2.2&rows=10}
status=0 QTime=64 



Also can you please elaborate on the 3rd point

*3> Try using Luke to examine the indexes on both servers to determine 
 whether they're the same. *




Thanks.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Search-in-multivalued-string-field-does-not-work-tp3509458p3512710.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr Score Normalization

2011-11-16 Thread Jan Høydahl
Perhaps you can solve your usecase by playing with the new eDismax "boost" 
parameter, which multiplies the functions with the other score instead of 
adding.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 5. nov. 2011, at 01:26, sangrish wrote:

> 
> Hi,
> 
> 
>I have a (dismax) request handler which has the following 3 scoring
> components (1 qf & 2 bf) :
> 
>qf = "field1^2 field2^3"
>bf = func1(field3)^2 func2(field4)^3
> 
>  Both func1 & func2 return scores between 0 & 1. The score returned by
> textual match (qf) ranges from 0 to 
> 
>   To allow better combination of text match & my functions, I want the text
> score to be normalized between 0 & 1. Is there any way I can achieve that
> here?
> 
> Thanks
> Sid
> 
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Solr-Score-Normalization-tp3481627p3481627.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Join and faceting by children's attributes

2011-11-16 Thread Tobias
Hello,

I currently have a demand for faceting on the children of a join query.

My index is set up in a way that there are parent and child documents.
The child documents do have the facet information in a (precisely: some)
multivalue field(s). The parent documents themselves do not have any of it.

As the join query support allows me to do a simple search within the child
documents and return documents from the parent document space I thought
there
probably is a way to figure out the available facet values from the child
document space and present both in the result set, but this seems more
difficult
than I thought it would be.

The join query support would allow me to filter on specific
child-document-space
facet fields, for example:

but I can not really find a way to present *which faceting options are
available*
in the result set in first place.

Denormalizing my index in a way that the parent documents would contain
the faceting information is not an option at the moment, because I wanted
to keep the index more generic, so that there's not one field per attribute
but two generic attribute fields (multi-value), that keep the Key/Value
pairs,
like the following table shows. I need this setup because at index setup
time
I do not know which attributes for the various products/items will be
available.



If I now would denormalize a bunch of shoe child items into the parent
product
it would always contain all possible size/color combinations, even if some
of
the child products do not meet the initial search term's criteria, e.g.
searching above for (title:Sneakers AND desc:cool) should return just facets
for "size" (2), "color" (2), "red" (1), "blue" (1), "40" (1)  and "42" (1),
which I do postprocess in my client application, so that I know that
"red" and "blue" are "color"s and "40" and "42" are "size"s.

I thought that you cracks might have an idea on how to continue from there.

Best,
Tobias


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Join-and-faceting-by-children-s-attributes-tp3512629p3512629.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Problems installing Solr PHP extension

2011-11-16 Thread Adolfo Castro Menna
Pecl installation is kinda buggy. I installed it ignoring pecl dependencies
because I already had them.

Try: pecl install -n solr  (-n ignores dependencies)
And when it prompts for curl and libxml, point the path to where you have
installed them, probably in /usr/lib/

Cheers,
Adolfo.

On Tue, Nov 15, 2011 at 7:27 PM, Travis Low  wrote:

> I know this isn't strictly Solr, but I've been at this for hours and I'm at
> my wits end.  I cannot install the Solr PECL extension (
> http://pecl.php.net/package/solr), either by command line "pecl install
> solr" or by downloading and using phpize.  Always the same error, which I
> see here:
>
> http://www.lmpx.com/nav/article.php/news.php.net/php.qa.reports/24197/read/index.html
>
> It boils down to this:
> PHP Warning: PHP Startup: Unable to load dynamic library
> '/root/solr-0.9.11/modules/solr.so' - /root/solr-0.9.11/modules/solr.so:
> undefined symbol: curl_easy_getinfo in Unknown on line 0
>
> I am using the current Solr PECL extension.  PHP 5.3.8.  Curl 7.21.3.  Yes,
> libcurl and libcurl-dev are both installed, also 7.21.3.  Fedora Core 15,
> patched to current levels.
>
> Please help!
>
> cheers,
>
> Travis
> --
>
> **
>
> *Travis Low, Director of Development*
>
>
> ** * *
>
> *Centurion Research Solutions, LLC*
>
> *14048 ParkEast Circle *•* Suite 100 *•* Chantilly, VA 20151*
>
> *703-956-6276 *•* 703-378-4474 (fax)*
>
> *http://www.centurionresearch.com* 
>
> **The information contained in this email message is confidential and
> protected from disclosure.  If you are not the intended recipient, any use
> or dissemination of this communication, including attachments, is strictly
> prohibited.  If you received this email message in error, please delete it
> and immediately notify the sender.
>
> This email message and any attachments have been scanned and are believed
> to be free of malicious software and defects that might affect any computer
> system in which they are received and opened. No responsibility is accepted
> by Centurion Research Solutions, LLC for any loss or damage arising from
> the content of this email.
>


Re: Different maxAnalyzedChars value in solrconfig.xml

2011-11-16 Thread Koji Sekiguchi

(11/11/16 13:12), Shyam Bhaskaran wrote:

Hi,

Wanted to know whether we can set different maxAnalyzedChars values in the 
solrconfig.xml based on different fields.

Can someone point if this is possible at all, my requirement needs me to set 
different values for maxAnalyzedChars parameter based on two different field 
values.

For example field type has the value "xxx" then the maxAnalyzedChars needs to be set to 
1MB and if the value is "yyy" the maxAnalyzedChars needs to set to 3MB. Let me know if 
this can be done and how to do set this.


I don't think it is possible.

koji
--
Check out "Query Log Visualizer" for Apache Solr
http://www.rondhuit-demo.com/loganalyzer/loganalyzer.html
http://www.rondhuit.com/en/


Re: OutOfMemoryError when using query with sort

2011-11-16 Thread Benson Ba
Hi Hamid,

i also encounterd the same OOM issue on windows 2003 (32-bits) server... but
only 3 millions articles stored in solr. i would like to know your
configurations to drive so many records. 
Many thanks. 


Best Regards
Benson


--
View this message in context: 
http://lucene.472066.n3.nabble.com/OutOfMemoryError-when-using-query-with-sort-tp729437p3512224.html
Sent from the Solr - User mailing list archive at Nabble.com.


Rich document indexing

2011-11-16 Thread kumar8anuj
I am using solr 3.4 and configured my DataImportHandler to get some data from
MySql as well as index some rich document from the disk. 

This is the part of db-data-config file where i am indexing Rich text
documents.



http://localhost/resumes-new/resumes${resume.dir}/${js_logins.id}/${resume.name}";
dataSource="ds-file" format="text">





But after some time i get the following error in my error log. It looks like
a class missing error, Can anyone tell me which poi jar version would work
with tika.0.6. Currently I have  poi-3.7.jar. 

Error which i am getting is this  

SEVERE: Exception while processing: js_logins document :
SolrInputDocument[{id=id(1.0)={100984},
complete_mobile_number=complete_mobile_number(1.0)={+91 9600067575},
emailid=emailid(1.0)={vkry...@gmail.com}, full_name=full_name(1.0)={Venkat
Ryali}}]:org.apache.solr.handler.dataimport.DataImportHandlerException:
java.lang.NoSuchMethodError:
org.apache.poi.xwpf.usermodel.XWPFParagraph.(Lorg/openxmlformats/schemas/wordprocessingml/x2006/main/CTP;Lorg/apache/poi/xwpf/usermodel/XWPFDocument;)V
 
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:669)
 
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:622)
 
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:622)
 
at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:268) 
at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:187) 
at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:359)
 
at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:427) 
at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:408) 
Caused by: java.lang.NoSuchMethodError:
org.apache.poi.xwpf.usermodel.XWPFParagraph.(Lorg/openxmlformats/schemas/wordprocessingml/x2006/main/CTP;Lorg/apache/poi/xwpf/usermodel/XWPFDocument;)V
 
at
org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator$MyXWPFParagraph.(XWPFWordExtractorDecorator.java:163)
 
at
org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator$MyXWPFParagraph.(XWPFWordExtractorDecorator.java:161)
 
at
org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractTableContent(XWPFWordExtractorDecorator.java:140)
 
at
org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.buildXHTML(XWPFWordExtractorDecorator.java:91)
 
at
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:69)
 
at
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:51) 
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120) 
at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101) 
at
org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:128)
 
at
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:238)
 
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:596)
 
... 7 more 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Rich-document-indexing-tp3512276p3512276.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Can we have lucene regular and fastVectorHiglighter together in solr

2011-11-16 Thread Koji Sekiguchi

(11/11/16 18:58), Shyam Bhaskaran wrote:

Hi,

Can we use Lucene regular highlighter along with fastVectorHighlighter together 
in solrconfig.xml (solr) ?

-Shyam



Yes, you can. See  section in 
solr/example/solr/conf/solrconfig.xml for example.

koji
--
Check out "Query Log Visualizer" for Apache Solr
http://www.rondhuit-demo.com/loganalyzer/loganalyzer.html
http://www.rondhuit.com/en/


Re: Aggregated indexing of updating RSS feeds

2011-11-16 Thread sbarriba
All,
Can anyone advise how to stop the "deleteAll" event during a full import? 

As discussed above using clean=false with Solr 3.4 still seems to trigger a
delete of all previous imported data. I want to aggregate the results of
multiple imports.

Thanks in advance.
S

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Aggregated-indexing-of-updating-RSS-feeds-tp3485335p3512260.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Phrase between quotes with dismax edismax

2011-11-16 Thread Jean-Claude Dauphin
Thanks Erick for yr quick answer.

I am using Solr 3.1

1) I have set the mm parameter to 0 and removed the categories from the
search. Thus the query is only for "chef de projet" and nothing else.
But the problem remains, i.e searching for "chef de projet" gives no
results while searching for "chef projet" gives the right result.

Here is an excerpt from the test I made:

DISMAX query (q)=("chef de projet")

=The Parameters=

*queryResponse*=[{responseHeader={status=0,QTime=157,

params={facet=true,

f.createDate.facet.date.start=NOW/DAY-6DAYS,tie=0.1,

facet.limit=4,

f.location.facet.limit=3,

*q.alt*=*:*,

facet.date.other=all,

hl=true,version=2,

*bq*=[categoryPayloads:category1071^1,
categoryPayloads:category10055078^1, categoryPayloads:category10055405^1],

fl=*,score,

debugQuery=true,

facet.field=[soldProvisions, contractTypeText, nafCodeText, createDate,
wage, keywords, labelLocation, jobCode, organizationName,
requiredExperienceLevelText],

*qs*=3,

qt=edismax,

facet.date.end=NOW/DAY,

*mm*=0,

facet.mincount=1,

facet.date=createDate,

*qf*= title^4.0 formattedDescription^2.0 nafCodeText^2.0 jobCodeText^3.0
organizationName^1.0 keywords^3.0 location^1.0 labelLocation^1.0
categoryPayloads^1.0,

hl.fl=title,

wt=javabin,

rows=20,

start=0,

*q*=("chef de projet"),

facet.date.gap=+1DAY,

*stopwords*=false,

*ps*=3}},

The Solr Response
response={numFound=0

Debug Info

debug={

*rawquerystring*=("chef de projet"),

*querystring*=("chef de projet"),

*---
*

*parsedquery*=

+*DisjunctionMaxQuery*((title:"chef de projet"~3^4.0 | keywords:chef de
projet^3.0 | organizationName:chef de projet | location:chef de projet |
formattedDescription:"chef de projet"~3^2.0 | nafCodeText:chef de
projet^2.0 | jobCodeText:chef de projet^3.0 | categoryPayloads:"chef de
projet"~3 | labelLocation:chef de projet)~0.1)
*DisjunctionMaxQuery*((title:"(("chef
chef) de (projet") projet)"~3^4.0)~0.1) categoryPayloads:category1071
categoryPayloads:category10055078 categoryPayloads:category10055405,

*---*

*parsedquery_toString*=+(title:"chef de projet"~3^4.0 | keywords:chef de
projet^3.0 | organizationName:chef de projet | location:chef de projet |
formattedDescription:"chef de projet"~3^2.0 | nafCodeText:chef de
projet^2.0 | jobCodeText:chef de projet^3.0 | categoryPayloads:"chef de
projet"~3 | labelLocation:chef de projet)~0.1 (title:"(("chef chef) de
(projet") projet)"~3^4.0)~0.1 categoryPayloads:category1071
categoryPayloads:category10055078 categoryPayloads:category10055405,



explain={},

QParser=ExtendedDismaxQParser,altquerystring=null,

*boost_queries*=[categoryPayloads:category1071^1,
categoryPayloads:category10055078^1, categoryPayloads:category10055405^1],

*parsed_boost_queries*=[categoryPayloads:category1071,
categoryPayloads:category10055078, categoryPayloads:category10055405],
boostfuncs=null,

2) I tried to remove the bq values but no changes:

*querystring*=("chef de projet"),

*parsedquery*=+*DisjunctionMaxQuery*((title:"chef de projet"~3^4.0 |
keywords:chef de projet^3.0 | organizationName:chef de projet |
location:chef de projet | formattedDescription:"chef de projet"~3^2.0 |
nafCodeText:chef de projet^2.0 | jobCodeText:chef de projet^3.0 |
categoryPayloads:"chef de projet"~3 | labelLocation:chef de projet)~0.1) *
DisjunctionMaxQuery*((title:"(("chef chef) de (projet")
projet)"~3^4.0)~0.1),
*parsedquery_toString*=+(title:"chef de projet"~3^4.0 | keywords:chef de
projet^3.0 | organizationName:chef de projet | location:chef de projet |
formattedDescription:"chef de projet"~3^2.0 | nafCodeText:chef de
projet^2.0 | jobCodeText:chef de projet^3.0 | categoryPayloads:"chef de
projet"~3 | labelLocation:chef de projet)~0.1 (title:"(("chef chef) de
(projet") projet)"~3^4.0)~0.1,

3) and the query which works

debug={

*rawquerystring*=("chef  projet"),

*querystring*=("chef  projet"),

*parsedquery*=+*DisjunctionMaxQuery*((title:"chef projet"~3^4.0 |
keywords:chef  projet^3.0 | organizationName:chef  projet |
location:chef  projet
| formattedDescription:"chef projet"~3^2.0 | nafCodeText:chef  projet^2.0 |
jobCodeText:chef  projet^3.0 | categoryPayloads:"chef projet"~3 |
labelLocation:chef  projet)~0.1) *DisjunctionMaxQuery*((title:"(("chef
chef) (projet") projet)"~3^4.0)~0.1),

*parsedquery_toString*=+(title:"chef projet"~3^4.0 | keywords:chef  projet^3.0
| organizationName:chef  projet | location:chef  projet |
formattedDescription:"chef projet"~3^2.0 | nafCodeText:chef  projet^2.0 |
jobCodeText:chef  projet^3.0 | categoryPayloads:"chef projet"~3 |
labelLocation:chef  projet)~0.1 (title:"(("chef chef) (projet")
projet)"~3^4.0)~0.1,

explain={23715081=

14.832518 = (MATCH) sum of:

I really don't know how to solve this issue and would appreciate any help

Best wishes

Jean-Claude


On Tue, Nov 15, 2011 at 9:28 PM, Erick Erickson wrote:

> The query re-writing is...er...interesting, and I'll skip th