Dismax, Sharding and Elevation

2011-01-13 Thread Oliver Marahrens
Hi all,

I have discovered a strange thing with Dismax and Elevation and hope
someone can enlighten me what to do.

Whenever I search for something using the elevation Request Handler the
hits are from a normal Lucene query (with elevated results if the search
term was defined in elevation.xml). Elevation works, but only with dismax.
Whenever I search using the dismax handler with elevated terms,
elevation only works if I turned off sharding. Using shards results in
an exception (IndexOutOfBoundsException). Complete message is listed below.

Is this a bug or did I miss anything to switch in configuration? I also
tried to add
str name=defTypedismax/str
to elevation request handler in solrconfig.xml, but that didn't help.
The elevator component is integrated into the dismax search handler in
arr  name=last-components.

Any hints appreciated!

Thank you in advance
Oliver


My Solr-configuration for elevation request handler and elevation search
component look like that:

  searchComponent name=elevator class=solr.QueryElevationComponent 
!-- pick a fieldType to analyze queries --
str name=queryFieldTypetext/str
str name=config-fileelevate.xml/str
  /searchComponent

  requestHandler name=/elevate class=solr.SearchHandler startup=lazy
lst name=defaults
  str name=echoParamsexplicit/str
/lst
arr name=last-components
  strelevator/str
  strdebug/str
/arr
  /requestHandler

The complete exception message I get from searching with dismax,
elevation and sharding:

java.lang.IndexOutOfBoundsException: Index: 1, Size: 0
at java.util.ArrayList.RangeCheck(ArrayList.java:547)
at java.util.ArrayList.get(ArrayList.java:322)
at org.apache.solr.common.util.NamedList.getVal(NamedList.java:137)
at 
org.apache.solr.handler.component.ShardFieldSortedHitQueue$ShardComparator.sortVal(ShardDoc.java:195)
at 
org.apache.solr.handler.component.ShardFieldSortedHitQueue$2.compare(ShardDoc.java:233)
at 
org.apache.solr.handler.component.ShardFieldSortedHitQueue.lessThan(ShardDoc.java:134)
at org.apache.lucene.util.PriorityQueue.upHeap(PriorityQueue.java:270)
at org.apache.lucene.util.PriorityQueue.put(PriorityQueue.java:129)
at 
org.apache.lucene.util.PriorityQueue.insertWithOverflow(PriorityQueue.java:171)
at org.apache.lucene.util.PriorityQueue.insert(PriorityQueue.java:156)
at 
org.apache.solr.handler.component.QueryComponent.mergeIds(QueryComponent.java:445)
at 
org.apache.solr.handler.component.QueryComponent.handleResponses(QueryComponent.java:298)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:290)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1088)
at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:360)
at 
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:729)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
at 
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:206)
at 
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:324)
at 
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:505)
at 
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:829)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:513)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:211)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:380)
at 
org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:395)
at 
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:488)


-- 
Oliver Marahrens
TU Hamburg-Harburg / Universitätsbibliothek / Digitale Dienste
Denickestr. 22
21071 Hamburg - Harburg
Tel.+49 (0)40 / 428 78 - 32 91
eMail   o.marahr...@tu-harburg.de
--
GPG/PGP-Schlüssel: 
http://www.tub.tu-harburg.de/keys/Oliver_Marahrens_pub.asc
--
Projekt DISCUS http://discus.tu-harburg.de
Projekt TUBdok http://doku.b.tu-harburg.de



Re: spell suggest response

2011-01-13 Thread Grijesh.singh

Similar type of work I have done earlier by using spell-check component with
auto-suggest combined.

Autosuggest will provide the words starting with query term and spellcheck
returns the words similar to that.
I have combined both suggestion in single list to display

-
Grijesh
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/spell-suggest-response-tp2233409p2247479.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Dismax, Sharding and Elevation

2011-01-13 Thread Grijesh.singh

As I seen the code for QueryElevationComponent ,there is no supports for
Distributed Search i.e. query elevation does not works with shards.

-
Grijesh
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Dismax-Sharding-and-Elevation-tp2247156p2247522.html
Sent from the Solr - User mailing list archive at Nabble.com.


range queries in solr

2011-01-13 Thread ur lops
Hi,
 I am sorry to ask this silly question but I could not find the
documentation about this and I am very new to lucene solr. I want to run a
range query on one of the multivalued field e.g. I have a point say [10,20],
which is the point of intersection of the diagonals of a rectangle. Now I
want to run a solr query, which gives me all the points within
the rectangle whose vertices are at { [8,20], [12,20], [10,18] , [10,22]}.
Any help would be highly appreciated.
Thanks
Urlop


Solr + Hadoop

2011-01-13 Thread Joan
Hi,

I'm trying build solr index with MapReduce (Hadoop) and I'm using
https://issues.apache.org/jira/browse/SOLR-1301 but I've a problem with
hadoop version and this patch.

When I compile this patch, I use 0.21.0 hadoop version, I don't have any
problem but when I'm trying to run my job in Hadoop (0.0.21) I get some
error like this:

*Exception in thread main java.lang.IncompatibleClassChangeError: Found
interface org.apache.hadoop.mapreduce.JobContext, but class was expected*
 at
org.apache.solr.hadoop.SolrOutputFormat.checkOutputSpecs(SolrOutputFormat.java:147)
at
org.apache.hadoop.mapreduce.JobSubmitter.checkSpecs(JobSubmitter.java:373)
at
org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:334)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:960)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:976)



I try to override the next method:

  *public void checkOutputSpecs(JobContext job) throws IOException {
super.checkOutputSpecs(job);
if (job.getConfiguration().get(SETUP_OK) == null) {
  throw new IOException(Solr home cache not set up!);
}
  }*

by

 * public void checkOutputSpecs(Job job) throws IOException {
super.checkOutputSpecs(job);
if (job.getConfiguration().get(SETUP_OK) == null) {
  throw new IOException(Solr home cache not set up!);
}
  }*

but I continue receive some error:
*
java.lang.RuntimeException: java.lang.ClassNotFoundException:
org.apache.solr.hadoop.SolrOutputFormat
at
org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1128)
at
org.apache.hadoop.mapreduce.task.JobContextImpl.getOutputFormatClass(JobContextImpl.java:203)
at org.apache.hadoop.mapred.Task.initialize(Task.java:487)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:311)
at org.apache.hadoop.mapred.Child$4.run(Child.java:217)
at java.security.AccessController.doPrivileged(Native Method)*



Please, Is someone using this patch with 0.21.0 Version Hadoop?. Can someone
help me?

Thanks,

Joan


Re: Question on deleting all rows for an index

2011-01-13 Thread kenf_nc

If this is a one-time cleanup, not something you need to do programmatically,
you could delete the index directory ( solrDir/data/index ). In my case I
have to stop Tomcat, delete .\index and restart Tomcat. It is very fast and
starts me out with a fresh, empty, index. Noticed you are multi-core, I'm
not, so this could be bogus information for you...but thought I'd toss it
out just in case.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Question-on-deleting-all-rows-for-an-index-tp2246726p2248332.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: basic document crud in an index

2011-01-13 Thread kenf_nc

A/ You have to update all the fields, if you leave one off, it won't be in
the document anymore. I have my 'persisted' data stored outside of Solr, so
on update I get the stored data, modify it and update Solr with every field
(even if one changed). You could also do a Query/Modify/Update directly in
Solr, just remember to send all fields in the update. There isn't (in 1.4
anyway) a way to update specific fields only.

B/ When you update, it is my understanding that, yes, the old doc is there
deleted and a new doc is in place. You can't get to the old one however and
it will go away at the next Optimize. I've never used it, but when you
Commit you can send an optional parameter 'expungeDeletes' that should
remove deleted docs as well.

C/ Not that I'm aware of

D/ don't know

E/ That is my understanding, but I'm admittedly a little weak on that part.
I just have a job that runs in the middle of the night and runs Optimize
once each night, I don't dig deeper than that into what goes on.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/basic-document-crud-in-an-index-tp2246793p2248422.html
Sent from the Solr - User mailing list archive at Nabble.com.


Solr boolean operators

2011-01-13 Thread Xavier Schepler

Hi,

with the Lucene query syntax, is :

a AND (a OR b)

equivalent to :

a

(absorption)

?


Re: basic document crud in an index

2011-01-13 Thread Markus Jelsma
To fill the gaps:

b. the old version remains on disk but is flagged for deletion
d. optimize equals merging, the difference is how many segments come out
e. yes

On Thursday 13 January 2011 15:21:54 kenf_nc wrote:
 A/ You have to update all the fields, if you leave one off, it won't be in
 the document anymore. I have my 'persisted' data stored outside of Solr, so
 on update I get the stored data, modify it and update Solr with every field
 (even if one changed). You could also do a Query/Modify/Update directly in
 Solr, just remember to send all fields in the update. There isn't (in 1.4
 anyway) a way to update specific fields only.
 
 B/ When you update, it is my understanding that, yes, the old doc is there
 deleted and a new doc is in place. You can't get to the old one however and
 it will go away at the next Optimize. I've never used it, but when you
 Commit you can send an optional parameter 'expungeDeletes' that should
 remove deleted docs as well.
 
 C/ Not that I'm aware of
 
 D/ don't know
 
 E/ That is my understanding, but I'm admittedly a little weak on that part.
 I just have a job that runs in the middle of the night and runs Optimize
 once each night, I don't dig deeper than that into what goes on.

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: Solr boolean operators

2011-01-13 Thread dante stroe
To my understanding: in terms of the results that will be matched by your
query ... it's the same. In terms of the score of the results  no,
since, if you are using the first query, the documents that will match both
the a and the b terms, will match higher then the ones matching just the
a term.

On Thu, Jan 13, 2011 at 3:29 PM, Xavier Schepler 
xavier.schep...@sciences-po.fr wrote:

 Hi,

 with the Lucene query syntax, is :

 a AND (a OR b)

 equivalent to :

 a

 (absorption)

 ?



Re: Solr boolean operators

2011-01-13 Thread Xavier SCHEPLER
Ok, thanks.
That's what I expected :D

 
 From: dante stroe dante.st...@gmail.com
 Sent: Thu Jan 13 15:56:33 CET 2011
 To: solr-user@lucene.apache.org
 Subject: Re: Solr boolean operators
 
 
 To my understanding: in terms of the results that will be matched by your
 query ... it's the same. In terms of the score of the results  no,
 since, if you are using the first query, the documents that will match both
 the a and the b terms, will match higher then the ones matching just the
 a term.
 
 On Thu, Jan 13, 2011 at 3:29 PM, Xavier Schepler 
 xavier.schep...@sciences-po.fr wrote:
 
  Hi,
 
  with the Lucene query syntax, is :
 
  a AND (a OR b)
 
  equivalent to :
 
  a
 
  (absorption)
 
  ?
 


--
Tous les courriers électroniques émis depuis la messagerie
de Sciences Po doivent respecter des conditions d'usages.
Pour les consulter rendez-vous sur
http://www.ressources-numeriques.sciences-po.fr/confidentialite_courriel.htm


Get nearby words?

2011-01-13 Thread darren
Hi,
  Is there a way to get the relevant nearby words in the entire index
given a single word?

I want to know all the relevance ranked words before and after the queried
word.

thanks for any tips.
Darren


Re: Multi-word exact keyword case-insensitive search suggestions

2011-01-13 Thread Adam Estrada
Hi,

the following seems to work pretty well.

fieldType name=text_ws class=solr.TextField
positionIncrementGap=100
  analyzer
tokenizer class=solr.KeywordTokenizerFactory /
filter class=solr.ShingleFilterFactory
  maxShingleSize=4 outputUnigrams=true
outputUnigramIfNoNgram=false /
  /analyzer
/fieldType

!-- A text field that uses WordDelimiterFilter to enable splitting and
matching of
words on case-change, alpha numeric boundaries, and non-alphanumeric
chars,
so that a query of wifi or wi fi could match a document
containing Wi-Fi.
Synonyms and stopwords are customized by external files, and
stemming is enabled.
The attribute autoGeneratePhraseQueries=true (the default) causes
words that get split to
form phrase queries. For example, WordDelimiterFilter splitting
text:pdp-11 will cause the parser
to generate text:pdp 11 rather than (text:PDP OR text:11).
NOTE: autoGeneratePhraseQueries=true tends to not work well for
non whitespace delimited languages.
--
fieldType name=text class=solr.TextField positionIncrementGap=100
autoGeneratePhraseQueries=true
  analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
!-- in this example, we will only use synonyms at query time
filter class=solr.SynonymFilterFactory
synonyms=index_synonyms.txt ignoreCase=true expand=false/
--
!-- Case insensitive stop word removal.
  add enablePositionIncrements=true in both the index and query
  analyzers to leave a 'gap' for more accurate phrase queries.
--
filter class=solr.StopFilterFactory
ignoreCase=true
words=stopwords.txt
enablePositionIncrements=true
/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.KeywordMarkerFilterFactory
protected=protwords.txt/
filter class=solr.PorterStemFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
ignoreCase=true expand=true/
filter class=solr.StopFilterFactory
ignoreCase=true
words=stopwords.txt
enablePositionIncrements=true
/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=0
catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.KeywordMarkerFilterFactory
protected=protwords.txt/
filter class=solr.PorterStemFilterFactory/
  /analyzer
/fieldType

copyField source=cat dest=text/
copyField source=subject dest=text/
copyField source=summary dest=text/
copyField source=cause dest=text/
copyField source=status dest=text/
copyField source=urgency dest=text/

I ingest the source fields as text_ws (I know I've changed it a bit) and
then copy the field to text. This seems to do what you are asking for.

Adam

On Thu, Jan 13, 2011 at 12:05 AM, Chamnap Chhorn chamnapchh...@gmail.comwrote:

 Hi all,

 I'm just stuck with exact keyword for several days. Hope you guys could
 help
 me. Here is the scenario:

   1. It need to be matched with multi-word keyword and case insensitive
   2. Partial word or single word matching with this field is not allowed

 I want to know the field type definition for this field and sample solr
 query. I need to combine this search with my full text search which uses
 dismax query.

 Thanks
 --
 Chhorn Chamnap
 http://chamnapchhorn.blogspot.com/



Re: segment gets corrupted (after background merge ?)

2011-01-13 Thread Stéphane Delprat

I understand less and less what is happening to my solr.

I did a checkIndex (without -fix) and there was an error...

So a did another checkIndex with -fix and then the error was gone. The 
segment was alright



During checkIndex I do not shut down the solr server, I just make sure 
no client connect to the server.


Should I shut down the solr server during checkIndex ?



first checkIndex :

  4 of 17: name=_phe docCount=264148
compound=false
hasProx=true
numFiles=9
size (MB)=928.977
diagnostics = {optimize=false, mergeFactor=10, 
os.version=2.6.26-2-amd64, os=Linux, mergeDocStores=true, 
lucene.version=2.9.3 951790 - 2010-06-06 01:30:55, source=merge, 
os.arch=amd64, java.version=1.6.0_20, java.vendor=Sun Microsystems Inc.}

has deletions [delFileName=_phe_p3.del]
test: open reader.OK [44824 deleted docs]
test: fields..OK [51 fields]
test: field norms.OK [51 fields]
test: terms, freq, prox...ERROR [term post_id:562 docFreq=1 != num 
docs seen 0 + num docs deleted 0]
java.lang.RuntimeException: term post_id:562 docFreq=1 != num docs seen 
0 + num docs deleted 0
at 
org.apache.lucene.index.CheckIndex.testTermIndex(CheckIndex.java:675)
at 
org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:530)

at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903)
test: stored fields...OK [7206878 total field count; avg 32.86 
fields per doc]
test: term vectorsOK [0 total vector count; avg 0 term/freq 
vector fields per doc]

FAILED
WARNING: fixIndex() would remove reference to this segment; full 
exception:

java.lang.RuntimeException: Term Index test failed
at 
org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:543)

at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903)


a few minutes latter :

  4 of 18: name=_phe docCount=264148
compound=false
hasProx=true
numFiles=9
size (MB)=928.977
diagnostics = {optimize=false, mergeFactor=10, 
os.version=2.6.26-2-amd64, os=Linux, mergeDocStores=true, 
lucene.version=2.9.3 951790 - 2010-06-06 01:30:55, source=merge, 
os.arch=amd64, java.version=1.6.0

_20, java.vendor=Sun Microsystems Inc.}
has deletions [delFileName=_phe_p4.del]
test: open reader.OK [44828 deleted docs]
test: fields..OK [51 fields]
test: field norms.OK [51 fields]
test: terms, freq, prox...OK [3200899 terms; 26804334 terms/docs 
pairs; 28919124 tokens]
test: stored fields...OK [7206764 total field count; avg 32.86 
fields per doc]
test: term vectorsOK [0 total vector count; avg 0 term/freq 
vector fields per doc]



Le 12/01/2011 16:50, Michael McCandless a écrit :

Curious... is it always a docFreq=1 != num docs seen 0 + num docs deleted 0?

It looks like new deletions were flushed against the segment (del file
changed from _ncc_22s.del to _ncc_24f.del).

Are you hitting any exceptions during indexing?

Mike

On Wed, Jan 12, 2011 at 10:33 AM, Stéphane Delprat
stephane.delp...@blogspirit.com  wrote:

I got another corruption.

It sure looks like it's the same type of error. (on a different field)

It's also not linked to a merge, since the segment size did not change.


*** good segment :

  1 of 9: name=_ncc docCount=1841685
compound=false
hasProx=true
numFiles=9
size (MB)=6,683.447
diagnostics = {optimize=false, mergeFactor=10, os.version=2.6.26-2-amd64,
os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06
01:30:55, source=merge, os.arch=amd64, java.version=1.6.0
_20, java.vendor=Sun Microsystems Inc.}
has deletions [delFileName=_ncc_22s.del]
test: open reader.OK [275881 deleted docs]
test: fields..OK [51 fields]
test: field norms.OK [51 fields]
test: terms, freq, prox...OK [17952652 terms; 174113812 terms/docs pairs;
204561440 tokens]
test: stored fields...OK [45511958 total field count; avg 29.066
fields per doc]
test: term vectorsOK [0 total vector count; avg 0 term/freq
vector fields per doc]


a few hours latter :

*** broken segment :

  1 of 17: name=_ncc docCount=1841685
compound=false
hasProx=true
numFiles=9
size (MB)=6,683.447
diagnostics = {optimize=false, mergeFactor=10, os.version=2.6.26-2-amd64,
os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06
01:30:55, source=merge, os.arch=amd64, java.version=1.6.0
_20, java.vendor=Sun Microsystems Inc.}
has deletions [delFileName=_ncc_24f.del]
test: open reader.OK [278167 deleted docs]
test: fields..OK [51 fields]
test: field norms.OK [51 fields]
test: terms, freq, prox...ERROR [term post_id:1599104 docFreq=1 != num
docs seen 0 + num docs deleted 0]
java.lang.RuntimeException: term post_id:1599104 docFreq=1 != num docs seen
0 + num docs deleted 0
at

Re: StopFilterFactory and qf containing some fields that use it and some that do not

2011-01-13 Thread Jonathan Rochkind
It's a known 'issue' in dismax, (really an inherent part of dismax's 
design with no clear way to do anything about it), that qf over fields 
with different stop word definitions will produce odd results for a 
query with a stopword.


Here's my understanding of what's going on: 
http://bibwild.wordpress.com/2010/04/14/solr-stop-wordsdismax-gotcha/


On 1/12/2011 6:48 PM, Markus Jelsma wrote:

Here's another thread on the subject:
http://lucene.472066.n3.nabble.com/Dismax-Minimum-Match-Stopwords-Bug-
td493483.html

And slightly off topic: you'd also might want to look at using common grams,
they are really useful for phrase queries that contain stopwords.

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.CommonGramsFilterFactory



Here is what debug says each of these queries parse to:

1. q=lifedefType=edismaxqf=Title  ... returns 277,635 results
2. q=the lifedefType=edismaxqf=Title ... returns 277,635 results
3. q=lifedefType=edismaxqf=Title Contributor  ... returns 277,635
4. q=the lifedefType=edismaxqf=Title Contributor ... returns 0 results

1. +DisjunctionMaxQuery((Title:life))
2. +((DisjunctionMaxQuery((Title:life)))~1)
3. +DisjunctionMaxQuery((CTBR_SEARCH:life | Title:life))
4. +((DisjunctionMaxQuery((Contributor:the))
DisjunctionMaxQuery((Contributor:life | Title:life)))~2)

I see what's going on here.  Because the is a stop word for Title, it
gets removed from first part of the expression.  This means that
Contributor is required to contain the.  dismax does the same thing
too.  I guess I should have run debug before asking the mail list!

It looks like the only workarounds I have is to either filter out the
stopwords in the client when this happens, or enable stop words for all
the fields that are used in qf with stopword-enabled fields.
Unless...someone has a better idea??

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311

-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io]
Sent: Wednesday, January 12, 2011 4:44 PM
To: solr-user@lucene.apache.org
Cc: Jayendra Patil
Subject: Re: StopFilterFactory and qf containing some fields that use it
and some that do not


Have used edismax and Stopword filters as well. But usually use the fq
parameter e.g. fq=title:the life and never had any issues.

That is because filter queries are not relevant for the mm parameter which
is being used for the main query.


Can you turn on the debugQuery and check whats the Query formed for all
the combinations you mentioned.

Regards,
Jayendra

On Wed, Jan 12, 2011 at 5:19 PM, Dyer, James

james.d...@ingrambook.comwrote:

I'm running into a problem with StopFilterFactory in conjunction with
(e)dismax queries that have a mix of fields, only some of which use
StopFilterFactory.  It seems that if even 1 field on the qf parameter
does not use StopFilterFactory, then stop words are not removed when
searching any fields.  Here's an example of what I mean:

- I have 2 fields indexed:
Title is textStemmed, which includes StopFilterFactory (see
below). Contributor is textSimple, which does not include
StopFilterFactory

(see below).
- The is a stop word in stopwords.txt
- q=lifedefType=edismaxqf=Title  ... returns 277,635 results
- q=the lifedefType=edismaxqf=Title ... returns 277,635 results
- q=lifedefType=edismaxqf=Title Contributor  ... returns 277,635
results - q=the lifedefType=edismaxqf=Title Contributor ... returns 0
results

It seems as if the stop words are not being stripped from the query
because qf contains a field that doesn't use StopFilterFactory.  I
did testing with combining Stemmed fields with not Stemmed fields in
qf and it seems as if stemming gets applied regardless.  But stop
words do not.

Does anyone have ideas on what is going on?  Is this a feature or
possibly a bug?  Any known workarounds?  Any advice is appreciated.

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311

fieldType name=textSimple class=solr.TextField
positionIncrementGap=100
analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
/analyzer
analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
/analyzer
/fieldType

fieldType name=textStemmed class=solr.TextField
positionIncrementGap=100
analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt enablePositionIncrements=true /
filter class=solr.WordDelimiterFilterFactory generateWordParts=1
generateNumberParts=0 catenateWords=0 catenateNumbers=0
catenateAll=0 splitOnCaseChange=0 splitOnNumerics=0
stemEnglishPossessive=1 /
filter class=solr.LowerCaseFilterFactory/
filter class=solr.PorterStemFilterFactory/
/analyzer
analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
ignoreCase=true expand=true/

Re: Tuning StatsComponent

2011-01-13 Thread Johannes Goll
What field type do you recommend for a  float stats.field for optimal Solr
1.4.1 StatsComponent performance ?

float, pfloat or tfloat ?

Do you recommend to index the field ?


2011/1/12 stockii st...@shopgate.com


 my field Type  is double maybe sint is better ? but i need double ...
 =(
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Tuning-StatsComponent-tp2225809p2241903.html
 Sent from the Solr - User mailing list archive at Nabble.com.



RE: StopFilterFactory and qf containing some fields that use it and some that do not

2011-01-13 Thread Dyer, James
I appreciate the reply and blog posting.  For now, I just enabled stopwords for 
all the fields on Qf.  We have a very short list anyhow and our legacy search 
engine didn't even allow field-by-field configuration (stopwords are global on 
that system).

I do wonder...what if (e)dismax had a flag you could set that would tell it 
that if any analyzers removed a term, then that term would become optional for 
any fields for which it remained?  I'm not sure what the development effort 
would perhaps it would be a nice way to circumvent this problem in a future 
release...

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-Original Message-
From: Jonathan Rochkind [mailto:rochk...@jhu.edu] 
Sent: Thursday, January 13, 2011 9:54 AM
To: solr-user@lucene.apache.org; markus.jel...@openindex.io
Cc: Dyer, James
Subject: Re: StopFilterFactory and qf containing some fields that use it and 
some that do not

It's a known 'issue' in dismax, (really an inherent part of dismax's 
design with no clear way to do anything about it), that qf over fields 
with different stop word definitions will produce odd results for a 
query with a stopword.

Here's my understanding of what's going on: 
http://bibwild.wordpress.com/2010/04/14/solr-stop-wordsdismax-gotcha/

On 1/12/2011 6:48 PM, Markus Jelsma wrote:
 Here's another thread on the subject:
 http://lucene.472066.n3.nabble.com/Dismax-Minimum-Match-Stopwords-Bug-
 td493483.html

 And slightly off topic: you'd also might want to look at using common grams,
 they are really useful for phrase queries that contain stopwords.

 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.CommonGramsFilterFactory


 Here is what debug says each of these queries parse to:

 1. q=lifedefType=edismaxqf=Title  ... returns 277,635 results
 2. q=the lifedefType=edismaxqf=Title ... returns 277,635 results
 3. q=lifedefType=edismaxqf=Title Contributor  ... returns 277,635
 4. q=the lifedefType=edismaxqf=Title Contributor ... returns 0 results

 1. +DisjunctionMaxQuery((Title:life))
 2. +((DisjunctionMaxQuery((Title:life)))~1)
 3. +DisjunctionMaxQuery((CTBR_SEARCH:life | Title:life))
 4. +((DisjunctionMaxQuery((Contributor:the))
 DisjunctionMaxQuery((Contributor:life | Title:life)))~2)

 I see what's going on here.  Because the is a stop word for Title, it
 gets removed from first part of the expression.  This means that
 Contributor is required to contain the.  dismax does the same thing
 too.  I guess I should have run debug before asking the mail list!

 It looks like the only workarounds I have is to either filter out the
 stopwords in the client when this happens, or enable stop words for all
 the fields that are used in qf with stopword-enabled fields.
 Unless...someone has a better idea??

 James Dyer
 E-Commerce Systems
 Ingram Content Group
 (615) 213-4311

 -Original Message-
 From: Markus Jelsma [mailto:markus.jel...@openindex.io]
 Sent: Wednesday, January 12, 2011 4:44 PM
 To: solr-user@lucene.apache.org
 Cc: Jayendra Patil
 Subject: Re: StopFilterFactory and qf containing some fields that use it
 and some that do not

 Have used edismax and Stopword filters as well. But usually use the fq
 parameter e.g. fq=title:the life and never had any issues.
 That is because filter queries are not relevant for the mm parameter which
 is being used for the main query.

 Can you turn on the debugQuery and check whats the Query formed for all
 the combinations you mentioned.

 Regards,
 Jayendra

 On Wed, Jan 12, 2011 at 5:19 PM, Dyer, James
 james.d...@ingrambook.comwrote:
 I'm running into a problem with StopFilterFactory in conjunction with
 (e)dismax queries that have a mix of fields, only some of which use
 StopFilterFactory.  It seems that if even 1 field on the qf parameter
 does not use StopFilterFactory, then stop words are not removed when
 searching any fields.  Here's an example of what I mean:

 - I have 2 fields indexed:
 Title is textStemmed, which includes StopFilterFactory (see
 below). Contributor is textSimple, which does not include
 StopFilterFactory

 (see below).
 - The is a stop word in stopwords.txt
 - q=lifedefType=edismaxqf=Title  ... returns 277,635 results
 - q=the lifedefType=edismaxqf=Title ... returns 277,635 results
 - q=lifedefType=edismaxqf=Title Contributor  ... returns 277,635
 results - q=the lifedefType=edismaxqf=Title Contributor ... returns 0
 results

 It seems as if the stop words are not being stripped from the query
 because qf contains a field that doesn't use StopFilterFactory.  I
 did testing with combining Stemmed fields with not Stemmed fields in
 qf and it seems as if stemming gets applied regardless.  But stop
 words do not.

 Does anyone have ideas on what is going on?  Is this a feature or
 possibly a bug?  Any known workarounds?  Any advice is appreciated.

 James Dyer
 E-Commerce Systems
 Ingram Content Group
 (615) 

Re: Improving Solr performance

2011-01-13 Thread supersoft

On the one hand, I found really interesting those comments about the reasons
for sharding. Documentation agrees you about why to split an index in
several shards (big sizes problems) but I don't find any explanation about
the inconvenients as an Access Control List. I guess there should be some
and they can be critical in this design. Any example?

On the other hand, the performance problems. I have configured big caches
and I launch a test of simultaneous requests (with the same query) without
commiting during the test. The caches are initially empty and after the
test:

namequeryResultCache  
stats   
lookups 1129
hits1120
hitratio0.99
inserts 16
evictions   0
size9
warmupTime  0
cumulative_lookups  1129
cumulative_hits 1120
cumulative_hitratio 0.99
cumulative_inserts  16
cumulative_evictions0

namedocumentCache  
stats   
lookups 6750
hits6440
hitratio0.95
inserts 310
evictions   0
size310
warmupTime  0
cumulative_lookups  6750
cumulative_hits 6440
cumulative_hitratio 0.95
cumulative_inserts  310
cumulative_evictions0

Although most of the queries are cache hits, the performance is still
dependent of the number of simultaneous queries:

1 simultaneous query: 3437 ms (cache fails)

2 simultaneous queries: 594, 954 ms

10 simultaneous queries: 1047, 1313, 1438, 1797, 1922, 2094, 2250, 2500,
2938, 3000 ms

50 simultaneous queries: 1203, 1453, 1453, 1437, 1625, 1953, 5688, 12938,
14953, 16281, 15984, 16453, 15812, 16469, 16563, 16844, 17703, 16843, 17359,
16828, 18235, 18219, 18172, 18203, 17672, 17344, 17453, 18484, 18157, 18531,
18297, 18359, 18063, 18516, 18125, 17516, 18562, 18016, 18187, 18610, 18703,
18672, 17829, 18344, 18797, 18781, 18265, 18875, 18250, 18812

100 simultaneous queries: 1297, 1531, 1969, 2203, 2375, 2891, 3937, 4109,
4703, 4890, 5047, 5312, 5563, 6422, 6437, 7063, 7093, 7391, 7594, 7672,
8172, 8547, 8750, 8984, 9265, 9609, 9907, 10344, 11406, 11484, 11484, 11500,
11547, 11703, 11797, 11875, 11922, 12328, 12375, 12875, 12922, 13187, 13219,
13407, 13500, 13562, 13719, 13828, 13875, 14016, 14078, 14672, 15922, 16328,
16625, 16953, 17282, 18172, 18484, 18985, 20594, 20625, 20860, 21281, 21469,
21625, 21875, 21875, 22141, 22157, 22172, 23125, 23125, 23141, 23203, 23203,
23328, 24625, 24641, 24672, 24797, 24985, 25031, 25188, 25844, 25937, 26016,
26437, 26453, 26437, 26485, 28297, 28687, 31782, 31985, 31969, 32016, 32031,
32016, 32281 ms

Is this an expected situation? Is there any technique for not being so
dependent of the number simultaneuos queries? (due to economical reasons,
replication in more servers is not an option)

Thanks in advance (and also thanks for previous comments)
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Improving-Solr-performance-tp2210843p2249108.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Term frequency across multiple documents

2011-01-13 Thread Ahmet Arslan
So you are interested in collection frequency of words.

TermsComponent gives you document frequency of terms. You can modify it to give 
collection frequency info. http://search-lucene.com/m/of5Fn1PUOHU/

--- On Wed, 1/12/11, Juan Grande juan.gra...@gmail.com wrote:

 From: Juan Grande juan.gra...@gmail.com
 Subject: Re: Term frequency across multiple documents
 To: solr-user@lucene.apache.org
 Date: Wednesday, January 12, 2011, 6:56 PM
 Maybe there is a better solution, but
 I think that you can solve this
 problem using facets. You will get the number of documents
 where each term
 appears. Also, you can filter a specific set of terms by
 entering a query
 like +field:term1 OR +field:term2 OR ..., or using the
 facet.query
 parameter.
 
 Regards,
 
 Juan Grande
 
 On Wed, Jan 12, 2011 at 11:08 AM, Aaron Bycoffe 
 abyco...@sunlightfoundation.com
 wrote:
 
  I'm attempting to calculate term frequency across
 multiple documents
  in Solr. I've been able to use TermVectorComponent to
 get this data on
  a per-document basis but have been unable to find a
 way to do it for
  multiple documents -- that is, get a list of terms
 appearing in the
  documents and how many times each one appears. I'd
 also like to be
  able to filter the list of terms to be able to see how
 many times a
  specific term appears, though this is less important.
 
  Is there a way to do this in Solr?
 
 
  Aaron
 
 


  


Adding a new site to existing solr configuration

2011-01-13 Thread PeterKerk

I still have the default Solr example config running on Jetty. I use Cygwin
to start my current site.

Now I already have fully configured one solr instance with these files:
\example\example-DIH\solr\db\conf\my-data-config.xml
\example\example-DIH\solr\db\conf\schema.xml
\example\example-DIH\solr\db\conf\solrconfig.xml

Now, I wish to add ANOTHER site to my already running sites. This site
ofcourse has a different data-config, but the question is: what files
can/should I add to the already existing directories?

What I have now is that i just added the data-config:
\example\example-DIH\solr\db\conf\data-config-site2.xml

But should I change anything in the schema.xml/solrconfig for this to work
and to be able to run both sites simultaneously with the same web server
instance?
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Adding-a-new-site-to-existing-solr-configuration-tp2249223p2249223.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: verifying that an index contains ONLY utf-8

2011-01-13 Thread Peter Karich
 take a look also into icu4j which is one of the contrib projects ...

 converting on the fly is not supported by Solr but should be relative
 easy in Java.
 Also scanning is relative simple (accept only a range). Detection too:
 http://www.mozilla.org/projects/intl/chardet.html

 We've created an index from a number of different documents that are
 supplied by third parties. We want the index to only contain UTF-8
 encoded characters. I have a couple questions about this:

 1) Is there any way to be sure during indexing (by setting something
 in the solr configuration?) that the documents that we index will
 always be stored in utf-8? Can solr convert documents that need
 converting on the fly, or can solr reject documents containing illegal
 characters?

 2) Is there a way to scan the existing index to find any string
 containing non-utf8 characters? Or is there another way that I can
 discover if any crept into my index?




-- 
http://jetwick.com open twitter search



DataimportHandler development issue

2011-01-13 Thread Derek Werthmuller
We're just getting started with Solr and are very interested in using Solr
for search applications.

I've got the rss example working 1.4.1 didn't work out of the box, but we
figured it out -then found fixes in the svn.  Any way we are learning how
to load the data/rss  atom feeds into the Solr index.  We are trying to
modify the rss-data-import.xml file so that we can import atom feeds also.
But for some reason they don't load.  Here is what we have for the
configuration.  

We've been using the DataImportHandler Development Console
http://localhost:8983/solr/rss/admin/dataimport.jsp?handler=/rssimport
http://localhost:8983/solr/rss/admin/dataimport.jsp?handler=/rssimport  to
look at the status and the DocsNum but only the rss feed works.
If we remove all the slashdot -rss entity the atom example still doesn't
work.  We've tried creating a seperate atom-data-config.xml file and adding
the 
proper entry to the solrconfig.xml to support the extra dataimport.  That
gave us the same results.  
response
−
lst name=responseHeader
int name=status0/int
int name=QTime1/int
/lst
−
lst name=initArgs
−
lst name=defaults
str
name=configatom-data-config.xml/str
/lst
/lst
str name=commandstatus/str
str name=statusidle/str
str name=importResponse/
−
lst name=statusMessages
str name=Total Requests made to
DataSource1/str
str name=Total Rows Fetched0/str
str name=Total Documents Skipped0/str
str name=Full Dump Started2011-01-13
08:42:53/str
str name=Total Documents
Processed0/str
str name=Time taken 0:0:0.519/str
/lst
−
str name=WARNING
This response format is experimental. It is
likely to change in the future.
/str
/response


Its not clear why its not working.  Advice?
Also is this the best way to load data?  We intent on loading several
thousand docbook documents once we understand how this all works.  We stuck
with the rss/atom example since we didn't want to deal with schema changes
yet.
Thanks
Derek

example-DIH/solr/rss/conf/rss-data-config.xml  modified source:
dataConfig
dataSource type=URLDataSource /
document
entity name=slashdot
pk=link
url=http://twitter.com/statuses/user_timeline/existdb.rss;
processor=XPathEntityProcessor
forEach=/rss/channel | /rss/channel/item
transformer=DateFormatTransformer

field column=source xpath=/rss/channel/title commonField=true /
field column=source-link xpath=/rss/channel/link commonField=true /
field column=subject xpath=/rss/channel/subject commonField=true /

field column=title xpath=/rss/channel/item/title /
field column=link xpath=/rss/channel/item/link /
field column=description xpath=/rss/channel/item/description /
field column=creator xpath=/rss/channel/item/creator /
field column=item-subject xpath=/rss/channel/item/subject /
field column=date xpath=/rss/channel/item/date
dateTimeFormat=-MM-dd'T'hh:mm:ss /
field column=slash-department xpath=/rss/channel/item/department /
field column=slash-section xpath=/rss/channel/item/section /
field column=slash-comments xpath=/rss/channel/item/comments /
/entity

entity name=twitter
pk=link
url=http://twitter.com/statuses/user_timeline/ctg_ualbany.atom;
processor=XPathEntityProcessor
forEach=/feed | /feed/entry
transformer=DateFormatTransformer

field column=source xpath=/feed/title commonField=true /
field column=source-link xpath=/feed/link commonField=true /
field column=subject xpath=/feed/subtitle commonField=true /

field column=title xpath=/feed/entry/title /
field column=link xpath=/feed/entry/link /
field column=description xpath=/feed/entry/description /
field column=creator xpath=/feed/entry/creator /
field column=item-subject xpath=/feed/entry/subject /
field column=date xpath=/rss/channel/item/date
dateTimeFormat=-MM-dd'T'hh:mm:ss /
field column=slash-department xpath=/feed/entry/department /
field column=slash-section xpath=/feed/entry/section /
field column=slash-comments xpath=/feed/entry/comments /
/entity
/document
/dataConfig



Re: verifying that an index contains ONLY utf-8

2011-01-13 Thread Jonathan Rochkind
Scanning for only 'valid' utf-8 is definitely not simple.  You can 
eliminate some obviously not valid utf-8 things by byte ranges, but you 
can't confirm valid utf-8 alone by byte ranges. There are some bytes 
that can only come after or before other certain bytes to be valid utf-8.


There is no good way to do what you're doing, once you've lost track of 
what encoding something is in, you are reduced to applying heuristics to 
text strings to guess what encoding it is meant to be.


There is no cheap way to do this to an entire Solr index, you're just 
going to have to fetch every single (stored field, indexed fields are 
pretty much lost to you) and apply heuristic algorithms to it.  Keep in 
mind that Solr really probably shouldn't ever be used as your canonical 
_store_ of data; Solr isn't a 'store', it's an index.  So you really 
ought to have this stuff stored somewhere else if you want to be able to 
examine it or modify it like this, and just deal with that somewhere 
else.  This isn't really a Solr question at all, really, even if you are 
querying Solr on stored fields to try and guess their char encodings.


There are various packages of such heuristic algorithms to guess char 
encoding, I wouldn't try to write my own. icu4j might include such an 
algorithm, not sure.


On 1/13/2011 1:12 PM, Peter Karich wrote:

  take a look also into icu4j which is one of the contrib projects ...


converting on the fly is not supported by Solr but should be relative
easy in Java.
Also scanning is relative simple (accept only a range). Detection too:
http://www.mozilla.org/projects/intl/chardet.html


We've created an index from a number of different documents that are
supplied by third parties. We want the index to only contain UTF-8
encoded characters. I have a couple questions about this:

1) Is there any way to be sure during indexing (by setting something
in the solr configuration?) that the documents that we index will
always be stored in utf-8? Can solr convert documents that need
converting on the fly, or can solr reject documents containing illegal
characters?

2) Is there a way to scan the existing index to find any string
containing non-utf8 characters? Or is there another way that I can
discover if any crept into my index?





Re: segment gets corrupted (after background merge ?)

2011-01-13 Thread Michael McCandless
Generally it's not safe to run CheckIndex if a writer is also open on the index.

It's not safe because CheckIndex could hit FNFE's on opening files,
or, if you use -fix, CheckIndex will change the index out from under
your other IndexWriter (which will then cause other kinds of
corruption).

That said, I don't think the corruption that CheckIndex is detecting
in your index would be caused by having a writer open on the index.
Your first CheckIndex has a different deletes file (_phe_p3.del, with
44824 deleted docs) than the 2nd time you ran it (_phe_p4.del, with
44828 deleted docs), so it must somehow have to do with that change.

One question: if you have a corrupt index, and run CheckIndex on it
several times in a row, does it always fail in the same way?  (Ie the
same term hits the below exception).

Is there any way I could get a copy of one of your corrupt cases?  I
can then dig...

Mike

On Thu, Jan 13, 2011 at 10:52 AM, Stéphane Delprat
stephane.delp...@blogspirit.com wrote:
 I understand less and less what is happening to my solr.

 I did a checkIndex (without -fix) and there was an error...

 So a did another checkIndex with -fix and then the error was gone. The
 segment was alright


 During checkIndex I do not shut down the solr server, I just make sure no
 client connect to the server.

 Should I shut down the solr server during checkIndex ?



 first checkIndex :

  4 of 17: name=_phe docCount=264148
    compound=false
    hasProx=true
    numFiles=9
    size (MB)=928.977
    diagnostics = {optimize=false, mergeFactor=10, os.version=2.6.26-2-amd64,
 os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06
 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0_20,
 java.vendor=Sun Microsystems Inc.}
    has deletions [delFileName=_phe_p3.del]
    test: open reader.OK [44824 deleted docs]
    test: fields..OK [51 fields]
    test: field norms.OK [51 fields]
    test: terms, freq, prox...ERROR [term post_id:562 docFreq=1 != num docs
 seen 0 + num docs deleted 0]
 java.lang.RuntimeException: term post_id:562 docFreq=1 != num docs seen 0 +
 num docs deleted 0
        at
 org.apache.lucene.index.CheckIndex.testTermIndex(CheckIndex.java:675)
        at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:530)
        at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903)
    test: stored fields...OK [7206878 total field count; avg 32.86 fields
 per doc]
    test: term vectorsOK [0 total vector count; avg 0 term/freq
 vector fields per doc]
 FAILED
    WARNING: fixIndex() would remove reference to this segment; full
 exception:
 java.lang.RuntimeException: Term Index test failed
        at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:543)
        at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903)


 a few minutes latter :

  4 of 18: name=_phe docCount=264148
    compound=false
    hasProx=true
    numFiles=9
    size (MB)=928.977
    diagnostics = {optimize=false, mergeFactor=10, os.version=2.6.26-2-amd64,
 os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06
 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0
 _20, java.vendor=Sun Microsystems Inc.}
    has deletions [delFileName=_phe_p4.del]
    test: open reader.OK [44828 deleted docs]
    test: fields..OK [51 fields]
    test: field norms.OK [51 fields]
    test: terms, freq, prox...OK [3200899 terms; 26804334 terms/docs pairs;
 28919124 tokens]
    test: stored fields...OK [7206764 total field count; avg 32.86 fields
 per doc]
    test: term vectorsOK [0 total vector count; avg 0 term/freq
 vector fields per doc]


 Le 12/01/2011 16:50, Michael McCandless a écrit :

 Curious... is it always a docFreq=1 != num docs seen 0 + num docs deleted
 0?

 It looks like new deletions were flushed against the segment (del file
 changed from _ncc_22s.del to _ncc_24f.del).

 Are you hitting any exceptions during indexing?

 Mike

 On Wed, Jan 12, 2011 at 10:33 AM, Stéphane Delprat
 stephane.delp...@blogspirit.com  wrote:

 I got another corruption.

 It sure looks like it's the same type of error. (on a different field)

 It's also not linked to a merge, since the segment size did not change.


 *** good segment :

  1 of 9: name=_ncc docCount=1841685
    compound=false
    hasProx=true
    numFiles=9
    size (MB)=6,683.447
    diagnostics = {optimize=false, mergeFactor=10,
 os.version=2.6.26-2-amd64,
 os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06
 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0
 _20, java.vendor=Sun Microsystems Inc.}
    has deletions [delFileName=_ncc_22s.del]
    test: open reader.OK [275881 deleted docs]
    test: fields..OK [51 fields]
    test: field norms.OK [51 fields]
    test: terms, freq, prox...OK [17952652 terms; 174113812 terms/docs
 pairs;
 204561440 tokens]
    test: stored fields...OK 

Re: verifying that an index contains ONLY utf-8

2011-01-13 Thread Michael McCandless
The tokens that Lucene sees (pre-4.0) are char[] based (ie, UTF16), so
the first place where invalid UTF8 is detected/corrected/etc. is
during your analysis process, which takes your raw content and
produces char[] based tokens.

Second, during indexing, Lucene ensures that the incoming char[]
tokens are valid UTF16.

If an invalid char sequence is hit, eg naked (unpaired) surrogate, or
invalid surrogate pair, the behavior is undefined, but, today, Lucene
will replace such invalid char/s with the unicode character U+FFFD, so
you could iterate all terms looking for that replacement char.

Mike

On Wed, Jan 12, 2011 at 5:16 PM, Paul p...@nines.org wrote:
 We've created an index from a number of different documents that are
 supplied by third parties. We want the index to only contain UTF-8
 encoded characters. I have a couple questions about this:

 1) Is there any way to be sure during indexing (by setting something
 in the solr configuration?) that the documents that we index will
 always be stored in utf-8? Can solr convert documents that need
 converting on the fly, or can solr reject documents containing illegal
 characters?

 2) Is there a way to scan the existing index to find any string
 containing non-utf8 characters? Or is there another way that I can
 discover if any crept into my index?



Variable datasources

2011-01-13 Thread tjpoe

I have several similar databases that I'd like to import from 14 to be exact. 
there is also a 15th database where I can get a listing of the 14 database. 

I'm trying to do a variable datasource such as: 

datasource url=jdbc:mysql://localhost/${local.code} name=content /
datasource url=jdbc:mysql://localhost/master name=master /

then my import query looks like this

document name=items
 entity datasource=master name=local query=select code from locals
rootEntity=false
  entity datasource=content name=item query= select *, ${local.code}
as code from item /
 /entity
/document

The above configuration works, but the ${local.code} variable is ONLY
resolved the first time so it looks throught the correct # of times, and I
can see {$local.code} being resolved in each of the item queries, but the
data source never changes. 

I also tried creating datasources for each local and then using a variable
datasource in the entity such as:

datasource url=jdbc:mysql://localhost/aaa name=content_aaa /
datasource url=jdbc:mysql://localhost/bbb name=content_bbb /
datasource url=jdbc:mysql://localhost/ccc name=content_ccc /
datasource url=jdbc:mysql://localhost/master name=master /

and then the document as:

document name=items
 entity datasource=master name=local query=select code from locals
rootEntity=false
  entity datasource=content_${local.code} name=item query= select *,
${local.code} as code from item /
 /entity
/document

but the ${local.code} variable is not resolved and it attempts to connect to
the literal source content_${local.code}. 

any ideas how I can get all of the items imported for all of the locals at
once? 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Variable-datasources-tp2249568p2249568.html
Sent from the Solr - User mailing list archive at Nabble.com.


start value in queries zero or one based?

2011-01-13 Thread Dennis Gearon
Do I even need a body for this message? ;-)

 Dennis Gearon


Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better 
idea to learn from others’ mistakes, so you do not have to make them yourself. 
from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'


EARTH has a Right To Life,
otherwise we all die.



Re: verifying that an index contains ONLY utf-8

2011-01-13 Thread Robert Muir
On Thu, Jan 13, 2011 at 2:05 PM, Jonathan Rochkind rochk...@jhu.edu wrote:

 There are various packages of such heuristic algorithms to guess char
 encoding, I wouldn't try to write my own. icu4j might include such an
 algorithm, not sure.


it does: 
http://icu-project.org/apiref/icu4j/com/ibm/icu/text/CharsetDetector.html
this takes a sample of the file and makes a guess.

also, in general keep in mind that java CharsetDecoders tend to
silently replace or skip illegal chars, rather than throw exceptions.

If you want to instead be paranoid about these things, instead of
opening InputStreamReader with Charset,
open it with something like
charset.newDecoder().onMalformedInput(CodingErrorAction.REPORT).onUnmappableCharacter(CodingErrorAction.REPORT)

Then if the decoder ends up in some illegal state/byte sequence,
instead of silently replacing with U+FFFD, it will throw an exception.
Of course as Jonathan says, you cannot confirm that something is UTF-8.

But many times, you can confirm its definitely not: see
https://issues.apache.org/jira/browse/SOLR-2003 for an example
practical use of this, we throw
an exception if we can detect that your stopwords or synonyms file is
definitely wrongly-encoded.


Re: start value in queries zero or one based?

2011-01-13 Thread Walter Underwood
On Jan 13, 2011, at 1:28 PM, Dennis Gearon wrote:

 Do I even need a body for this message? ;-)
 
 Dennis Gearon

Are you asking is it or should it be? If the latter, we can also discuss 
Emacs and vi.

wunder
--
Walter Underwood
K6WRU



Re: Solr + Hadoop

2011-01-13 Thread Em

Hi Joan,

I am not sure whether it applies, but are you really using Solr 1.4 (not
1.4.1) and were also using the Hadoop-Jars provided by this patch (0.20.1
not 0.0.21)?
I ask, because I had some other issues with other classes that were related
to different package-definitions etc. - in short: some import-organization
failed and my IDE did not notice that when I build the files.

However, this is just a guess. 

Regards
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Hadoop-tp2247856p2249935.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: start value in queries zero or one based?

2011-01-13 Thread Markus Jelsma
Perhaps it would be more useful to RTFM instead of messing around on the 
mailing list: http://wiki.apache.org/solr/CommonQueryParameters#start

Please, read every wiki page you can find and write notes.

 Do I even need a body for this message? ;-)
 
  Dennis Gearon
 
 
 Signature Warning
 
 It is always a good idea to learn from your own mistakes. It is usually a
 better idea to learn from others’ mistakes, so you do not have to make
 them yourself. from
 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'
 
 
 EARTH has a Right To Life,
 otherwise we all die.


RE: start value in queries zero or one based?

2011-01-13 Thread Steven A Rowe
 Please, read every wiki page you can find and write notes.

NO!!!  Once you start down this road, there is no turning back!  Soon you will 
feel the need to turn your notes into a new wiki page or a blog post, and 
people will read those and write notes, and the process will repeat, ad 
infinitum: a Vicious Circle of Writing (VCoW).   Please, please, please: Don't 
have a VCoW, man!


Re: start value in queries zero or one based?

2011-01-13 Thread Dennis Gearon
I'm migrating to CTO/CEO status in life due to building a small company. I find 
I don't have too much time for theory. I work with wht is.

So, what is it, not what should it be.

 Dennis Gearon


Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better 
idea to learn from others’ mistakes, so you do not have to make them yourself. 
from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'


EARTH has a Right To Life,
otherwise we all die.



- Original Message 
From: Walter Underwood wun...@wunderwood.org
To: solr-user@lucene.apache.org
Sent: Thu, January 13, 2011 1:38:26 PM
Subject: Re: start value in queries zero or one based?

On Jan 13, 2011, at 1:28 PM, Dennis Gearon wrote:

 Do I even need a body for this message? ;-)
 
 Dennis Gearon

Are you asking is it or should it be? If the latter, we can also discuss 
Emacs and vi.

wunder
--
Walter Underwood
K6WRU


Re: verifying that an index contains ONLY utf-8

2011-01-13 Thread Paul
Thanks for all the responses.

CharsetDetector does look promising. Unfortunately, we aren't allowed
to keep the original of much of our data, so the solr index is the
only place it exists (to us). I do have a java app that reindexes,
i.e., reads all documents out of one index, does some transform on
them, then writes them to a second index. So I already have a place
where I see all the data in the index stream by. I wanted to make sure
there wasn't some built in way of doing what I need.

I know that it is possible to fool the algorithm, but I'll see if the
string is a possible utf-8 string first and not change that. Then I
won't be introducing more errors and maybe I can detect a large
percentage of the non-utf-8 strings.

On Thu, Jan 13, 2011 at 4:36 PM, Robert Muir rcm...@gmail.com wrote:
 it does: 
 http://icu-project.org/apiref/icu4j/com/ibm/icu/text/CharsetDetector.html
 this takes a sample of the file and makes a guess.


RE: verifying that an index contains ONLY utf-8

2011-01-13 Thread Jonathan Rochkind
So you're allowed to put the entire original document in a stored field in 
Solr, but you aren't allowed to stick it in, say, a redis or couchdb too? Ah, 
beaurocracy. But no reason what you are doing won't work, as you of course 
already know from doing it.  

If you actually know the charset of a document when indexing it, you might want 
to consider putting THAT in a stored field; easier to keep track of the 
encoding you know then to try and guess it again later. 


From: Paul [p...@nines.org]
Sent: Thursday, January 13, 2011 6:21 PM
To: solr-user@lucene.apache.org
Subject: Re: verifying that an index contains ONLY utf-8

Thanks for all the responses.

CharsetDetector does look promising. Unfortunately, we aren't allowed
to keep the original of much of our data, so the solr index is the
only place it exists (to us). I do have a java app that reindexes,
i.e., reads all documents out of one index, does some transform on
them, then writes them to a second index. So I already have a place
where I see all the data in the index stream by. I wanted to make sure
there wasn't some built in way of doing what I need.

I know that it is possible to fool the algorithm, but I'll see if the
string is a possible utf-8 string first and not change that. Then I
won't be introducing more errors and maybe I can detect a large
percentage of the non-utf-8 strings.

On Thu, Jan 13, 2011 at 4:36 PM, Robert Muir rcm...@gmail.com wrote:
 it does: 
 http://icu-project.org/apiref/icu4j/com/ibm/icu/text/CharsetDetector.html
 this takes a sample of the file and makes a guess.


RE: start value in queries zero or one based?

2011-01-13 Thread Jonathan Rochkind
You could have tried it and seen for yourself on any Solr server in your 
possession in less time than it took to have this thread. And if you don't have 
a Solr server, then why do you care?

But the answer is 0. 

http://wiki.apache.org/solr/CommonQueryParameters#start
The default value is 0

Since the default start is 0, and if you leave start out you don't always skip 
the first item of your result set, that means if you DO want to skip the first 
item if your result set, start=1 will do it. 


From: Dennis Gearon [gear...@sbcglobal.net]
Sent: Thursday, January 13, 2011 6:04 PM
To: solr-user@lucene.apache.org
Subject: Re: start value in queries zero or one based?

I'm migrating to CTO/CEO status in life due to building a small company. I find
I don't have too much time for theory. I work with wht is.

So, what is it, not what should it be.

 Dennis Gearon


Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a better
idea to learn from others’ mistakes, so you do not have to make them yourself.
from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'


EARTH has a Right To Life,
otherwise we all die.



- Original Message 
From: Walter Underwood wun...@wunderwood.org
To: solr-user@lucene.apache.org
Sent: Thu, January 13, 2011 1:38:26 PM
Subject: Re: start value in queries zero or one based?

On Jan 13, 2011, at 1:28 PM, Dennis Gearon wrote:

 Do I even need a body for this message? ;-)

 Dennis Gearon

Are you asking is it or should it be? If the latter, we can also discuss
Emacs and vi.

wunder
--
Walter Underwood
K6WRU


Searchers and Warmups

2011-01-13 Thread David Cramer
I'm trying to understand the mechanics behind warming up, when new searchers 
are registered, and their costs. A quick Google didn't point me in the right 
direction, so hoping for some of that here.


-- 
David Cramer




Re: Solr + Hadoop

2011-01-13 Thread Alexander Kanarsky
Joan,

make sure that you are running the job on Hadoop 0.21 cluster. (It
looks like you have compiled the apache-solr-hadoop jar with Hadoop
0.21 but using it on 0.20 cluster).

-Alexander


[sfield] Missing in Spatial Search

2011-01-13 Thread Adam Estrada
According to the documentation here:
http://wiki.apache.org/solr/SpatialSearch the field that identifies the
spatial point data is sfield. See the console output below.

Jan 13, 2011 6:49:40 PM org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/select
params={spellcheck=truef.jtype.facet.mincoun
t=1facet=truef.cat.facet.mincount=1f.cause.facet.mincount=1f.urgency.facet.m
incount=1rows=10start=0q=*:*f.status.facet.mincount=1facet.field=catfacet.
field=jtypefacet.field=statusfacet.field=causefacet.field=urgency?=fq={!typ
e%3Dgeofilt+pt%3D39.0914154052734,-84.517822265625+sfield%3Dcoords+d%3D300}text:
} hits=113 status=0 QTime=1
Jan 13, 2011 6:51:51 PM org.apache.solr.common.SolrException log
SEVERE: org.apache.solr.common.SolrException:  missing sfield for spatial
reques
t

Any ideas on this one?

Thanks in advance,
Adam


Re: Multi-word exact keyword case-insensitive search suggestions

2011-01-13 Thread Chamnap Chhorn
Thanks for your reply. However, it doesn't work for my case at all. I think
it's the problem with query parser or something else. It forces me to put
double quote to the search query in order to get the results found.

str name=rawquerystringsim 010/str
str name=querystringsim 010/str
str name=parsedquery+DisjunctionMaxQuery((keyphrase:sim 010)) ()/str
str name=parsedquery_toString+(keyphrase:sim 010) ()/str

str name=rawquerystringsmart mobile/str
str name=querystringsmart mobile/str
str name=parsedquery
+((DisjunctionMaxQuery((keyphrase:smart))
DisjunctionMaxQuery((keyphrase:mobile)))~2) ()
/str
str name=parsedquery_toString+(((keyphrase:smart) (keyphrase:mobile))~2)
()/str

The intent here is to do a full text search, part of that is to search
keyword field, so I can't put quote to it.

On Thu, Jan 13, 2011 at 10:30 PM, Adam Estrada 
estrada.adam.gro...@gmail.com wrote:

 Hi,

 the following seems to work pretty well.

fieldType name=text_ws class=solr.TextField
 positionIncrementGap=100
  analyzer
tokenizer class=solr.KeywordTokenizerFactory /
filter class=solr.ShingleFilterFactory
  maxShingleSize=4 outputUnigrams=true
 outputUnigramIfNoNgram=false /
  /analyzer
/fieldType

!-- A text field that uses WordDelimiterFilter to enable splitting and
 matching of
words on case-change, alpha numeric boundaries, and non-alphanumeric
 chars,
so that a query of wifi or wi fi could match a document
 containing Wi-Fi.
Synonyms and stopwords are customized by external files, and
 stemming is enabled.
The attribute autoGeneratePhraseQueries=true (the default) causes
 words that get split to
form phrase queries. For example, WordDelimiterFilter splitting
 text:pdp-11 will cause the parser
to generate text:pdp 11 rather than (text:PDP OR text:11).
NOTE: autoGeneratePhraseQueries=true tends to not work well for
 non whitespace delimited languages.
--
fieldType name=text class=solr.TextField positionIncrementGap=100
 autoGeneratePhraseQueries=true
  analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
!-- in this example, we will only use synonyms at query time
filter class=solr.SynonymFilterFactory
 synonyms=index_synonyms.txt ignoreCase=true expand=false/
--
!-- Case insensitive stop word removal.
  add enablePositionIncrements=true in both the index and query
  analyzers to leave a 'gap' for more accurate phrase queries.
--
filter class=solr.StopFilterFactory
ignoreCase=true
words=stopwords.txt
enablePositionIncrements=true
/
filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1 catenateWords=1
 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.KeywordMarkerFilterFactory
 protected=protwords.txt/
filter class=solr.PorterStemFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
 ignoreCase=true expand=true/
filter class=solr.StopFilterFactory
ignoreCase=true
words=stopwords.txt
enablePositionIncrements=true
/
filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1 catenateWords=0
 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.KeywordMarkerFilterFactory
 protected=protwords.txt/
filter class=solr.PorterStemFilterFactory/
  /analyzer
/fieldType

copyField source=cat dest=text/
copyField source=subject dest=text/
copyField source=summary dest=text/
copyField source=cause dest=text/
copyField source=status dest=text/
copyField source=urgency dest=text/

 I ingest the source fields as text_ws (I know I've changed it a bit) and
 then copy the field to text. This seems to do what you are asking for.

 Adam

 On Thu, Jan 13, 2011 at 12:05 AM, Chamnap Chhorn chamnapchh...@gmail.com
 wrote:

  Hi all,
 
  I'm just stuck with exact keyword for several days. Hope you guys could
  help
  me. Here is the scenario:
 
1. It need to be matched with multi-word keyword and case insensitive
2. Partial word or single word matching with this field is not allowed
 
  I want to know the field type definition for this field and sample solr
  query. I need to combine this search with my full text search which uses
  dismax query.
 
  Thanks
  --
  Chhorn Chamnap
  http://chamnapchhorn.blogspot.com/
 




-- 
Chhorn Chamnap
http://chamnapchhorn.blogspot.com/


Re: segment gets corrupted (after background merge ?)

2011-01-13 Thread Lance Norskog
1) CheckIndex is not supposed to change a corrupt segment, only remove it.
2) Are you using local hard disks, or do run on a common SAN or remote
file server? I have seen corruption errors on SANs, where existing
files have random changes.

On Thu, Jan 13, 2011 at 11:06 AM, Michael McCandless
luc...@mikemccandless.com wrote:
 Generally it's not safe to run CheckIndex if a writer is also open on the 
 index.

 It's not safe because CheckIndex could hit FNFE's on opening files,
 or, if you use -fix, CheckIndex will change the index out from under
 your other IndexWriter (which will then cause other kinds of
 corruption).

 That said, I don't think the corruption that CheckIndex is detecting
 in your index would be caused by having a writer open on the index.
 Your first CheckIndex has a different deletes file (_phe_p3.del, with
 44824 deleted docs) than the 2nd time you ran it (_phe_p4.del, with
 44828 deleted docs), so it must somehow have to do with that change.

 One question: if you have a corrupt index, and run CheckIndex on it
 several times in a row, does it always fail in the same way?  (Ie the
 same term hits the below exception).

 Is there any way I could get a copy of one of your corrupt cases?  I
 can then dig...

 Mike

 On Thu, Jan 13, 2011 at 10:52 AM, Stéphane Delprat
 stephane.delp...@blogspirit.com wrote:
 I understand less and less what is happening to my solr.

 I did a checkIndex (without -fix) and there was an error...

 So a did another checkIndex with -fix and then the error was gone. The
 segment was alright


 During checkIndex I do not shut down the solr server, I just make sure no
 client connect to the server.

 Should I shut down the solr server during checkIndex ?



 first checkIndex :

  4 of 17: name=_phe docCount=264148
    compound=false
    hasProx=true
    numFiles=9
    size (MB)=928.977
    diagnostics = {optimize=false, mergeFactor=10, os.version=2.6.26-2-amd64,
 os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06
 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0_20,
 java.vendor=Sun Microsystems Inc.}
    has deletions [delFileName=_phe_p3.del]
    test: open reader.OK [44824 deleted docs]
    test: fields..OK [51 fields]
    test: field norms.OK [51 fields]
    test: terms, freq, prox...ERROR [term post_id:562 docFreq=1 != num docs
 seen 0 + num docs deleted 0]
 java.lang.RuntimeException: term post_id:562 docFreq=1 != num docs seen 0 +
 num docs deleted 0
        at
 org.apache.lucene.index.CheckIndex.testTermIndex(CheckIndex.java:675)
        at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:530)
        at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903)
    test: stored fields...OK [7206878 total field count; avg 32.86 fields
 per doc]
    test: term vectorsOK [0 total vector count; avg 0 term/freq
 vector fields per doc]
 FAILED
    WARNING: fixIndex() would remove reference to this segment; full
 exception:
 java.lang.RuntimeException: Term Index test failed
        at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:543)
        at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903)


 a few minutes latter :

  4 of 18: name=_phe docCount=264148
    compound=false
    hasProx=true
    numFiles=9
    size (MB)=928.977
    diagnostics = {optimize=false, mergeFactor=10, os.version=2.6.26-2-amd64,
 os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06
 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0
 _20, java.vendor=Sun Microsystems Inc.}
    has deletions [delFileName=_phe_p4.del]
    test: open reader.OK [44828 deleted docs]
    test: fields..OK [51 fields]
    test: field norms.OK [51 fields]
    test: terms, freq, prox...OK [3200899 terms; 26804334 terms/docs pairs;
 28919124 tokens]
    test: stored fields...OK [7206764 total field count; avg 32.86 fields
 per doc]
    test: term vectorsOK [0 total vector count; avg 0 term/freq
 vector fields per doc]


 Le 12/01/2011 16:50, Michael McCandless a écrit :

 Curious... is it always a docFreq=1 != num docs seen 0 + num docs deleted
 0?

 It looks like new deletions were flushed against the segment (del file
 changed from _ncc_22s.del to _ncc_24f.del).

 Are you hitting any exceptions during indexing?

 Mike

 On Wed, Jan 12, 2011 at 10:33 AM, Stéphane Delprat
 stephane.delp...@blogspirit.com  wrote:

 I got another corruption.

 It sure looks like it's the same type of error. (on a different field)

 It's also not linked to a merge, since the segment size did not change.


 *** good segment :

  1 of 9: name=_ncc docCount=1841685
    compound=false
    hasProx=true
    numFiles=9
    size (MB)=6,683.447
    diagnostics = {optimize=false, mergeFactor=10,
 os.version=2.6.26-2-amd64,
 os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06
 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0
 _20, 

Re: Multi-word exact keyword case-insensitive search suggestions

2011-01-13 Thread Estrada Groups
Ahhh...the fun of open source software ;-). Requires a ton of trial and error! 
I found what worked for me and figured it was worth passing it along. If you 
don't mind...when you sort everything out on your end, please post results for 
the rest of us to take a gander at. 

Cheers,
Adam

On Jan 13, 2011, at 9:08 PM, Chamnap Chhorn chamnapchh...@gmail.com wrote:

 Thanks for your reply. However, it doesn't work for my case at all. I think
 it's the problem with query parser or something else. It forces me to put
 double quote to the search query in order to get the results found.
 
 str name=rawquerystringsim 010/str
 str name=querystringsim 010/str
 str name=parsedquery+DisjunctionMaxQuery((keyphrase:sim 010)) ()/str
 str name=parsedquery_toString+(keyphrase:sim 010) ()/str
 
 str name=rawquerystringsmart mobile/str
 str name=querystringsmart mobile/str
 str name=parsedquery
 +((DisjunctionMaxQuery((keyphrase:smart))
 DisjunctionMaxQuery((keyphrase:mobile)))~2) ()
 /str
 str name=parsedquery_toString+(((keyphrase:smart) (keyphrase:mobile))~2)
 ()/str
 
 The intent here is to do a full text search, part of that is to search
 keyword field, so I can't put quote to it.
 
 On Thu, Jan 13, 2011 at 10:30 PM, Adam Estrada 
 estrada.adam.gro...@gmail.com wrote:
 
 Hi,
 
 the following seems to work pretty well.
 
   fieldType name=text_ws class=solr.TextField
 positionIncrementGap=100
 analyzer
   tokenizer class=solr.KeywordTokenizerFactory /
   filter class=solr.ShingleFilterFactory
 maxShingleSize=4 outputUnigrams=true
 outputUnigramIfNoNgram=false /
 /analyzer
   /fieldType
 
   !-- A text field that uses WordDelimiterFilter to enable splitting and
 matching of
   words on case-change, alpha numeric boundaries, and non-alphanumeric
 chars,
   so that a query of wifi or wi fi could match a document
 containing Wi-Fi.
   Synonyms and stopwords are customized by external files, and
 stemming is enabled.
   The attribute autoGeneratePhraseQueries=true (the default) causes
 words that get split to
   form phrase queries. For example, WordDelimiterFilter splitting
 text:pdp-11 will cause the parser
   to generate text:pdp 11 rather than (text:PDP OR text:11).
   NOTE: autoGeneratePhraseQueries=true tends to not work well for
 non whitespace delimited languages.
   --
   fieldType name=text class=solr.TextField positionIncrementGap=100
 autoGeneratePhraseQueries=true
 analyzer type=index
   tokenizer class=solr.WhitespaceTokenizerFactory/
   !-- in this example, we will only use synonyms at query time
   filter class=solr.SynonymFilterFactory
 synonyms=index_synonyms.txt ignoreCase=true expand=false/
   --
   !-- Case insensitive stop word removal.
 add enablePositionIncrements=true in both the index and query
 analyzers to leave a 'gap' for more accurate phrase queries.
   --
   filter class=solr.StopFilterFactory
   ignoreCase=true
   words=stopwords.txt
   enablePositionIncrements=true
   /
   filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1 catenateWords=1
 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/
   filter class=solr.LowerCaseFilterFactory/
   filter class=solr.KeywordMarkerFilterFactory
 protected=protwords.txt/
   filter class=solr.PorterStemFilterFactory/
 /analyzer
 analyzer type=query
   tokenizer class=solr.WhitespaceTokenizerFactory/
   filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
 ignoreCase=true expand=true/
   filter class=solr.StopFilterFactory
   ignoreCase=true
   words=stopwords.txt
   enablePositionIncrements=true
   /
   filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1 catenateWords=0
 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/
   filter class=solr.LowerCaseFilterFactory/
   filter class=solr.KeywordMarkerFilterFactory
 protected=protwords.txt/
   filter class=solr.PorterStemFilterFactory/
 /analyzer
   /fieldType
 
   copyField source=cat dest=text/
   copyField source=subject dest=text/
   copyField source=summary dest=text/
   copyField source=cause dest=text/
   copyField source=status dest=text/
   copyField source=urgency dest=text/
 
 I ingest the source fields as text_ws (I know I've changed it a bit) and
 then copy the field to text. This seems to do what you are asking for.
 
 Adam
 
 On Thu, Jan 13, 2011 at 12:05 AM, Chamnap Chhorn chamnapchh...@gmail.com
 wrote:
 
 Hi all,
 
 I'm just stuck with exact keyword for several days. Hope you guys could
 help
 me. Here is the scenario:
 
  1. It need to be matched with multi-word keyword and case insensitive
  2. Partial word or single word matching with this field is not allowed
 
 I want to know the field type definition for this field and sample solr
 query. I need to 

use of schema.xml

2011-01-13 Thread Dennis Gearon
I'm going to buy the book for Solr, since it looks like I need to do more of 
the 
work than I thought I would.

But, from looking at it, the schema file only says:

A/ What types of data can be in the 'fields' of the documents
B/ If there are any dynamically assigned fields.
C/ What parsers are available
D/ other stuff.

And what it DOESN'T do is set the 'schema' for the index, right?
(like DDL for a database does)

 Dennis Gearon


Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better 
idea to learn from others’ mistakes, so you do not have to make them yourself. 
from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'


EARTH has a Right To Life,
otherwise we all die.



Re: Solr 4.0 = Spatial Search - How to

2011-01-13 Thread Lance Norskog
Spatial does not support separate separate fields: you don't need
lat/long, only 'coord'.

To get latitude/longitude in the coord field from the DIH, you need to
use a transformer in the DIH script.
It would populate a field 'coord' with a text string made from the lat
and lon fields:

http://wiki.apache.org/solr/DataImportHandler?#TemplateTransformer



On Wed, Jan 12, 2011 at 5:47 PM, Adam Estrada
estrada.adam.gro...@gmail.com wrote:
 In my case, I am getting data from a database and am able to concatenate the
 lat/long as a coordinate pair to store in my coords field. To test this, I
 randomized the lat/long values and generated about 6000 documents.

 Adam

 On Wed, Jan 12, 2011 at 8:29 PM, caman aboxfortheotherst...@gmail.comwrote:


 Adam,

 thanks. Yes that helps
 but how does coords fields get populated? All I have is

 field name=lat type=tdouble indexed=true stored=true /
 field name=lng type=tdouble indexed=true stored=true /

 field name=coord type=location indexed=true stored=true /

 fields 'lat' and  'lng' get populated by dataimporthandler but coord, am
 not
 sure?

 Thanks
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-4-0-Spatial-Search-How-to-tp2245592p2245709.html
 Sent from the Solr - User mailing list archive at Nabble.com.





-- 
Lance Norskog
goks...@gmail.com


Re: use of schema.xml

2011-01-13 Thread Lance Norskog
Correct. Solr and Lucene do not store or enforce the schema. You're on
your own :)

On Thu, Jan 13, 2011 at 8:09 PM, Dennis Gearon gear...@sbcglobal.net wrote:
 I'm going to buy the book for Solr, since it looks like I need to do more of 
 the
 work than I thought I would.

 But, from looking at it, the schema file only says:

 A/ What types of data can be in the 'fields' of the documents
 B/ If there are any dynamically assigned fields.
 C/ What parsers are available
 D/ other stuff.

 And what it DOESN'T do is set the 'schema' for the index, right?
 (like DDL for a database does)

  Dennis Gearon


 Signature Warning
 
 It is always a good idea to learn from your own mistakes. It is usually a 
 better
 idea to learn from others’ mistakes, so you do not have to make them yourself.
 from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'


 EARTH has a Right To Life,
 otherwise we all die.





-- 
Lance Norskog
goks...@gmail.com


Re: use of schema.xml

2011-01-13 Thread Lance Norskog
Wait- it does enforce the schema names. What it does not enforce is
field contents when you change the schema. Since Lucene does not have
field replacement, it is not practical to remove or add a field to all
existing documents when you change the schema.

On Thu, Jan 13, 2011 at 8:15 PM, Lance Norskog goks...@gmail.com wrote:
 Correct. Solr and Lucene do not store or enforce the schema. You're on
 your own :)

 On Thu, Jan 13, 2011 at 8:09 PM, Dennis Gearon gear...@sbcglobal.net wrote:
 I'm going to buy the book for Solr, since it looks like I need to do more of 
 the
 work than I thought I would.

 But, from looking at it, the schema file only says:

 A/ What types of data can be in the 'fields' of the documents
 B/ If there are any dynamically assigned fields.
 C/ What parsers are available
 D/ other stuff.

 And what it DOESN'T do is set the 'schema' for the index, right?
 (like DDL for a database does)

  Dennis Gearon


 Signature Warning
 
 It is always a good idea to learn from your own mistakes. It is usually a 
 better
 idea to learn from others’ mistakes, so you do not have to make them 
 yourself.
 from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'


 EARTH has a Right To Life,
 otherwise we all die.





 --
 Lance Norskog
 goks...@gmail.com




-- 
Lance Norskog
goks...@gmail.com


Re: use of schema.xml

2011-01-13 Thread Dennis Gearon
I could put 1-10,000 fileds in any one document, as long as they are told what 
type or they are dynamically matched by dynamic fields relative to what's in 
the 
schema.xml file? 


It's very much like google 'big tables' or 'elastic search' that way, right?

It's up to me to enforce any field names or quantities and assign field types 
during insert/update?


 Dennis Gearon


Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better 
idea to learn from others’ mistakes, so you do not have to make them yourself. 
from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'


EARTH has a Right To Life,
otherwise we all die.



- Original Message 
From: Lance Norskog goks...@gmail.com
To: solr-user@lucene.apache.org
Sent: Thu, January 13, 2011 8:16:54 PM
Subject: Re: use of schema.xml

Wait- it does enforce the schema names. What it does not enforce is
field contents when you change the schema. Since Lucene does not have
field replacement, it is not practical to remove or add a field to all
existing documents when you change the schema.

On Thu, Jan 13, 2011 at 8:15 PM, Lance Norskog goks...@gmail.com wrote:
 Correct. Solr and Lucene do not store or enforce the schema. You're on
 your own :)

 On Thu, Jan 13, 2011 at 8:09 PM, Dennis Gearon gear...@sbcglobal.net wrote:
 I'm going to buy the book for Solr, since it looks like I need to do more of 
the
 work than I thought I would.

 But, from looking at it, the schema file only says:

 A/ What types of data can be in the 'fields' of the documents
 B/ If there are any dynamically assigned fields.
 C/ What parsers are available
 D/ other stuff.

 And what it DOESN'T do is set the 'schema' for the index, right?
 (like DDL for a database does)

  Dennis Gearon


 Signature Warning
 
 It is always a good idea to learn from your own mistakes. It is usually a 
better
 idea to learn from others’ mistakes, so you do not have to make them 
yourself.
 from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'


 EARTH has a Right To Life,
 otherwise we all die.





 --
 Lance Norskog
 goks...@gmail.com




-- 
Lance Norskog
goks...@gmail.com



Re: Improving Solr performance

2011-01-13 Thread Gora Mohanty
On Thu, Jan 13, 2011 at 10:10 PM, supersoft elarab...@gmail.com wrote:

 On the one hand, I found really interesting those comments about the reasons
 for sharding. Documentation agrees you about why to split an index in
 several shards (big sizes problems) but I don't find any explanation about
 the inconvenients as an Access Control List. I guess there should be some
 and they can be critical in this design. Any example?
[...]

Can I ask what might be a stupid question? How are you measuring
the numbers below, and what do they mean?

As your hit ratio is close to 1 (i.e., everything after the first query is
coming from the cache), these numbers seem a little strange. Are
these really the time for each of the N simultaneous queries? They
seem to be monotonically increasing (though with a couple of
strange exceptions), which leads me to suspect that they are some
kind of cumulative times, e.g., by this interpretation, for the case of
the 10 simultaneous queries, the first one takes 1047ms, the second
268ms, the third 125ms, and so on.

We have run performance tests with pg_bench on a index of size
40GB on a single Solr server with about 6GB of RAM allocated
to Solr, and see what I would think of as expected behaviour, i.e.,
for every fresh query term, the first query takes the longest, and
the time for subsequent queries with the same term goes down
dramatically, as the result is coming out of the cache. This is at
odds to what you describe here, so I have to go back and check
that we did not miss something important.

 1 simultaneous query: 3437 ms (cache fails)

 2 simultaneous queries: 594, 954 ms

 10 simultaneous queries: 1047, 1313, 1438, 1797, 1922, 2094, 2250, 2500,
 2938, 3000 ms

 50 simultaneous queries: 1203, 1453, 1453, 1437, 1625, 1953, 5688, 12938,
 14953, 16281, 15984, 16453, 15812, 16469, 16563, 16844, 17703, 16843, 17359,
 16828, 18235, 18219, 18172, 18203, 17672, 17344, 17453, 18484, 18157, 18531,
 18297, 18359, 18063, 18516, 18125, 17516, 18562, 18016, 18187, 18610, 18703,
 18672, 17829, 18344, 18797, 18781, 18265, 18875, 18250, 18812

 100 simultaneous queries: 1297, 1531, 1969, 2203, 2375, 2891, 3937, 4109,
 4703, 4890, 5047, 5312, 5563, 6422, 6437, 7063, 7093, 7391, 7594, 7672,
 8172, 8547, 8750, 8984, 9265, 9609, 9907, 10344, 11406, 11484, 11484, 11500,
 11547, 11703, 11797, 11875, 11922, 12328, 12375, 12875, 12922, 13187, 13219,
 13407, 13500, 13562, 13719, 13828, 13875, 14016, 14078, 14672, 15922, 16328,
 16625, 16953, 17282, 18172, 18484, 18985, 20594, 20625, 20860, 21281, 21469,
 21625, 21875, 21875, 22141, 22157, 22172, 23125, 23125, 23141, 23203, 23203,
 23328, 24625, 24641, 24672, 24797, 24985, 25031, 25188, 25844, 25937, 26016,
 26437, 26453, 26437, 26485, 28297, 28687, 31782, 31985, 31969, 32016, 32031,
 32016, 32281 ms
[...]

Regards,
Gora


Re: Variable datasources

2011-01-13 Thread Gora Mohanty
On Fri, Jan 14, 2011 at 1:02 AM, tjpoe tanner.post...@gmail.com wrote:
[...]
 I also tried creating datasources for each local and then using a variable
 datasource in the entity such as:

 datasource url=jdbc:mysql://localhost/aaa name=content_aaa /
 datasource url=jdbc:mysql://localhost/bbb name=content_bbb /
 datasource url=jdbc:mysql://localhost/ccc name=content_ccc /
 datasource url=jdbc:mysql://localhost/master name=master /

 and then the document as:

 document name=items
  entity datasource=master name=local query=select code from locals
 rootEntity=false
  entity datasource=content_${local.code} name=item query= select *,
 ${local.code} as code from item /
  /entity
 /document

 but the ${local.code} variable is not resolved and it attempts to connect to
 the literal source content_${local.code}.
[...]

As you have discovered, the datasource attribute is not variable resolved.
There was a thread on this subject a couple of days ago, and apparently
Alexei has resolved the issue. Please see:
http://www.mail-archive.com/solr-user@lucene.apache.org/msg45407.html

Regards,
Gora


Re: Adding a new site to existing solr configuration

2011-01-13 Thread Gora Mohanty
On Thu, Jan 13, 2011 at 10:47 PM, PeterKerk vettepa...@hotmail.com wrote:

 I still have the default Solr example config running on Jetty. I use Cygwin
 to start my current site.

 Now I already have fully configured one solr instance with these files:
 \example\example-DIH\solr\db\conf\my-data-config.xml
 \example\example-DIH\solr\db\conf\schema.xml
 \example\example-DIH\solr\db\conf\solrconfig.xml

 Now, I wish to add ANOTHER site to my already running sites. This site
 ofcourse has a different data-config, but the question is: what files
 can/should I add to the already existing directories?
[...]

If I understand your requirements correctly, the easiest way would be
to do the following:
* Copy the entire directory example/example-DIH/solr/db to a new one,
  say example/example-DIH/solr/test
* As this is running a multi-core setup, add the new site as a different
  core instance in example/example-DIH/solr/solr.xml. Thus, just before
  the /cores line, add:
core default=false instanceDir=test name=test/core
* example/example-DIH/solr/test/conf/solrconfig.xml is already set up to
  use db-data-config.xml as the DIH configuration file, so you can make
  any changes there. Else, change the name of db-data-config.xml, and
  modify the config attribute of the /dataimport RequestHandler in
  solrconfig.xml.
* Make any desired changes to schema.xml, e.g., if you have different
  fields, or if they are of different types.
* Start Solr, and run it as usual, as per example/example-DIH/README.
  E.g., a dataimport would be initiated by loading
http://localhost:8983/solr/test/dataimport?command=full-import

Regards,
Gora


Re: Solr 4.0 = Spatial Search - How to

2011-01-13 Thread Grijesh.singh

I have used that type of location searching. But I have not used spatial
search. I wrote my logic at application end.
I have cached the location ids and their lat/lang. When queries are comming
for any location say New Delhi then my location searche logic at
application end calculate the distance from New Delhi to other locations
from my cache and short lists the only location which are in my radious. and
then I have goto solr for search on all locations i have got from my logic.

It works faster because it worked on only some data near about 500
locations. But in spatial search that calculation is done for all document
counts which we have .

So this workaround does not impact on performance when my index size will
grow up but spatial search do. 

-
Grijesh
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-4-0-Spatial-Search-How-to-tp2245592p2253682.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr 4.0 = Spatial Search - How to

2011-01-13 Thread caman

Thanks
Here was the issues. Concatenating 2 floats(lat,lng) at mysql end converted
it to a BLOB. Indexing would fail in storing BLOB in 'location' type field.
After BLOB issue was resolved, all worked ok.

Thank you all for your help



-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-4-0-Spatial-Search-How-to-tp2245592p2253691.html
Sent from the Solr - User mailing list archive at Nabble.com.