Re: Couple issues with edismax in 3.5

2012-03-05 Thread William Bell
Actually the results are great with lucene. The issue is with edismax.
I did figure out the issue...

The scoring was putting different results based on distance, when I
really need the scoring to be:

score=tf(user_query,"smith") and add geodist() only if tf > 0. this is
pretty difficult to do in SOLR 3.5, but trivail in 4.0.

When are we getting tf() in 3.5 ?

Bill


On Mon, Mar 5, 2012 at 9:31 AM, Ahmet Arslan  wrote:
>> I also get an issue with "." with
>> edismax.
>>
>> For example: Dr. Smith gices me different results than "dr
>> Smith"
>
> I believe this is related to analysis ( rather than query parser). You can 
> inspect output admin/analysis.jsp.
>
> What happens when you switch to &defType=lucene ? Dr. Smith yields same 
> results with dr Smith?



-- 
Bill Bell
billnb...@gmail.com
cell 720-256-8076


How to rank an exact match higher?

2012-03-05 Thread Tommy Chheng
I'm using solr 3.5 for a type ahead search system. I want to rank
exact matches(lowercased) higher than non-exact matches.

For example, if i have two docs:
Doc One: title="New York"
Doc Two: title="New York City"

I would expect a query of "new york" to rank "New York" over "New York City"

It looks like I need to take into account the # of matches vs the
total # of tokens in a field. I'm not sure how to do this.

My debug query shows the two docs with the exact scores:

"new york"
"new york"
+DisjunctionMaxQuery((title:"new york"^50.0 |
textng:"new york"^40.0))
+(title:"new york"^50.0 | textng:"new
york"^40.0)

1.1890696 = (MATCH) max of:
1.1890696 = (MATCH) weight(title:"new york"^50.0 in 0), product of:
0.9994 = queryWeight(title:"new york"^50.0), product of:  50.0
= boost  1.1890697 = idf(title: new=2 york=2)  0.01681987 =
queryNorm1.1890697 = fieldWeight(title:"new york" in 0), product
of:  1.0 = tf(phraseFreq=1.0)  1.1890697 = idf(title: new=2
york=2)  1.0 = fieldNorm(field=title, doc=0)
1.1890696 = (MATCH) max of:
1.1890696 = (MATCH) weight(title:"new york"^50.0 in 1), product of:
0.9994 = queryWeight(title:"new york"^50.0), product of:  50.0
= boost  1.1890697 = idf(title: new=2 york=2)  0.01681987 =
queryNorm1.1890697 = fieldWeight(title:"new york" in 1), product
of:  1.0 = tf(phraseFreq=1.0)  1.1890697 = idf(title: new=2
york=2)  1.0 = fieldNorm(field=title, doc=1)


I posted my solrconfig/schema here:
https://gist.github.com/1984052

-- 
Tommy Chheng


Creating a query-able dictionary using Solr

2012-03-05 Thread Beach, Joel
Hi there,

Am looking at using Solr to perform the following tasks:

1. Push a lot of PDF documents into SOLR.
2. Build a database of all the words encountered in those documents.
3. Be able to query for a list of words matching a string like "a*"

For example, if the collection contains the words aardvark, apple, doctor and 
zebra,
I would expect a query of "a*" to return the list:

[ aardvark, apple ]

I have done a google around for this in Solr and found similar things involving
spell-checkers, but nothing that seems exactly the same.

Anyone, already done this or something similar in Solr willing to point me in 
the
right direction?

Cheers,

Joel

Re: Building a resilient cluster

2012-03-05 Thread Ranjan Bagchi
Hi Mark,

So I tried this: started up one instance w/ zookeeper, and started a second
instance defining a shard name in solr.xml -- it worked, searching would
search both indices, and looking at the zookeeper ui, I'd see the second
shard.  However, when I brought the second server down -- the first one
stopped working:  it didn't kick the second shard out of the cluster.

Any way to do this?

Thanks,

Ranjan


> From: Mark Miller 
> To: solr-user@lucene.apache.org
> Cc:
> Date: Wed, 29 Feb 2012 22:57:26 -0500
> Subject: Re: Building a resilient cluster
> Doh! Sorry - this was broken - I need to fix the doc or add it back.
>
> The shard id is actually set in solr.xml since its per core - the sys prop
> was a sugar option we had setup. So either add 'shard' to the core in
> solr.xml, or to make it work like it does in the doc, do:
>
>  
>
> That sets shard to the 'shard' system property if its set, or as a default,
> act as if it wasn't set.
>
> I've been working with custom shard ids mainly through solrj, so I hadn't
> noticed this.
>
> - Mark
>
> On Wed, Feb 29, 2012 at 10:36 AM, Ranjan Bagchi  >wrote:
>
> > Hi,
> >
> > At this point I'm ok with one zk instance being a point of failure, I
> just
> > want to create sharded solr instances, bring them into the cluster, and
> be
> > able to shut them down without bringing down the whole cluster.
> >
> > According to the wiki page, I should be able to bring up new shard by
> using
> > shardId [-D shardId], but when I did that, the logs showed it replicating
> > an existing shard.
> >
> > Ranjan
> > Andre Bois-Crettez wrote:
> >
> > > You have to run ZK on a at least 3 different machines for fault
> > > tolerance (a ZK ensemble).
> > >
> > >
> >
> http://wiki.apache.org/solr/SolrCloud#Example_C:_Two_shard_cluster_with_sha=
> > > rd_replicas_and_zookeeper_ensemble
> > >
> > > Ranjan Bagchi wrote:
> > > > Hi,
> > > >
> > > > I'm interested in setting up a solr cluster where each machine [at
> > least
> > > > initially] hosts a separate shard of a big index [too big to sit on
> the
> > > > machine].  I'm able to put a cloud together by telling it that I have
> > (to
> > > > start out with) 4 nodes, and then starting up nodes on 3 machines
> > > pointin=
> > > g
> > > > at the zkInstance.  I'm able to load my sharded data onto each
> machine
> > > > individually and it seems to work.
> > > >
> > > > My concern is that it's not fault tolerant:  if one of the
> > non-zookeeper
> > > > machines falls over, the whole cluster won't work.  Also, I can't
> > create
> > > =
> > > a
> > > > shard with more data, and have it work within the existing cloud.
> > > >
> > > > I tried using -DshardId=3Dshard5 [on an existing 4-shard cluster],
> but
> > it
> > > > just started replicating, which doesn't seem right.
> > > >
> > > > Are there ways around this?
> > > >
> > > > Thanks,
> > > > Ranjan Bagchi
> > > >
> > > >
> >
>
>
>
> --
> - Mark
>
> http://www.lucidimagination.com
>
>


Re: Help with Synonyms

2012-03-05 Thread Donald Organ
Excellent thank you, it is now working!

On Mon, Mar 5, 2012 at 9:37 PM, Koji Sekiguchi  wrote:

> (12/03/06 11:23), Donald Organ wrote:
>
>> Ok so do I need to use a different format in my synonyms.txt file in order
>> to do this at index time?
>>
>>
> Right, if you want to apply synonym rules to only index time.
> Use "," like this:
>
> floor locker, storage locker
>
> And don't forget to set expand="true" in your index time synonym
> definition.
> This makes if you have "floor locker" in your document, it will be
> expanded not only
> "floor locker" but also "storage locker" in index, then you can search
> the document by any of q=floor locker or storage locker.
>
> koji
> --
> Query Log Visualizer for Apache Solr
> http://soleami.com/
>


Re: Help with Synonyms

2012-03-05 Thread Koji Sekiguchi

(12/03/06 11:23), Donald Organ wrote:

Ok so do I need to use a different format in my synonyms.txt file in order
to do this at index time?



Right, if you want to apply synonym rules to only index time.
Use "," like this:

floor locker, storage locker

And don't forget to set expand="true" in your index time synonym definition.
This makes if you have "floor locker" in your document, it will be expanded not 
only
"floor locker" but also "storage locker" in index, then you can search
the document by any of q=floor locker or storage locker.

koji
--
Query Log Visualizer for Apache Solr
http://soleami.com/


Re: Help with Synonyms

2012-03-05 Thread Donald Organ
Ok so do I need to use a different format in my synonyms.txt file in order
to do this at index time?

On Monday, March 5, 2012, Koji Sekiguchi  wrote:
> (12/03/06 11:07), Donald Organ wrote:
>>
>> No I do synonyms at index time.
>>
> :

 I am still getting results for storage locker  and no results for floor
 locker

 synonyms.txt still looks like this:

 floor locker=>storage locker
>
> So that's the cause of the problem. Due to the definition "floor
locker=>storage locker"
> on index time analysis, you got "storage" / "locker" in your index, no
"floor" terms
> in your index at all. In general, if you use "=>" method in your
synonyms.txt,
> you should apply same rule to both index and query time.
>
> koji
> --
> Query Log Visualizer for Apache Solr
> http://soleami.com/
>


Re: Help with Synonyms

2012-03-05 Thread Koji Sekiguchi

(12/03/06 11:07), Donald Organ wrote:

No I do synonyms at index time.


:

I am still getting results for storage locker  and no results for floor
locker

synonyms.txt still looks like this:

floor locker=>storage locker


So that's the cause of the problem. Due to the definition "floor locker=>storage 
locker"
on index time analysis, you got "storage" / "locker" in your index, no "floor" 
terms
in your index at all. In general, if you use "=>" method in your synonyms.txt,
you should apply same rule to both index and query time.

koji
--
Query Log Visualizer for Apache Solr
http://soleami.com/


Re: XSLT Response Writer and content transformation

2012-03-05 Thread Matthew Parker
You can embed custom Java functions in XSLT:

http://cafeconleche.org/books/xmljava/chapters/ch17s03.html


On Mon, Mar 5, 2012 at 4:27 AM, darul  wrote:

> Hello,
>
> Using native XSLT Response Writer, we may need to alter content before
> processing xml solr output as a RSS Feed.
>
> Example (trivial one...):
>
> 
>  bla bla bla 
> 
>
> After processing content:
>
> 
>  bla bla bla bla bla bla bla bla bla bla bla bla
> 
>
> Have you any ideas on how to implement a custom function in xslt or before
> in XsltResponseWriter.
>
> I would like get this code in a java class and call it for content
> processing
>
> Thanks,
>
> Jul
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/XSLT-Response-Writer-and-content-transformation-tp3800251p3800251.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

--
This e-mail and any files transmitted with it may be proprietary.  Please note 
that any views or opinions presented in this e-mail are solely those of the 
author and do not necessarily represent those of Apogee Integration.


Re: Help with Synonyms

2012-03-05 Thread Donald Organ
No I do synonyms at index time.

On Monday, March 5, 2012, Koji Sekiguchi  wrote:
> (12/03/06 0:11), Donald Organ wrote:
>>>
>>> Try to remove tokenizerFactory="**KeywordTokenizerFactory" in your
>>> synonym filter
>>> definition because I think you would want to tokenize the synonym
settings
>>> in
>>> synonyms.txt as "floor" / "locker" =>  "storage" / "locker". But if you
set
>>> it
>>> to KeywordTokenizer, it will be a map of "floor locker" =>  "storage
>>> locker", and as you
>>> are using WhitespaceTokenizer for your  in, then
>>> if you
>>> try to index "floor locker", it will be "floor"/"locker" (not "floor
>>> locker"),
>>> as a result, it will not match to your synonym map.
>>>
>>> Aside, I recommend that you would set  -  -
>>> 
>>> chain in the natural order in, though if those are wrong it
>>> won't
>>> be the cause of the problem at all.
>>>
>>>
>>>
>> OK so I have updated my schema.xml to the following:
>>
>> > omitNorms="false">
>>   
>> 
>> 
>> > synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
>> > generateWordParts="1" generateNumberParts="1" catenateWords="1"
>> catenateNumbers="1" catenateAll="0" />
>> > words="stopwords.txt" />
>> > protected="protwords.txt" />
>> 
>> 
>>   
>>   .
>>
>> I am still getting results for storage locker  and no results for floor
>> locker
>>
>> synonyms.txt still looks like this:
>>
>> floor locker=>storage locker
>
> Hi Donald,
>
> Do you use same SynonymFilter setting to the query analyzer part
> ()?
>
> koji
> --
> Query Log Visualizer for Apache Solr
> http://soleami.com/
>


Re: Help with Synonyms

2012-03-05 Thread Koji Sekiguchi

(12/03/06 0:11), Donald Organ wrote:

Try to remove tokenizerFactory="**KeywordTokenizerFactory" in your
synonym filter
definition because I think you would want to tokenize the synonym settings
in
synonyms.txt as "floor" / "locker" =>  "storage" / "locker". But if you set
it
to KeywordTokenizer, it will be a map of "floor locker" =>  "storage
locker", and as you
are using WhitespaceTokenizer for your  in, then
if you
try to index "floor locker", it will be "floor"/"locker" (not "floor
locker"),
as a result, it will not match to your synonym map.

Aside, I recommend that you would set  -  -

chain in the natural order in, though if those are wrong it
won't
be the cause of the problem at all.




OK so I have updated my schema.xml to the following:


   
 
 
 
 
 
 
 
 
   
   .

I am still getting results for storage locker  and no results for floor
locker

synonyms.txt still looks like this:

floor locker=>storage locker


Hi Donald,

Do you use same SynonymFilter setting to the query analyzer part
()?

koji
--
Query Log Visualizer for Apache Solr
http://soleami.com/


Highlighting Multivalued Field question

2012-03-05 Thread Jamie Johnson
If I have a multivalued field with values as follows

black pantswhite shirt

and I do a query against that field with highlighting enabled as follows

/select?hl.fl=clothing&rows=5&q=clothing:black clothing:shirt&hl=on&indent=true

I thought I would see the following in the highlights

black pantswhite
shirt

but instead I'm seeing the following

black pants

is this expected?

Also I'm using a custom highlighter which extends SolrHighlighter but
99.9% of it is a straight copy of DefaultSolrHighlighter with support
from pulling unstored fields from an external data base, so I expect
that this works the same was as DefaultSolrHighlighter, but if this is
not the expected case I will try with DefaultSolrHighlighter.


RE: disabling QueryElevationComponent

2012-03-05 Thread Welty, Richard
Walter Underwood [mailto:wun...@wunderwood.org] writes:
 
>On Mar 5, 2012, at 1:16 PM, Welty, Richard wrote:

>> Walter Underwood [mailto:wun...@wunderwood.org] writes:
 
>>> You may be able to have unique keys. At Netflix, I found that there were
>>> collisions between the movie IDs and the person IDs. So, I put an 'm' at
>>> the beginning of each movie ID and a 'p' at the beginning of each person
>>> ID. Like magic, I had unique IDs.

>> did you do this with a transformer at index time, or in some other manner?

>SQL should be able to do this, though it might not be portable. For MySQL is
> it something like:

>select
>   concat('m', movie_id) as id,

ok, thanks. i know how to do that in postgresql...

richard








Re: disabling QueryElevationComponent

2012-03-05 Thread Walter Underwood
On Mar 5, 2012, at 1:16 PM, Welty, Richard wrote:

> Walter Underwood [mailto:wun...@wunderwood.org] writes:
> 
>> You may be able to have unique keys. At Netflix, I found that there were 
>> collisions between >the movie IDs and the person IDs. So, I put an 'm' at 
>> the beginning of each movie ID and a >'p' at the beginning of each person 
>> ID. Like magic, I had unique IDs.
> 
> did you do this with a transformer at index time, or in some other manner?

I had custom code for indexing; this was in Solr 1.3.

SQL should be able to do this, though it might not be portable. For MySQL is it 
something like:

select
   concat('m', movie_id) as id,
   ...

wunder
--
Walter Underwood
wun...@wunderwood.org





RE: How can Solr do parallel query warming with and ?

2012-03-05 Thread Michael Ryan
https://issues.apache.org/jira/browse/SOLR-2548 may be of interest to you.

-Michael


wildcard queries with edismax and lucene query parsers

2012-03-05 Thread Robert Stewart
How is scoring affected by wildcard queries?  Seems when I use a
wildcard query I get all constant scores in response (all scores =
1.0).  That occurs with both edismax as well as lucene query parser.
I am trying to implement auto-suggest feature so I need to use wild
card to return all results that match the prefix entered by a user.
But I want the results sorted according to score defined by the "qf"
parameter in my search handler.

?defType=edismax&q=grow*&fl=title,score



1.0

S&P 1000 Growth



1.0

S&P 1000 Pure Growth




?defType=lucene&q=grow*&fl=title,score



1.0

S&P 1000 Growth



1.0

S&P 1000 Pure Growth



If I use query with no wildcard, scoring appears correct:

?defType=edismax&q=growth&fl=title,score



0.7500377

S&P 1000 Growth



0.7500377

S&P 500 Growth



0.656283

S&P 1000 Pure Growth



I am using SOLR version 3.2 and using a request handler defined like this:


   
 explicit
 10
 edismax
  *:*
 
  ticker^10.0 indexCode^10.0 indexKey^10.0 title^5.0
indexName^5.0
 
  indexId,indexName,indexCode,indexKey,title,ticker,urlTitle
  
  
  
  +contentType:IndexProfile
   
   


RE: disabling QueryElevationComponent

2012-03-05 Thread Welty, Richard



Walter Underwood [mailto:wun...@wunderwood.org] writes:

>You may be able to have unique keys. At Netflix, I found that there were 
>collisions between >the movie IDs and the person IDs. So, I put an 'm' at the 
>beginning of each movie ID and a >'p' at the beginning of each person ID. Like 
>magic, I had unique IDs.

did you do this with a transformer at index time, or in some other manner?

>You should be able to disable the query elevation stuff by removing it from 
>your >solrconfig.xml.

the documentation certainly implies this, which is why i'm baffled. i see no 
reason
why removing the config should trigger the multiple core error when i only have 
the
default setup with one core.

richard



Re: disabling QueryElevationComponent

2012-03-05 Thread Walter Underwood
You may be able to have unique keys. At Netflix, I found that there were 
collisions between the movie IDs and the person IDs. So, I put an 'm' at the 
beginning of each movie ID and a 'p' at the beginning of each person ID. Like 
magic, I had unique IDs.

You should be able to disable the query elevation stuff by removing it from 
your solrconfig.xml.

wunder
Walter Underwood
wun...@wunderwood.org

On Mar 5, 2012, at 1:09 PM, Welty, Richard wrote:

> i googled and found numerous references to this, but no answers that went to 
> my specific issues.
> 
> i have a solr 3.5.0 server set up that needs to index several different 
> document types, there is no common unique key field. so i can't use the 
> uniqueKey declaration and need to disable the QueryElevationComponent. 
> 
> when i set this up with the uniqueKey, i mapped ids from the various database 
> tables to the id key temporarily just to get things working, but the results 
> from queries are showing me i really have to get rid of that hack. so i 
> commented out uniqueKey in schema.xml and commented out the 
> QueryElevationComponent searchComponent and the associated requestHandler in 
> solrconfig.xml
> 
> when i restart solr and go to the solr/admin/dataimport.jsp page to test, i 
> get the 'missing core name in path' error.
> 
> so what further configuration changes are required to disable 
> QueryElevationComponent? 
> 
> thanks,
>   richard

--





disabling QueryElevationComponent

2012-03-05 Thread Welty, Richard
i googled and found numerous references to this, but no answers that went to my 
specific issues.

i have a solr 3.5.0 server set up that needs to index several different 
document types, there is no common unique key field. so i can't use the 
uniqueKey declaration and need to disable the QueryElevationComponent. 

when i set this up with the uniqueKey, i mapped ids from the various database 
tables to the id key temporarily just to get things working, but the results 
from queries are showing me i really have to get rid of that hack. so i 
commented out uniqueKey in schema.xml and commented out the 
QueryElevationComponent searchComponent and the associated requestHandler in 
solrconfig.xml

when i restart solr and go to the solr/admin/dataimport.jsp page to test, i get 
the 'missing core name in path' error.

so what further configuration changes are required to disable 
QueryElevationComponent? 

thanks,
   richard


Re: How can Solr do parallel query warming with and ?

2012-03-05 Thread Mikhail Khludnev
Neil,

Still is not clear whether it multi or singe valued fields that
defines usage or FieldCache or UnInvertedField, and per-segment reader vs
top-level reader.
The only concern I have about your approach is the waste  of cpu for
calculate facets for huge *:* docsets. I guess you can try to find narrow
way to trigger field cache initialization. Perhaps
http://wiki.apache.org/solr/TermsComponent can be useful for it.

Regards

On Sat, Mar 3, 2012 at 11:58 PM, Neil Hooey  wrote:

> I need to have those queries trigger the generation of facet counts, which
> can take up to 5 minutes for all of them combined.
>
> If the facet counts aren't warmed, then the first query to ask for facet
> counts on a particular field will take several minutes to return results.
>
> On Sat, Mar 3, 2012 at 5:40 AM, Mikhail Khludnev <
> mkhlud...@griddynamics.com>
> wrote:
> > Neil,
> >
> > Would you mind if I ask what particularly do you want to warm by these
> > queries?
> >
> > Regards
> >
> > On Sat, Mar 3, 2012 at 12:37 AM, Neil Hooey  wrote:
> >
> >> I'm trying to get Solr to run warming queries in parallel with
> >> listener events, but it always does them in sequence, pegging one CPU
> >> while calculating facet counts.
> >>
> >> Someone at Lucid Imagination suggested using multiple  >> event="firstSearcher"> tags, each with a single facet query in them,
> >> but those are still done in parallel.
> >>
> >> Is it possible to run warming queries in parallel, and if so, how?
> >>
> >> I'm aware that you could run an external script that forks, but I'd
> >> like to use Solr's native support for this if it exists.
> >>
> >> Examples that don't work:
> >>
> >> 
> >> 
> >>  
> >>
> >>  *:* name="facet.field">field1
> >>  *:* name="facet.field">field2
> >>  *:* name="facet.field">field3
> >>  *:* name="facet.field">field4
> >>
> >>  
> >> 
> >>
> >> 
> >> 
> >>  
> >>
> >>  *:* name="facet.field">field1
> >>
> >>  
> >>  
> >>
> >>  *:* name="facet.field">field2
> >>
> >>  
> >>  
> >>
> >>  *:* name="facet.field">field3
> >>
> >>  
> >>  
> >>
> >>  *:* name="facet.field">field4
> >>
> >>  
> >> 
> >>
> >
> >
> >
> > --
> > Sincerely yours
> > Mikhail Khludnev
> > Lucid Certified
> > Apache Lucene/Solr Developer
> > Grid Dynamics
> >
> > 
> >  
>



-- 
Sincerely yours
Mikhail Khludnev
Lucid Certified
Apache Lucene/Solr Developer
Grid Dynamics


 


Re: Java6 End of Life, upgrading to 7

2012-03-05 Thread Shawn Heisey

On 2/28/2012 8:16 AM, Shawn Heisey wrote:
Due to the End of Life announcement for Java6, I am going to need to 
upgrade to Java 7 in the very near future.  I'm running Solr 3.5.0 
modified with a couple of JIRA patches.


https://blogs.oracle.com/henrik/entry/updated_java_6_eol_date

I saw the announcement that Java 7u1 had fixed all the known bugs 
relating to Solr.  Is there anything I need to be aware of when 
upgrading?  These are the commandline switches I am using that apply 
to Java itself:


-Xms8192M
-Xmx8192M
-XX:NewSize=6144M
-XX:SurvivorRatio=4
-XX:+UseParNewGC
-XX:+UseConcMarkSweepGC
-XX:+CMSParallelRemarkEnabled


Does anyone have any information on upgrading to Java7 with Solr?  Any 
gotchas, recommendations, or should I just go for it without changing 
anything?  Java6 goes EOL in November, after which public (free) support 
and bugfixes go away.  By then running Solr with Java 7 will be 
important to the entire community, not just me.


Thanks,
Shawn



Re: Solr Design question on spatial search

2012-03-05 Thread Lance Norskog
The Lucene geo searching code is very fast. Geosearch queries
calculate the distance from the city to all 20k stores and sort on
this.

If this is not fast enough, you can pre-calculate the city/store lists
by doing all of this searching in advance. You can store these in a DB
and do incremental updates to your index. As to re-indexing all the
data, you should assume you will do this regularly.

Lance

On Fri, Mar 2, 2012 at 2:06 PM, Venu Gmail Dev  wrote:
> Sorry for not being clear enough.
>
> I don't know the point of origin. All I know is that there are 20K retail 
> stores. Only the cities within 10 miles radius of these stores should be 
> searchable. Any city which is outside these small 10miles circles around 
> these 20K stores should be ignored.
>
> So when somebody searches for a city, I need to query the cities which are in 
> these 20K 10miles circles but I don't know which 10-mile circle I should 
> query.
>
> So the approach that I was thinking were :-
>
> a) Have 2 separate indexes. First one to store the information about all 
> the cities and second one to store the retail stores information. 
> Whenever user searches for a city then I return all the matching cities ( 
> and hence the lat-long) from first index and then do a spatial search on 
> each of the matched city in the second index. But this is too costly.
>
> b) Index only the cities which have a nearby store. Do all the 
> calculation(s) before indexing the data so that the search is fast. The 
> problem that I see with this approach is that if a new retail store or a 
> city is added then I would have to re-index all the data again.
>
> Does this answers the problem that you posed ?
>
> Thanks,
> Venu.
>
> On Mar 2, 2012, at 9:52 PM, Erick Erickson wrote:
>
>> But again, that doesn't answer the problem I posed. Where is your
>> point of origin?
>> There's nothing in what you've written that indicates how you would know
>> that 10 miles is relative to San Francisco. All you've said is that
>> you're searching
>> on "San". Which would presumably return San Francisco, San Mateo, San Jose.
>>
>> Then, also presumably, you're looking for all the cities with stores
>> within 10 miles
>> of one of these cities. But nothing in your criteria so far says that
>> that city is
>> San Francisco.
>>
>> If you already know that San Francisco is the locus, simple distance
>> will work just
>> fine. You can index both city and store info in the same index and
>> restrict, say, facets
>> (or, indeed search results) by fq clause (e.g. fq=type:city or 
>> fq=type:store).
>>
>> Or I'm completely missing the boat here.
>>
>> Best
>> Erick
>>
>>
>> On Fri, Mar 2, 2012 at 11:50 AM, Venu Dev  wrote:
>>> So let's say x=10 miles. Now if I search for San then San Francisco, San 
>>> Mateo should be returned because there is a retail store in San Francisco. 
>>> But San Jose should not be returned because it is more than 10 miles away 
>>> from San
>>> Francisco. Had there been a retail store in San Jose then it should be also 
>>> returned when you search for San. I can restrict the queries to a country.
>>>
>>> Thanks,
>>> ~Venu
>>>
>>> On Mar 2, 2012, at 5:57 AM, Erick Erickson  wrote:
>>>
 I don't see how this works, since your search for San could also return
 San Marino, Italy. Would you then return all retail stores in
 X miles of that city? What about San Salvador de Jujuy, Argentina?

 And even in your example, San would match San Mateo. But should
 the search then return any stores within X miles of San Mateo?
 You have to stop somewhere

 Is there any other information you have that restricts how far to expand 
 the
 search?

 Best
 Erick

 On Thu, Mar 1, 2012 at 4:57 PM, Venu Gmail Dev  
 wrote:
> I don't think Spatial search will fully fit into this. I have 2 
> approaches in mind but I am not satisfied with either one of them.
>
> a) Have 2 separate indexes. First one to store the information about all 
> the cities and second one to store the retail stores information. 
> Whenever user searches for a city then I return all the matching cities 
> from first index and then do a spatial search on each of the matched city 
> in the second index. But this is too costly.
>
> b) Index only the cities which have a nearby store. Do all the 
> calculation(s) before indexing the data so that the search is fast. The 
> problem that I see with this approach is that if a new retail store or a 
> city is added then I would have to re-index all the data again.
>
>
> On Mar 1, 2012, at 7:59 AM, Dirceu Vieira wrote:
>
>> I believe that what you need is spatial search...
>>
>> Have a look a the documention:  http://wiki.apache.org/solr/SpatialSearch
>>
>> On Wed, Feb 29, 2012 at 10:54 PM, Venu Shankar 
>> wrote:
>>
>>> Hello,
>>>
>>> I have a d

How to limit the number of open searchers?

2012-03-05 Thread Michael Ryan
Is there a way to limit the number of searchers that can be open at a given 
time?  I know there is a maxWarmingSearchers configuration that limits the 
number of warming searchers, but that's not quite what I'm looking for...

Ideally, when I commit, I want there to only be one searcher open before the 
commit, so that during the commit and warming, there is a max of two searchers 
open.  I'd be okay with delaying the commit until there is only one searcher 
open.  Is there a way to programmatically determine how many searchers are 
currently open?

-Michael


Re: [SoldCloud] Slow indexing

2012-03-05 Thread Markus Jelsma
On Mon, 5 Mar 2012 11:26:20 -0500, Mark Miller  
wrote:

On Mar 5, 2012, at 10:01 AM, dar...@ontrenet.com wrote:

If one of those 10 indexing nodes goes down or falls out of sync and 
comes
back, does ZK block the state of indexing until that single node 
catches

back up?


No - if a node falls out of sync or comes back, the rest of the
cluster continues as normal and the node goes into recovery.

In recovery, the node tries two things to catch up: first it tries to
peer sync - if its off by less than 100 updates, it will simply
exchange updates with the leader and come back into sync. If its off
by more than that, it will start buffering updates from the leader,
replicate the full index from the leader, and then apply its buffered
updates to get come back in sync.

The only time indexing is stopped for a node is if that node loses
its connection to zookeeper. All other nodes that can still talk to
zookeeper will continue indexing. How soon we consider that we can't
talk to zookeeper depends on the zk session timeout - I have to look,
but for an embedded ensemble, we may be defaulting this a little low
currently.


That would suggest that in our case at some point Solr drops the 
connection to ZK and is unable restore the connection, even after 
restarting Tomcat, many times.


I know ZK is running fine and responds with imok when i ask ruok. When 
i restart Tomcat i'll see these bad things in ZK's log:


2012-03-05 17:55:07,084 [myid:] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@213] - 
Accepted socket connection from /141.105.120.152:52328
2012-03-05 17:55:07,090 [myid:] - WARN  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@792] - 
Connection request from old client /141.105.120.152:52328; will be 
dropped if server is in r-o mode
2012-03-05 17:55:07,091 [myid:] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@838] - Client 
attempting to establish new session at /141.105.120.152:52328
2012-03-05 17:55:07,094 [myid:] - INFO  [SyncThread:0:FileTxnLog@199] - 
Creating new log file: log.1
2012-03-05 17:55:07,107 [myid:] - INFO  
[SyncThread:0:ZooKeeperServer@604] - Established session 
0x135e3ffdb54 with negotiated timeout 1 for client 
/141.105.120.152:52328
2012-03-05 17:55:07,206 [myid:] - INFO  [ProcessThread(sid:0 
cport:-1)::PrepRequestProcessor@617] - Got user-level KeeperException 
when processing sessionid:0x135e3ffdb54 type:delete cxid:0xb 
zxid:0x5 txntype:-1 reqpath:n/a Error 
Path:/live_nodes/cn003.openindex.io:80_solr Error:KeeperErrorCode = 
NoNode for /live_nodes/cn003.openindex.io:80_solr


Solr will not come back up, even with a clean ZK data dir. I'll clear 
the dataDir of one of the stuborn Solr nodes and retry. ... The Solr 
node comes back up, finally. Here's the ZK log:


2012-03-05 17:56:55,939 [myid:] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@213] - 
Accepted socket connection from /141.105.120.152:36311
2012-03-05 17:56:55,944 [myid:] - WARN  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@792] - 
Connection request from old client /141.105.120.152:36311; will be 
dropped if server is in r-o mode
2012-03-05 17:56:55,944 [myid:] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@838] - Client 
attempting to establish new session at /141.105.120.152:36311
2012-03-05 17:56:55,967 [myid:] - INFO  
[SyncThread:0:ZooKeeperServer@604] - Established session 
0x135e3ffdb540001 with negotiated timeout 1 for client 
/141.105.120.152:36311
2012-03-05 17:56:56,058 [myid:] - INFO  [ProcessThread(sid:0 
cport:-1)::PrepRequestProcessor@617] - Got user-level KeeperException 
when processing sessionid:0x135e3ffdb540001 type:delete cxid:0x3 
zxid:0x6b txntype:-1 reqpath:n/a Error 
Path:/live_nodes/cn003.openindex.io:80_solr Error:KeeperErrorCode = 
NoNode for /live_nodes/cn003.openindex.io:80_solr


I'm not sure about the problem but it looks like Solr won't start fine 
if there's an issue after listing all segment files. It may not be a ZK 
or cloud problem at all. Any suggestions?


Thanks



- Mark Miller
lucidimagination.com




need input - lessons learned or best practices for data imports

2012-03-05 Thread geeky2
hello all,

we are approaching the time when we will move our first solr core in to a
more "production like" environment.  as a precursor to this, i am attempting
to write some documents on impact assessment and batch load / data import
strategies.

does anyone have processes or lessons learned - that they can share?

maybe a good place to start - but not limited to - would be how do people
monitor data imports (we are using a very simple DIH hooked to an informix
schema) and send out appropriate notifications?

thank you for any help or suggestions,
mark


--
View this message in context: 
http://lucene.472066.n3.nabble.com/need-input-lessons-learned-or-best-practices-for-data-imports-tp3801327p3801327.html
Sent from the Solr - User mailing list archive at Nabble.com.


ngram synonyms & dismax together

2012-03-05 Thread Husain, Yavar


I have ngram-indexed 2 fields (columns in the database) and the third one is my 
full text field. Now my default text field is the full text field and while 
querying I use dismax handler and specify in it both the ngrammed field with 
certain boost values and also full text field with a certain boost value. 

Problem for me if I dont use dismax and just search full text field(i.e. 
default field specified in schema) synonyms work correctly i.e. ca returns all 
results where california is there whereas if i use dismax ca is also searched 
in the ngrammed fields and return partial matches of the word ca and does not 
go at all in the synonym part.

I want to use synonyms in every case so how should I go about it?
**
 
This message may contain confidential or proprietary information intended only 
for the use of the 
addressee(s) named above or may contain information that is legally privileged. 
If you are 
not the intended addressee, or the person responsible for delivering it to the 
intended addressee, 
you are hereby notified that reading, disseminating, distributing or copying 
this message is strictly 
prohibited. If you have received this message by mistake, please immediately 
notify us by 
replying to the message and delete the original message and any copies 
immediately thereafter. 

Thank you.- 
**
FAFLD



Re: Couple issues with edismax in 3.5

2012-03-05 Thread Ahmet Arslan
> I also get an issue with "." with
> edismax.
> 
> For example: Dr. Smith gices me different results than "dr
> Smith"

I believe this is related to analysis ( rather than query parser). You can 
inspect output admin/analysis.jsp. 

What happens when you switch to &defType=lucene ? Dr. Smith yields same results 
with dr Smith?


Re: [SoldCloud] Slow indexing

2012-03-05 Thread Mark Miller

On Mar 5, 2012, at 10:01 AM, dar...@ontrenet.com wrote:

> If one of those 10 indexing nodes goes down or falls out of sync and comes
> back, does ZK block the state of indexing until that single node catches
> back up?

No - if a node falls out of sync or comes back, the rest of the cluster 
continues as normal and the node goes into recovery.

In recovery, the node tries two things to catch up: first it tries to peer sync 
- if its off by less than 100 updates, it will simply exchange updates with the 
leader and come back into sync. If its off by more than that, it will start 
buffering updates from the leader, replicate the full index from the leader, 
and then apply its buffered updates to get come back in sync.

The only time indexing is stopped for a node is if that node loses its 
connection to zookeeper. All other nodes that can still talk to zookeeper will 
continue indexing. How soon we consider that we can't talk to zookeeper depends 
on the zk session timeout - I have to look, but for an embedded ensemble, we 
may be defaulting this a little low currently.

- Mark Miller
lucidimagination.com













Re: Help with Synonyms

2012-03-05 Thread Donald Organ
>
>
>>
> Hi Donald,
>
> Try to remove tokenizerFactory="**KeywordTokenizerFactory" in your
> synonym filter
> definition because I think you would want to tokenize the synonym settings
> in
> synonyms.txt as "floor" / "locker" => "storage" / "locker". But if you set
> it
> to KeywordTokenizer, it will be a map of "floor locker" => "storage
> locker", and as you
> are using WhitespaceTokenizer for your  in , then
> if you
> try to index "floor locker", it will be "floor"/"locker" (not "floor
> locker"),
> as a result, it will not match to your synonym map.
>
> Aside, I recommend that you would set  -  -
> 
> chain in the natural order in , though if those are wrong it
> won't
> be the cause of the problem at all.
>
>
>
OK so I have updated my schema.xml to the following:


  








  
  .

I am still getting results for storage locker  and no results for floor
locker

synonyms.txt still looks like this:

floor locker=>storage locker


Re: [SoldCloud] Slow indexing

2012-03-05 Thread darren
A question relating to this.

If you are running a single ZK node, but say 10 other nodes and then
parallel index on each of those nodes, will the ZK be hit by all 10
indexing nodes constantly? i.e. very chatty?

If one of those 10 indexing nodes goes down or falls out of sync and comes
back, does ZK block the state of indexing until that single node catches
back up?

>
> On Mar 4, 2012, at 5:43 PM, Markus Jelsma wrote:
>
>> everything stalls after it lists all segment files and that a ZK state
>> change has occured.
>
> Can you get a stack trace here? I'll try to respond to more tomorrow. What
> version of trunk are you using? We have been making fixes and improvements
> all the time, so need to get a frame of reference.
>
> When a client node cannot talk to zookeeper, because it may not know
> certain things it should (what if a leader changes?), it must reject
> updates (searches will still work). Why can't the node talk to zookeeper?
> Perhaps the load is so high on the server, it cannot respond to zk within
> the session timeout? I really don't know yet. When this happens though, it
> forces a recovery when/if the node can reconnect to zookeeper.
>
> We have not yet started on optimizing bulk indexing - currently an update
> is added locally *before* sending updates in parallel to each replica.
> Then we wait for each response before responding to the client. We plan to
> offer more optimizations and options around this.
>
> Feed back will be useful in making some of these improvements.
>
>
> - Mark Miller
> lucidimagination.com
>
>
>
>
>
>
>
>
>
>
>
>



Re: Retrieving multiple levels with hierarchical faceting in Solr

2012-03-05 Thread Erick Erickson
I should have read more carefully. Why not just use facet.query? They are
treated completely independently, so you can specify something like:
facet.query=field:0*
facet.query=field:1_foovalue*
and you can even specify facet.field as well, they all just come back
as separate sections in the facets list of the response.

Best
Erick

On Sun, Mar 4, 2012 at 7:52 AM, adrian.strin...@holidaylettings.co.uk
 wrote:
> At the moment, I'm just using a multi-valued string field.  I was previously 
> using a text field that was defined as follows:
>
>        
>                
>                        
>                         pattern="\+" replacement=" " />
>                
>        
>
> I've tried to have a look on the net, but I can't seem to find any 
> documentation on the difference between specifying a facet prefix using 
> f.facetname.facet.prefix and using local params - if anyone could point me in 
> the right direction it'd be much appreciated.  From what I can see, it 
> appears that my Solr instance just ignores the prefix when supplied via local 
> params.  I suspect that, as I don't know much about Solr, I'm probably just 
> searching for the wrong phrases :(
>
> Regards,
> Ade
>
> -Original Message-
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: 03 March 2012 00:36
> To: solr-user@lucene.apache.org
> Subject: Re: Retrieving multiple levels with hierarchical faceting in Solr
>
> A lot depends on the analysis chain your field is actually using, that is the 
> tokens that are in the index. Can you supply the schema.xml file for the 
> field in question?
>
> Best
> Erick
>
> On Fri, Mar 2, 2012 at 7:21 AM, adrian.strin...@holidaylettings.co.uk
>  wrote:
>> I've got a hierarchical facet in my Solr collection; root level values are 
>> prefixed with 0;, and the next level is prefixed 1_foovalue;.  I can get the 
>> root level easily enough, but when foovalue is selected I need to retrieve 
>> the next level in the hierarchy while still displaying all of the options in 
>> the root level.  I can't work out how to request either two different 
>> prefixes for the facet, or the same facet twice using different prefixes.
>>
>> I've found a couple of discussions online that suggest I ought to be able to 
>> set the prefix using local params:
>>
>>    facet.field={!prefix=0;}foo
>>    facet.field={!prefix=1_foovalue; key=bar}foo
>>
>> but the prefix seems to be ignored, as the facet returned contains all 
>> values.  Should I just  so I can query 
>> using f.foo.facet.prefix=0;&f.bar.facet.prefix=1_foovalue;, or is there 
>> another way I can request the two different levels of my facet hierarchy at 
>> once?
>>
>> I'm using Solr 3.5.
>>
>> Thanks,
>> Ade
>>


JoinQuery and document score problem

2012-03-05 Thread Stefan Moises

Hi list,

we are using the kinda new JoinQuery feature in Solr 4.x Trunk and are 
facing a problem (and also Solr 3.5. with the JoinQuery patch applied) ...
We have documents with a parent - child relationship where a parent can 
have any number of childs, parents being identified by the field "parentid".


Now after a join (from the field "parentid" to "id") to get the parent 
documents only (and to filter out the "variants"/childs of the parent 
documents), the document score gets "lost" - all the returned documents 
have a score of "1.0" - if we remove the join from the query, the scores 
are fine again. Here is an example call:


http://localhost:8983/solr4/select?qt=dismax&q={!join%20from=parentid%20to=id}foo&fl=id,title,score

All the results now have a score of "1.0", which makes the order of 
results pretty much random and the scoring therefore useless... :(
(the same applies for the "standard" query type, so it's not the dismax 
parser)


I can't imagine this is "expected" behaviour...? Is there an easy way to 
get the "right" scores for the joined documents (e.g. using the max. 
score of the childs)? Can the scoring of "joined" documents be 
configured somewhere / somehow?


Thanks a lot in advance,
best regards,
Stefan

--
Mit den besten Grüßen aus Nürnberg,
Stefan Moises

***
Stefan Moises
Senior Softwareentwickler
Leiter Modulentwicklung

shoptimax GmbH
Guntherstraße 45 a
90461 Nürnberg
Amtsgericht Nürnberg HRB 21703
GF Friedrich Schreieck

Tel.: 0911/25566-25
Fax:  0911/25566-29
moi...@shoptimax.de
http://www.shoptimax.de
***




Re: Remove underscore char when indexing and query problem

2012-03-05 Thread Erick Erickson
Look at the admin/analysis page and be sure to check the "verbose"
checkboxes. that'll show you what each filter does to the input. My
guess is that WordDelimiterFilterFactory has different parameters
and that's what you're seeing. WDFF can be tricky to understand...

If that's not helpful, you need to provide your field definition.

Best
Erick

On Fri, Mar 2, 2012 at 10:52 PM, Floyd Wu  wrote:
> Hi there,
>
> I have a document and its title is "20111213_solr_apache conference report".
>
> When I use analysis web interface to see what tokens exactly solr analyze
> and the following is the result
>
> term text20111213_solrapacheconferencereportterm type
> 
>
>
> Why 20111213_solr tokenized as  and "_" char won't be removed? (I've
> add "_" as stop word in stopwords.txt)
>
> I did another test when "20111213_solr_apache conference_report".
> As you can see the difference is I add an underscore char between
> conference and report. To analyze this string
> term text20111213_solrapacheconferencereportterm type
> 
> this time the underscore char between conference and report is removed!
>
> Why? How to make solr remove underscore char and behave consistent?
> Please help on this.
>
> Thanks in advance.
>
> Floyd


Re: errata for solr tutorial

2012-03-05 Thread Jan Høydahl
Hi,

Thanks for reporting. This is fixed now on the staging site, will be set live 
soon.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 1. mars 2012, at 16:50, Nicolai Scheer wrote:

> Hi!
> 
> Having just worked through the solr tutorial
> (http://lucene.apache.org/solr/tutorial.html) I think I found two minor
> "bugs":
> 
> 1.
> The "delete by query" example
> 
> java -Ddata=args -jar post.jar ""
> 
> should read
> 
> java -Ddata=args -jar post.jar "name:DDR"
> 
> 2.
> The link to the mailing lists at the end of the tutorial seems to be
> dead as of now:
> 
> http://lucene.apache.org/solr/mailing_lists.html
> 
> Greetings,
> 
> Nico



Re: How to define a multivalued string type "langid.langsField" in solrconfig.xml

2012-03-05 Thread Jan Høydahl
Hi,

The documentation for this features says:
> langid.langsField
> 
> Specifies the field to output a list of detected languages into. This must be 
> a multiValued String field. If you use langid.map.individual, each detected 
> language will be added to this field.
> 
Your langid.langsField field must be defined in the schema as multiValued, not 
in the requesthandler config.
As far as I remember, the langid.map.individual will only take effect if you 
have langid.map=true, i.e. you cannot detech langs from individual fields and 
have them be added to the langsField without also doing mapping.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 27. feb. 2012, at 05:09, bing wrote:

> Hi, all, 
> 
> I am using tika language detection. It is said that, if "langid.langsField"
> is set as multivalued string, and then a list of languages can be stored for
> the fields specified in "langid.fl". 
> 
> Following is how I configure the processor in soleconfig.xml. I tried using
> "text" only, and the detected result is language_s="zh_tw"; for
> "attr_stream_name", the result is language_s="en". I was expecting, when
> adding both "text" and  "attr_stream_name", the result would look like
> language_s="en,zh_tw". However, I failed to see the result. 
> 
> 
> 
>class="org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory">
>  
>   text,attr_stream_name
> language_s
>true  
>
>   
> 
> 
> 
> I will be grateful if anyone can point my mistake or give some hints how to
> do the correct things. Thank you. 
> 
> Best Regards, 
> Bing 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/How-to-define-a-multivalued-string-type-langid-langsField-in-solrconfig-xml-tp3779602p3779602.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: Date search by specific month and day

2012-03-05 Thread Jan Høydahl
I've seen this question several times on the list.
Perhaps it could be beneficial to create a new Date field that also soupports 
year-only, year-month, year-month-day etc queries? It could be called 
ExtendedDateField or something, and when indexing a date "-MM-DDTHH:mm:ssZ" 
it would individually store multiple versions in the index, perhaps using 
poly-field? It could work exactly like DateField for full date input, but also 
allow queries like myDate:2012, myDate:2012-03, myDate:2012-03-05, myDate:[1991 
TO 2012] etc. 

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 24. feb. 2012, at 03:52, Erick Erickson wrote:

> I think your best bet is to parse out the relevant units and index
> them independently. But this is probably only a few ints
> per record, so it shouldn't be much of a resource hog
> 
> Best
> Erick
> 
> On Thu, Feb 23, 2012 at 5:24 PM, Kurt Nordstrom  
> wrote:
>> Hello all!
>> 
>> We have a situation involving date searching that I could use some seasoned
>> opinions on. What we have is a collection of records, each containing a Solr
>> date field by which we want search on.
>> 
>> The catch is that we want to be able to search for items that match a
>> specific day/month. Essentially, we're trying to implement a "this day in
>> history" feature for our dataset, so that users would be able to put in a
>> date and we'd return all matching records from the past 100 years or so.
>> 
>> Is there a way to perform this kind of search with only the basic Solr date
>> field? Or would I have parse out the month and day and store them in
>> separate fields at indexing time?
>> 
>> Thanks for the help!
>> 
>> -Kurt



Polish language in Solr

2012-03-05 Thread Agnieszka Kukałowicz
Hi,

I have question about Polish language in Solr.

There are 2 options: StempelPolishStemFilterFactory or
HunspellStemFilterFactory with polish dictionary. I've made some tests but
the results are not satisfying me. StempelPolishStemFilterFactory is very
fast during indexing but the quality of searches is not exactly that I
expect. In turn HunspellStemFilterFactory is better in searching but
indexing polish text is very slow.

For example indexing 100k documents with StempelPolishStemFilterFactory
takes only 10 min (150 doc/sec), with HunspellStemFilterFactory - 1h 20
min, so it is only 18-20 doc/sec. (server with 8 cores, 24GB RAM, index on
SSD disk).

Is it possible to speed up indexing with hunspell? What should I optimize?

Have you any experience with Hunspell?

I use Solr 4.0.

Best regards
Agnieszka


Re: A sorting question.

2012-03-05 Thread Luis Cappa Banda
Sometimes the solution is so easy that  I can't see it in front of me.

Thanks, Mikhail!

2012/3/3 Mikhail Khludnev 

> Hi Luis,
>
> Do you mean
>
> q=id:(A^10+OR+B^9+OR+C^8+OR...)
> I'm not sure whether it woks but
>
> q=id:A^10+OR+id:B^9+OR+id:C^8+OR...)
>
> definitely does
>
> On Fri, Mar 2, 2012 at 1:13 PM, Luis Cappa Banda  >wrote:
>
> > Hello!
> >
> > Just a brief question. I'm querying by my docs ids to retrieve the whole
> > document data from them, and I would like to retrieve them in the same
> > order as I queried. Example:
> >
> > *q*=id:(A+OR+B+OR+C+OR...)
> >
> > And I would like to get a response with a default order like:
> >
> > response:
> >
> >*docA*:{
> >
> > }
> >
> >
> >*docB*:{
> >
> > }
> >
> >
> >*docC*:{
> >
> > }
> >
> >Etc.
> >
> >
> > The default response get the documents in a different order, I supose
> that
> > due to Solr internal score algorithm. The ids are not numeric, so there
> is
> > no option to order them with a numeric logic. Any suggestion?
> >
> > Thanks a lot!
> >
> >
> >
> > Luis Cappa.
> >
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Lucid Certified
> Apache Lucene/Solr Developer
> Grid Dynamics
>
> 
>  
>


Re: [SoldCloud] Slow indexing

2012-03-05 Thread Markus Jelsma
On Sun, 4 Mar 2012 21:09:30 -0500, Mark Miller  
wrote:

On Mar 4, 2012, at 5:43 PM, Markus Jelsma wrote:

everything stalls after it lists all segment files and that a ZK 
state change has occured.


Can you get a stack trace here? I'll try to respond to more tomorrow.
What version of trunk are you using? We have been making fixes and
improvements all the time, so need to get a frame of reference.


I updated trunk this saterday, march 3rd. The stack traces i provided 
is all i got. This point of restart and stalling does not produce a 
stack trace at all. This is the final part of the info log:


[lots of segment files]...fdt, _135.fdt, _199_nrm.cfs, _18s.tvd, 
_zm.fdx, _18s.tvf, _196_0.frq, _135.tvf, _195.fdt, _135.tvd, _18n.tvf, 
_18n.tvd, _18y_0.tim, _18s.tvx, _zm.fnm, _187.tvx, _10g.fnm, _13t.per, 
_195.fdx]2012-03-04 22:39:15,061 INFO [solr.core.SolrCore] - 
[recoveryExecutor-2-thread-1] - : newest commit = 312012-03-04 
22:39:16,052 INFO [common.cloud.ZkStateReader] - [main-EventThread] - : 
A cluster state change has occurred2012-03-04 22:39:16,585 INFO 
[common.cloud.ZkStateReader] - [main-EventThread] - : A cluster state 
change has occurred2012-03-04 22:39:36,652 INFO 
[common.cloud.ZkStateReader] - [main-EventThread] - : A cluster state 
change has occurred2012-03-04 22:39:52,220 INFO 
[common.cloud.ZkStateReader] - [main-EventThread] - : A cluster state 
change has occurred




When a client node cannot talk to zookeeper, because it may not know
certain things it should (what if a leader changes?), it must reject
updates (searches will still work). Why can't the node talk to
zookeeper? Perhaps the load is so high on the server, it cannot
respond to zk within the session timeout? I really don't know yet.
When this happens though, it forces a recovery when/if the node can
reconnect to zookeeper.


Sounds likely. There are a lot of time outs in ZK's log such as :

EndOfStreamException: Unable to read additional data from client 
sessionid 0x135dfcfda12, likely client has closed socket
at 
org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220)
at 
org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:224)

at java.lang.Thread.run(Thread.java:662)
2012-03-04 22:37:38,956 [myid:] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1000] - Closed 
socket connection for

 client /141.105.120.151:49833 which had sessionid 0x135dfcfda12
2012-03-04 22:37:39,077 [myid:] - WARN  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@349] - caught 
end of stream exception
EndOfStreamException: Unable to read additional data from client 
sessionid 0x135dfcfda10, likely client has closed socket
at 
org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220)
at 
org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:224)

at java.lang.Thread.run(Thread.java:662)
2012-03-04 22:37:39,077 [myid:] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1000] - Closed 
socket connection for client /141.105.120.153:36794 which had sessionid 
0x135dfcfda10
2012-03-04 22:37:48,000 [myid:] - INFO  
[SessionTracker:ZooKeeperServer@334] - Expiring session 
0x135dfcfda11, timeout of 1ms exceeded


The problems seem to have a lot to do with ZK as we always see bad 
messsages in its log around the time Solr is going crazy.




We have not yet started on optimizing bulk indexing - currently an
update is added locally *before* sending updates in parallel to each
replica. Then we wait for each response before responding to the
client. We plan to offer more optimizations and options around this.



This is indeed a bit of a problem but at least it's indexing. If 
there's any additional information you need or want us to pull in new 
commits and try again we're happy to give it a shot.




Feed back will be useful in making some of these improvements.


- Mark Miller
lucidimagination.com


--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536600 / 06-50258350


XSLT Response Writer and content transformation

2012-03-05 Thread darul
Hello,

Using native XSLT Response Writer, we may need to alter content before
processing xml solr output as a RSS Feed.

Example (trivial one...):


  bla bla bla 


After processing content:


  bla bla bla bla bla bla bla bla bla bla bla bla


Have you any ideas on how to implement a custom function in xslt or before
in XsltResponseWriter.

I would like get this code in a java class and call it for content
processing

Thanks,

Jul

--
View this message in context: 
http://lucene.472066.n3.nabble.com/XSLT-Response-Writer-and-content-transformation-tp3800251p3800251.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Couple issues with edismax in 3.5

2012-03-05 Thread William Bell
I also get an issue with "." with edismax.

For example: Dr. Smith gices me different results than "dr Smith"

On Thu, Mar 1, 2012 at 10:18 PM, Way Cool  wrote:
> Thanks Ahmet! That's good to know someone else also tried to make  phrase
> queries to fix multi-word synonym issue. :-)
>
>
> On Thu, Mar 1, 2012 at 1:42 AM, Ahmet Arslan  wrote:
>
>> > I don't think mm will help here because it defaults to 100%
>> > already by the
>> > following code.
>>
>> Default behavior of mm has changed recently. So it is a good idea to
>> explicitly set it to 100%. Then all of the search terms must match.
>>
>> > Regarding multi-word synonym, what is the best way to handle
>> > it now? Make
>> > it as a phrase with " or adding -  in between?
>> > I don't like index time expansion because it adds lots of
>> > noises.
>>
>> Solr wiki advices to use them at index time for various reasons.
>>
>> "... The recommended approach for dealing with synonyms like this, is to
>> expand the synonym when indexing..."
>>
>>
>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory
>>
>> However index time synonyms has its own problems as well. If you add a new
>> synonym, you need to re-index those documents that contain this  newly
>> added synonym.
>>
>> Also highlighting highlights whole phrases. For example you have :
>>    us, united states
>> Searching for states will highlight both united and stated.
>> Not sure but this seems fixed with LUCENE-3668
>>
>> I was thinking to have query expansion module to handle multi-word
>> synonyms at query time only. Either using o.a.l.search.Query manipulation
>> or String manipulation. Similar to Lukas' posting here
>> http://www.searchworkings.org/forum/-/message_boards/view_message/146097
>>
>>
>>
>>



-- 
Bill Bell
billnb...@gmail.com
cell 720-256-8076


Boost dependency

2012-03-05 Thread William Bell
What I would like to do is ONLY boost if there is a match on terms in
SOLR 3.5. For example:

1. q=smith&defType=dismax&qf=user_query&sort=score desc
2. I want to add a boost by distance (closest = highest score), ONLY
if there is a hit on #1.

This one only multiplies by the "smith" * recip(geodist(),5,25,25)...
 q=smith&defType=dismax&qf=user_query&sort=score
desc&bq=_val_:"recip(geodist($pt),5,25,25)"&sfield=lat_lon&pt=39.740112,-104.984856

I want something like:
  q=smith&defType=dismax&qf=user_query&sort=score
desc&bq=_val_:"product(tf(user_query,$q),recip(geodist(),5,25,25))"&sfield=lat_lon

TF() is not available in SOLR 3.5.

I also tried:

 
q=smith&defType=dismax&qf=user_query&sort=product(query($qq),recip(geodist($pt),1,25,25))
desc&sfield=lat_lon&qq={!lucene}user_query:smith&pt=39.740112,-104.984856

Thoughts?


-- 
Bill Bell
billnb...@gmail.com
cell 720-256-8076