date:20110420

old searchers not closing after optimize or replication

2011-04-20 Thread Bernd Fehling


Hello list,

we have the problem that old searchers often are not closing
after optimize (on master) or replication (on slaves) and
therefore have huge index volumes.
Only solution so far is to stop and start solr which cleans
up everything successfully, but this can only be a workaround.

Is the parameter waitSearcher=false an option to solve this?

Any hints what to check or to debug?

We use Apache Solr 3.1.0 on Linux.

Regards
Bernd

How could each core share configuration files

2011-04-20 Thread kun xiong

Hi all,

Currently in my project , most of the core configurations are
same(solrconfig.xml, dataimport.properties...),  which are putted in their
own folder as reduplicative.

I am wondering how could I put common ones in one folder, which each core
could share, and keep the different ones in their own folder still.

Thanks

Kun

Re: How could each core share configuration files

2011-04-20 Thread lboutros

Perhaps this could help :

http://lucene.472066.n3.nabble.com/Shared-conf-td2787771.html#a2789447

Ludovic.

2011/4/20 kun xiong [via Lucene] 
ml-node+2841801-1701787156-383...@n3.nabble.com

 Hi all,

 Currently in my project , most of the core configurations are
 same(solrconfig.xml, dataimport.properties...),  which are putted in their
 own folder as reduplicative.

 I am wondering how could I put common ones in one folder, which each core
 could share, and keep the different ones in their own folder still.

 Thanks

 Kun


 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://lucene.472066.n3.nabble.com/How-could-each-core-share-configuration-files-tp2841801p2841801.html
  To start a new topic under Solr - User, email
 ml-node+472068-1765922688-383...@n3.nabble.com
 To unsubscribe from Solr - User, click 
 herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=472068code=Ym91dHJvc2xAZ21haWwuY29tfDQ3MjA2OHw0Mzk2MDUxNjE=.




-
Jouve
France.
--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-could-each-core-share-configuration-files-tp2841801p2841875.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Custom Sorting

2011-04-20 Thread Michael Owen

Ok thank you for the discussion. As I thought regard to not possible within 
performance limits.
I think the way to go is to document some more stats at index time, and use 
them in boost queries. :)
Thanks
Mike

 Date: Tue, 19 Apr 2011 15:12:00 -0400
 Subject: Re: Custom Sorting
 From: erickerick...@gmail.com
 To: solr-user@lucene.apache.org

 As I understand it, sorting by field is what caches are all
 about. You have a big list in memory of all of the terms for
 a field, indexed by Lucene doc ID so fetching the term to
 compare by doc ID is fast, and also why the caches need
 to be warmed, and why sort fields should be single-valued.

 If you try to do this yourself and fetch data from each document,
 you can incur a huge performance hit, since you'll be seeking
 all over your disk...

 Score is special though since it's transient. Internally, all Lucene
 has to do is keep track of the top N scores encountered where
 N is something like start + queryResultWindowSize, this
 latter from solrconfig.xml, with no seeks to disk at all...

 Best
 Erick

 On Tue, Apr 19, 2011 at 2:50 PM, Jonathan Rochkind rochk...@jhu.edu wrote:
  On 4/19/2011 1:43 PM, Jan Høydahl wrote:

  Hi,

  Not possible :)
  Lucene compares each matching document against the query and produces a
  score for each.
  Documents are not compared to eachother like normal sort, that would be
  way too costly.

  That might be true for sort by 'score' (although even if you have all the
  scores, it still seems like some kind of sort must be neccesary to see which
  comes first), but when you sort by a field value, which is also possible,
  Lucene must be doing some kind of 'normal sort' algorithm, no?  Ah, I guess
  it could just be using each term's position in the index, which is available
  in constant time, always kept track of in an index? Maybe, I don't know?

Re: TikaEntityProcessor

2011-04-20 Thread firdous_kind86

hi, i asked that :)

didnt get that.. what dependencies?

i am using solr 1.4 and tika 0.9

i replaced tika-core 0.9 and tika-parsers 0.9 at /contrib/extraction/lib
also replaced old version of dataimporthandler-extras by
apache-solr-dataimporthandler-extras-3.1.0.jar

but still same problem..

someone pointed bug SOLR-2116 to me but i guess it is only for solr-3.1

--
View this message in context: 
http://lucene.472066.n3.nabble.com/TikaEntityProcessor-tp2839188p2841936.html
Sent from the Solr - User mailing list archive at Nabble.com.

Selecting (and sorting!) by the min/max value from multiple fields

2011-04-20 Thread jmaslac

Hello,

short question is this - is there a way for a search to return a field that
is not defined in the schema but is a minimal/maximum value of several
(int/float) fields in solrDocument? (and how would that search look like?)

Longer explanation. I have products and each of them can have a several
prices (price for cash, price for credit cards, coupon price and so on) -
not every product has all the price options. (Don't ask why - that's the use
case:) )

   field name=priceCash type=tfloat indexed=true stored=true /
   field name=priceCreditCard type=tfloat indexed=true stored=true
/
   field name=priceCoupon type=tfloat indexed=true stored=true /
+2 more

Is there a way to ask give me the products containing for example 'sony'
and in the results return me the minimal price of all possible prices (for
each product) and SORT the results by that (minimal) price?

I know I can calculate the minimal price at import/index time and store it
in one separate field but the idea is that users will have checkboxes in
which they could say - i'm only interested in products that have the
priceCreditCard and priceCoupon, show me the smaller of those two and sort
by that value.

My idea is something like this:
?q=sonyminPrice:min(priceCash,priceCreditCard,priceCoupon...)
(the field minPrice is not defined in schema but should return in the
results)

For searching this actually doesn't represent a problem as I can easily
programatically compare the prices and present it to the user. The problem
is sorting - I could do that also programatically but that would mean that
I'd have to pull out all the results query returned (which can be quite big
of course) and then sort them, so that a option I would naturally like to
avoid.

Don't know if I'm asking too much of solr:) but I can see usefulness of
something like this in other examples other then mine. 
Hope the question is clear and if I'm going about things completely the
wrong way please advise in the right direction.
(If there is a similar question asked somewhere else please redirect me - i
didn't find it)

Help much appreciated!

Josip

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Selecting-and-sorting-by-the-min-max-value-from-multiple-fields-tp2841944p2841944.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Selecting (and sorting!) by the min/max value from multiple fields

2011-04-20 Thread Tanguy Moal


Hello,

Have you tried reading : 
http://wiki.apache.org/solr/FunctionQuery#Sort_By_Function


From that page I would try something like :
http://host:port/solr/select?q=sonysort=min(min(priceCash,priceCreditCard),priceCoupon)+ascrows=10indent=ondebugQuery=on

Is that of any help ?

--
Tanguy

On 04/20/2011 09:41 AM, jmaslac wrote:

Hello,

short question is this - is there a way for a search to return a field that
is not defined in the schema but is a minimal/maximum value of several
(int/float) fields in solrDocument? (and how would that search look like?)

Longer explanation. I have products and each of them can have a several
prices (price for cash, price for credit cards, coupon price and so on) -
not every product has all the price options. (Don't ask why - that's the use
case:) )

field name=priceCash type=tfloat indexed=true stored=true /
field name=priceCreditCard type=tfloat indexed=true stored=true
/
field name=priceCoupon type=tfloat indexed=true stored=true /
+2 more

Is there a way to ask give me the products containing for example 'sony'
and in the results return me the minimal price of all possible prices (for
each product) and SORT the results by that (minimal) price?

I know I can calculate the minimal price at import/index time and store it
in one separate field but the idea is that users will have checkboxes in
which they could say - i'm only interested in products that have the
priceCreditCard and priceCoupon, show me the smaller of those two and sort
by that value.

My idea is something like this:
?q=sonyminPrice:min(priceCash,priceCreditCard,priceCoupon...)
(the field minPrice is not defined in schema but should return in the
results)

For searching this actually doesn't represent a problem as I can easily
programatically compare the prices and present it to the user. The problem
is sorting - I could do that also programatically but that would mean that
I'd have to pull out all the results query returned (which can be quite big
of course) and then sort them, so that a option I would naturally like to
avoid.

Don't know if I'm asking too much of solr:) but I can see usefulness of
something like this in other examples other then mine.
Hope the question is clear and if I'm going about things completely the
wrong way please advise in the right direction.
(If there is a similar question asked somewhere else please redirect me - i
didn't find it)

Help much appreciated!

Josip

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Selecting-and-sorting-by-the-min-max-value-from-multiple-fields-tp2841944p2841944.html
Sent from the Solr - User mailing list archive at Nabble.com.



--
--
Tanguy

Saravanan Chinnadurai/Actionimages is out of the office.

2011-04-20 Thread Saravanan . Chinnadurai

I will be out of the office starting  20/04/2011 and will not return until
21/04/2011.

Please email to itsta...@actionimages.com  for any urgent issues.


Action Images is a division of Reuters Limited and your data will therefore be 
protected
in accordance with the Reuters Group Privacy / Data Protection notice which is 
available
in the privacy footer at www.reuters.com
Registered in England No. 145516   VAT REG: 397000555

RE: How could each core share configuration files

2011-04-20 Thread Ephraim Ofir

I just use soft-links...

Ephraim Ofir

-Original Message-
From: lboutros [mailto:boutr...@gmail.com] 
Sent: Wednesday, April 20, 2011 10:09 AM
To: solr-user@lucene.apache.org
Subject: Re: How could each core share configuration files

Perhaps this could help :

http://lucene.472066.n3.nabble.com/Shared-conf-td2787771.html#a2789447

Ludovic.

2011/4/20 kun xiong [via Lucene] 
ml-node+2841801-1701787156-383...@n3.nabble.com

 Hi all,

 Currently in my project , most of the core configurations are
 same(solrconfig.xml, dataimport.properties...),  which are putted in
their
 own folder as reduplicative.

 I am wondering how could I put common ones in one folder, which each
core
 could share, and keep the different ones in their own folder still.

 Thanks

 Kun

 --
  If you reply to this email, your message will be added to the
discussion
 below:

http://lucene.472066.n3.nabble.com/How-could-each-core-share-configurati
on-files-tp2841801p2841801.html
  To start a new topic under Solr - User, email
 ml-node+472068-1765922688-383...@n3.nabble.com
 To unsubscribe from Solr - User, click
herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=u
nsubscribe_by_codenode=472068code=Ym91dHJvc2xAZ21haWwuY29tfDQ3MjA2OHw0
Mzk2MDUxNjE=.

-
Jouve
France.
--
View this message in context:
http://lucene.472066.n3.nabble.com/How-could-each-core-share-configurati
on-files-tp2841801p2841875.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Selecting (and sorting!) by the min/max value from multiple fields

2011-04-20 Thread jmaslac

Tanguy, thanks for the anwser.

Yes I have already tried that but the problem is that min() function is not
yet available (it is set for Solr 3.2). 
:(


Btw. in my original post I've asked if the query could in the results return
a new field with this computed minimal value - that is redudant, I'm only
interested in sorting part of the question.



Tanguy Moal wrote:
 
 Hello,
 
 Have you tried reading : 
 http://wiki.apache.org/solr/FunctionQuery#Sort_By_Function
 
  From that page I would try something like :
 http://host:port/solr/select?q=sonysort=min(min(priceCash,priceCreditCard),priceCoupon)+ascrows=10indent=ondebugQuery=on
 
 Is that of any help ?
 
 --
 Tanguy
 
 On 04/20/2011 09:41 AM, jmaslac wrote:
 Hello,

 short question is this - is there a way for a search to return a field
 that
 is not defined in the schema but is a minimal/maximum value of several
 (int/float) fields in solrDocument? (and how would that search look
 like?)

 Longer explanation. I have products and each of them can have a several
 prices (price for cash, price for credit cards, coupon price and so on) -
 not every product has all the price options. (Don't ask why - that's the
 use
 case:) )

 field name=priceCash type=tfloat indexed=true stored=true /
 field name=priceCreditCard type=tfloat indexed=true
 stored=true
 /
 field name=priceCoupon type=tfloat indexed=true stored=true
 /
 +2 more

 Is there a way to ask give me the products containing for example 'sony'
 and in the results return me the minimal price of all possible prices
 (for
 each product) and SORT the results by that (minimal) price?

 I know I can calculate the minimal price at import/index time and store
 it
 in one separate field but the idea is that users will have checkboxes in
 which they could say - i'm only interested in products that have the
 priceCreditCard and priceCoupon, show me the smaller of those two and
 sort
 by that value.

 My idea is something like this:
 ?q=sonyminPrice:min(priceCash,priceCreditCard,priceCoupon...)
 (the field minPrice is not defined in schema but should return in the
 results)

 For searching this actually doesn't represent a problem as I can easily
 programatically compare the prices and present it to the user. The
 problem
 is sorting - I could do that also programatically but that would mean
 that
 I'd have to pull out all the results query returned (which can be quite
 big
 of course) and then sort them, so that a option I would naturally like to
 avoid.

 Don't know if I'm asking too much of solr:) but I can see usefulness of
 something like this in other examples other then mine.
 Hope the question is clear and if I'm going about things completely the
 wrong way please advise in the right direction.
 (If there is a similar question asked somewhere else please redirect me -
 i
 didn't find it)

 Help much appreciated!

 Josip

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Selecting-and-sorting-by-the-min-max-value-from-multiple-fields-tp2841944p2841944.html
 Sent from the Solr - User mailing list archive at Nabble.com.
 
 
 -- 
 --
 Tanguy
 


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Selecting-and-sorting-by-the-min-max-value-from-multiple-fields-tp2841944p2842232.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: KStemmer for Solr 3.x +

2011-04-20 Thread Ofer Fort

Seems like it isn't. In my installation (1.4.1) i used
LucidKStemFilterFactory, and when switching the solr.war file to the 3.1 one
i get:
14:42:31.664 ERROR [pool-1-thread-1]: java.lang.AbstractMethodError:
org.apache.lucene.analysis.TokenStream.incrementToken()Z
at
org.apache.lucene.analysis.CachingTokenFilter.fillCache(CachingTokenFilter.java:78)
at
org.apache.lucene.analysis.CachingTokenFilter.incrementToken(CachingTokenFilter.java:50)
at
org.apache.lucene.queryParser.QueryParser.getFieldQuery(QueryParser.java:606)
at
org.apache.solr.search.SolrQueryParser.getFieldQuery(SolrQueryParser.java:151)
at
org.apache.lucene.queryParser.QueryParser.Term(QueryParser.java:1421)
at
org.apache.lucene.queryParser.QueryParser.Clause(QueryParser.java:1309)
at
org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:1237)
at
org.apache.lucene.queryParser.QueryParser.TopLevelQuery(QueryParser.java:1226)
at
org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:206)
at
org.apache.solr.search.LuceneQParser.parse(LuceneQParserPlugin.java:80)
at org.apache.solr.search.QParser.getQuery(QParser.java:142)
at
org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:84)
at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:173)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1360)
at
org.apache.solr.core.QuerySenderListener.newSearcher(QuerySenderListener.java:52)
at org.apache.solr.core.SolrCore$3.call(SolrCore.java:1169)
at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown
Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
Source)
at java.lang.Thread.run(Unknown Source)

when the config is:
  analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
!--filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
ignoreCase=true expand=true/--
filter class=solr.StopFilterFactory
ignoreCase=true
words=old_stopwords.txt
enablePositionIncrements=true
/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 preserveOriginal=0
catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnNumerics=0
splitOnCaseChange=0/
filter class=solr.LowerCaseFilterFactory/
!--filter class=solr.SnowballPorterFilterFactory
language=English protected=protwords.txt/--
filter
class=com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory
protected=protwords.txt/
  /analyzer

anybody familiar with this issue?

On Sat, Apr 9, 2011 at 7:00 AM, David Smiley (@MITRE.org) dsmi...@mitre.org
 wrote:

 I see no reason why it would not be compatible.

 -
  Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/KStemmer-for-Solr-3-x-tp2796594p2798213.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: old searchers not closing after optimize or replication

2011-04-20 Thread Erick Erickson

Does this persist? In other words, if you just watch it for
some time, does the disk usage go back to normal?

Because it's typical that your index size will temporarily
spike after the operations you describe as new searchers
are warmed up. During that interval, both the old and new
searchers are open.

Look particularly at your warmup time in the Solr admin page,
that should give you an indication of how long it takes your
warmup to happen and give you a clue about when you should
expect the index sizes to drop again.

How often do you optimize on the master and replicate on the
slave? Because you may be getting into the runaway warmup
problem where a new searcher is opened before the last one
is autowarmed and spiraling out of control.

Hope that helps
Erick

On Wed, Apr 20, 2011 at 2:36 AM, Bernd Fehling
bernd.fehl...@uni-bielefeld.de wrote:
 Hello list,

 we have the problem that old searchers often are not closing
 after optimize (on master) or replication (on slaves) and
 therefore have huge index volumes.
 Only solution so far is to stop and start solr which cleans
 up everything successfully, but this can only be a workaround.

 Is the parameter waitSearcher=false an option to solve this?

 Any hints what to check or to debug?

 We use Apache Solr 3.1.0 on Linux.

 Regards
 Bernd

Re: old searchers not closing after optimize or replication

2011-04-20 Thread Bernd Fehling


Hi Erik,

Am 20.04.2011 13:56, schrieb Erick Erickson:

Does this persist? In other words, if you just watch it for
some time, does the disk usage go back to normal?


Only after restarting the whole solr the disk usage goes back to normal.



Because it's typical that your index size will temporarily
spike after the operations you describe as new searchers
are warmed up. During that interval, both the old and new
searchers are open.


Temporarily yes, but still after a couple of hours after optimize
or replication?



Look particularly at your warmup time in the Solr admin page,
that should give you an indication of how long it takes your
warmup to happen and give you a clue about when you should
expect the index sizes to drop again.


We have newSearcher and firstSearcher (both with 2 simple queries) and
useColdSearcherfalse/useColdSearcher
maxWarmingSearchers2/maxWarmingSearchers
The QTime is less than 500 (0.5 second).

warmupTime=0 for all autowarming Searcher



How often do you optimize on the master and replicate on the
slave? Because you may be getting into the runaway warmup
problem where a new searcher is opened before the last one
is autowarmed and spiraling out of control.


We commit new content about every hour and do an optimze once a day.
So replication is also once a day after optimize finished and
system has settled down.
No commit during optimize and replication.


Any further hints?




Hope that helps
Erick

On Wed, Apr 20, 2011 at 2:36 AM, Bernd Fehling
bernd.fehl...@uni-bielefeld.de  wrote:

Hello list,

we have the problem that old searchers often are not closing
after optimize (on master) or replication (on slaves) and
therefore have huge index volumes.
Only solution so far is to stop and start solr which cleans
up everything successfully, but this can only be a workaround.

Is the parameter waitSearcher=false an option to solve this?

Any hints what to check or to debug?

We use Apache Solr 3.1.0 on Linux.

Regards
Bernd

Re: old searchers not closing after optimize or replication

2011-04-20 Thread Erick Erickson

H, this isn't right. You've pretty much eliminated the obvious
things. What does lsof show? I'm assuming it shows the files are
being held open by your Solr instance, but it's worth checking.

I'm not getting the same behavior, admittedly on a Windows box.
The only other thing I can think of is that you have a query that's
somehow never ending, but that's grasping at straws.

Do your log files show anything interesting?

Best
Erick@NotMuchHelpIKnow

On Wed, Apr 20, 2011 at 8:37 AM, Bernd Fehling
bernd.fehl...@uni-bielefeld.de wrote:
 Hi Erik,

 Am 20.04.2011 13:56, schrieb Erick Erickson:

 Does this persist? In other words, if you just watch it for
 some time, does the disk usage go back to normal?

 Only after restarting the whole solr the disk usage goes back to normal.


 Because it's typical that your index size will temporarily
 spike after the operations you describe as new searchers
 are warmed up. During that interval, both the old and new
 searchers are open.

 Temporarily yes, but still after a couple of hours after optimize
 or replication?


 Look particularly at your warmup time in the Solr admin page,
 that should give you an indication of how long it takes your
 warmup to happen and give you a clue about when you should
 expect the index sizes to drop again.

 We have newSearcher and firstSearcher (both with 2 simple queries) and
 useColdSearcherfalse/useColdSearcher
 maxWarmingSearchers2/maxWarmingSearchers
 The QTime is less than 500 (0.5 second).

 warmupTime=0 for all autowarming Searcher


 How often do you optimize on the master and replicate on the
 slave? Because you may be getting into the runaway warmup
 problem where a new searcher is opened before the last one
 is autowarmed and spiraling out of control.

 We commit new content about every hour and do an optimze once a day.
 So replication is also once a day after optimize finished and
 system has settled down.
 No commit during optimize and replication.


 Any further hints?



 Hope that helps
 Erick

 On Wed, Apr 20, 2011 at 2:36 AM, Bernd Fehling
 bernd.fehl...@uni-bielefeld.de  wrote:

 Hello list,

 we have the problem that old searchers often are not closing
 after optimize (on master) or replication (on slaves) and
 therefore have huge index volumes.
 Only solution so far is to stop and start solr which cleans
 up everything successfully, but this can only be a workaround.

 Is the parameter waitSearcher=false an option to solve this?

 Any hints what to check or to debug?

 We use Apache Solr 3.1.0 on Linux.

 Regards
 Bernd

Solr - Multi Term highlighting issue

2011-04-20 Thread Ramanathapuram, Rajesh

Hello,

I am dealing with a highlighting issue in SOLR, I will try to explain
the issue.

When I search for a single term in solr, it wraps em tag around the
words I want to highlight, all works well.
But if I search multiple term, for most part highlighting works good and
then for some of the terms, 
the highlight return multiple terms in a sing em tag ...
emsrchtrm1) brbp srchtrm2/em
I expect solr to return highlight terms like... emsrchtrm1/em)
brbp... emsrchtrm2/em

When I search for 'US mec chile', here is how my result appears 
  ... Corboba. (emMEC)/b/pp/ppbCHILE/em/FOREST FIRES: We
had ... with emUS/em and emChile/em ...,
  (emMEC)/b/pp/ppbUS/em  

This is what I was expecting it to be 
  ... Corboba. (emMEC/em)/b/pp/ppbemCHILE/em/FOREST
FIRES: We had ... with emUS/em and emChile/em ...,
(emMEC/em)/b/pp/ppbemUS/em 

Here is my query params 
- response
- lst name=responseHeader
  int name=status0/int 
  int name=QTime26/int 
- lst name=params
 str name=hl.fragsize10/str 
 str name=explainOther / 
 str name=indenton/str 
 str name=hl.flstory, slug/str 
 str name=wtstandard/str 
 str name=hlon/str 
 str name=rows10/str 
 str name=version2.2/str 
 str name=hl.highlightMultiTermtrue/str 
 str name=fl*/str 
 str name=start0/str 
 str name=qmec us chile/str 
 str name=qtstandard/str 
 str name=hl.usePhraseHighlightertrue/str 
 str name=fqstoryid=  X/str 
  /lst
  /lst

Here are some other links I found in the forum, but no real conclusion
 
http://www.lucidimagination.com/search/document/ac64e4f0abb6e4fc/solr_hi
ghlighting_question#78163c42a67cb533 
   
I am going to try this patch, which also had no conclusive results
   https://issues.apache.org/jira/browse/SOLR-1394 

Has anyone come across this issue?
Any suggestions on how to fix this issue is much appreciated.


thanks  regards,
Rajesh Ramana

Re: old searchers not closing after optimize or replication

2011-04-20 Thread Bernd Fehling


Hi Erik,

Am 20.04.2011 15:42, schrieb Erick Erickson:

H, this isn't right. You've pretty much eliminated the obvious
things. What does lsof show? I'm assuming it shows the files are
being held open by your Solr instance, but it's worth checking.


Just commited new content 3 times and finally optimized.
Again having old index files left.

Then checked on my master, only the newest version of index files are
listed with lsof. No file handles to the old index files but the
old index files remain in data/index/.
Thats strange.

This time replication worked fine and cleaned up old index on slaves.



I'm not getting the same behavior, admittedly on a Windows box.
The only other thing I can think of is that you have a query that's
somehow never ending, but that's grasping at straws.

Do your log files show anything interesting?


Lets see:
- it has the old generation (generation=12) and its files
- and recognizes that there have been several commits (generation=18)

20.04.2011 14:05:26 org.apache.solr.update.DirectUpdateHandler2 commit
INFO: start 
commit(optimize=true,waitFlush=false,waitSearcher=true,expungeDeletes=false)
20.04.2011 14:05:26 org.apache.solr.core.SolrDeletionPolicy onInit
INFO: SolrDeletionPolicy.onInit: commits:num=2
commit{dir=/srv/www/solr/solr/solrserver/solr/data/index,segFN=segments_c,version=1302159868435,generation=12,filenames=[_3xm.nrm, 
_3xm.fdx, segment

s_c, _3xm.fnm, _3xm.fdt, _3xm.tis, _3xm.tii, _3xm.prx, _3xm.frq]
commit{dir=/srv/www/solr/solr/solrserver/solr/data/index,segFN=segments_i,version=1302159868447,generation=18,filenames=[_3xm.nrm, 
_3xo.tis, _3xp.pr
x, _3xo.fnm, _3xp.fdx, _3xs.frq, _3xo.tii, _3xp.fdt, _3xn.tii, _3xm.fdx, _3xn.nrm, _3xm.fdt, _3xs.prx, _3xn.tis, _3xn.fdx, _3xr.nrm, _3xm.prx, 
_3xn.fdt, _3x
p.tii, _3xs.nrm, _3xp.tis, _3xo.prx, segments_i, _3xm.tii, _3xq.tii, _3xs.fdx, _3xs.fdt, _3xo.frq, _3xn.prx, _3xm.tis, _3xr.prx, _3xq.tis, 
_3xo.fdt, _3xp.fr
q, _3xq.fnm, _3xo.fdx, _3xp.fnm, _3xr.tis, _3xr.fnm, _3xq.frq, _3xr.tii, _3xr.frq, _3xo.nrm, _3xs.tii, _3xq.fdx, _3xq.fdt, _3xp.nrm, _3xq.prx, 
_3xs.tis, _3x

m.frq, _3xr.fdx, _3xm.fnm, _3xn.frq, _3xq.nrm, _3xs.fnm, _3xn.fnm, _3xr.fdt]
20.04.2011 14:05:26 org.apache.solr.core.SolrDeletionPolicy updateCommits
INFO: newest commit = 1302159868447


- after 44 minutes of optimizing (over 140GB and 27.8 mio docs) it gets
  the SolrDeletionPolicy onCommit and has the new generation 19 listed.


20.04.2011 14:49:25 org.apache.solr.core.SolrDeletionPolicy onCommit
INFO: SolrDeletionPolicy.onCommit: commits:num=3
commit{dir=/srv/www/solr/solr/solrserver/solr/data/index,segFN=segments_c,version=1302159868435,generation=12,filenames=[_3xm.nrm, 
_3xm.fdx, segment

s_c, _3xm.fnm, _3xm.fdt, _3xm.tis, _3xm.tii, _3xm.prx, _3xm.frq]
commit{dir=/srv/www/solr/solr/solrserver/solr/data/index,segFN=segments_i,version=1302159868447,generation=18,filenames=[_3xm.nrm, 
_3xo.tis, _3xp.pr
x, _3xo.fnm, _3xp.fdx, _3xs.frq, _3xo.tii, _3xp.fdt, _3xn.tii, _3xm.fdx, _3xn.nrm, _3xm.fdt, _3xs.prx, _3xn.tis, _3xn.fdx, _3xr.nrm, _3xm.prx, 
_3xn.fdt, _3x
p.tii, _3xs.nrm, _3xp.tis, _3xo.prx, segments_i, _3xm.tii, _3xq.tii, _3xs.fdx, _3xs.fdt, _3xo.frq, _3xn.prx, _3xm.tis, _3xr.prx, _3xq.tis, 
_3xo.fdt, _3xp.fr
q, _3xq.fnm, _3xo.fdx, _3xp.fnm, _3xr.tis, _3xr.fnm, _3xq.frq, _3xr.tii, _3xr.frq, _3xo.nrm, _3xs.tii, _3xq.fdx, _3xq.fdt, _3xp.nrm, _3xq.prx, 
_3xs.tis, _3x

m.frq, _3xr.fdx, _3xm.fnm, _3xn.frq, _3xq.nrm, _3xs.fnm, _3xn.fnm, _3xr.fdt]
commit{dir=/srv/www/solr/solr/solrserver/solr/data/index,segFN=segments_j,version=1302159868449,generation=19,filenames=[_3xt.fnm, 
_3xt.nrm, _3xt.fr

q, _3xt.fdt, _3xt.tis, _3xt.fdx, segments_j, _3xt.prx, _3xt.tii]
20.04.2011 14:49:25 org.apache.solr.core.SolrDeletionPolicy updateCommits
INFO: newest commit = 1302159868449


- it starts a new searcher and warms it up
- it sends SolrIndexSearcher close


20.04.2011 14:49:29 org.apache.solr.search.SolrIndexSearcher init
INFO: Opening Searcher@2c37425f main
20.04.2011 14:49:29 org.apache.solr.update.DirectUpdateHandler2 commit
INFO: end_commit_flush
20.04.2011 14:49:29 org.apache.solr.search.SolrIndexSearcher warm
...
20.04.2011 14:49:29 org.apache.solr.core.QuerySenderListener newSearcher
INFO: QuerySenderListener sending requests to Searcher@2c37425f main
20.04.2011 14:49:29 org.apache.solr.core.SolrCore execute
INFO: [] webapp=null path=null params={facet=truestart=0event=newSearcherq=solrfacet.limit=100facet.field=f_dcyearrows=10} hits=96 
status=0 QTime=816

20.04.2011 14:49:30 org.apache.solr.core.SolrCore execute
INFO: [] webapp=null path=null params={facet=truestart=0event=newSearcherq=*:*facet.limit=100facet.field=f_dcyearrows=10} hits=27826100 
status=0 QTime=633

20.04.2011 14:49:30 org.apache.solr.core.QuerySenderListener newSearcher
INFO: QuerySenderListener done.
20.04.2011 14:49:30 org.apache.solr.core.SolrCore registerSearcher
INFO: [] Registered new searcher Searcher@2c37425f main
20.04.2011

Re: TikaEntityProcessor

2011-04-20 Thread Andreas Kemkes

I went unsuccessfully down this path - too many incompatibilities among 
versions 
- some code changes and recompiling required.  See also thread Solr 1.4.1 and 
Tika 0.9 - some tests not passing for remaining issues.  You'll have better 
luck with the newer Solr 3.1 release, which already uses Tika 0.8 - still 
re-compiled from code (no changes as far as I remember) - never tried the 
library replacement - don't think it's possible.

Andreas  




From: firdous_kind86 naturelov...@gmail.com
To: solr-user@lucene.apache.org
Sent: Wed, April 20, 2011 12:38:02 AM
Subject: Re: TikaEntityProcessor

hi, i asked that :)

didnt get that.. what dependencies?

i am using solr 1.4 and tika 0.9

i replaced tika-core 0.9 and tika-parsers 0.9 at /contrib/extraction/lib
also replaced old version of dataimporthandler-extras by
apache-solr-dataimporthandler-extras-3.1.0.jar

but still same problem..

someone pointed bug SOLR-2116 to me but i guess it is only for solr-3.1

--
View this message in context: 
http://lucene.472066.n3.nabble.com/TikaEntityProcessor-tp2839188p2841936.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: TikaEntityProcessor

2011-04-20 Thread firdous_kind86

after reading this post i hoped that i could achieve.. but couldnt find any
success in almost a week

http://lucene.472066.n3.nabble.com/TikaEntityProcessor-not-working-td856965.html#a867572

--
View this message in context: 
http://lucene.472066.n3.nabble.com/TikaEntityProcessor-tp2839188p2843084.html
Sent from the Solr - User mailing list archive at Nabble.com.

Multiple Tags and Facets

2011-04-20 Thread Em

Hello,

I watched an online video with Chris Hostsetter from Lucidimagination. He
showed the possibility of having some Facets that exclude *all* filter while
also having some Facets that take care of some of the set filters while
ignoring other filters.

Unfortunately the Webinar did not explain how they made this and I wasn't
able to give a filter/facet more than one tag.

Here is an example:

Facets and Filters: DocType, Author

Facet:
- Author
-- George (10)
-- Brian (12)
-- Christian (78)
-- Julia (2)

-Doctype
-- PDF (70)
-- ODT (10)
-- Word (20)
-- JPEG (1)
-- PNG (1)

When clicking on Julia I would like to achieve the following:
Facet:
- Author
-- George (10)
-- Brian (12)
-- Christian (78)
-- Julia (2)
 Julia's Doctypes:
-- JPEG (1)
-- PNG (1)

-Doctype
-- PDF (70)
-- ODT (10)
-- Word (20)
-- JPEG (1)
-- PNG (1)

Another example which adds special options to your GUI could be as
following:
Imagine a fashion store.
If you search for shirt you get a color-facet:

colors:
- red (19)
- green (12)
- blue (4)
- black (2)

As well as a brand-facet:

brands:
- puma (18)
- nike (19)

When I click on the red color-facet, I would like to get the following back:
colors:
- red (19)
- green (12)*
- blue (4)*
- black (2)*

brands:
- puma (18)*
- nike (19)

All those filters marked by an * could be displayed half-transparent or so
- they just show the user that those filter-options exist for his/her search
but aren't included in the result-set, since he/she excluded them by
clicking the red filter.

This case is more interesting, if not all red shirts were from nike. 
This way you can show the user that i.e. 8 of 19 red - shirts are from the
brand you selected/you see 8 of 19 red shirts.

I hope I explained what I want to achive.

Thank you!

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Multiple-Tags-and-Facets-tp2843130p2843130.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: old searchers not closing after optimize or replication

2011-04-20 Thread Erick Erickson

It looks OK, but still doesn't explain keeping the old files around. What is
your deletionPolicy in your solrconfig.xml look like? It's
possible that you're seeing Solr attempt to keep around several
optimized copies of the index, but that still doesn't explain why
restarting Solr removes them unless the deletionPolicy gets invoked
on sometime and you're index files are aging out (I don't know the
internals of deletion well enough to say).

About optimization. It's become less important with recent code. Once
upon a time, it made a substantial difference in search speed. More
recently, it has very little impact on search speed, and is used
much more sparingly. Its greatest benefit is reclaiming unused resources
left over from deleted documents. So you might want to avoid the pain
of optimizing (44 minutes!) and only optimize rarely of if you have
deleted a lot of documents.

It might be worthwhile to try (with a smaller index !) a bunch of optimize
cycles and see if the deletionPolicy idea has any merit. I'd expect
your index to reach a maximum and stay there after the saved
copies of the index was reached...

But otherwise I'm puzzled...

Erick

On Wed, Apr 20, 2011 at 10:30 AM, Bernd Fehling
bernd.fehl...@uni-bielefeld.de wrote:
 Hi Erik,

 Am 20.04.2011 15:42, schrieb Erick Erickson:

 H, this isn't right. You've pretty much eliminated the obvious
 things. What does lsof show? I'm assuming it shows the files are
 being held open by your Solr instance, but it's worth checking.

 Just commited new content 3 times and finally optimized.
 Again having old index files left.

 Then checked on my master, only the newest version of index files are
 listed with lsof. No file handles to the old index files but the
 old index files remain in data/index/.
 Thats strange.

 This time replication worked fine and cleaned up old index on slaves.


 I'm not getting the same behavior, admittedly on a Windows box.
 The only other thing I can think of is that you have a query that's
 somehow never ending, but that's grasping at straws.

 Do your log files show anything interesting?

 Lets see:
 - it has the old generation (generation=12) and its files
 - and recognizes that there have been several commits (generation=18)

 20.04.2011 14:05:26 org.apache.solr.update.DirectUpdateHandler2 commit
 INFO: start
 commit(optimize=true,waitFlush=false,waitSearcher=true,expungeDeletes=false)
 20.04.2011 14:05:26 org.apache.solr.core.SolrDeletionPolicy onInit
 INFO: SolrDeletionPolicy.onInit: commits:num=2

  commit{dir=/srv/www/solr/solr/solrserver/solr/data/index,segFN=segments_c,version=1302159868435,generation=12,filenames=[_3xm.nrm,
 _3xm.fdx, segment
 s_c, _3xm.fnm, _3xm.fdt, _3xm.tis, _3xm.tii, _3xm.prx, _3xm.frq]

  commit{dir=/srv/www/solr/solr/solrserver/solr/data/index,segFN=segments_i,version=1302159868447,generation=18,filenames=[_3xm.nrm,
 _3xo.tis, _3xp.pr
 x, _3xo.fnm, _3xp.fdx, _3xs.frq, _3xo.tii, _3xp.fdt, _3xn.tii, _3xm.fdx,
 _3xn.nrm, _3xm.fdt, _3xs.prx, _3xn.tis, _3xn.fdx, _3xr.nrm, _3xm.prx,
 _3xn.fdt, _3x
 p.tii, _3xs.nrm, _3xp.tis, _3xo.prx, segments_i, _3xm.tii, _3xq.tii,
 _3xs.fdx, _3xs.fdt, _3xo.frq, _3xn.prx, _3xm.tis, _3xr.prx, _3xq.tis,
 _3xo.fdt, _3xp.fr
 q, _3xq.fnm, _3xo.fdx, _3xp.fnm, _3xr.tis, _3xr.fnm, _3xq.frq, _3xr.tii,
 _3xr.frq, _3xo.nrm, _3xs.tii, _3xq.fdx, _3xq.fdt, _3xp.nrm, _3xq.prx,
 _3xs.tis, _3x
 m.frq, _3xr.fdx, _3xm.fnm, _3xn.frq, _3xq.nrm, _3xs.fnm, _3xn.fnm, _3xr.fdt]
 20.04.2011 14:05:26 org.apache.solr.core.SolrDeletionPolicy updateCommits
 INFO: newest commit = 1302159868447


 - after 44 minutes of optimizing (over 140GB and 27.8 mio docs) it gets
  the SolrDeletionPolicy onCommit and has the new generation 19 listed.


 20.04.2011 14:49:25 org.apache.solr.core.SolrDeletionPolicy onCommit
 INFO: SolrDeletionPolicy.onCommit: commits:num=3

  commit{dir=/srv/www/solr/solr/solrserver/solr/data/index,segFN=segments_c,version=1302159868435,generation=12,filenames=[_3xm.nrm,
 _3xm.fdx, segment
 s_c, _3xm.fnm, _3xm.fdt, _3xm.tis, _3xm.tii, _3xm.prx, _3xm.frq]

  commit{dir=/srv/www/solr/solr/solrserver/solr/data/index,segFN=segments_i,version=1302159868447,generation=18,filenames=[_3xm.nrm,
 _3xo.tis, _3xp.pr
 x, _3xo.fnm, _3xp.fdx, _3xs.frq, _3xo.tii, _3xp.fdt, _3xn.tii, _3xm.fdx,
 _3xn.nrm, _3xm.fdt, _3xs.prx, _3xn.tis, _3xn.fdx, _3xr.nrm, _3xm.prx,
 _3xn.fdt, _3x
 p.tii, _3xs.nrm, _3xp.tis, _3xo.prx, segments_i, _3xm.tii, _3xq.tii,
 _3xs.fdx, _3xs.fdt, _3xo.frq, _3xn.prx, _3xm.tis, _3xr.prx, _3xq.tis,
 _3xo.fdt, _3xp.fr
 q, _3xq.fnm, _3xo.fdx, _3xp.fnm, _3xr.tis, _3xr.fnm, _3xq.frq, _3xr.tii,
 _3xr.frq, _3xo.nrm, _3xs.tii, _3xq.fdx, _3xq.fdt, _3xp.nrm, _3xq.prx,
 _3xs.tis, _3x
 m.frq, _3xr.fdx, _3xm.fnm, _3xn.frq, _3xq.nrm, _3xs.fnm, _3xn.fnm, _3xr.fdt]

  commit{dir=/srv/www/solr/solr/solrserver/solr/data/index,segFN=segments_j,version=1302159868449,generation=19,filenames=[_3xt.fnm,
 _3xt.nrm, _3xt.fr
 q, _3xt.fdt, _3xt.tis, _3xt.fdx, segments_j, _3xt.prx, _3xt.tii]

Re: Solr - Multi Term highlighting issue

2011-04-20 Thread Erick Erickson

Does your configuration have hl.mergeContiguous set to true by any
chance? And what
happens if you explicitly set this to false on your query?

Best
Erick

On Wed, Apr 20, 2011 at 9:43 AM, Ramanathapuram, Rajesh
rajesh.ramanathapu...@turner.com wrote:
 Hello,

 I am dealing with a highlighting issue in SOLR, I will try to explain
 the issue.

 When I search for a single term in solr, it wraps em tag around the
 words I want to highlight, all works well.
 But if I search multiple term, for most part highlighting works good and
 then for some of the terms,
 the highlight return multiple terms in a sing em tag     ...
 emsrchtrm1) brbp srchtrm2/em
 I expect solr to return highlight terms like    ... emsrchtrm1/em)
 brbp... emsrchtrm2/em

 When I search for 'US mec chile', here is how my result appears
  ... Corboba. (emMEC)/b/pp/ppbCHILE/em/FOREST FIRES: We
 had ... with emUS/em and emChile/em ...,
  (emMEC)/b/pp/ppbUS/em  

 This is what I was expecting it to be
  ... Corboba. (emMEC/em)/b/pp/ppbemCHILE/em/FOREST
 FIRES: We had ... with emUS/em and emChile/em ...,
 (emMEC/em)/b/pp/ppbemUS/em 

 Here is my query params
 - response
 - lst name=responseHeader
  int name=status0/int
  int name=QTime26/int
 - lst name=params
     str name=hl.fragsize10/str
     str name=explainOther /
     str name=indenton/str
     str name=hl.flstory, slug/str
     str name=wtstandard/str
     str name=hlon/str
     str name=rows10/str
     str name=version2.2/str
     str name=hl.highlightMultiTermtrue/str
     str name=fl*/str
     str name=start0/str
     str name=qmec us chile/str
     str name=qtstandard/str
     str name=hl.usePhraseHighlightertrue/str
     str name=fqstoryid=  X/str
  /lst
  /lst

 Here are some other links I found in the forum, but no real conclusion

 http://www.lucidimagination.com/search/document/ac64e4f0abb6e4fc/solr_hi
 ghlighting_question#78163c42a67cb533

 I am going to try this patch, which also had no conclusive results
   https://issues.apache.org/jira/browse/SOLR-1394

 Has anyone come across this issue?
 Any suggestions on how to fix this issue is much appreciated.


 thanks  regards,
 Rajesh Ramana

HTMLStripCharFilterFactory, highlighting and InvalidTokenOffsetsException

2011-04-20 Thread Robert Gründler


Hi all,

i'm getting the following exception when using highlighting for a field 
containing HTMLStripCharFilterFactory:


org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token 
... exceeds length of provided text sized 21


It seems this is a know issue:

https://issues.apache.org/jira/browse/LUCENE-2208

Does anyone know if there's a fix implemented yet in solr?


thanks!


-robert

Re: Creating a TrieDateField (and other Trie fields) from Lucene Java

2011-04-20 Thread Yonik Seeley

On Tue, Apr 19, 2011 at 11:17 PM, Craig Stires craig.sti...@gmail.com wrote:
 The barrier I have is that I need to build this offline (without using a
 solr server, solrconfig.xml, or schema.xml)

This is pretty unusual... can you share your use case?
Solr can also be run in embedded mode if you can't run a stand-alone
server for some reason.

-Yonik
http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
25-26, San Francisco

Re: HTMLStripCharFilterFactory, highlighting and InvalidTokenOffsetsException

2011-04-20 Thread Robert Muir

Hi, there is a proposed patch uploaded to the issue. Maybe you can
help by reviewing/testing it?

2011/4/20 Robert Gründler rob...@dubture.com:
 Hi all,

 i'm getting the following exception when using highlighting for a field
 containing HTMLStripCharFilterFactory:

 org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token ...
 exceeds length of provided text sized 21

 It seems this is a know issue:

 https://issues.apache.org/jira/browse/LUCENE-2208

 Does anyone know if there's a fix implemented yet in solr?


 thanks!


 -robert

stemming filter analyzers, any favorites?

2011-04-20 Thread Robert Petersen

Stemming filter analyzers... anyone have any favorites for particular
search domains?  Just wondering what people are using.  I'm using Lucid
K Stemmer and having issues.   Seems like it misses a lot of common
stems.  We went to that because of excessively loose matches on the
solr.PorterStemFilterFactory


I understand K Stemmer is a dictionary based stemmer.  Seems to me like
it is missing a lot of common stem reductions.  Ie   Bags does not match
Bag in our searches.

Here is my analyzer stack:

fieldType name=text class=solr.TextField
positionIncrementGap=100
analyzer type=index
tokenizer
class=solr.WhitespaceTokenizerFactory/
filter
class=solr.SynonymFilterFactory synonyms=index_synonyms.txt
ignoreCase=true expand=true/
filter class=solr.StopFilterFactory
ignoreCase=true words=stopwords.txt/
  filter class=solr.WordDelimiterFilterFactory
generateWordParts=1
generateNumberParts=1
catenateWords=1
catenateNumbers=1
catenateAll=1
preserveOriginal=1
/  filter
class=solr.LowerCaseFilterFactory/
!-- The LucidKStemmer currently
requires a lowercase filter somewhere before it. --
filter
class=com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory
protected=protwords.txt/
filter
class=solr.RemoveDuplicatesTokenFilterFactory/
/analyzer
analyzer type=query
tokenizer
class=solr.WhitespaceTokenizerFactory/
filter
class=solr.SynonymFilterFactory synonyms=query_synonyms.txt
ignoreCase=true expand=true/
filter class=solr.StopFilterFactory
ignoreCase=true words=stopwords.txt/
  filter class=solr.WordDelimiterFilterFactory
generateWordParts=1
generateNumberParts=1
catenateWords=1
catenateNumbers=1
catenateAll=1
preserveOriginal=1
/  filter
class=solr.LowerCaseFilterFactory/
!-- The LucidKStemmer currently
requires a lowercase filter somewhere before it. --
filter
class=com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory
protected=protwords.txt/
filter
class=solr.RemoveDuplicatesTokenFilterFactory/
/analyzer
/fieldType

Re: stemming filter analyzers, any favorites?

2011-04-20 Thread Erick Erickson

You can get a better sense of exactly what tranformations occur when
if you look at the analysis page (be sure to check the verbose
checkbox).

I'm surprised that bags doesn't match bag, what does the analysis
page say?

Best
Erick

On Wed, Apr 20, 2011 at 1:44 PM, Robert Petersen rober...@buy.com wrote:
 Stemming filter analyzers... anyone have any favorites for particular
 search domains?  Just wondering what people are using.  I'm using Lucid
 K Stemmer and having issues.   Seems like it misses a lot of common
 stems.  We went to that because of excessively loose matches on the
 solr.PorterStemFilterFactory


 I understand K Stemmer is a dictionary based stemmer.  Seems to me like
 it is missing a lot of common stem reductions.  Ie   Bags does not match
 Bag in our searches.

 Here is my analyzer stack:

                fieldType name=text class=solr.TextField
 positionIncrementGap=100
                        analyzer type=index
                                tokenizer
 class=solr.WhitespaceTokenizerFactory/
                                filter
 class=solr.SynonymFilterFactory synonyms=index_synonyms.txt
 ignoreCase=true expand=true/
                                filter class=solr.StopFilterFactory
 ignoreCase=true words=stopwords.txt/
          filter class=solr.WordDelimiterFilterFactory
                generateWordParts=1
                generateNumberParts=1
                catenateWords=1
                catenateNumbers=1
                catenateAll=1
                preserveOriginal=1
                /                              filter
 class=solr.LowerCaseFilterFactory/
                                !-- The LucidKStemmer currently
 requires a lowercase filter somewhere before it. --
                                filter
 class=com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory
 protected=protwords.txt/
                                filter
 class=solr.RemoveDuplicatesTokenFilterFactory/
                        /analyzer
                        analyzer type=query
                                tokenizer
 class=solr.WhitespaceTokenizerFactory/
                                filter
 class=solr.SynonymFilterFactory synonyms=query_synonyms.txt
 ignoreCase=true expand=true/
                                filter class=solr.StopFilterFactory
 ignoreCase=true words=stopwords.txt/
          filter class=solr.WordDelimiterFilterFactory
                generateWordParts=1
                generateNumberParts=1
                catenateWords=1
                catenateNumbers=1
                catenateAll=1
                preserveOriginal=1
                /                              filter
 class=solr.LowerCaseFilterFactory/
                                !-- The LucidKStemmer currently
 requires a lowercase filter somewhere before it. --
                                filter
 class=com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory
 protected=protwords.txt/
                                filter
 class=solr.RemoveDuplicatesTokenFilterFactory/
                        /analyzer
                /fieldType

Bug in solr.KeywordMarkerFilterFactory?

2011-04-20 Thread Demian Katz

I've just started experimenting with the solr.KeywordMarkerFilterFactory in 
Solr 3.1, and I'm seeing some strange behavior.  It seems that every word 
subsequent to a protected word is also treated as being protected.

For testing purposes, I have put the word spelling in my protwords.txt.  If I 
do a test for spelling bees in the analyze tool, the stemmer produces 
spelling bees - nothing is stemmed.  But if I do a test for bees spelling, 
I get bee spelling, the expected result with bees stemmed but spelling 
left unstemmed.  I have tried extended examples - in every case I tried, all of 
the words prior to spelling get stemmed, but none of the words after 
spelling get stemmed.  When turning on the verbose mode of the analyze tool, 
I can see that the settings of the keyword attribute introduced by 
solr.KeywordMarkerFilterFactory correspond with the the stemming behavior... so 
I think the solr.KeywordMarkerFilterFactory component is to blame, and not 
anything later in the analyze chain.

Any idea what might be going wrong?  Is this a known issue?

Here is my field type definition, in case it makes a difference:

fieldType name=text class=solr.TextField positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.ICUTokenizerFactory/
filter class=solr.WordDelimiterFilterFactory generateWordParts=1 
generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 
splitOnCaseChange=1/
filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt enablePositionIncrements=true/
filter class=solr.ICUFoldingFilterFactory/
filter class=solr.KeywordMarkerFilterFactory 
protected=protwords.txt/
filter class=solr.SnowballPorterFilterFactory language=English/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.ICUTokenizerFactory/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt 
ignoreCase=true expand=true/
filter class=solr.WordDelimiterFilterFactory generateWordParts=1 
generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 
splitOnCaseChange=1/
filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt enablePositionIncrements=true/
filter class=solr.ICUFoldingFilterFactory/
filter class=solr.KeywordMarkerFilterFactory 
protected=protwords.txt/
filter class=solr.SnowballPorterFilterFactory language=English/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
/fieldType

thanks,
Demian

Re: Bug in solr.KeywordMarkerFilterFactory?

2011-04-20 Thread Yonik Seeley

On Wed, Apr 20, 2011 at 2:01 PM, Demian Katz demian.k...@villanova.edu wrote:
 I've just started experimenting with the solr.KeywordMarkerFilterFactory in 
 Solr 3.1, and I'm seeing some strange behavior.  It seems that every word 
 subsequent to a protected word is also treated as being protected.

You're right!  This was broken by LUCENE-2901 back in Jan.
I've opened this issue:  https://issues.apache.org/jira/browse/LUCENE-3039

The easiest short-term workaround for you would probably be to create
a custom filter that looks like KeywordMarkerFilter before the
LUCENE-2901 change.

-Yonik
http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
25-26, San Francisco

RE: Solr - Multi Term highlighting issue

2011-04-20 Thread Ramanathapuram, Rajesh

Thanks Erick. 

I tried your suggestion, the issue still exists.

http://localhost:8983/searchsolr/mainCore/select?indent=onversion=2.2q=mec+us+chilefq=storyid%3DXXX%22start=0rows=10fl=*qt=standardwt=standardexplainOther=hl=onhl.fl=story%2C+slughl.fragsize=10hl.highlightMultiTerm=truehl.usePhraseHighlighter=truehl.mergeContiguous=false

- lst name=params
  str name=hl.fragsize10/str 
  str name=explainOther / 
  str name=indenton/str 
  str name=hl.mergeContiguousfalse/str 


... Corboba. (emMEC)/b/pp/ppbCHILE/em/FOREST FIRES ...


thanks  regards,
Rajesh Ramana 


-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Wednesday, April 20, 2011 11:59 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr - Multi Term highlighting issue

Does your configuration have hl.mergeContiguous set to true by any chance? 
And what happens if you explicitly set this to false on your query?

Best
Erick

On Wed, Apr 20, 2011 at 9:43 AM, Ramanathapuram, Rajesh 
rajesh.ramanathapu...@turner.com wrote:
 Hello,

 I am dealing with a highlighting issue in SOLR, I will try to explain 
 the issue.

 When I search for a single term in solr, it wraps em tag around the 
 words I want to highlight, all works well.
 But if I search multiple term, for most part highlighting works good 
 and then for some of the terms, the highlight return multiple terms in 
 a sing em tag     ...
 emsrchtrm1) brbp srchtrm2/em I expect solr to return 
 highlight terms like    ... emsrchtrm1/em) brbp... 
 emsrchtrm2/em

 When I search for 'US mec chile', here is how my result appears
  ... Corboba. (emMEC)/b/pp/ppbCHILE/em/FOREST FIRES: 
 We had ... with emUS/em and emChile/em ...,
  (emMEC)/b/pp/ppbUS/em  

 This is what I was expecting it to be
  ... Corboba. (emMEC/em)/b/pp/ppbemCHILE/em/FOREST
 FIRES: We had ... with emUS/em and emChile/em ..., 
 (emMEC/em)/b/pp/ppbemUS/em 

 Here is my query params
 - response
 - lst name=responseHeader
  int name=status0/int
  int name=QTime26/int
 - lst name=params
     str name=hl.fragsize10/str
     str name=explainOther /
     str name=indenton/str
     str name=hl.flstory, slug/str
     str name=wtstandard/str
     str name=hlon/str
     str name=rows10/str
     str name=version2.2/str
     str name=hl.highlightMultiTermtrue/str
     str name=fl*/str
     str name=start0/str
     str name=qmec us chile/str
     str name=qtstandard/str
     str name=hl.usePhraseHighlightertrue/str
     str name=fqstoryid=  X/str
  /lst
  /lst

 Here are some other links I found in the forum, but no real conclusion

 http://www.lucidimagination.com/search/document/ac64e4f0abb6e4fc/solr_
 hi
 ghlighting_question#78163c42a67cb533

 I am going to try this patch, which also had no conclusive results
   https://issues.apache.org/jira/browse/SOLR-1394

 Has anyone come across this issue?
 Any suggestions on how to fix this issue is much appreciated.


 thanks  regards,
 Rajesh Ramana

Re: Bug in solr.KeywordMarkerFilterFactory?

2011-04-20 Thread Robert Muir

No, this is only a bug in analysis.jsp.

you can see this by comparing analysis.jsp's dontstems bees to using
the query debug interface:
lst name=debug
  str name=rawquerystringdontstems bees/str
  str name=querystringdontstems bees/str
  str name=parsedqueryPhraseQuery(text:dontstems bee)/str
  str name=parsedquery_toStringtext:dontstems bee/str

On Wed, Apr 20, 2011 at 2:43 PM, Yonik Seeley
yo...@lucidimagination.com wrote:
 On Wed, Apr 20, 2011 at 2:01 PM, Demian Katz demian.k...@villanova.edu 
 wrote:
 I've just started experimenting with the solr.KeywordMarkerFilterFactory in 
 Solr 3.1, and I'm seeing some strange behavior.  It seems that every word 
 subsequent to a protected word is also treated as being protected.

 You're right!  This was broken by LUCENE-2901 back in Jan.
 I've opened this issue:  https://issues.apache.org/jira/browse/LUCENE-3039

 The easiest short-term workaround for you would probably be to create
 a custom filter that looks like KeywordMarkerFilter before the
 LUCENE-2901 change.

 -Yonik
 http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
 25-26, San Francisco

RE: Bug in solr.KeywordMarkerFilterFactory?

2011-04-20 Thread Demian Katz

That's good news -- thanks for the help (not to mention the reassurance that 
Solr itself is actually working right)!  Hopefully 3.1.1 won't be too far off, 
though; when the analysis tool lies, life can get very confusing! :-)

- Demian

 -Original Message-
 From: Robert Muir [mailto:rcm...@gmail.com]
 Sent: Wednesday, April 20, 2011 2:54 PM
 To: solr-user@lucene.apache.org; yo...@lucidimagination.com
 Subject: Re: Bug in solr.KeywordMarkerFilterFactory?

 No, this is only a bug in analysis.jsp.

 you can see this by comparing analysis.jsp's dontstems bees to using
 the query debug interface:
 lst name=debug
   str name=rawquerystringdontstems bees/str
   str name=querystringdontstems bees/str
   str name=parsedqueryPhraseQuery(text:dontstems bee)/str
   str name=parsedquery_toStringtext:dontstems bee/str

 On Wed, Apr 20, 2011 at 2:43 PM, Yonik Seeley
 yo...@lucidimagination.com wrote:
  On Wed, Apr 20, 2011 at 2:01 PM, Demian Katz
 demian.k...@villanova.edu wrote:
  I've just started experimenting with the
 solr.KeywordMarkerFilterFactory in Solr 3.1, and I'm seeing some
 strange behavior.  It seems that every word subsequent to a protected
 word is also treated as being protected.

  You're right!  This was broken by LUCENE-2901 back in Jan.
  I've opened this issue:
  https://issues.apache.org/jira/browse/LUCENE-3039

  The easiest short-term workaround for you would probably be to create
  a custom filter that looks like KeywordMarkerFilter before the
  LUCENE-2901 change.

  -Yonik
  http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
  25-26, San Francisco

Re: ConcurrentLRUCache$Stats error

2011-04-20 Thread Chris Hostetter


: https://issues.apache.org/jira/browse/SOLR-1797

that issue doesn't seem to have anything to do with the stack trace 
reported...

:  SEVERE: java.util.concurrent.ExecutionException:
:  java.lang.NoSuchMethodError:
:  org.apache.solr.common.util.ConcurrentLRUCache$Stats.add(Lorg/apache/solr/c
:  ommon/util/ConcurrentLRUCache$Stats;)V

NoSuchMethodError means that one compiled java class expects another 
compiled java class to have a method that it does not actually have -- 
this typically happens when you have inconcsisten classfiles (or jars) in 
your classpath.

ie: you most likely have a mix of jars from two different versions of 
solr/lucene.

-Hoss

RE: stemming filter analyzers, any favorites?

2011-04-20 Thread Robert Petersen

I have been doing that, and for Bags example the trailing 's' is not being 
removed by the Kstemmer so if indexing the word bags and searching on bag you 
get no matches.  Why wouldn't the trailing 's' get stemmed off?  Kstemmer is 
dictionary based so bags isn't in the dictionary?   That trailing 's' should 
always be dropped no?  That seems like it would be better, we don't want to 
make synonyms for basic use cases like this.  I fear I will have to return to 
the Porter stemmer.  Are there other better ones is my main question.

Off topic secondary question: sometimes I am puzzled by the output of the 
analysis page.  It seems like there should be a match, but I don't get the 
results during a search that I'd expect...  

Like in the case if the WordDelimiterFilterFactory splits up a term into a 
bunch of terms before the K-stemmer is applied, sometimes if the matching term 
is in position two of the final analysis but the searcher had the partial term 
just alone and so thereby in position 1 in the analysis stack then when 
searching there wasn't a match.  Am I reading this correctly?  Is that right or 
should that match and I am misreading my analysis output?  

Thanks!

Robi

PS  I have a category named Bags and am catching flack for it not coming up in 
a search for bag.  hah
PPS the term is not in protwords.txt


com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory 
{protected=protwords.txt}
term position   1
term text   bags
term type   word
source start,end0,4
payload 


-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Wednesday, April 20, 2011 10:55 AM
To: solr-user@lucene.apache.org
Subject: Re: stemming filter analyzers, any favorites?

You can get a better sense of exactly what tranformations occur when
if you look at the analysis page (be sure to check the verbose
checkbox).

I'm surprised that bags doesn't match bag, what does the analysis
page say?

Best
Erick

On Wed, Apr 20, 2011 at 1:44 PM, Robert Petersen rober...@buy.com wrote:
 Stemming filter analyzers... anyone have any favorites for particular
 search domains?  Just wondering what people are using.  I'm using Lucid
 K Stemmer and having issues.   Seems like it misses a lot of common
 stems.  We went to that because of excessively loose matches on the
 solr.PorterStemFilterFactory


 I understand K Stemmer is a dictionary based stemmer.  Seems to me like
 it is missing a lot of common stem reductions.  Ie   Bags does not match
 Bag in our searches.

 Here is my analyzer stack:

                fieldType name=text class=solr.TextField
 positionIncrementGap=100
                        analyzer type=index
                                tokenizer
 class=solr.WhitespaceTokenizerFactory/
                                filter
 class=solr.SynonymFilterFactory synonyms=index_synonyms.txt
 ignoreCase=true expand=true/
                                filter class=solr.StopFilterFactory
 ignoreCase=true words=stopwords.txt/
          filter class=solr.WordDelimiterFilterFactory
                generateWordParts=1
                generateNumberParts=1
                catenateWords=1
                catenateNumbers=1
                catenateAll=1
                preserveOriginal=1
                /                              filter
 class=solr.LowerCaseFilterFactory/
                                !-- The LucidKStemmer currently
 requires a lowercase filter somewhere before it. --
                                filter
 class=com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory
 protected=protwords.txt/
                                filter
 class=solr.RemoveDuplicatesTokenFilterFactory/
                        /analyzer
                        analyzer type=query
                                tokenizer
 class=solr.WhitespaceTokenizerFactory/
                                filter
 class=solr.SynonymFilterFactory synonyms=query_synonyms.txt
 ignoreCase=true expand=true/
                                filter class=solr.StopFilterFactory
 ignoreCase=true words=stopwords.txt/
          filter class=solr.WordDelimiterFilterFactory
                generateWordParts=1
                generateNumberParts=1
                catenateWords=1
                catenateNumbers=1
                catenateAll=1
                preserveOriginal=1
                /                              filter
 class=solr.LowerCaseFilterFactory/
                                !-- The LucidKStemmer currently
 requires a lowercase filter somewhere before it. --
                                filter
 class=com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory
 protected=protwords.txt/
                                filter
 class=solr.RemoveDuplicatesTokenFilterFactory/
                        /analyzer
                /fieldType

entity name issue

2011-04-20 Thread tjtong

Hi guys,

I have encountered a problem with entity name, see the data config code
below. the variable '${ea.a_aid}' was always empty. I suspect it is a
namespace issue. Anyone knows how to bypass it? 

This is on oracle database. I had to use the prefix myschema., otherwise,
the table name was not recognized. The similar thing worked on database
without adding a prefix to the table names.
Thanks in advance!

 entity name=e_a query=select myschema.table_a.aid as id,
myschema.table_a.aid as a_aid from myschema.table_a where
'${dataimporter.request.clean}' != 'false' and
myschema.table_a.aid${dataimporter.request.aid} 
entity name=e_b query=select col as c_col from myschema.table_b 
where
myschema.table_b.aid='${ea.a_aid}'/
/entity   

--
View this message in context: 
http://lucene.472066.n3.nabble.com/entity-name-issue-tp2843812p2843812.html
Sent from the Solr - User mailing list archive at Nabble.com.

Highest frequency terms for a subset of documents

2011-04-20 Thread Ofer Fort

Hi,
I am looking for the best way to find the terms with the highest frequency
for a given subset of documents. (terms in the text field)
My first thought was to do a count facet search , where the query defines
the subset of documents and the facet.field is the text field, this gives me
the result but it is very very slow.
These are my params:
str name=facettrue/str
str name=facet.offset0/str
str name=facet.mincount3/str
str name=indenton/str
str name=facet.limit500/str
str name=facet.methodenum/str
str name=wtxml/str
str name=rows0/str
str name=version2.2/str
str name=facet.sortcount/str
   str name=qin_subset:1/str
str name=facet.fieldtext/str
/lst

The index contains 7M documents, the subset is about 200K. A simple query
for the subset takes around 100ms, but the facet search takes 40s.

Am i doing something wrong?

If facet search is not the correct approach, i thought about using something
like org.apache.lucene.misc.HighFreqTerms, but i'm not sure how to do this
in solr. Should i implememt a request handler that executes this kind of
code?

thanks for any help

RE: Highest frequency terms for a subset of documents

2011-04-20 Thread Jonathan Rochkind

I think faceting is probably the best way to do that, indeed. It might be slow, 
but it's kind of set up for exactly that case, I can't imagine any other 
technique being faster -- there's stuff that has to be done to look up the info 
you want. 

BUT, I see your problem:  don't use facet.method=enum. Use facet.method=fc.  
Works a LOT better for very high arity fields (lots and lots of unique values) 
like you have. I bet you'll see significant speed-up if you use facet.method=fc 
instead, hopefully fast enough to be workable. 

With facet.method=enum, I would have indeed predicted it would be horribly 
slow, before solr 1.4 when facet.method=fc became available, it was nearly 
impossible to facet on very high arity fields, facet.method=fc is the magic. I 
think facet.method=fc is even the default in Solr 1.4+, if you hadn't 
explicitly set it to enum instead! 

Jonathan

From: Ofer Fort [ofer...@gmail.com]
Sent: Wednesday, April 20, 2011 6:49 PM
To: solr-user@lucene.apache.org
Subject: Highest frequency terms for a subset of documents
Hi,
I am looking for the best way to find the terms with the highest frequency
for a given subset of documents. (terms in the text field)
My first thought was to do a count facet search , where the query defines
the subset of documents and the facet.field is the text field, this gives me
the result but it is very very slow.
These are my params:
str name=facettrue/str
str name=facet.offset0/str
str name=facet.mincount3/str
str name=indenton/str
str name=facet.limit500/str
str name=facet.methodenum/str
str name=wtxml/str
str name=rows0/str
str name=version2.2/str
str name=facet.sortcount/str
   str name=qin_subset:1/str
str name=facet.fieldtext/str
/lst

The index contains 7M documents, the subset is about 200K. A simple query
for the subset takes around 100ms, but the facet search takes 40s.

Am i doing something wrong?

If facet search is not the correct approach, i thought about using something
like org.apache.lucene.misc.HighFreqTerms, but i'm not sure how to do this
in solr. Should i implememt a request handler that executes this kind of
code?

thanks for any help

Re: How to index MS SQL Server column with image type

2011-04-20 Thread Chris Hostetter


: Subject: How to index MS SQL Server column with image type
: 
: Hi all,
: 
: When I index a column(image type) of a table  via *
: http://localhost:8080/solr/dataimport?command=full-import*
: *There is a error like this: String length must be a multiple of four.*

For future refrence: full error message (with stack traces) are the best 
way to get people to help you diagnose problems.

I think the crux of hte issue is that DataImportHandler doesn't 
currently have any way of indexing raw binary data like images.

Under teh covers, Solr can deal with pure binary fields, but there aren't 
a lot of good usecases i can think of for it -- particularly if you want 
to *index* those bytes...

: field name=bs_attachment type=binary indexed=true stored=true/

...can you please explain what your goal is?  what are you ultimatley 
hoping to do with that field?





-Hoss

Re: Highest frequency terms for a subset of documents

2011-04-20 Thread Ofer Fort

thanks, but that's what i started with, but it took an even longer time and
threw this:
Approaching too many values for UnInvertedField faceting on field 'text' :
bucket size=15560140
Approaching too many values for UnInvertedField faceting on field 'text :
bucket size=15619075
Exception during facet counts:org.apache.solr.common.SolrException: Too many
values for UnInvertedField faceting on field text


On Thu, Apr 21, 2011 at 2:11 AM, Jonathan Rochkind rochk...@jhu.edu wrote:

 I think faceting is probably the best way to do that, indeed. It might be
 slow, but it's kind of set up for exactly that case, I can't imagine any
 other technique being faster -- there's stuff that has to be done to look up
 the info you want.

 BUT, I see your problem:  don't use facet.method=enum. Use facet.method=fc.
  Works a LOT better for very high arity fields (lots and lots of unique
 values) like you have. I bet you'll see significant speed-up if you use
 facet.method=fc instead, hopefully fast enough to be workable.

 With facet.method=enum, I would have indeed predicted it would be horribly
 slow, before solr 1.4 when facet.method=fc became available, it was nearly
 impossible to facet on very high arity fields, facet.method=fc is the magic.
 I think facet.method=fc is even the default in Solr 1.4+, if you hadn't
 explicitly set it to enum instead!

 Jonathan
 
 From: Ofer Fort [ofer...@gmail.com]
 Sent: Wednesday, April 20, 2011 6:49 PM
 To: solr-user@lucene.apache.org
 Subject: Highest frequency terms for a subset of documents
 Hi,
 I am looking for the best way to find the terms with the highest frequency
 for a given subset of documents. (terms in the text field)
 My first thought was to do a count facet search , where the query defines
 the subset of documents and the facet.field is the text field, this gives
 me
 the result but it is very very slow.
 These are my params:
 str name=facettrue/str
 str name=facet.offset0/str
 str name=facet.mincount3/str
 str name=indenton/str
 str name=facet.limit500/str
 str name=facet.methodenum/str
 str name=wtxml/str
 str name=rows0/str
 str name=version2.2/str
 str name=facet.sortcount/str
   str name=qin_subset:1/str
 str name=facet.fieldtext/str
 /lst

 The index contains 7M documents, the subset is about 200K. A simple query
 for the subset takes around 100ms, but the facet search takes 40s.

 Am i doing something wrong?

 If facet search is not the correct approach, i thought about using
 something
 like org.apache.lucene.misc.HighFreqTerms, but i'm not sure how to do this
 in solr. Should i implememt a request handler that executes this kind of
 code?

 thanks for any help

Re: Highest frequency terms for a subset of documents

2011-04-20 Thread Ofer Fort

seems like the facet search is not all that suited for a full text field. (
http://search.lucidimagination.com/search/document/178f1a82ff19070c/solr_severe_error_when_doing_a_faceted_search#16562790cda76197
)

Maybe i should go another direction. I think that the HighFreqTerms
approach, just not sure how to start.

On Thu, Apr 21, 2011 at 2:23 AM, Ofer Fort o...@tra.cx wrote:

 thanks, but that's what i started with, but it took an even longer time and
 threw this:
 Approaching too many values for UnInvertedField faceting on field 'text' :
 bucket size=15560140
 Approaching too many values for UnInvertedField faceting on field 'text :
 bucket size=15619075
 Exception during facet counts:org.apache.solr.common.SolrException: Too
 many values for UnInvertedField faceting on field text



 On Thu, Apr 21, 2011 at 2:11 AM, Jonathan Rochkind rochk...@jhu.eduwrote:

 I think faceting is probably the best way to do that, indeed. It might be
 slow, but it's kind of set up for exactly that case, I can't imagine any
 other technique being faster -- there's stuff that has to be done to look up
 the info you want.

 BUT, I see your problem:  don't use facet.method=enum. Use
 facet.method=fc.  Works a LOT better for very high arity fields (lots and
 lots of unique values) like you have. I bet you'll see significant speed-up
 if you use facet.method=fc instead, hopefully fast enough to be workable.

 With facet.method=enum, I would have indeed predicted it would be horribly
 slow, before solr 1.4 when facet.method=fc became available, it was nearly
 impossible to facet on very high arity fields, facet.method=fc is the magic.
 I think facet.method=fc is even the default in Solr 1.4+, if you hadn't
 explicitly set it to enum instead!

 Jonathan
 
 From: Ofer Fort [ofer...@gmail.com]
 Sent: Wednesday, April 20, 2011 6:49 PM
 To: solr-user@lucene.apache.org
 Subject: Highest frequency terms for a subset of documents
 Hi,
 I am looking for the best way to find the terms with the highest frequency
 for a given subset of documents. (terms in the text field)
 My first thought was to do a count facet search , where the query defines
 the subset of documents and the facet.field is the text field, this gives
 me
 the result but it is very very slow.
 These are my params:
 str name=facettrue/str
 str name=facet.offset0/str
 str name=facet.mincount3/str
 str name=indenton/str
 str name=facet.limit500/str
 str name=facet.methodenum/str
 str name=wtxml/str
 str name=rows0/str
 str name=version2.2/str
 str name=facet.sortcount/str
   str name=qin_subset:1/str
 str name=facet.fieldtext/str
 /lst

 The index contains 7M documents, the subset is about 200K. A simple query
 for the subset takes around 100ms, but the facet search takes 40s.

 Am i doing something wrong?

 If facet search is not the correct approach, i thought about using
 something
 like org.apache.lucene.misc.HighFreqTerms, but i'm not sure how to do this
 in solr. Should i implememt a request handler that executes this kind of
 code?

 thanks for any help

Re: Highest frequency terms for a subset of documents

2011-04-20 Thread Chris Hostetter


: thanks, but that's what i started with, but it took an even longer time and
: threw this:
: Approaching too many values for UnInvertedField faceting on field 'text' :
: bucket size=15560140
: Approaching too many values for UnInvertedField faceting on field 'text :
: bucket size=15619075
: Exception during facet counts:org.apache.solr.common.SolrException: Too many
: values for UnInvertedField faceting on field text

right ... facet.method=fc is a good default, but cases like full text 
faceting can cause it to seriously blow up the memory ... i didn't eve 
realize it was possible to get it to fail this way, i would have just 
expected an OutOfmemoryException.

facet.method=enum is probably your best bet in this situation precisely 
because it does a linera scan over the terms ... it's slower because it's 
safer.

the one speed up you might be able to get is to ensure you don't use the 
filterCache -- that way you don't wast time constantly caching/overwriting 
DocSets

and FWIW...

:  If facet search is not the correct approach, i thought about using
:  something
:  like org.apache.lucene.misc.HighFreqTerms, but i'm not sure how to do this
:  in solr. Should i implememt a request handler that executes this kind of

HighFreqTerms just looks at the raw docfreq for the terms, nearly 
identical to the TermsComponent -- there is no way to deal with your 
subset of documents requrements using an approach like that.

If the number of subsets you have to deal with are fixed, finite, and 
non-overlapping, using distinct cores for each subset (which you can 
aggregate using distributed search when you don't want this type of query) 
can also be a wise choice in many situations

(ie: if you have a books core and a movies core you can search both 
using distributed search, or hit the terms component on just one of them 
to get the top terms for that core)

-Hoss

Re: Highest frequency terms for a subset of documents

2011-04-20 Thread Yonik Seeley

On Wed, Apr 20, 2011 at 7:34 PM, Chris Hostetter
hossman_luc...@fucit.org wrote:

 : thanks, but that's what i started with, but it took an even longer time and
 : threw this:
 : Approaching too many values for UnInvertedField faceting on field 'text' :
 : bucket size=15560140
 : Approaching too many values for UnInvertedField faceting on field 'text :
 : bucket size=15619075
 : Exception during facet counts:org.apache.solr.common.SolrException: Too many
 : values for UnInvertedField faceting on field text

 right ... facet.method=fc is a good default, but cases like full text
 faceting can cause it to seriously blow up the memory ... i didn't eve
 realize it was possible to get it to fail this way, i would have just
 expected an OutOfmemoryException.

 facet.method=enum is probably your best bet in this situation precisely
 because it does a linera scan over the terms ... it's slower because it's
 safer.

 the one speed up you might be able to get is to ensure you don't use the
 filterCache -- that way you don't wast time constantly caching/overwriting
 DocSets

Right - or only using filterCache for high df terms via
http://wiki.apache.org/solr/SimpleFacetParameters#facet.enum.cache.minDf

-Yonik
http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
25-26, San Francisco

Re: Highest frequency terms for a subset of documents

2011-04-20 Thread Ofer Fort

Thanks
but i've disabled the cache already, since my concern is speed and i'm
willing to pay the price (memory), and my subset are not fixed.
Does the facet search do any extra work that i don't need, that i might be
able to disable (either by a flag or by a code change),
Somehow i feel, or rather hope, that counting the terms of 200K documents
and finding the top 500 should take less than 30 seconds.


On Thu, Apr 21, 2011 at 2:41 AM, Yonik Seeley yo...@lucidimagination.comwrote:

 On Wed, Apr 20, 2011 at 7:34 PM, Chris Hostetter
 hossman_luc...@fucit.org wrote:
 
  : thanks, but that's what i started with, but it took an even longer time
 and
  : threw this:
  : Approaching too many values for UnInvertedField faceting on field
 'text' :
  : bucket size=15560140
  : Approaching too many values for UnInvertedField faceting on field 'text
 :
  : bucket size=15619075
  : Exception during facet counts:org.apache.solr.common.SolrException: Too
 many
  : values for UnInvertedField faceting on field text
 
  right ... facet.method=fc is a good default, but cases like full text
  faceting can cause it to seriously blow up the memory ... i didn't eve
  realize it was possible to get it to fail this way, i would have just
  expected an OutOfmemoryException.
 
  facet.method=enum is probably your best bet in this situation precisely
  because it does a linera scan over the terms ... it's slower because it's
  safer.
 
  the one speed up you might be able to get is to ensure you don't use the
  filterCache -- that way you don't wast time constantly
 caching/overwriting
  DocSets

 Right - or only using filterCache for high df terms via
 http://wiki.apache.org/solr/SimpleFacetParameters#facet.enum.cache.minDf

 -Yonik
 http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
 25-26, San Francisco

Re: How to return score without using _val_

2011-04-20 Thread Yonik Seeley

On Tue, Apr 19, 2011 at 11:41 PM, Bill Bell billnb...@gmail.com wrote:
 I would like to influence the score but I would rather not mess with the q=
 field since I want the query to dismax for Q.

 Something like:

 fq={!type=dismax qf=$qqf v=$qspec}
 fq={!type=dismax qt=dismaxname v=$qname}
 q=_val_:{!type=dismax qf=$qqf  v=$qspec} _val_:{!type=dismax
 qt=dismaxname v=$qname}

 Is there a way to do a filter and add the FQ to the score by doing it
 another way?

 Also does this do multiple queries? Is this the right way to do it?

I really don't understand what you're trying to do...
Backing up, you say you want to influence the score,  but I can't
figure out how you would like to influence the score.

Do you want to:
 - add the score of another query to the main dismax query (use bq)
 - multiply the main dismax score by another query (use edismax along
with boost, or the boost query type)
 - something else?

-Yonik
http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
25-26, San Francisco

Re: Highest frequency terms for a subset of documents

2011-04-20 Thread Ofer Fort

BTW,
i'm using solr 1.4.1, does 3.1 or 4.0 contain any performance improvements
that will make a difference as far as facet search?
thanks again
Ofer

On Thu, Apr 21, 2011 at 2:45 AM, Ofer Fort o...@tra.cx wrote:

 Thanks
 but i've disabled the cache already, since my concern is speed and i'm
 willing to pay the price (memory), and my subset are not fixed.
 Does the facet search do any extra work that i don't need, that i might be
 able to disable (either by a flag or by a code change),
 Somehow i feel, or rather hope, that counting the terms of 200K documents
 and finding the top 500 should take less than 30 seconds.



 On Thu, Apr 21, 2011 at 2:41 AM, Yonik Seeley 
 yo...@lucidimagination.comwrote:

 On Wed, Apr 20, 2011 at 7:34 PM, Chris Hostetter
 hossman_luc...@fucit.org wrote:
 
  : thanks, but that's what i started with, but it took an even longer
 time and
  : threw this:
  : Approaching too many values for UnInvertedField faceting on field
 'text' :
  : bucket size=15560140
  : Approaching too many values for UnInvertedField faceting on field
 'text :
  : bucket size=15619075
  : Exception during facet counts:org.apache.solr.common.SolrException:
 Too many
  : values for UnInvertedField faceting on field text
 
  right ... facet.method=fc is a good default, but cases like full text
  faceting can cause it to seriously blow up the memory ... i didn't eve
  realize it was possible to get it to fail this way, i would have just
  expected an OutOfmemoryException.
 
  facet.method=enum is probably your best bet in this situation precisely
  because it does a linera scan over the terms ... it's slower because
 it's
  safer.
 
  the one speed up you might be able to get is to ensure you don't use the
  filterCache -- that way you don't wast time constantly
 caching/overwriting
  DocSets

 Right - or only using filterCache for high df terms via
 http://wiki.apache.org/solr/SimpleFacetParameters#facet.enum.cache.minDf

 -Yonik
 http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
 25-26, San Francisco

Re: Highest frequency terms for a subset of documents

2011-04-20 Thread Yonik Seeley

On Wed, Apr 20, 2011 at 7:45 PM, Ofer Fort o...@tra.cx wrote:
 Thanks
 but i've disabled the cache already, since my concern is speed and i'm
 willing to pay the price (memory)

Then you should not disable the cache.

, and my subset are not fixed.
 Does the facet search do any extra work that i don't need, that i might be
 able to disable (either by a flag or by a code change),
 Somehow i feel, or rather hope, that counting the terms of 200K documents
 and finding the top 500 should take less than 30 seconds.

Using facet.enum.cache.minDf should be a little faster than just
disabling the cache - it's a different code path.
Using the cache selectively will speed things up, so try setting that
minDf to 1000 or so for example.

How many unique terms do you have in the index?
Is this Solr 3.1 - there were some optimizations when there were many
terms to iterate over?
You could also try trunk, which has even more optimizations, or the
bulkpostings branch if you really want to experiment.

-Yonik

Re: Highest frequency terms for a subset of documents

2011-04-20 Thread Ofer Fort

my documents are user entries, so i'm guessing they vary a lot.
Tomorrow i'll try 3.1 and also 4.0, and see if they have an improvement.
thanks guys!

On Thu, Apr 21, 2011 at 3:02 AM, Yonik Seeley yo...@lucidimagination.comwrote:

 On Wed, Apr 20, 2011 at 7:45 PM, Ofer Fort o...@tra.cx wrote:
  Thanks
  but i've disabled the cache already, since my concern is speed and i'm
  willing to pay the price (memory)

 Then you should not disable the cache.

 , and my subset are not fixed.
  Does the facet search do any extra work that i don't need, that i might
 be
  able to disable (either by a flag or by a code change),
  Somehow i feel, or rather hope, that counting the terms of 200K documents
  and finding the top 500 should take less than 30 seconds.

 Using facet.enum.cache.minDf should be a little faster than just
 disabling the cache - it's a different code path.
 Using the cache selectively will speed things up, so try setting that
 minDf to 1000 or so for example.

 How many unique terms do you have in the index?
 Is this Solr 3.1 - there were some optimizations when there were many
 terms to iterate over?
 You could also try trunk, which has even more optimizations, or the
 bulkpostings branch if you really want to experiment.

 -Yonik

Solr - upgrade from 1.4.1 to 3.1 - finding AbstractSolrTestCase binaries - help please?

2011-04-20 Thread Bob Sandiford

HI, all.

I'm working on upgrading from 1.4.1 to 3.1, and I'm having some troubles with 
some of the unit test code for our custom Filters.  We wrote the tests to 
extend AbstractSolrTestCase, and I've been reading the thread about the 
test-harness elements not being present in the 3.1 distributables. [1]

So, I have checked out the 3.1 branch code and built that (ant 
generate-maven-artifacts), and I've found the 
lucene-test-framework-3.1-xxx.jar(s).  However, these contain only the lucene 
level framework elements, and none of the solr.

Did the solr test framework actually get built and embedded in one of the solr 
jars somewhere?  Or, if not, is there some way to build a jar that contains the 
solr portion of the test harnesses?

[1] SOLR-2061https://issues.apache.org/jira/browse/SOLR-2061 Generate jar 
containing test classes.https://issues.apache.org/jira/browse/SOLR-2061
*
Thanks!

Bob Sandiford | Lead Software Engineer | SirsiDynix
P: 800.288.8020 X6943 | bob.sandif...@sirsidynix.com
www.sirsidynix.comhttp://www.sirsidynix.com/

RE: Creating a TrieDateField (and other Trie fields) from Lucene Java

2011-04-20 Thread Craig Stires


Hi Yonik,

The limitations I need to work within, have to do with the index already
being built as part of an existing process.

Currently, the Solr server is in read-only mode and receives new indexes
daily from a Java application.  The Java app runs Lucene/Tika and is
indexing resources within the local network.  It builds off of a different
schema framework, then moves the finished indexes over to the Solr
deployment path.  The Solr server swaps over at that point.  The Solr server
isn't the only consumer of the indexes.  There are other Java apps which
read/write to the Lucene index, during the staging process.

This was working without issues, when using types were part of Lucene core
(String, Boolean, Integer, etc), because they just resolved to Strings.
But, the TrieDateField works off of byte data, so needed to find a way to
create those fields, using the existing classes.

Thanks,
-Craig
 


-Original Message-
From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley
Sent: Wednesday, 20 April 2011 11:19 PM
To: solr-user@lucene.apache.org
Subject: Re: Creating a TrieDateField (and other Trie fields) from Lucene
Java

On Tue, Apr 19, 2011 at 11:17 PM, Craig Stires craig.sti...@gmail.com
wrote:
 The barrier I have is that I need to build this offline (without using a
 solr server, solrconfig.xml, or schema.xml)

This is pretty unusual... can you share your use case?
Solr can also be run in embedded mode if you can't run a stand-alone
server for some reason.

-Yonik
http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
25-26, San Francisco

The issue of import data from database using Solr DIH

2011-04-20 Thread Kevin Xiang

Hi all,
I am a new to solr,I am importing data from database using DIH(solr
1.4).One document is made up of two entity,Every entity is a table in
database.
For example:
Table1:have 3 fields;
Table2:have 4 fields;
If it is Ok,it will be 7 fields.
But it is only 4 fields,it seem that solr don't merge the fields and
table2 over write table1.
The key is OS06Y.
The configuration of db-data-config.xml is the following:
document name=allperf
entity name=PerformanceData1
dataSource=getTrailingTotalReturnForMonthEnd1 query=SELECT Perfo
rmanceId,Trailing1MonthReturn,Trailing2MonthReturn,Trailing3MonthReturn,
FROM  Table1
field column=PerformanceId name=OS06Y /
field column=Trailing1MonthReturn name=PM004 /
field column=Trailing2MonthReturn name=PM133 /
field column=Trailing3MonthReturn name=PM006 /
/entity
entity name=PerformanceData2
dataSource=getTrailingTotalReturnForMonthEnd2 query=SELECT Performan
ceId,Trailing10YearReturn,Trailing15YearReturn,TrailingYearToDateReturn,
SinceInceptionReturn FROM Table2
field column=PerformanceId name=OS06Y /
field column=Trailing10YearReturn name=PM00I /
field column=Trailing15YearReturn name=PM00K /
field column=TrailingYearToDateReturn name=PM00A /
field column=SinceInceptionReturn name=PM00M /
/entity
/document
Has anyone come across this issue?
Any suggestions on how to fix this issue is much appreciated. 
Thanks.

Apache Spam Filter Blocking Messages

2011-04-20 Thread Trey Grainger

Hey (solr-user) Mailing list admin's,

I've tried replying to a thread multiple times tonight, and keep getting a
bounce-back with this response:
Technical details of permanent failure:
Google tried to deliver your message, but it was rejected by the recipient
domain. We recommend contacting the other email provider for further
information about the cause of this error. The error that the other server
returned was: 552 552 spam score (5.1) exceeded threshold
(FREEMAIL_FROM,FS_REPLICA,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,RFC_ABUSE_POST,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL
(state 18).

Apparently I sound like spam when I write perfectly good English and include
some xml and a link to a jira ticket in my e-mail (I tried a couple
different variations).  Anyone know a way around this filter, or should I
just respond to those involved in the e-mail chain directly and avoid the
mailing list?

Thanks,

-Trey

Re: Apache Spam Filter Blocking Messages

2011-04-20 Thread Marvin Humphrey

On Thu, Apr 21, 2011 at 12:30:29AM -0400, Trey Grainger wrote:
 (FREEMAIL_FROM,FS_REPLICA,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,RFC_ABUSE_POST,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL

Note the HTML_MESSAGE in the list of things SpamAssassin didn't like.

 Apparently I sound like spam when I write perfectly good English and include
 some xml and a link to a jira ticket in my e-mail (I tried a couple
 different variations).  Anyone know a way around this filter, or should I
 just respond to those involved in the e-mail chain directly and avoid the
 mailing list?

Send plain text email instead of HTML.  That solves the problem 99% of the
time.

Marvin Humphrey

Need to create dyanamic indexies base on different document workspaces

2011-04-20 Thread Gaurav Shingala





Hi,

Is there a way to create different solr indexes for different categories? 
We have different document workspaces and ideally want each workspace to have 
its own solr index.

Thanks,
Gaurav

52 matches

Mail list logo