Re: hot deploy of newer version of solr schema in production

2012-01-24 Thread Jan Høydahl
Hi,

To be able to do a true hot deploy of newer schema without reindexing, you must 
carefully see to that none of your changes are breaking changes. So you should 
test the process on your development machine and make sure it works. Adding and 
deleting fields would work, but not changing the field-type or analysis of an 
existing field. Depending on from/to version, you may want to keep the old 
schema-version number.

The process is:
1. Deploy the new schema, including all dependencies such as dictionaries
2. Do a RELOAD CORE http://wiki.apache.org/solr/CoreAdmin#RELOAD

My preference is to do a more thorough upgrade of schema including new 
functionality and breaking changes, and then do a full reindex. The exception 
is if my index is huge and the reason for Solr upgrade or schema change is to 
fix a bug, not to use new functionality.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 24. jan. 2012, at 01:51, roz dev wrote:

 Hi All,
 
 I need community's feedback about deploying newer versions of solr schema
 into production while existing (older) schema is in use by applications.
 
 How do people perform these things? What has been the learning of people
 about this.
 
 Any thoughts are welcome.
 
 Thanks
 Saroj



Re: Highlighting stopwords

2012-01-24 Thread Koji Sekiguchi

(12/01/24 9:31), O. Klein wrote:

Let's say I search for spellcheck solr on a website that only contains
info about Solr, so solr was added to the stopwords.txt. The query that
will be parsed then (dismax) will not contain the term solr.

So fragments won't contain highlights of the term solr. So when a fragment
with the highlighted term spellcheck is generated, it would be less
confusing for people who don't know how search engines work to also
highlight the term solr.

So my first test was to have a field with StopFilterFactory and search on
that field, while using another field without StopFilterFactory to highlight
on. This didn't do the trick.


Are you saying that using hl.q parameter on highlight field while using q on
the search field that has StopFilter and hl.q doesn't work for you?

koji
--
http://www.rondhuit.com/en/


Re: Size of index to use shard

2012-01-24 Thread Vadim Kisselmann
Hi,
it depends from your hardware.
Read this:
http://www.derivante.com/2009/05/05/solr-performance-benchmarks-single-vs-multi-core-index-shards/
Think about your cache-config (few updates, big caches) and a good
HW-infrastructure.
In my case i can handle a 250GB index with 100mil. docs on a I7
machine with RAID10 and 24GB RAM = q-times under 1 sec.
Regards
Vadim



2012/1/24 Anderson vasconcelos anderson.v...@gmail.com:
 Hi
 Has some size of index (or number of docs) that is necessary to break
 the index in shards?
 I have a index with 100GB of size. This index increase 10GB per year.
 (I don't have information how many docs they have) and the docs never
 will be deleted.  Thinking in 30 years, the index will be with 400GB
 of size.

 I think  is not required to break in shard, because i not consider
 this like a large index. Am I correct? What's is a real large
 index


 Thanks


RE: Filtering search results by an external set of values

2012-01-24 Thread John, Phil (CSS)
Thanks for the responses.

Groups probably wouldn't work as while there will be some overlap between 
customers, each will have a very different overall set of accessible resources.

I'll try the suggestion about simply reindexing, or using the no-cache option 
and see how I get on.

Failing that, are there hooks to write custom filter modules that used other 
parts of the records to decide on whether to include them in a result set or 
not? In our use case, the documents represent articles, which have an issue 
field. Each customer has defined issues (or ranges of issues) that they have 
subscriptions to, so the upper bounds for what to filter would probably be 
fairly small (10k - 20k issues/ranges). This could probably be used with the 
no-cache option you've pointed me to.

Best wishes,

Phil.

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: 23 January 2012 17:34
To: solr-user@lucene.apache.org
Subject: Re: Filtering search results by an external set of values

A second, but arguably quite expert option, is to use the no-cache option.
See: https://issues.apache.org/jira/browse/SOLR-2429

The idea here is that you can specify that a filter is expensive and it will 
only be run after all the other filters  etc have been applied.
Furthermore,
it will not be cached and only documents that pass through all the other 
filters will be matched against this filter. It has been specifically used for 
ACL calculations...

That said, see exactly how painful storing auth tokens is. I can index, on a 
relatively underpowered laptop, 11M Wiki documents in 5 minutes or so. If your 
worst-case rights update take 1/2 hour to re-index and it only happens once a 
month, why be complex?

And groups, as Jan says, often make even this unnecessary.

Best
Erick

On Mon, Jan 23, 2012 at 5:16 AM, Jan Høydahl jan@cominvent.com wrote:
 Hi,

 Do you have any kind of group membership for you users? If you have, 
 a resource's list of security access tokens could be smaller and avoid 
 re-indexing most resources when adding normal users which mostly 
 belong to groups. The common way is to add filters on the query. You 
 may do it yourself or have some framework/plugin to it for you, see 
 http://wiki.apache.org/solr/SolrSecurity#Document_Level_Security

 --
 Jan Høydahl, search solution architect Cominvent AS - 
 www.cominvent.com Solr Training - www.solrtraining.com

 On 23. jan. 2012, at 11:49, John, Phil (CSS) wrote:

 Hi,



 We're building quite a large shared index of resources, using Solr. 
 The application that makes use of these resources is a multitenant 
 one (i.e., many customers using the same index). For resources that 
 are private to a customer, it's fairly easy to tag a document with 
 their customer ID and using a FilterQuery to limit results to just 
 their stuff.



 We are soon going to be adding a large number (many tens of millions) 
 of records that will be shared amongst customers. Not all customers 
 will have access to the same shared resources, e.g.:



 *         Shared resource 1:

 o   Customer 1

 o   Customer 3



 *         Shared resource 2:

 o   Customer 2

 o   Customer 1



 The issue is, what is the best way to model this in Solr? Should we 
 have multiple customer_id fields on each record, and then use the 
 filter query as with private resources, or is there a better way of doing 
 it?
 What happens if we need to do a bulk change - i.e. adding new 
 customer, or a previous customer has a large change in what shared 
 resources they have access to? Am I right in thinking that we'd need 
 to go through every shared resource, read it, make the required 
 change, and reindex it?



 I'm wondering if there's a way, instead of updating these resources 
 directly, I could construct a set of documents that would act as a 
 filter at query time of which shared resources to return?



 Kind regards,



 Phil John

 Technical Lead, Capita Software Services

 Knights Court, Solihull Parkway

 Birmingham Business Park B37 7YB

 Office: 0870 400 5000

 Fax: 0870 400 5001
 email: philj...@capita.co.uk mailto:philj...@capita.co.uk



 Part of Capita plc www.capita.co.uk http://www.capita.co.uk





 This email and any attachment to it are confidential.  Unless you are the 
 intended recipient, you may not use, copy or disclose either the message or 
 any information contained in the message. If you are not the intended 
 recipient, you should delete this email and notify the sender immediately.

 Any views or opinions expressed in this email are those of the sender only, 
 unless otherwise stated.  All copyright in any Capita material in this email 
 is reserved.

 All emails, incoming and outgoing, may be recorded by Capita and monitored 
 for legitimate business purposes.

 Capita exclude all liability for any loss or damage arising or resulting 
 from the receipt, use or transmission of this email to the fullest extent 
 permitted by law.



Re: ExractionHandler/Cell ignore just 2 fields defined in schema 3.5.0

2012-01-24 Thread Wayne W
Ah perfect - thank you Jan so much. :-)


On Tue, Jan 24, 2012 at 11:14 AM, Jan Høydahl jan@cominvent.com wrote:
 Hi,

 It's because lowernames=true by default in solrconfig.xml, and it will 
 convert any - into _ in field names. So try adding a request parameter 
 lowernames=false or change the default in solrconfig.xml. Alternatively, 
 leave as is but name your fields project_id and company_id :)

 http://wiki.apache.org/solr/ExtractingRequestHandler#Input_Parameters

 --
 Jan Høydahl, search solution architect
 Cominvent AS - www.cominvent.com
 Solr Training - www.solrtraining.com

 On 23. jan. 2012, at 22:26, Wayne W wrote:

 Hi,

 Im been trying to figure this out now for a few days and I'm just not
 getting anywhere, so any pointers would be MOST welcome. I'm in the
 process of upgrading from 1.3 to the latest and greatest version of
 Solr and I'm getting there slowly. However I have this (final) problem
 that when sending a document for extraction, 2 of my fields defined in
 my schema are ignored. When I don't using the extraction the fields
 are used fine (I can see them via Luke).

 My schema has:
 field name=uid type=string stored=true/
        field name=type type=string stored=true /
        field name=id indexed=false type=long stored=true/
        field name=project-id type=long stored=true/
        field name=company-id type=long stored=true/
        field name=importTimestamp type=long stored=true/
        field name=label type=text_ws indexed=true
 stored=true multiValued=true omitNorms=true/
        field name=text type=text indexed=true stored=true
 multiValued=true /
        field name=title type=text indexed=true stored=true
 multiValued=true/
        field name=date type=date indexed=true stored=true
 multiValued=true/


 My request:
 INFO: [] webapp=/solr path=/update/extract
 params={literal.company-id=8literal.uid=hub.app.model.Document#203657literal.date=2012-01-23T21:10:42Zliteral.id=203657literal.type=hub.app.model.Documentidx.attr=trueliteral.label=literal.title=hotel+surfers.pdfdef.fl=textliteral.project-id=36}
 status=0 QTime=3579
 Jan 24, 2012 8:10:58 AM org.apache.solr.update.DirectUpdateHandler2 commit


 For unknown reasons the fields 'company-id', and 'project-id' are ignored.

 any ideas?
 many thanks
 Wayne



Re: Filtering search results by an external set of values

2012-01-24 Thread Mikhail Khludnev
Phil,

Some time ago I posted my thoughts about the similar problem. Scroll to
part II.

http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201201.mbox/%3CCANGii8egoB1_rXFfwJMheyxx72v48B_DA-6KteKOymiBrR=m...@mail.gmail.com%3E

Regards

On Tue, Jan 24, 2012 at 1:36 PM, John, Phil (CSS) philj...@capita.co.ukwrote:

 Thanks for the responses.

 Groups probably wouldn't work as while there will be some overlap between
 customers, each will have a very different overall set of accessible
 resources.

 I'll try the suggestion about simply reindexing, or using the no-cache
 option and see how I get on.

 Failing that, are there hooks to write custom filter modules that used
 other parts of the records to decide on whether to include them in a result
 set or not? In our use case, the documents represent articles, which have
 an issue field. Each customer has defined issues (or ranges of issues)
 that they have subscriptions to, so the upper bounds for what to filter
 would probably be fairly small (10k - 20k issues/ranges). This could
 probably be used with the no-cache option you've pointed me to.

 Best wishes,

 Phil.

 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com]
 Sent: 23 January 2012 17:34
 To: solr-user@lucene.apache.org
 Subject: Re: Filtering search results by an external set of values

 A second, but arguably quite expert option, is to use the no-cache option.
 See: https://issues.apache.org/jira/browse/SOLR-2429

 The idea here is that you can specify that a filter is expensive and it
 will only be run after all the other filters  etc have been applied.
 Furthermore,
 it will not be cached and only documents that pass through all the other
 filters will be matched against this filter. It has been specifically used
 for ACL calculations...

 That said, see exactly how painful storing auth tokens is. I can index, on
 a relatively underpowered laptop, 11M Wiki documents in 5 minutes or so. If
 your worst-case rights update take 1/2 hour to re-index and it only happens
 once a month, why be complex?

 And groups, as Jan says, often make even this unnecessary.

 Best
 Erick

 On Mon, Jan 23, 2012 at 5:16 AM, Jan Høydahl jan@cominvent.com
 wrote:
  Hi,
 
  Do you have any kind of group membership for you users? If you have,
  a resource's list of security access tokens could be smaller and avoid
  re-indexing most resources when adding normal users which mostly
  belong to groups. The common way is to add filters on the query. You
  may do it yourself or have some framework/plugin to it for you, see
  http://wiki.apache.org/solr/SolrSecurity#Document_Level_Security
 
  --
  Jan Høydahl, search solution architect Cominvent AS -
  www.cominvent.com Solr Training - www.solrtraining.com
 
  On 23. jan. 2012, at 11:49, John, Phil (CSS) wrote:
 
  Hi,
 
 
 
  We're building quite a large shared index of resources, using Solr.
  The application that makes use of these resources is a multitenant
  one (i.e., many customers using the same index). For resources that
  are private to a customer, it's fairly easy to tag a document with
  their customer ID and using a FilterQuery to limit results to just
  their stuff.
 
 
 
  We are soon going to be adding a large number (many tens of millions)
  of records that will be shared amongst customers. Not all customers
  will have access to the same shared resources, e.g.:
 
 
 
  * Shared resource 1:
 
  o   Customer 1
 
  o   Customer 3
 
 
 
  * Shared resource 2:
 
  o   Customer 2
 
  o   Customer 1
 
 
 
  The issue is, what is the best way to model this in Solr? Should we
  have multiple customer_id fields on each record, and then use the
  filter query as with private resources, or is there a better way of
 doing it?
  What happens if we need to do a bulk change - i.e. adding new
  customer, or a previous customer has a large change in what shared
  resources they have access to? Am I right in thinking that we'd need
  to go through every shared resource, read it, make the required
  change, and reindex it?
 
 
 
  I'm wondering if there's a way, instead of updating these resources
  directly, I could construct a set of documents that would act as a
  filter at query time of which shared resources to return?
 
 
 
  Kind regards,
 
 
 
  Phil John
 
  Technical Lead, Capita Software Services
 
  Knights Court, Solihull Parkway
 
  Birmingham Business Park B37 7YB
 
  Office: 0870 400 5000
 
  Fax: 0870 400 5001
  email: philj...@capita.co.uk mailto:philj...@capita.co.uk
 
 
 
  Part of Capita plc www.capita.co.uk http://www.capita.co.uk
 
 
 
 
 
  This email and any attachment to it are confidential.  Unless you are
 the intended recipient, you may not use, copy or disclose either the
 message or any information contained in the message. If you are not the
 intended recipient, you should delete this email and notify the sender
 immediately.
 
  Any views or opinions expressed in this email are those of 

Re: Highlighting stopwords

2012-01-24 Thread O. Klein
Ah, I never used the hl.q

That did the trick. Thanx!

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Highlighting-stopwords-tp3681901p3684245.html
Sent from the Solr - User mailing list archive at Nabble.com.


solr stopwords issue - documents are not matching

2012-01-24 Thread Ankita Patil
Hi,

I am using solr-3.4. My part of the schema looks like :

fieldType name=text class=solr.TextField
positionIncrementGap=100 autoGeneratePhraseQueries=true

analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords_en.txt enablePositionIncrements=true/

filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/

filter class=solr.LowerCaseFilterFactory/
filter class=solr.KeywordMarkerFilterFactory
protected=protwords.txt/
filter class=solr.PorterStemFilterFactory/

/analyzer
analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.SynonymFilterFactory
synonyms=synonyms.txt ignoreCase=true expand=true/

filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords_en.txt enablePositionIncrements=true/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=0
catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/

filter class=solr.LowerCaseFilterFactory/
filter class=solr.KeywordMarkerFilterFactory
protected=protwords.txt/
filter class=solr.PorterStemFilterFactory/

/analyzer
/fieldType

stopwords_en.txt contains :
a
an
and
are
as

etc..

Now when I search for *buy house* Solr does not return me the documents
with text *buy a house*.
Also when I search for *buy a house* Solr does not return me the
documents with text *buy house*.

A part of debugQuery is
str name=rawquerystringcContent:buy a house/str
str name=querystringcContent:buy a house/str
str name=parsedqueryPhraseQuery(cContent:bui ? hous)/str
str name=parsedquery_toStringcContent:bui ? hous/str

Any idea how can I solve this problem? or what is wrong?

Thanks
Ankita


highlighter not supporting surround parser

2012-01-24 Thread manyutomar
i want performing span queries using surround parser and i want tos how the
results with highlighter, but the problem is highlighter is not working
properly with surround query parser.Are their any plugins or updates
available to do it.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/highlighter-not-supporting-surround-parser-tp3684474p3684474.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: index-time over boosted

2012-01-24 Thread remi tassing
Any idea?

This is a snippet of my schema.xml now:

?xml version=1.0 encoding=UTF-8 ?
!--
 Licensed to the Apache Software Foundation (ASF) under one or more
...
   !-- fields for index-basic plugin --
field name=host type=url stored=false indexed=true/
field name=site type=string stored=false indexed=true/
field name=url type=url stored=true indexed=true
required=true/
field name=content type=text stored=true indexed=true
omitNorms=true/
field name=cache type=string stored=true indexed=false/
field name=tstamp type=long stored=true indexed=false/
   !-- fields for index-anchor plugin --
field name=anchor type=string stored=true indexed=true
multiValued=true/

...
   !-- uncomment the following to ignore any fields that don't already
match an existing
field name or dynamic field, rather than reporting them as an
error.
alternately, change the type=ignored to some other type e.g.
text if you want
unknown fields indexed and/or stored by default --
   !--dynamicField name=* type=ignored multiValued=true /--

 /fields

 !-- Field to use to determine and enforce document uniqueness.
  Unless this field is marked with required=false, it will be a
required field
   --
 uniqueKeyid/uniqueKey

 !-- field for the QueryParser to use when an explicit fieldname is absent
...

/schema


Remi

On Sun, Jan 22, 2012 at 6:31 PM, remi tassing tassingr...@gmail.com wrote:

 Hi,

 I got wrong in beginning but putting omitNorms in the query url.

 Now following your advice, I merged the schema.xml from Nutch and Solr and
 made sure omitNorms was set to true for the content, just as you said.

 Unfortunately the problem remains :-(


 On Thursday, January 19, 2012, Jan Høydahl jan@cominvent.com wrote:
  Hi,
 
  The schema you pasted in your mail is NOT Solr3.5's default example
 schema. Did you get it from the Nutch project?
 
  And the omitNorms parameter is supposed to go in the field tag in
 schema.xml, and the content field in the example schema does not have
 omitNorms=true. Try to change
 
field name=content type=text stored=false indexed=true/
  to
field name=content type=text stored=false indexed=true
 omitNorms=true/
 
  and try again. Please note that you SHOULD customize your schema, there
 is really no default schema in Solr (or Nutch), it's only an example or
 starting point. For your search application to work well you will have to
 invest some time in designing a schema, working with your queries, perhaps
 exploring DisMax query parser etc etc.
 
  --
  Jan Høydahl, search solution architect
  Cominvent AS - www.cominvent.com
  Solr Training - www.solrtraining.com
 
  On 19. jan. 2012, at 13:01, remi tassing wrote:
 
  Hello Jan,
 
  My schema wasn't changed from the release 3.5.0. The content can be seen
  below:
 
  schema name=nutch version=1.1
 types
 fieldType name=string class=solr.StrField
 sortMissingLast=true omitNorms=true/
 fieldType name=long class=solr.LongField
 omitNorms=true/
 fieldType name=float class=solr.FloatField
 omitNorms=true/
 fieldType name=text class=solr.TextField
 positionIncrementGap=100
 analyzer
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.StopFilterFactory
 ignoreCase=true words=stopwords.txt/
 filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1
 catenateWords=1 catenateNumbers=1 catenateAll=0
 splitOnCaseChange=1/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.EnglishPorterFilterFactory
 protected=protwords.txt/
 filter class=solr.RemoveDuplicatesTokenFilterFactory/
 /analyzer
 /fieldType
 fieldType name=url class=solr.TextField
 positionIncrementGap=100
 analyzer
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1/
 filter class=solr.RemoveDuplicatesTokenFilterFactory/
 /analyzer
 /fieldType
 /types
 fields
 field name=id type=string stored=true indexed=true/
 
 !-- core fields --
 field name=segment type=string stored=true
 indexed=false/
 field name=digest type=string stored=true
 indexed=false/
 field name=boost type=float stored=true indexed=false/
 
 !-- fields for index-basic plugin --
 field name=host type=url stored=false indexed=true/
 field name=site type=string stored=false indexed=true/
 f



Re: Size of index to use shard

2012-01-24 Thread Dmitry Kan
Hi,

The article you gave mentions 13GB of index size. It is quite small index
from our perspective. We have noticed, that at least solr 3.4 has some sort
of choking point with respect to growing index size. It just becomes
substantially slower than what we need (a query on avg taking more than 3-4
seconds) once index size crosses a magic level (about 80GB following our
practical observations). We try to keep our indices at around 60-70GB for
fast searches and above 100GB for slow ones. We also route majority of user
queries to fast indices. Yes, caching may help, but not necessarily we can
afford adding more RAM for bigger indices. BTW, our documents are very
small, thus in 100GB index we can have around 200 mil. documents. It would
be interesting to see, how you manage to ensure q-times under 1 sec with an
index of 250GB? How many documents / facets do you ask max. at a time? FYI,
we ask for a thousand of facets in one go.

Regards,
Dmitry

On Tue, Jan 24, 2012 at 10:30 AM, Vadim Kisselmann 
v.kisselm...@googlemail.com wrote:

 Hi,
 it depends from your hardware.
 Read this:

 http://www.derivante.com/2009/05/05/solr-performance-benchmarks-single-vs-multi-core-index-shards/
 Think about your cache-config (few updates, big caches) and a good
 HW-infrastructure.
 In my case i can handle a 250GB index with 100mil. docs on a I7
 machine with RAID10 and 24GB RAM = q-times under 1 sec.
 Regards
 Vadim



 2012/1/24 Anderson vasconcelos anderson.v...@gmail.com:
  Hi
  Has some size of index (or number of docs) that is necessary to break
  the index in shards?
  I have a index with 100GB of size. This index increase 10GB per year.
  (I don't have information how many docs they have) and the docs never
  will be deleted.  Thinking in 30 years, the index will be with 400GB
  of size.
 
  I think  is not required to break in shard, because i not consider
  this like a large index. Am I correct? What's is a real large
  index
 
 
  Thanks



Re: Advanced stopword handling edismax

2012-01-24 Thread O. Klein

O. Klein wrote
 
 As I understand it with edismax in trunk, whenever you have a query that
 only contains stopwords then all the terms are required.
 
 But when I try this I only get an empty parsedQuery like: (+() () () () ()
 () () () () () ()
 FunctionQuery((1.0/(3.16E-11*float(ms(const(132710400),date(date_dt)))+1.0))^50.0))/no_coord
 
 Am I misunderstanding this feature? Or is something going wrong?
 

Can someone at least confirm that when using edismax and a query like to be
or not to be (for English stopword list) the parsed query is empty?

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Advanced-stopword-handling-edismax-tp3677878p3684599.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Size of index to use shard

2012-01-24 Thread Anderson vasconcelos
Apparently, not so easy to determine when to break the content into
pieces. I'll investigate further about the amount of documents, the
size of each document and what kind of search is being used. It seems,
I will have to do a load test to identify the cutoff point to begin
using the strategy of shards.

Thanks

2012/1/24, Dmitry Kan dmitry@gmail.com:
 Hi,

 The article you gave mentions 13GB of index size. It is quite small index
 from our perspective. We have noticed, that at least solr 3.4 has some sort
 of choking point with respect to growing index size. It just becomes
 substantially slower than what we need (a query on avg taking more than 3-4
 seconds) once index size crosses a magic level (about 80GB following our
 practical observations). We try to keep our indices at around 60-70GB for
 fast searches and above 100GB for slow ones. We also route majority of user
 queries to fast indices. Yes, caching may help, but not necessarily we can
 afford adding more RAM for bigger indices. BTW, our documents are very
 small, thus in 100GB index we can have around 200 mil. documents. It would
 be interesting to see, how you manage to ensure q-times under 1 sec with an
 index of 250GB? How many documents / facets do you ask max. at a time? FYI,
 we ask for a thousand of facets in one go.

 Regards,
 Dmitry

 On Tue, Jan 24, 2012 at 10:30 AM, Vadim Kisselmann 
 v.kisselm...@googlemail.com wrote:

 Hi,
 it depends from your hardware.
 Read this:

 http://www.derivante.com/2009/05/05/solr-performance-benchmarks-single-vs-multi-core-index-shards/
 Think about your cache-config (few updates, big caches) and a good
 HW-infrastructure.
 In my case i can handle a 250GB index with 100mil. docs on a I7
 machine with RAID10 and 24GB RAM = q-times under 1 sec.
 Regards
 Vadim



 2012/1/24 Anderson vasconcelos anderson.v...@gmail.com:
  Hi
  Has some size of index (or number of docs) that is necessary to break
  the index in shards?
  I have a index with 100GB of size. This index increase 10GB per year.
  (I don't have information how many docs they have) and the docs never
  will be deleted.  Thinking in 30 years, the index will be with 400GB
  of size.
 
  I think  is not required to break in shard, because i not consider
  this like a large index. Am I correct? What's is a real large
  index
 
 
  Thanks




Re: index-time over boosted

2012-01-24 Thread Jan Høydahl
That looks right. Can you restart your Solr, do a new search with 
debugQuery=true and copy/paste the full EXPLAIN output for your query?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 24. jan. 2012, at 13:22, remi tassing wrote:

 Any idea?
 
 This is a snippet of my schema.xml now:
 
 ?xml version=1.0 encoding=UTF-8 ?
 !--
 Licensed to the Apache Software Foundation (ASF) under one or more
 ...
   !-- fields for index-basic plugin --
field name=host type=url stored=false indexed=true/
field name=site type=string stored=false indexed=true/
field name=url type=url stored=true indexed=true
required=true/
field name=content type=text stored=true indexed=true
 omitNorms=true/
field name=cache type=string stored=true indexed=false/
field name=tstamp type=long stored=true indexed=false/
   !-- fields for index-anchor plugin --
field name=anchor type=string stored=true indexed=true
multiValued=true/
 
 ...
   !-- uncomment the following to ignore any fields that don't already
 match an existing
field name or dynamic field, rather than reporting them as an
 error.
alternately, change the type=ignored to some other type e.g.
 text if you want
unknown fields indexed and/or stored by default --
   !--dynamicField name=* type=ignored multiValued=true /--
 
 /fields
 
 !-- Field to use to determine and enforce document uniqueness.
  Unless this field is marked with required=false, it will be a
 required field
   --
 uniqueKeyid/uniqueKey
 
 !-- field for the QueryParser to use when an explicit fieldname is absent
 ...
 
 /schema
 
 
 Remi
 
 On Sun, Jan 22, 2012 at 6:31 PM, remi tassing tassingr...@gmail.com wrote:
 
 Hi,
 
 I got wrong in beginning but putting omitNorms in the query url.
 
 Now following your advice, I merged the schema.xml from Nutch and Solr and
 made sure omitNorms was set to true for the content, just as you said.
 
 Unfortunately the problem remains :-(
 
 
 On Thursday, January 19, 2012, Jan Høydahl jan@cominvent.com wrote:
 Hi,
 
 The schema you pasted in your mail is NOT Solr3.5's default example
 schema. Did you get it from the Nutch project?
 
 And the omitNorms parameter is supposed to go in the field tag in
 schema.xml, and the content field in the example schema does not have
 omitNorms=true. Try to change
 
  field name=content type=text stored=false indexed=true/
 to
  field name=content type=text stored=false indexed=true
 omitNorms=true/
 
 and try again. Please note that you SHOULD customize your schema, there
 is really no default schema in Solr (or Nutch), it's only an example or
 starting point. For your search application to work well you will have to
 invest some time in designing a schema, working with your queries, perhaps
 exploring DisMax query parser etc etc.
 
 --
 Jan Høydahl, search solution architect
 Cominvent AS - www.cominvent.com
 Solr Training - www.solrtraining.com
 
 On 19. jan. 2012, at 13:01, remi tassing wrote:
 
 Hello Jan,
 
 My schema wasn't changed from the release 3.5.0. The content can be seen
 below:
 
 schema name=nutch version=1.1
   types
   fieldType name=string class=solr.StrField
   sortMissingLast=true omitNorms=true/
   fieldType name=long class=solr.LongField
   omitNorms=true/
   fieldType name=float class=solr.FloatField
   omitNorms=true/
   fieldType name=text class=solr.TextField
   positionIncrementGap=100
   analyzer
   tokenizer class=solr.WhitespaceTokenizerFactory/
   filter class=solr.StopFilterFactory
   ignoreCase=true words=stopwords.txt/
   filter class=solr.WordDelimiterFilterFactory
   generateWordParts=1 generateNumberParts=1
   catenateWords=1 catenateNumbers=1 catenateAll=0
   splitOnCaseChange=1/
   filter class=solr.LowerCaseFilterFactory/
   filter class=solr.EnglishPorterFilterFactory
   protected=protwords.txt/
   filter class=solr.RemoveDuplicatesTokenFilterFactory/
   /analyzer
   /fieldType
   fieldType name=url class=solr.TextField
   positionIncrementGap=100
   analyzer
   tokenizer class=solr.StandardTokenizerFactory/
   filter class=solr.LowerCaseFilterFactory/
   filter class=solr.WordDelimiterFilterFactory
   generateWordParts=1 generateNumberParts=1/
   filter class=solr.RemoveDuplicatesTokenFilterFactory/
   /analyzer
   /fieldType
   /types
   fields
   field name=id type=string stored=true indexed=true/
 
   !-- core fields --
   field name=segment type=string stored=true
 indexed=false/
   field name=digest type=string stored=true
 indexed=false/
   field name=boost type=float stored=true 

RE: Highlighting more than 1 term

2012-01-24 Thread Tim Hibbs
Nitin and any others who may have followed this item,

I resolved the issue, but I'm not exactly sure of the originating cause.
I had change the field types of my text fields to text_en and then
re-indexed. Changing to text_en kept highlighting from happening to
more than one term in the fields for which I desired highlighting. Note
that I used the stock fieldtype definitions supplied with solr.

Once I changed the field type back to text and re-indexed again,
highlighting multiple terms in the same field was re-enabled.

Thanks,
Tim Hibbs

-Original Message-
From: csscouter [mailto:tim.hi...@verizon.net] 
Sent: Thursday, January 19, 2012 9:54 AM
To: solr-user@lucene.apache.org
Subject: RE: Highlighting more than 1 term

Nitin (and any other interested parties here):

Unfortunately, re-indexing the content did not resolve the problem and
the symptom remains the same. Any additional advice is appreciated.

Tim

--
View this message in context:
http://lucene.472066.n3.nabble.com/Highlighting-more-than-1-term-tp36708
62p3672538.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: index-time over boosted

2012-01-24 Thread remi tassing
Hello,

thanks for helping out Jan, I really appreciate that!

These are full explains of two results:

Result#1.--

3.0412199E-5 = (MATCH) max of:
  3.0412199E-5 = (MATCH) weight(content:mobil broadband^0.5 in
19081), product of:
0.13921623 = queryWeight(content:mobil broadband^0.5), product of:
  0.5 = boost
  6.3531075 = idf(content: mobil=5270 broadband=2392)
  0.043826185 = queryNorm
2.1845297E-4 = fieldWeight(content:mobil broadband in 19081), product of:
  3.6055512 = tf(phraseFreq=13.0)
  6.3531075 = idf(content: mobil=5270 broadband=2392)
  9.536743E-6 = fieldNorm(field=content, doc=19081)

Result#2.-

2.6991445E-5 = (MATCH) max of:
  2.6991445E-5 = (MATCH) weight(content:mobil broadband^0.5 in
15306), product of:
0.13921623 = queryWeight(content:mobil broadband^0.5), product of:
  0.5 = boost
  6.3531075 = idf(content: mobil=5270 broadband=2392)
  0.043826185 = queryNorm
1.9388145E-4 = fieldWeight(content:mobil broadband in 15306), product of:
  1.0 = tf(phraseFreq=1.0)
  6.3531075 = idf(content: mobil=5270 broadband=2392)
  3.0517578E-5 = fieldNorm(field=content, doc=15306)

Remi


On Tue, Jan 24, 2012 at 3:38 PM, Jan Høydahl jan@cominvent.com wrote:

 That looks right. Can you restart your Solr, do a new search with
 debugQuery=true and copy/paste the full EXPLAIN output for your query?

 --
 Jan Høydahl, search solution architect
 Cominvent AS - www.cominvent.com
 Solr Training - www.solrtraining.com

 On 24. jan. 2012, at 13:22, remi tassing wrote:

  Any idea?
 
  This is a snippet of my schema.xml now:
 
  ?xml version=1.0 encoding=UTF-8 ?
  !--
  Licensed to the Apache Software Foundation (ASF) under one or more
  ...
!-- fields for index-basic plugin --
 field name=host type=url stored=false indexed=true/
 field name=site type=string stored=false indexed=true/
 field name=url type=url stored=true indexed=true
 required=true/
 field name=content type=text stored=true indexed=true
  omitNorms=true/
 field name=cache type=string stored=true indexed=false/
 field name=tstamp type=long stored=true indexed=false/
!-- fields for index-anchor plugin --
 field name=anchor type=string stored=true indexed=true
 multiValued=true/
 
  ...
!-- uncomment the following to ignore any fields that don't already
  match an existing
 field name or dynamic field, rather than reporting them as an
  error.
 alternately, change the type=ignored to some other type e.g.
  text if you want
 unknown fields indexed and/or stored by default --
!--dynamicField name=* type=ignored multiValued=true /--
 
  /fields
 
  !-- Field to use to determine and enforce document uniqueness.
   Unless this field is marked with required=false, it will be a
  required field
--
  uniqueKeyid/uniqueKey
 
  !-- field for the QueryParser to use when an explicit fieldname is
 absent
  ...
 
  /schema
 
 
  Remi
 
  On Sun, Jan 22, 2012 at 6:31 PM, remi tassing tassingr...@gmail.com
 wrote:
 
  Hi,
 
  I got wrong in beginning but putting omitNorms in the query url.
 
  Now following your advice, I merged the schema.xml from Nutch and Solr
 and
  made sure omitNorms was set to true for the content, just as you said.
 
  Unfortunately the problem remains :-(
 
 
  On Thursday, January 19, 2012, Jan Høydahl jan@cominvent.com
 wrote:
  Hi,
 
  The schema you pasted in your mail is NOT Solr3.5's default example
  schema. Did you get it from the Nutch project?
 
  And the omitNorms parameter is supposed to go in the field tag in
  schema.xml, and the content field in the example schema does not have
  omitNorms=true. Try to change
 
   field name=content type=text stored=false indexed=true/
  to
   field name=content type=text stored=false indexed=true
  omitNorms=true/
 
  and try again. Please note that you SHOULD customize your schema, there
  is really no default schema in Solr (or Nutch), it's only an example
 or
  starting point. For your search application to work well you will have
 to
  invest some time in designing a schema, working with your queries,
 perhaps
  exploring DisMax query parser etc etc.
 
  --
  Jan Høydahl, search solution architect
  Cominvent AS - www.cominvent.com
  Solr Training - www.solrtraining.com
 
  On 19. jan. 2012, at 13:01, remi tassing wrote:
 
  Hello Jan,
 
  My schema wasn't changed from the release 3.5.0. The content can be
 seen
  below:
 
  schema name=nutch version=1.1
types
fieldType name=string class=solr.StrField
sortMissingLast=true omitNorms=true/
fieldType name=long class=solr.LongField
omitNorms=true/
fieldType name=float class=solr.FloatField
omitNorms=true/
fieldType name=text class=solr.TextField
positionIncrementGap=100
analyzer
tokenizer 

full import is not working and still not showing any errors

2012-01-24 Thread scabra4
hi all, anyone can help me with this please.
i am trying to do a full import, i've done everything correctly, now when i
try the full import an xml page displays showing the following and i stays
like this now matter how i refresh the page:
This XML file does not appear to have any style information associated with
it. The document tree is shown below.
  response
lst name=responseHeader
int name=status0/int
int name=QTime0/int
/lst
lst name=initArgs
lst name=defaults
str name=configC:\solr\conf\data-config.xml/str
/lst
/lst
str name=commandfull-import/str
str name=statusbusy/str
str name=importResponseA command is still running.../str
lst name=statusMessages
str name=Time Elapsed0:5:8.925/str
str name=Total Requests made to DataSource1/str
str name=Total Rows Fetched0/str
str name=Total Documents Processed0/str
str name=Total Documents Skipped0/str
str name=Full Dump Started2012-01-24 16:29:31/str
/lst
str name=WARNINGThis response format is experimental.  It is likely to
change in the future./str/response

--
View this message in context: 
http://lucene.472066.n3.nabble.com/full-import-is-not-working-and-still-not-showing-any-errors-tp3684751p3684751.html
Sent from the Solr - User mailing list archive at Nabble.com.


Not getting the expected search results

2012-01-24 Thread m0rt0n
Hello,

I am a newbie in this Solr world and I am getting surprised because I try to
do searches, both with the  browser interface and by using a Java client and
the expected results do not appear.

The issue is:

1) I have set up an entity called via in my data-config.xml with 5 fields.
I do the full-import and it indexes 

1.5M records:

entity name=via query=select TVIA, NVIAC, CMUM, CVIA, CPRO from
INE_VIAS
field column=TVIA name=TVIA / 
field column=NVIAC name=NVIAC / 
field column=CMUM name=CMUM / 
field column=CVIA name=CVIA / 
field column=CPRO name=CPRO / 
/entity

2) These 5 fields are mapped in the schema.xml, this way:
   field name=TVIA type=text_general indexed=true stored=true /
   field name=NVIAC type=text_general indexed=true stored=true /
   field name=CMUM type=text_general indexed=true stored=true /
   field name=CVIA type=string indexed=true stored=true /
   field name=CPRO type=int indexed=true stored=true /

3) I try to do a search for Alcala street in Madrid:
NVIAC:ALCALA AND CPRO:28 AND CMUM:079

But it does just get two results (none of them, the desired one):
docstr name=CMUM079/strint name=CPRO28/intstr
name=CVIA45363/strstr name=NVIACALCALA 

GAZULES/strstr name=TVIACALLE/str/doc
docstr name=CMUM079/strint name=CPRO28/intstr
name=CVIA08116/strstr name=NVIACALCALA 

GUADAIRA/strstr name=TVIACALLE/str/doc

4) When I do the indexing by delimiting the entity search:

entity name=via query=select TVIA, NVIAC, CMUM, CVIA, CPRO from INE_VIAS
WHERE NVIAC LIKE '%ALCALA%'

The full import does 913 documents and I do the same search, but this time I
get the desired result:

docstr name=CMUM079/strint name=CPRO28/intstr
name=CVIA00132/strstr name=NVIACALCALA/strstr
name=TVIACALLE/str/doc

Anyone can help me with that? I don't know why it does not work as expected
when I do the full-import of the whole lot of streets.

Thanks a lot in advance.


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Not-getting-the-expected-search-results-tp3684974p3684974.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Limiting term frequency in a document to a specific term

2012-01-24 Thread solr user
With the Solr search relevancy functions, a ParseException, unknown
function ttf in FunctionQuery.

http://localhost:8983/solr/select/?fl=score,documentPageIddefType=funcq=ttf(contents,amplifiers)

where contents is a field name, and amplifiers is text in the field name.

Just curious why I get a parse exception for the above syntax.




On Monday, January 23, 2012, Ahmet Arslan iori...@yahoo.com wrote:
 Below is an example query to search for the term frequency
 in a document,
 but it is returning the frequency for all the terms.

 [

http://localhost:8983/solr/select/?fl=documentPageIdq=documentPageId:49667.3qt=tvrhtv.tf=truetv.fl=contents][1
 ]

 I would like to be able to limit the query to just one term
 that I know
 occurs in the document.

 I don't fully follow but http://wiki.apache.org/solr/FunctionQuery#tf may
be what you want?



analyzing stored fields (removing HTML tags)

2012-01-24 Thread Robert Stewart
Is it possible to configure schema to remove HTML tags from stored
field content?  As far as I can tell analyzers can only be applied to
indexed content, but they don't affect stored content.  I want to
remove HTML tags from text fields so that returned text content from
stored field has no HTML tags.

Thanks
Bob


Re: index-time over boosted

2012-01-24 Thread Jan Høydahl
Hi,

Well, I think you do it right, but get tricked by either editing the wrong 
file, a typo or browser caching.
Why not try to start with a fresh Solr3.5.0, start the example app, index all 
exampledocs, search for Podcasts, you get one hit, in fields text and 
features.
Then change solr/example/solr/conf/schema.xml and add omitNorms=true to these 
two fields. Then stop Solr, delete your index, start Solr, re-index the docs 
and try again. fieldNorm is now 1.0. Once you get that working you can start 
debugging where you got it wrong in your own setup.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 24. jan. 2012, at 14:55, remi tassing wrote:

 Hello,
 
 thanks for helping out Jan, I really appreciate that!
 
 These are full explains of two results:
 
 Result#1.--
 
 3.0412199E-5 = (MATCH) max of:
  3.0412199E-5 = (MATCH) weight(content:mobil broadband^0.5 in
 19081), product of:
0.13921623 = queryWeight(content:mobil broadband^0.5), product of:
  0.5 = boost
  6.3531075 = idf(content: mobil=5270 broadband=2392)
  0.043826185 = queryNorm
2.1845297E-4 = fieldWeight(content:mobil broadband in 19081), product of:
  3.6055512 = tf(phraseFreq=13.0)
  6.3531075 = idf(content: mobil=5270 broadband=2392)
  9.536743E-6 = fieldNorm(field=content, doc=19081)
 
 Result#2.-
 
 2.6991445E-5 = (MATCH) max of:
  2.6991445E-5 = (MATCH) weight(content:mobil broadband^0.5 in
 15306), product of:
0.13921623 = queryWeight(content:mobil broadband^0.5), product of:
  0.5 = boost
  6.3531075 = idf(content: mobil=5270 broadband=2392)
  0.043826185 = queryNorm
1.9388145E-4 = fieldWeight(content:mobil broadband in 15306), product of:
  1.0 = tf(phraseFreq=1.0)
  6.3531075 = idf(content: mobil=5270 broadband=2392)
  3.0517578E-5 = fieldNorm(field=content, doc=15306)
 
 Remi
 
 
 On Tue, Jan 24, 2012 at 3:38 PM, Jan Høydahl jan@cominvent.com wrote:
 
 That looks right. Can you restart your Solr, do a new search with
 debugQuery=true and copy/paste the full EXPLAIN output for your query?
 
 --
 Jan Høydahl, search solution architect
 Cominvent AS - www.cominvent.com
 Solr Training - www.solrtraining.com
 
 On 24. jan. 2012, at 13:22, remi tassing wrote:
 
 Any idea?
 
 This is a snippet of my schema.xml now:
 
 ?xml version=1.0 encoding=UTF-8 ?
 !--
 Licensed to the Apache Software Foundation (ASF) under one or more
 ...
  !-- fields for index-basic plugin --
   field name=host type=url stored=false indexed=true/
   field name=site type=string stored=false indexed=true/
   field name=url type=url stored=true indexed=true
   required=true/
   field name=content type=text stored=true indexed=true
 omitNorms=true/
   field name=cache type=string stored=true indexed=false/
   field name=tstamp type=long stored=true indexed=false/
  !-- fields for index-anchor plugin --
   field name=anchor type=string stored=true indexed=true
   multiValued=true/
 
 ...
  !-- uncomment the following to ignore any fields that don't already
 match an existing
   field name or dynamic field, rather than reporting them as an
 error.
   alternately, change the type=ignored to some other type e.g.
 text if you want
   unknown fields indexed and/or stored by default --
  !--dynamicField name=* type=ignored multiValued=true /--
 
 /fields
 
 !-- Field to use to determine and enforce document uniqueness.
 Unless this field is marked with required=false, it will be a
 required field
  --
 uniqueKeyid/uniqueKey
 
 !-- field for the QueryParser to use when an explicit fieldname is
 absent
 ...
 
 /schema
 
 
 Remi
 
 On Sun, Jan 22, 2012 at 6:31 PM, remi tassing tassingr...@gmail.com
 wrote:
 
 Hi,
 
 I got wrong in beginning but putting omitNorms in the query url.
 
 Now following your advice, I merged the schema.xml from Nutch and Solr
 and
 made sure omitNorms was set to true for the content, just as you said.
 
 Unfortunately the problem remains :-(
 
 
 On Thursday, January 19, 2012, Jan Høydahl jan@cominvent.com
 wrote:
 Hi,
 
 The schema you pasted in your mail is NOT Solr3.5's default example
 schema. Did you get it from the Nutch project?
 
 And the omitNorms parameter is supposed to go in the field tag in
 schema.xml, and the content field in the example schema does not have
 omitNorms=true. Try to change
 
 field name=content type=text stored=false indexed=true/
 to
 field name=content type=text stored=false indexed=true
 omitNorms=true/
 
 and try again. Please note that you SHOULD customize your schema, there
 is really no default schema in Solr (or Nutch), it's only an example
 or
 starting point. For your search application to work well you will have
 to
 invest some time in designing a schema, working with your queries,
 perhaps
 exploring DisMax query parser etc etc.
 
 --
 Jan Høydahl, search solution architect
 Cominvent 

Re: Solr Java client API

2012-01-24 Thread Erick Erickson
It would really help to see the relevant parts of the code
you're using to see what you've tried. You might want to
review:
http://wiki.apache.org/solr/UsingMailingLists

Best
Erick

On Mon, Jan 23, 2012 at 2:45 PM, jingjung Ng jingjun...@gmail.com wrote:
 Hi,

 I implemented the facet using

 query.addFacetQuery
 query.addFilterQuery

 to facet on:

 gender:male
 state:DC

 This works fine. How can I facet on multi-values using Solrj API, like
 following:

 gender:male
 gender:female
 state:DC


 I've tried, but return 0. Can anyone help ?

 Thanks,

 -jingjung ng


Re: analyzing stored fields (removing HTML tags)

2012-01-24 Thread darul
You probably may use a Sanitizer as we do here.

http://stackoverflow.com/questions/1947021/libs-for-html-sanitizing



--
View this message in context: 
http://lucene.472066.n3.nabble.com/analyzing-stored-fields-removing-HTML-tags-tp3685144p3685182.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Hierarchical faceting in UI

2012-01-24 Thread Yuhao
Darren,

One challenge for me is that a term can appear in multiple places of the 
hierarchy.  So it's not safe to simply use the term as it appears to get its 
children; I probably need to include the entire tree path up to this term.  For 
example, if the hierarchy is Cardiovascular Diseases  Arteriosclerosis  
Coronary Artery Disease, and I'm getting the children of the middle term 
Arteriosclerosi, I need to filter on something like parent:Cardiovascular 
Diseases/Arteriosclerosis.

I'm having trouble figuring out how I can get the complete path per above to 
add to the URL of each facet term.  I know velocity/facet_field.vm is where I 
build the URL.  I know how to simply add a parent:term filter to the URL.  
But I don't know how to access a document field, like the complete parent path, 
in facet_field.vm.  Any help would be great.

Yuhao





 From: dar...@ontrenet.com dar...@ontrenet.com
To: Yuhao nfsvi...@yahoo.com 
Cc: solr-user@lucene.apache.org 
Sent: Monday, January 23, 2012 7:16 PM
Subject: Re: Hierarchical faceting in UI
 

On Mon, 23 Jan 2012 14:33:00 -0800 (PST), Yuhao nfsvi...@yahoo.com
wrote:
 Programmatically, something like this might work: for each facet field,
 add another hidden field that identifies its parent.  Then, program
 additional logic in the UI to show only the facet terms at the currently
 selected level.  For example, if one filters on cat:electronics, the
new
 UI logic would apply the additional filter cat_parent:electronics. 
Can
 this be done?  

Yes. This is how I do it.

 Would it be a lot of work?  
No. Its not a lot of work, simply represent your hierarchy as parent/child
relations in the document fields and in your UI drill down by issuing new
faceted searches. Use the current facet (tree level) as the parent:level
in the next query. Its much easier than other suggestions for this.

 Is there a better way?
Not in my opinion, there isn't. This is the simplest to implement and
understand.

 
 By the way, Flamenco (another faceted browser) has built-in support for
 hierarchies, and it has worked well for my data in this aspect (but less
 well than Solr in others).  I'm looking for the same kind of
hierarchical
 UI feature in Solr.

Re: java.net.SocketException: Too many open files

2012-01-24 Thread Michael Kuhlmann

Hi Jonty,

no, not really. When we first had such problems, we really thought that 
the number of open files is the problem, so we implemented an algorithm 
that performed an optimize from time to time to force a segment merge. 
Due to some misconfiguration, this ran too often. With the result that 
an optimize was issued before thje previous optimization was finished, 
which is a really bad idea.


We removed the optimization calls, and since then we didn't have any 
more problems.


However, I never found out the initial reason for the exception. Maybe 
there was some bug in Solr's 3.1 version - we're using 3.5 right now -, 
but I couldn't find a hint in the changelog.


At least we didn't have this exception for more than two months now, so 
I'm hoping that the cause for this has disappeared somehow.


Sorry that I can't help you more.

Greetings,
Kuli

On 24.01.2012 07:48, Jonty Rhods wrote:

Hi Kuli,

Did you get the solution of this problem? I am still facing this problem.
Please help me to overcome this problem.

regards


On Wed, Oct 26, 2011 at 1:16 PM, Michael Kuhlmannk...@solarier.de  wrote:


Hi;

we have a similar problem here. We already raised the file ulimit on the
server to 4096, but this only defered the problem. We get a
TooManyOpenFilesException every few months.

The problem has nothing to do with real files. When we had the last
TooManyOpenFilesException, we investigated with netstat -a and saw that
there were about 3900 open sockets in Jetty.

Curiously, we only have one SolrServer instance per Solr client, and we
only have three clients (our running web servers).

We have set defaultMaxConnectionsPerHost to 20 and maxTotalConnections
to 100. There should be room enough.

Sorry that I can't help you, we still have not solved tghe problem on
our own.

Greetings,
Kuli

Am 25.10.2011 22:03, schrieb Jonty Rhods:

Hi,

I am using solrj and for connection to server I am using instance of the
solr server:

SolrServer server =  new CommonsHttpSolrServer(
http://localhost:8080/solr/core0;);

I noticed that after few minutes it start throwing exception
java.net.SocketException: Too many open files.
It seems that it related to instance of the HttpClient. How to resolved

the

instances to a certain no. Like connection pool in dbcp etc..

I am not experienced on java so please help to resolved this problem.

  solr version: 3.4

regards
Jonty










Re: java.net.SocketException: Too many open files

2012-01-24 Thread Sethi, Parampreet
Hi Jonty,

You can try changing the maximum number of files opened by a process using
command:

ulimit -n XXX

In case, the number of opened files is not increasing with time and just a
constant number which is larger than system default limit, this should fix
it.

-param

On 1/24/12 11:40 AM, Michael Kuhlmann k...@solarier.de wrote:

Hi Jonty,

no, not really. When we first had such problems, we really thought that
the number of open files is the problem, so we implemented an algorithm
that performed an optimize from time to time to force a segment merge.
Due to some misconfiguration, this ran too often. With the result that
an optimize was issued before thje previous optimization was finished,
which is a really bad idea.

We removed the optimization calls, and since then we didn't have any
more problems.

However, I never found out the initial reason for the exception. Maybe
there was some bug in Solr's 3.1 version - we're using 3.5 right now -,
but I couldn't find a hint in the changelog.

At least we didn't have this exception for more than two months now, so
I'm hoping that the cause for this has disappeared somehow.

Sorry that I can't help you more.

Greetings,
Kuli

On 24.01.2012 07:48, Jonty Rhods wrote:
 Hi Kuli,

 Did you get the solution of this problem? I am still facing this
problem.
 Please help me to overcome this problem.

 regards


 On Wed, Oct 26, 2011 at 1:16 PM, Michael Kuhlmannk...@solarier.de
wrote:

 Hi;

 we have a similar problem here. We already raised the file ulimit on
the
 server to 4096, but this only defered the problem. We get a
 TooManyOpenFilesException every few months.

 The problem has nothing to do with real files. When we had the last
 TooManyOpenFilesException, we investigated with netstat -a and saw that
 there were about 3900 open sockets in Jetty.

 Curiously, we only have one SolrServer instance per Solr client, and we
 only have three clients (our running web servers).

 We have set defaultMaxConnectionsPerHost to 20 and maxTotalConnections
 to 100. There should be room enough.

 Sorry that I can't help you, we still have not solved tghe problem on
 our own.

 Greetings,
 Kuli

 Am 25.10.2011 22:03, schrieb Jonty Rhods:
 Hi,

 I am using solrj and for connection to server I am using instance of
the
 solr server:

 SolrServer server =  new CommonsHttpSolrServer(
 http://localhost:8080/solr/core0;);

 I noticed that after few minutes it start throwing exception
 java.net.SocketException: Too many open files.
 It seems that it related to instance of the HttpClient. How to
resolved
 the
 instances to a certain no. Like connection pool in dbcp etc..

 I am not experienced on java so please help to resolved this problem.

   solr version: 3.4

 regards
 Jonty








using pre-core properties in dih config

2012-01-24 Thread Robert Stewart
I have a multi-core setup, and for each core I have a shared
data-config.xml which specifies a SQL query for data import.  What I
want to do is have the same data-config.xml file shared between my
cores (linked to same physical file). I'd like to specify core
properties in solr.xml such that each core loads a different set of
data from SQL.  So my query might look like this:

query=select * from index_values where mod(index_id,${NUM_CORES})=${CORE_ID}

So I want to have NUM_CORES and CORE_ID specified as properties in
solr.xml, something like:

solr ...
  cores ..
 property name=NUM_CORES value=3/
 core name=index0 ...
property name=CORE_ID value=0/
 /core
 core name=index1 ...
property name=CORE_ID value=1/
 /core
 core name=index2 ...
property name=CORE_ID value=2/
 /core
  /cores

/solr

So my question is, is this possible, and if so what is exact syntax to
make it work?

Thanks,
Bob


Re: Size of index to use shard

2012-01-24 Thread Erick Erickson
Talking about index size can be very misleading. Take
a look at http://lucene.apache.org/java/3_5_0/fileformats.html#file-names.
Note that the *.fdt and *.fdx files are used to for stored fields, i.e.
the verbatim copy of data put in the index when you specify
stored=true. These files have virtually no impact on search
speed.

So, if your *.fdx and *.fdt files are 90G out of a 100G index
it is a much different thing than if these files are 10G out of
a 100G index.

And this doesn't even mention the peculiarities of your query mix.
Nor does it say a thing about whether your cheapest alternative
is to add more memory.

Anderson's method is about the only reliable one, you just have
to test with your index and real queries. At some point, you'll
find your tipping point, typically when you come under memory
pressure. And it's a balancing act between how much memory
you allocate to the JVM and how much you leave for the op
system.

Bottom line: No hard and fast numbers. And you should periodically
re-test the empirical numbers you *do* arrive at...

Best
Erick

On Tue, Jan 24, 2012 at 5:31 AM, Anderson vasconcelos
anderson.v...@gmail.com wrote:
 Apparently, not so easy to determine when to break the content into
 pieces. I'll investigate further about the amount of documents, the
 size of each document and what kind of search is being used. It seems,
 I will have to do a load test to identify the cutoff point to begin
 using the strategy of shards.

 Thanks

 2012/1/24, Dmitry Kan dmitry@gmail.com:
 Hi,

 The article you gave mentions 13GB of index size. It is quite small index
 from our perspective. We have noticed, that at least solr 3.4 has some sort
 of choking point with respect to growing index size. It just becomes
 substantially slower than what we need (a query on avg taking more than 3-4
 seconds) once index size crosses a magic level (about 80GB following our
 practical observations). We try to keep our indices at around 60-70GB for
 fast searches and above 100GB for slow ones. We also route majority of user
 queries to fast indices. Yes, caching may help, but not necessarily we can
 afford adding more RAM for bigger indices. BTW, our documents are very
 small, thus in 100GB index we can have around 200 mil. documents. It would
 be interesting to see, how you manage to ensure q-times under 1 sec with an
 index of 250GB? How many documents / facets do you ask max. at a time? FYI,
 we ask for a thousand of facets in one go.

 Regards,
 Dmitry

 On Tue, Jan 24, 2012 at 10:30 AM, Vadim Kisselmann 
 v.kisselm...@googlemail.com wrote:

 Hi,
 it depends from your hardware.
 Read this:

 http://www.derivante.com/2009/05/05/solr-performance-benchmarks-single-vs-multi-core-index-shards/
 Think about your cache-config (few updates, big caches) and a good
 HW-infrastructure.
 In my case i can handle a 250GB index with 100mil. docs on a I7
 machine with RAID10 and 24GB RAM = q-times under 1 sec.
 Regards
 Vadim



 2012/1/24 Anderson vasconcelos anderson.v...@gmail.com:
  Hi
  Has some size of index (or number of docs) that is necessary to break
  the index in shards?
  I have a index with 100GB of size. This index increase 10GB per year.
  (I don't have information how many docs they have) and the docs never
  will be deleted.  Thinking in 30 years, the index will be with 400GB
  of size.
 
  I think  is not required to break in shard, because i not consider
  this like a large index. Am I correct? What's is a real large
  index
 
 
  Thanks




Re: Limiting term frequency in a document to a specific term

2012-01-24 Thread Erick Erickson
At a guess, you're using 3.x and the relevance functions are only
on trunk (4.0).

Best
Erick

On Tue, Jan 24, 2012 at 7:49 AM, solr user mvidaat...@gmail.com wrote:
 With the Solr search relevancy functions, a ParseException, unknown
 function ttf in FunctionQuery.

 http://localhost:8983/solr/select/?fl=score,documentPageIddefType=funcq=ttf(contents,amplifiers)

 where contents is a field name, and amplifiers is text in the field name.

 Just curious why I get a parse exception for the above syntax.




 On Monday, January 23, 2012, Ahmet Arslan iori...@yahoo.com wrote:
 Below is an example query to search for the term frequency
 in a document,
 but it is returning the frequency for all the terms.

 [

 http://localhost:8983/solr/select/?fl=documentPageIdq=documentPageId:49667.3qt=tvrhtv.tf=truetv.fl=contents][1
 ]

 I would like to be able to limit the query to just one term
 that I know
 occurs in the document.

 I don't fully follow but http://wiki.apache.org/solr/FunctionQuery#tf may
 be what you want?



phrase auto-complete with suggester component

2012-01-24 Thread Tommy Chheng
I'm testing out the various auto-complete functionalities on the
wikipedia dataset.

I first tried the facet.prefix and found it slow at times. I'm now
looking at the Suggester component. Given a query like new york, I
would like to get results like New York or New York City.

When I tried using the suggest component, it suggest entries for each
word rather then phrase(even if i add quotes). How can I change my
config to get title matches and not have the query broken into each
word?

lst name=spellcheck
lst name=suggestions
lst name=new
int name=numFound5/int
int name=startOffset0/int
int name=endOffset3/int
arr name=suggestion
strnewt/str
strnewwy patitta/str
strnewyddion/str
strnewyorker/str
strnewyork–presbyterian hospital/str
/arr
/lst
lst name=york
int name=numFound5/int
int name=startOffset4/int
int name=endOffset8/int
arr name=suggestion
stryork/str
stryork–dauphin (septa station)/str
stryork—humber/str
stryork—scarborough/str
stryork—simcoe/str
/arr
/lst
str name=collationnewt york/str
/lst
/lst

/solr/suggest?q=new%20yorkomitHeader=truespellcheck.count=5spellcheck.collate=true

solrconfig.xml:
  searchComponent name=suggest class=solr.SpellCheckComponent
   lst name=spellchecker
    str name=namesuggest/str
    str name=classnameorg.apache.solr.spelling.suggest.Suggester/str
    str name=lookupImplorg.apache.solr.spelling.suggest.tst.TSTLookup/str
    str name=fieldtitle_autocomplete/str
    str name=buildOnCommittrue/str
   /lst
  /searchComponent

  requestHandler name=/suggest
class=org.apache.solr.handler.component.SearchHandler
   lst name=defaults
    str name=spellchecktrue/str
    str name=spellcheck.dictionarysuggest/str
    str name=spellcheck.count10/str
   /lst
   arr name=components
    strsuggest/str
   /arr
  /requestHandler

schema.xml:
    fieldType name=text_auto class=solr.TextField
     analyzer
      tokenizer class=solr.KeywordTokenizerFactory/
      filter class=solr.LowerCaseFilterFactory/
     /analyzer
    /fieldType

   field name=title_autocomplete type=text_auto indexed=true
stored=false multiValued=false /


-- 
Tommy Chheng


Re: Hierarchical faceting in UI

2012-01-24 Thread Darren Govoni

Yuhao,
Ok, let me think about this. A term can have multiple parents. Each of 
those parents would be 'different', yes?
In this case, use a multivalued field for the parent and add all the parent 
names or id's to it. The relations should be unique.

Your UI will associate the correct parent id to build the facet query from and 
return the correct children because the user
is descending down a specific path in the UI and the parent node unique id's 
are returned along the way.

Now, if you are having parent names/id's that themselves can appear in multiple 
locations (vs. just terms 'the leafs'),
then perhaps your hierarchy needs refactoring for redundancy?

Happy to help with more details.

Darren


On 01/24/2012 11:22 AM, Yuhao wrote:

Darren,

One challenge for me is that a term can appear in multiple places of the hierarchy.  So it's not safe to 
simply use the term as it appears to get its children; I probably need to include the entire tree path up 
to this term.  For example, if the hierarchy is Cardiovascular Diseases  Arteriosclerosis  
Coronary Artery Disease, and I'm getting the children of the middle term Arteriosclerosi, I need to 
filter on something like parent:Cardiovascular Diseases/Arteriosclerosis.

I'm having trouble figuring out how I can get the complete path per above to add to the URL of each facet term.  I 
know velocity/facet_field.vm is where I build the URL.  I know how to simply add a 
parent:term filter to the URL.  But I don't know how to access a document field, like the 
complete parent path, in facet_field.vm.  Any help would be great.

Yuhao





  From: dar...@ontrenet.comdar...@ontrenet.com
To: Yuhaonfsvi...@yahoo.com
Cc: solr-user@lucene.apache.org
Sent: Monday, January 23, 2012 7:16 PM
Subject: Re: Hierarchical faceting in UI


On Mon, 23 Jan 2012 14:33:00 -0800 (PST), Yuhaonfsvi...@yahoo.com
wrote:

Programmatically, something like this might work: for each facet field,
add another hidden field that identifies its parent.  Then, program
additional logic in the UI to show only the facet terms at the currently
selected level.  For example, if one filters on cat:electronics, the

new

UI logic would apply the additional filter cat_parent:electronics.

Can

this be done?

Yes. This is how I do it.


Would it be a lot of work?

No. Its not a lot of work, simply represent your hierarchy as parent/child
relations in the document fields and in your UI drill down by issuing new
faceted searches. Use the current facet (tree level) as the parent:level
in the next query. Its much easier than other suggestions for this.


Is there a better way?

Not in my opinion, there isn't. This is the simplest to implement and
understand.


By the way, Flamenco (another faceted browser) has built-in support for
hierarchies, and it has worked well for my data in this aspect (but less
well than Solr in others).  I'm looking for the same kind of

hierarchical

UI feature in Solr.




SolrCell maximum file size

2012-01-24 Thread Augusto Camarotti
Hi everybody
 
Does anyone knows if there is a maximum file size that can be uploaded to the 
extractingrequesthandler via http request?
 
Thanks in advance,
 
Augusto Camarotti


HTMLStripCharFilterFactory not working in Solr4?

2012-01-24 Thread Mike Hugo
We recently updated to the latest build of Solr4 and everything is working
really well so far!  There is one case that is not working the same way it
was in Solr 3.4 - we strip out certain HTML constructs (like trademark and
registered, for example) in a field as defined below - it was working in
Solr3.4 with the configuration shown here, but is not working the same way
in Solr4.

The label field is defined as type=text_general
field name=label type=text_general indexed=true stored=false
required=false multiValued=true/

Here's the type definition for text_general field:
fieldType name=text_general class=solr.TextField
positionIncrementGap=100
analyzer type=index
tokenizer class=solr.StandardTokenizerFactory/
charFilter class=solr.HTMLStripCharFilterFactory/
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt
enablePositionIncrements=true/
filter class=solr.LowerCaseFilterFactory/
/analyzer
analyzer type=query
tokenizer class=solr.StandardTokenizerFactory/
charFilter class=solr.HTMLStripCharFilterFactory/
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt
enablePositionIncrements=true/
filter class=solr.LowerCaseFilterFactory/
/analyzer
/fieldType


In Solr 3.4, that configuration was completely stripping html constructs
out of the indexed field which is exactly what we wanted.  If for example,
we then do a facet on the label field, like in the test below, we're
getting some terms in the response that we would not like to be there.


// test case (groovy)
void specialHtmlConstructsGetStripped() {
SolrInputDocument inputDocument = new SolrInputDocument()
inputDocument.addField('label', 'Bose#174; #8482;')

solrServer.add(inputDocument)
solrServer.commit()

QueryResponse response = solrServer.query(new SolrQuery('bose'))
assert 1 == response.results.numFound

SolrQuery facetQuery = new SolrQuery('bose')
facetQuery.facet = true
facetQuery.set(FacetParams.FACET_FIELD, 'label')
facetQuery.set(FacetParams.FACET_MINCOUNT, '1')

response = solrServer.query(facetQuery)
FacetField ff = response.facetFields.find {it.name == 'label'}

List suggestResponse = []

for (FacetField.Count facetField in ff?.values) {
suggestResponse  facetField.name
}

assert suggestResponse == ['bose']
}

With the upgrade to Solr4, the assertion fails, the suggested response
contains 174 and 8482 as terms.  Test output is:

Assertion failed:

assert suggestResponse == ['bose']
   |   |
   |   false
   [174, 8482, bose]


I just tried again using the latest build from today, namely:
https://builds.apache.org/job/Lucene-Solr-Maven-trunk/369/ and we're still
getting the failing assertion. Is there a different way to configure the
HTMLStripCharFilterFactory in Solr4?

Thanks in advance for any tips!

Mike


Re: HTMLStripCharFilterFactory not working in Solr4?

2012-01-24 Thread Yonik Seeley
You can use LegacyHTMLStripCharFilterFactory to get the previous behavior.
See https://issues.apache.org/jira/browse/LUCENE-3690 for more details.

-Yonik
http://www.lucidimagination.com



On Tue, Jan 24, 2012 at 1:34 PM, Mike Hugo m...@piragua.com wrote:
 We recently updated to the latest build of Solr4 and everything is working
 really well so far!  There is one case that is not working the same way it
 was in Solr 3.4 - we strip out certain HTML constructs (like trademark and
 registered, for example) in a field as defined below - it was working in
 Solr3.4 with the configuration shown here, but is not working the same way
 in Solr4.

 The label field is defined as type=text_general
 field name=label type=text_general indexed=true stored=false
 required=false multiValued=true/

 Here's the type definition for text_general field:
 fieldType name=text_general class=solr.TextField
 positionIncrementGap=100
            analyzer type=index
                tokenizer class=solr.StandardTokenizerFactory/
                charFilter class=solr.HTMLStripCharFilterFactory/
                filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt
                        enablePositionIncrements=true/
                filter class=solr.LowerCaseFilterFactory/
            /analyzer
            analyzer type=query
                tokenizer class=solr.StandardTokenizerFactory/
                charFilter class=solr.HTMLStripCharFilterFactory/
                filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt
                        enablePositionIncrements=true/
                filter class=solr.LowerCaseFilterFactory/
            /analyzer
        /fieldType


 In Solr 3.4, that configuration was completely stripping html constructs
 out of the indexed field which is exactly what we wanted.  If for example,
 we then do a facet on the label field, like in the test below, we're
 getting some terms in the response that we would not like to be there.


 // test case (groovy)
 void specialHtmlConstructsGetStripped() {
    SolrInputDocument inputDocument = new SolrInputDocument()
    inputDocument.addField('label', 'Bose#174; #8482;')

    solrServer.add(inputDocument)
    solrServer.commit()

    QueryResponse response = solrServer.query(new SolrQuery('bose'))
    assert 1 == response.results.numFound

    SolrQuery facetQuery = new SolrQuery('bose')
    facetQuery.facet = true
    facetQuery.set(FacetParams.FACET_FIELD, 'label')
    facetQuery.set(FacetParams.FACET_MINCOUNT, '1')

    response = solrServer.query(facetQuery)
    FacetField ff = response.facetFields.find {it.name == 'label'}

    List suggestResponse = []

    for (FacetField.Count facetField in ff?.values) {
        suggestResponse  facetField.name
    }

    assert suggestResponse == ['bose']
 }

 With the upgrade to Solr4, the assertion fails, the suggested response
 contains 174 and 8482 as terms.  Test output is:

 Assertion failed:

 assert suggestResponse == ['bose']
       |               |
       |               false
       [174, 8482, bose]


 I just tried again using the latest build from today, namely:
 https://builds.apache.org/job/Lucene-Solr-Maven-trunk/369/ and we're still
 getting the failing assertion. Is there a different way to configure the
 HTMLStripCharFilterFactory in Solr4?

 Thanks in advance for any tips!

 Mike


Re: Hierarchical faceting in UI

2012-01-24 Thread Yuhao
Hi Darren.  You said: 


Your UI will associate the correct parent id to build the facet query

This is the part I'm having trouble figuring out how to accomplish and some 
guidance would help. How would I get the value of the parent to build the facet 
query in the UI, if the value is in another document field?  I was imagining 
that I would add the additional filter of parent:parent path to the fq 
URL parameter.  But I don't have a way to do it yet.

Perhaps seeing some data would help.  Here is a record in old (flattened) and 
new (parent-enabled) versions, both in JSON format:

OLD:
    {
        ID : 3816,
        Gene Symbol : KLK1,
        Alternate Names : hCG_22931;Klk6;hK1;KLKR,
        Description : Kallikrein 1, a peptidase that cleaves kininogen, 
functions in glucose homeostasis, heart contraction, semen liquefaction, and 
vasoconstriction, aberrantly expressed in pancreatitis and endometrial cancer; 
gene polymorphism correlates with kidney failure (BKL),
        GAD_Positive_Disease_Associations : [Mental Disorders(MESH:D001523) 
 Dementia, Vascular(MESH:D015140), Cardiovascular Diseases(MESH:D002318)  
Coronary Artery Disease(MESH:D003324)],
        HuGENet_GeneProspector_Associations : [atherosclerosis, HDL],
    }



NEW:
    {
        ID : 3816,
        Gene Symbol : KLK1,
        Alternate Names : hCG_22931;Klk6;hK1;KLKR,
        Description : Kallikrein 1, a peptidase that cleaves kininogen, 
functions in glucose homeostasis, heart contraction, semen liquefaction, and 
vasoconstriction, aberrantly expressed in pancreatitis and endometrial cancer; 
gene polymorphism correlates with kidney failure (BKL),
        GAD_Positive_Disease_Associations : [Dementia, 
Vascular(MESH:D015140), Coronary Artery Disease(MESH:D003324)],
        GAD_Positive_Disease_Associations_parent : [Mental 
Disorders(MESH:D001523), Cardiovascular Diseases(MESH:D002318)],
        HuGENet_GeneProspector_Associations : [atherosclerosis, HDL],
    }

In the old version, the field GAD_Positive_Disease_Associations had 2 levels 
of hierarchy that were flattened.  It had the full path of the hierarchy 
leading to the current term.  In the new version, the field only has the 
current term.  A separate field called 
GAD_Positive_Disease_Associations_parent has the full path preceding the 
current term.

So, let's say in the UI, I click on the term Dementia, Vascular(MESH:D015140) 
to get its child terms and data.  My filters in the URL querystring would be 
exactly: 

fq=GAD_Positive_Disease_Associations:Dementia, 
Vascular(MESH:D015140)fq=GAD_Positive_Disease_Associations_parent:Mental 
Disorders(MESH:D001523)

My question is, how to get the parent value of Mental Disorders(MESH:D001523) 
to build that querystring?

Thanks!

Yuhao





 From: Darren Govoni dar...@ontrenet.com
To: solr-user@lucene.apache.org 
Sent: Tuesday, January 24, 2012 1:23 PM
Subject: Re: Hierarchical faceting in UI
 
Yuhao,
     Ok, let me think about this. A term can have multiple parents. Each of 
those parents would be 'different', yes?
In this case, use a multivalued field for the parent and add all the parent 
names or id's to it. The relations should be unique.

Your UI will associate the correct parent id to build the facet query from and 
return the correct children because the user
is descending down a specific path in the UI and the parent node unique id's 
are returned along the way.

Now, if you are having parent names/id's that themselves can appear in multiple 
locations (vs. just terms 'the leafs'),
then perhaps your hierarchy needs refactoring for redundancy?

Happy to help with more details.

Darren


On 01/24/2012 11:22 AM, Yuhao wrote:
 Darren,

 One challenge for me is that a term can appear in multiple places of the 
 hierarchy.  So it's not safe to simply use the term as it appears to get its 
 children; I probably need to include the entire tree path up to this term.  
 For example, if the hierarchy is Cardiovascular Diseases  
 Arteriosclerosis  Coronary Artery Disease, and I'm getting the children of 
 the middle term Arteriosclerosi, I need to filter on something like 
 parent:Cardiovascular Diseases/Arteriosclerosis.

 I'm having trouble figuring out how I can get the complete path per above to 
 add to the URL of each facet term.  I know velocity/facet_field.vm is where 
 I build the URL.  I know how to simply add a parent:term filter to the 
 URL.  But I don't know how to access a document field, like the complete 
 parent path, in facet_field.vm.  Any help would be great.

 Yuhao




 
   From: dar...@ontrenet.comdar...@ontrenet.com
 To: Yuhaonfsvi...@yahoo.com
 Cc: solr-user@lucene.apache.org
 Sent: Monday, January 23, 2012 7:16 PM
 Subject: Re: Hierarchical faceting in UI


 On Mon, 23 Jan 2012 14:33:00 -0800 (PST), Yuhaonfsvi...@yahoo.com
 wrote:
 Programmatically, something like this might work: for each facet field,
 add another hidden field that 

Re: Size of index to use shard

2012-01-24 Thread Anderson vasconcelos
Thanks for the explanation Erick :)

2012/1/24, Erick Erickson erickerick...@gmail.com:
 Talking about index size can be very misleading. Take
 a look at http://lucene.apache.org/java/3_5_0/fileformats.html#file-names.
 Note that the *.fdt and *.fdx files are used to for stored fields, i.e.
 the verbatim copy of data put in the index when you specify
 stored=true. These files have virtually no impact on search
 speed.

 So, if your *.fdx and *.fdt files are 90G out of a 100G index
 it is a much different thing than if these files are 10G out of
 a 100G index.

 And this doesn't even mention the peculiarities of your query mix.
 Nor does it say a thing about whether your cheapest alternative
 is to add more memory.

 Anderson's method is about the only reliable one, you just have
 to test with your index and real queries. At some point, you'll
 find your tipping point, typically when you come under memory
 pressure. And it's a balancing act between how much memory
 you allocate to the JVM and how much you leave for the op
 system.

 Bottom line: No hard and fast numbers. And you should periodically
 re-test the empirical numbers you *do* arrive at...

 Best
 Erick

 On Tue, Jan 24, 2012 at 5:31 AM, Anderson vasconcelos
 anderson.v...@gmail.com wrote:
 Apparently, not so easy to determine when to break the content into
 pieces. I'll investigate further about the amount of documents, the
 size of each document and what kind of search is being used. It seems,
 I will have to do a load test to identify the cutoff point to begin
 using the strategy of shards.

 Thanks

 2012/1/24, Dmitry Kan dmitry@gmail.com:
 Hi,

 The article you gave mentions 13GB of index size. It is quite small index
 from our perspective. We have noticed, that at least solr 3.4 has some
 sort
 of choking point with respect to growing index size. It just becomes
 substantially slower than what we need (a query on avg taking more than
 3-4
 seconds) once index size crosses a magic level (about 80GB following our
 practical observations). We try to keep our indices at around 60-70GB for
 fast searches and above 100GB for slow ones. We also route majority of
 user
 queries to fast indices. Yes, caching may help, but not necessarily we
 can
 afford adding more RAM for bigger indices. BTW, our documents are very
 small, thus in 100GB index we can have around 200 mil. documents. It
 would
 be interesting to see, how you manage to ensure q-times under 1 sec with
 an
 index of 250GB? How many documents / facets do you ask max. at a time?
 FYI,
 we ask for a thousand of facets in one go.

 Regards,
 Dmitry

 On Tue, Jan 24, 2012 at 10:30 AM, Vadim Kisselmann 
 v.kisselm...@googlemail.com wrote:

 Hi,
 it depends from your hardware.
 Read this:

 http://www.derivante.com/2009/05/05/solr-performance-benchmarks-single-vs-multi-core-index-shards/
 Think about your cache-config (few updates, big caches) and a good
 HW-infrastructure.
 In my case i can handle a 250GB index with 100mil. docs on a I7
 machine with RAID10 and 24GB RAM = q-times under 1 sec.
 Regards
 Vadim



 2012/1/24 Anderson vasconcelos anderson.v...@gmail.com:
  Hi
  Has some size of index (or number of docs) that is necessary to break
  the index in shards?
  I have a index with 100GB of size. This index increase 10GB per year.
  (I don't have information how many docs they have) and the docs never
  will be deleted.  Thinking in 30 years, the index will be with 400GB
  of size.
 
  I think  is not required to break in shard, because i not consider
  this like a large index. Am I correct? What's is a real large
  index
 
 
  Thanks





Re: HTMLStripCharFilterFactory not working in Solr4?

2012-01-24 Thread Mike Hugo
Thanks for the response Yonik,
Interestingly enough, changing to to the LegacyHTMLStripCharFilterFactory
does NOT solve the problem - in fact I get the same result

I can see that the LegacyHTMLStripCharFilterFactory is being applied at
startup:

Jan 24, 2012 1:25:29 PM org.apache.solr.util.plugin.AbstractPluginLoader
load
INFO: created : org.apache.solr.analysis.LegacyHTMLStripCharFilterFactory

however, I'm still getting the same assertion error.  Any thoughts?

Mike


On Tue, Jan 24, 2012 at 12:40 PM, Yonik Seeley
yo...@lucidimagination.comwrote:

 You can use LegacyHTMLStripCharFilterFactory to get the previous behavior.
 See https://issues.apache.org/jira/browse/LUCENE-3690 for more details.

 -Yonik
 http://www.lucidimagination.com



 On Tue, Jan 24, 2012 at 1:34 PM, Mike Hugo m...@piragua.com wrote:
  We recently updated to the latest build of Solr4 and everything is
 working
  really well so far!  There is one case that is not working the same way
 it
  was in Solr 3.4 - we strip out certain HTML constructs (like trademark
 and
  registered, for example) in a field as defined below - it was working in
  Solr3.4 with the configuration shown here, but is not working the same
 way
  in Solr4.
 
  The label field is defined as type=text_general
  field name=label type=text_general indexed=true stored=false
  required=false multiValued=true/
 
  Here's the type definition for text_general field:
  fieldType name=text_general class=solr.TextField
  positionIncrementGap=100
 analyzer type=index
 tokenizer class=solr.StandardTokenizerFactory/
 charFilter class=solr.HTMLStripCharFilterFactory/
 filter class=solr.StopFilterFactory ignoreCase=true
  words=stopwords.txt
 enablePositionIncrements=true/
 filter class=solr.LowerCaseFilterFactory/
 /analyzer
 analyzer type=query
 tokenizer class=solr.StandardTokenizerFactory/
 charFilter class=solr.HTMLStripCharFilterFactory/
 filter class=solr.StopFilterFactory ignoreCase=true
  words=stopwords.txt
 enablePositionIncrements=true/
 filter class=solr.LowerCaseFilterFactory/
 /analyzer
 /fieldType
 
 
  In Solr 3.4, that configuration was completely stripping html constructs
  out of the indexed field which is exactly what we wanted.  If for
 example,
  we then do a facet on the label field, like in the test below, we're
  getting some terms in the response that we would not like to be there.
 
 
  // test case (groovy)
  void specialHtmlConstructsGetStripped() {
 SolrInputDocument inputDocument = new SolrInputDocument()
 inputDocument.addField('label', 'Bose#174; #8482;')
 
 solrServer.add(inputDocument)
 solrServer.commit()
 
 QueryResponse response = solrServer.query(new SolrQuery('bose'))
 assert 1 == response.results.numFound
 
 SolrQuery facetQuery = new SolrQuery('bose')
 facetQuery.facet = true
 facetQuery.set(FacetParams.FACET_FIELD, 'label')
 facetQuery.set(FacetParams.FACET_MINCOUNT, '1')
 
 response = solrServer.query(facetQuery)
 FacetField ff = response.facetFields.find {it.name == 'label'}
 
 List suggestResponse = []
 
 for (FacetField.Count facetField in ff?.values) {
 suggestResponse  facetField.name
 }
 
 assert suggestResponse == ['bose']
  }
 
  With the upgrade to Solr4, the assertion fails, the suggested response
  contains 174 and 8482 as terms.  Test output is:
 
  Assertion failed:
 
  assert suggestResponse == ['bose']
|   |
|   false
[174, 8482, bose]
 
 
  I just tried again using the latest build from today, namely:
  https://builds.apache.org/job/Lucene-Solr-Maven-trunk/369/ and we're
 still
  getting the failing assertion. Is there a different way to configure the
  HTMLStripCharFilterFactory in Solr4?
 
  Thanks in advance for any tips!
 
  Mike



Re: phrase auto-complete with suggester component

2012-01-24 Thread O. Klein
You might wanna read
http://lucene.472066.n3.nabble.com/suggester-issues-td3262718.html#a3264740
which contains the solution to your problem.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/phrase-auto-complete-with-suggester-component-tp3685572p3685730.html
Sent from the Solr - User mailing list archive at Nabble.com.


Indexing failover and replication

2012-01-24 Thread Anderson vasconcelos
Hi
I'm doing now a test with replication using solr 1.4.1. I configured
two servers (server1 and server 2) as master/slave to sincronized
both. I put apache on the front side, and we index sometime in server1
and sometime  in server2.

I realized that the both index servers are now confused. In solr data
folder, was created many index folders with the timestamp of
syncronization (Exemple: index.20120124041340) with some segments
inside.

I thought that was possible to index in two master server and than
synchronized both using replication. It's really possible do this with
replication mechanism? If is possible, what I have done wrong?

I need to have more than one node for indexing to guarantee failover
feature for indexing. MultiMaster is the best way to guarantee
failover feature for indexing?

Thanks


Re: phrase auto-complete with suggester component

2012-01-24 Thread Tommy Chheng
Thanks, I'll try out the custom class file. Any possibilities this
class can be merged into solr? It seems like an expected behavior.


On Tue, Jan 24, 2012 at 11:29 AM, O. Klein kl...@octoweb.nl wrote:
 You might wanna read
 http://lucene.472066.n3.nabble.com/suggester-issues-td3262718.html#a3264740
 which contains the solution to your problem.

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/phrase-auto-complete-with-suggester-component-tp3685572p3685730.html
 Sent from the Solr - User mailing list archive at Nabble.com.



-- 
Tommy Chheng


RE: HTMLStripCharFilterFactory not working in Solr4?

2012-01-24 Thread Steven A Rowe
Hi Mike,

When I add the following test to TestHTMLStripCharFilterFactory.java on Solr 
trunk, it passes:
  
public void testNumericCharacterEntities() throws Exception {
  final String text = Bose#174; #8482;;  // |Bose® ™|
  HTMLStripCharFilterFactory htmlStripFactory = new 
HTMLStripCharFilterFactory();
  htmlStripFactory.init(Collections.String,StringemptyMap());
  CharStream charStream = htmlStripFactory.create(CharReader.get(new 
StringReader(text)));
  StandardTokenizerFactory stdTokFactory = new StandardTokenizerFactory();
  stdTokFactory.init(DEFAULT_VERSION_PARAM);
  Tokenizer stream = stdTokFactory.create(charStream);
  assertTokenStreamContents(stream, new String[] { Bose });
}

What's happening: 

First, htmlStripFactory converts #174; to ® and #8482; to ™.  Then 
stdTokFactory declines to tokenize ® and ™, because they are belong to the 
Unicode general category Symbol, Other, and so are not included in any of the 
output tokens.

StandardTokenizer uses the Word Break rules find UAX#29 
http://unicode.org/reports/tr29/ to find token boundaries, and then outputs 
only alphanumeric tokens.  See the JFlex grammar for details: 
http://svn.apache.org/viewvc/lucene/dev/trunk/modules/analysis/common/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex?view=markup.

The behavior you're seeing is not consistent with the above test.

Steve

 -Original Message-
 From: Mike Hugo [mailto:m...@piragua.com]
 Sent: Tuesday, January 24, 2012 1:34 PM
 To: solr-user@lucene.apache.org
 Subject: HTMLStripCharFilterFactory not working in Solr4?
 
 We recently updated to the latest build of Solr4 and everything is working
 really well so far!  There is one case that is not working the same way it
 was in Solr 3.4 - we strip out certain HTML constructs (like trademark and
 registered, for example) in a field as defined below - it was working in
 Solr3.4 with the configuration shown here, but is not working the same way
 in Solr4.
 
 The label field is defined as type=text_general
 field name=label type=text_general indexed=true stored=false
 required=false multiValued=true/
 
 Here's the type definition for text_general field:
 fieldType name=text_general class=solr.TextField
 positionIncrementGap=100
 analyzer type=index
 tokenizer class=solr.StandardTokenizerFactory/
 charFilter class=solr.HTMLStripCharFilterFactory/
 filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt
 enablePositionIncrements=true/
 filter class=solr.LowerCaseFilterFactory/
 /analyzer
 analyzer type=query
 tokenizer class=solr.StandardTokenizerFactory/
 charFilter class=solr.HTMLStripCharFilterFactory/
 filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt
 enablePositionIncrements=true/
 filter class=solr.LowerCaseFilterFactory/
 /analyzer
 /fieldType
 
 
 In Solr 3.4, that configuration was completely stripping html constructs
 out of the indexed field which is exactly what we wanted.  If for example,
 we then do a facet on the label field, like in the test below, we're
 getting some terms in the response that we would not like to be there.
 
 
 // test case (groovy)
 void specialHtmlConstructsGetStripped() {
 SolrInputDocument inputDocument = new SolrInputDocument()
 inputDocument.addField('label', 'Bose#174; #8482;')
 
 solrServer.add(inputDocument)
 solrServer.commit()
 
 QueryResponse response = solrServer.query(new SolrQuery('bose'))
 assert 1 == response.results.numFound
 
 SolrQuery facetQuery = new SolrQuery('bose')
 facetQuery.facet = true
 facetQuery.set(FacetParams.FACET_FIELD, 'label')
 facetQuery.set(FacetParams.FACET_MINCOUNT, '1')
 
 response = solrServer.query(facetQuery)
 FacetField ff = response.facetFields.find {it.name == 'label'}
 
 List suggestResponse = []
 
 for (FacetField.Count facetField in ff?.values) {
 suggestResponse  facetField.name
 }
 
 assert suggestResponse == ['bose']
 }
 
 With the upgrade to Solr4, the assertion fails, the suggested response
 contains 174 and 8482 as terms.  Test output is:
 
 Assertion failed:
 
 assert suggestResponse == ['bose']
|   |
|   false
[174, 8482, bose]
 
 
 I just tried again using the latest build from today, namely:
 https://builds.apache.org/job/Lucene-Solr-Maven-trunk/369/ and we're still
 getting the failing assertion. Is there a different way to configure the
 HTMLStripCharFilterFactory in Solr4?
 
 Thanks in advance for any tips!
 
 Mike


RE: HTMLStripCharFilterFactory not working in Solr4?

2012-01-24 Thread Michael Ryan
Try putting the HTMLStripCharFilterFactory before the StandardTokenizerFactory 
instead of after it. I vaguely recall being burned by something like this 
before.

-Michael


Re: HTMLStripCharFilterFactory not working in Solr4?

2012-01-24 Thread Yonik Seeley
Oops, I didn't read carefully enough to see that you wanted those constructs
entirely stripped out.

Given that you're seeing numbers indexed, this strongly indicates an
escaping bug in the SolrJ client that must have been introduced at
some point.
I'll see if I can reproduce it in a unit test.


-Yonik
http://www.lucidimagination.com


Re: dismax: limiting term match to one field

2012-01-24 Thread astubbs
This seems like a real shame. As soon as you search across more than one
field, the mm setting becomes nearly useless.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/dismax-limiting-term-match-to-one-field-tp2056498p3685850.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Size of index to use shard

2012-01-24 Thread Vadim Kisselmann
@Erick
thanks:)
i´m with you with your opinion.
my load tests show the same.

@Dmitry
my docs are small too, i think about 3-15KB per doc.
i update my index all the time and i have an average of 20-50 requests
per minute (20% facet queries, 80% large boolean queries with
wildcard/fuzzy) . How much docs at a time= depends from choosed
filters, from 10 to all 100Mio.
I work with very small caches (strangely, but if my index is under
100GB i need larger caches, over 100GB smaller caches..)
My JVM has 6GB, 18GB for I/O.
With few updates a day i would configure very big caches, like Tim
Burton (see HathiTrust´s Blog)

Regards Vadim



2012/1/24 Anderson vasconcelos anderson.v...@gmail.com:
 Thanks for the explanation Erick :)

 2012/1/24, Erick Erickson erickerick...@gmail.com:
 Talking about index size can be very misleading. Take
 a look at http://lucene.apache.org/java/3_5_0/fileformats.html#file-names.
 Note that the *.fdt and *.fdx files are used to for stored fields, i.e.
 the verbatim copy of data put in the index when you specify
 stored=true. These files have virtually no impact on search
 speed.

 So, if your *.fdx and *.fdt files are 90G out of a 100G index
 it is a much different thing than if these files are 10G out of
 a 100G index.

 And this doesn't even mention the peculiarities of your query mix.
 Nor does it say a thing about whether your cheapest alternative
 is to add more memory.

 Anderson's method is about the only reliable one, you just have
 to test with your index and real queries. At some point, you'll
 find your tipping point, typically when you come under memory
 pressure. And it's a balancing act between how much memory
 you allocate to the JVM and how much you leave for the op
 system.

 Bottom line: No hard and fast numbers. And you should periodically
 re-test the empirical numbers you *do* arrive at...

 Best
 Erick

 On Tue, Jan 24, 2012 at 5:31 AM, Anderson vasconcelos
 anderson.v...@gmail.com wrote:
 Apparently, not so easy to determine when to break the content into
 pieces. I'll investigate further about the amount of documents, the
 size of each document and what kind of search is being used. It seems,
 I will have to do a load test to identify the cutoff point to begin
 using the strategy of shards.

 Thanks

 2012/1/24, Dmitry Kan dmitry@gmail.com:
 Hi,

 The article you gave mentions 13GB of index size. It is quite small index
 from our perspective. We have noticed, that at least solr 3.4 has some
 sort
 of choking point with respect to growing index size. It just becomes
 substantially slower than what we need (a query on avg taking more than
 3-4
 seconds) once index size crosses a magic level (about 80GB following our
 practical observations). We try to keep our indices at around 60-70GB for
 fast searches and above 100GB for slow ones. We also route majority of
 user
 queries to fast indices. Yes, caching may help, but not necessarily we
 can
 afford adding more RAM for bigger indices. BTW, our documents are very
 small, thus in 100GB index we can have around 200 mil. documents. It
 would
 be interesting to see, how you manage to ensure q-times under 1 sec with
 an
 index of 250GB? How many documents / facets do you ask max. at a time?
 FYI,
 we ask for a thousand of facets in one go.

 Regards,
 Dmitry

 On Tue, Jan 24, 2012 at 10:30 AM, Vadim Kisselmann 
 v.kisselm...@googlemail.com wrote:

 Hi,
 it depends from your hardware.
 Read this:

 http://www.derivante.com/2009/05/05/solr-performance-benchmarks-single-vs-multi-core-index-shards/
 Think about your cache-config (few updates, big caches) and a good
 HW-infrastructure.
 In my case i can handle a 250GB index with 100mil. docs on a I7
 machine with RAID10 and 24GB RAM = q-times under 1 sec.
 Regards
 Vadim



 2012/1/24 Anderson vasconcelos anderson.v...@gmail.com:
  Hi
  Has some size of index (or number of docs) that is necessary to break
  the index in shards?
  I have a index with 100GB of size. This index increase 10GB per year.
  (I don't have information how many docs they have) and the docs never
  will be deleted.  Thinking in 30 years, the index will be with 400GB
  of size.
 
  I think  is not required to break in shard, because i not consider
  this like a large index. Am I correct? What's is a real large
  index
 
 
  Thanks





Re: HTMLStripCharFilterFactory not working in Solr4?

2012-01-24 Thread Mike Hugo
Thanks for the responses everyone.

Steve, the test method you provided also works for me.  However, when I try
a more end to end test with the HTMLStripCharFilterFactory configured for a
field I am still having the same problem.  I attached a failing unit test
and configuration to the following issue in JIRA:

https://issues.apache.org/jira/browse/LUCENE-3721

I appreciate all the prompt responses!  Looking forward to finding the root
cause of this guy :)  If there's something I'm doing incorrectly in the
configuration, please let me know!

Mike

On Tue, Jan 24, 2012 at 1:57 PM, Steven A Rowe sar...@syr.edu wrote:

 Hi Mike,

 When I add the following test to TestHTMLStripCharFilterFactory.java on
 Solr trunk, it passes:

 public void testNumericCharacterEntities() throws Exception {
  final String text = Bose#174; #8482;;  // |Bose® ™|
  HTMLStripCharFilterFactory htmlStripFactory = new
 HTMLStripCharFilterFactory();
  htmlStripFactory.init(Collections.String,StringemptyMap());
  CharStream charStream = htmlStripFactory.create(CharReader.get(new
 StringReader(text)));
  StandardTokenizerFactory stdTokFactory = new StandardTokenizerFactory();
  stdTokFactory.init(DEFAULT_VERSION_PARAM);
  Tokenizer stream = stdTokFactory.create(charStream);
  assertTokenStreamContents(stream, new String[] { Bose });
 }

 What's happening:

 First, htmlStripFactory converts #174; to ® and #8482; to ™.
  Then stdTokFactory declines to tokenize ® and ™, because they are
 belong to the Unicode general category Symbol, Other, and so are not
 included in any of the output tokens.

 StandardTokenizer uses the Word Break rules find UAX#29 
 http://unicode.org/reports/tr29/ to find token boundaries, and then
 outputs only alphanumeric tokens.  See the JFlex grammar for details: 
 http://svn.apache.org/viewvc/lucene/dev/trunk/modules/analysis/common/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex?view=markup
 .

 The behavior you're seeing is not consistent with the above test.

 Steve

  -Original Message-
  From: Mike Hugo [mailto:m...@piragua.com]
  Sent: Tuesday, January 24, 2012 1:34 PM
  To: solr-user@lucene.apache.org
  Subject: HTMLStripCharFilterFactory not working in Solr4?
 
  We recently updated to the latest build of Solr4 and everything is
 working
  really well so far!  There is one case that is not working the same way
 it
  was in Solr 3.4 - we strip out certain HTML constructs (like trademark
 and
  registered, for example) in a field as defined below - it was working in
  Solr3.4 with the configuration shown here, but is not working the same
 way
  in Solr4.
 
  The label field is defined as type=text_general
  field name=label type=text_general indexed=true stored=false
  required=false multiValued=true/
 
  Here's the type definition for text_general field:
  fieldType name=text_general class=solr.TextField
  positionIncrementGap=100
  analyzer type=index
  tokenizer class=solr.StandardTokenizerFactory/
  charFilter class=solr.HTMLStripCharFilterFactory/
  filter class=solr.StopFilterFactory ignoreCase=true
  words=stopwords.txt
  enablePositionIncrements=true/
  filter class=solr.LowerCaseFilterFactory/
  /analyzer
  analyzer type=query
  tokenizer class=solr.StandardTokenizerFactory/
  charFilter class=solr.HTMLStripCharFilterFactory/
  filter class=solr.StopFilterFactory ignoreCase=true
  words=stopwords.txt
  enablePositionIncrements=true/
  filter class=solr.LowerCaseFilterFactory/
  /analyzer
  /fieldType
 
 
  In Solr 3.4, that configuration was completely stripping html constructs
  out of the indexed field which is exactly what we wanted.  If for
 example,
  we then do a facet on the label field, like in the test below, we're
  getting some terms in the response that we would not like to be there.
 
 
  // test case (groovy)
  void specialHtmlConstructsGetStripped() {
  SolrInputDocument inputDocument = new SolrInputDocument()
  inputDocument.addField('label', 'Bose#174; #8482;')
 
  solrServer.add(inputDocument)
  solrServer.commit()
 
  QueryResponse response = solrServer.query(new SolrQuery('bose'))
  assert 1 == response.results.numFound
 
  SolrQuery facetQuery = new SolrQuery('bose')
  facetQuery.facet = true
  facetQuery.set(FacetParams.FACET_FIELD, 'label')
  facetQuery.set(FacetParams.FACET_MINCOUNT, '1')
 
  response = solrServer.query(facetQuery)
  FacetField ff = response.facetFields.find {it.name == 'label'}
 
  List suggestResponse = []
 
  for (FacetField.Count facetField in ff?.values) {
  suggestResponse  facetField.name
  }
 
  assert suggestResponse == ['bose']
  }
 
  With the upgrade to Solr4, the assertion fails, the suggested response
  contains 

Fw: Problem with SpliBy in Solr 3.4

2012-01-24 Thread Sumit Sen



- Forwarded Message -
From: Sumit Sen sumitse...@yahoo.com
To: Solr List solr-user@lucene.apache.org 
Sent: Tuesday, January 24, 2012 3:53 PM
Subject: Problem with SpliBy in Solr 3.4


Hi All:

I have a very silly problem. I am using Solr 3.4. I have a data import handle 
for indexing which is not Spliting a field data by '|' inspite of following 
setup.

    document
  entity dataSource=ds-1 name=associate pk=id 
transformer=RegexTransformer 
query=Select case  when EMPLID != ' ' then EMPLID END as ID   ,
   case  when FIRST_NAME   != ' ' then FIRST_NAME  END as firstName,
   case  when MIDDLE_NAME  != ' ' then MIDDLE_NAME END as middleName,
   case  when LAST_NAME   != ' ' then LAST_NAME END as familyName,
   case  when FORMER_NAME != ' ' then FORMER_NAME END as middleName,
   case  when EMAIL_ADDRESS  != ' ' then EMAIL_ADDRESS END as businessEmail,
   case  when CITY != ' ' then CITY END as homeCity,
   case  when STATE != ' ' then STATE END  as homeCState,
   case  when ZIP != ' ' then ZIP END as homeZip,
   case  when COUNTRY_ISO  != ' ' then COUNTRY_ISO END as homeCountry,
   case  when WORK_PHONE  != ' ' then WORK_PHONE END as businessTel,
   (select xlatlongname
   from xlattable
  where fieldname = 'PER_STATUS' and fieldvalue = t1.per_status
   and language_cd = 'ENG') as PER_STATUS,
   case when ORIG_HIRE_DT IS NOT NULL then ORIG_HIRE_DT END as hireDate,
   (select xlatlongname
   from xlattable
  where fieldname = 'SEX' and fieldvalue = t1.sex
   and language_cd = 'ENG') as sex,
   (select xlatlongname
   from xlattable
  where fieldname = 'ETHNIC_GROUP' and fieldvalue = t1.ethnic_group
   and language_cd = 'ENG')   as ethnicityCode,
   case when CITZNS_CNTRY_ISO != ' ' then  CITZNS_CNTRY_ISO END  as 
citizenship,
   (select xlatlongname
   from xlattable
  where fieldname = 'MAR_STATUS' and fieldvalue = t1.mar_status
   and language_cd = 'ENG')   as marritalStatus,
   case when PREFERRED_LANGUAGE  != ' ' then PREFERRED_LANGUAGE END as 
primaryLanguageCode,
   case when BUSINESS_TITLE != ' ' then BUSINESS_TITLE END as businessTitle,
   case when TITLE != ' ' then TITLE END as title,
   case when JOBCODE != ' ' then JOBCODE END    ,
   (select xlatlongname
   from xlattable
  where fieldname = 'EMPL_STATUS' and fieldvalue = t1.empl_status
   and language_cd = 'ENG')   as workLevelStatus,
   case when LOCATION  != ' ' then LOCATION END ,
   case when CITY_EMPL  != ' ' then CITY_EMPL END   ,
   case when STATE_EMPL  != ' ' then STATE_EMPL END    ,
   case when  COUNTRY_2CHAR  != ' ' then COUNTRY_2CHAR END  ,
   case when ZIP_INTL != ' ' then ZIP_INTL END ,
  (select xlatlongname
   from xlattable
  where fieldname = 'EMPL_TYPE' and fieldvalue = t1.empl_type
   and language_cd = 'ENG')   as employmenttype,
   case when HOME_DEPARTMENT != ' ' then HOME_DEPARTMENT END   as 
DEPARTMENT,
   (Select case when name != ' ' then name end from ps_personal_data where 
employee_oid = t1.REPORTS_TO_AOID) as reportsTo,
   case when t1.ROLE_CODE1 != ' ' then   t1.ROLE_CODE1 end ||'|'||
   case when t1.ROLE_CODE2 != ' ' then  t1.ROLE_CODE2 end  ||'|'|| 
   case when t1.ROLE_CODE3 != ' ' then t1.ROLE_CODE3  end   ||'|'|| 
   case when t1.EE_ROLE_CODE1 != ' ' then t1.EE_ROLE_CODE1 end ||'|'||
   case when t1.EE_ROLE_CODE2 != ' ' then t1.EE_ROLE_CODE2 end ||'|'||
   case when t1.EE_ROLE_CODE3 != ' ' then t1.EE_ROLE_CODE3 end ||'|'|| 
   case when t1.EE_ROLE_CODE4 != ' ' then t1.EE_ROLE_CODE4 end ||'|'||
   case when t1.EE_ROLE_CODE5 != ' ' then t1.EE_ROLE_CODE5 end ||'|'||
   case when t1.EE_ROLE_CODE6 != ' ' then t1.EE_ROLE_CODE6 end as roleCode
   From PS_BOD_EE_VW t1 where t1.per_status = 'A'
   field column = id /
   field column = title /
   field column = firstName /
   field column = middleName /
   field column = familyName /
   field column = maidenName /
   field column = primaryLanguageCode /
...
    ...
   field column = education /
   field column = roleCode splitBy = \| name=roleCode /
  field column = applicationDate /
...
...
   field column = securityLevel /
   /entity
    /document
/dataConfig

I schema.xm  I have 

   field name=id type=string indexed=true stored=true required=true 
/
   field name=title type=string indexed=true stored=true 
required=false /
   field name=firstName type=string indexed=true stored=true 
required=false /
   field name=middleName type=string indexed=true stored=true 
required=false /
   field name=familyName type=string indexed=true stored=true 
required=false /
   field name=maidenName type=string indexed=true stored=true 
required=false /
   field name=sex type=string indexed=true stored=true 
required=false /
   field 

Re: Do Hignlighting + proximity using surround query parser

2012-01-24 Thread Scott Stults
I got this working the way you describe it (in the getHighlightQuery()
method). The span queries were tripping it up, so I extracted the query
terms and created a DisMax query from them. There'll be a loss of accuracy
in the highlighting, but in my case that's better than no highlighting.

Should I just go ahead and submit a patch to SOLR-2703?


On Tue, Jan 10, 2012 at 9:35 AM, Ahmet Arslan iori...@yahoo.com wrote:

  I am not able to do highlighting with surround query parser
  on the returned
  results.
  I have tried the highlighting component but it does not
  return highlighted
  results.

 Highlighter does not recognize Surround Query. It must be re-written to
 enable highlighting in o.a.s.search.QParser#getHighlightQuery() method.

 Not sure this functionality should be added in SOLR-2703 or a separate
 jira issue.




-- 
Scott Stults | Founder  Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com


Re: Solr 3.5.0 can't find Carrot classes

2012-01-24 Thread Christopher J. Bottaro
On Tuesday, January 24, 2012 at 3:07 PM, Christopher J. Bottaro wrote:
 SEVERE: java.lang.NoClassDefFoundError: org/carrot2/core/ControllerFactory
 at 
 org.apache.solr.handler.clustering.carrot2.CarrotClusteringEngine.init(CarrotClusteringEngine.java:102)
 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
 Method)
 at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown 
 Source)
 at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown 
 Source)
 at java.lang.reflect.Constructor.newInstance(Unknown Source)
 at java.lang.Class.newInstance0(Unknown Source)
 at java.lang.Class.newInstance(Unknown Source)
  
 …
  
 I'm starting Solr with -Dsolr.clustering.enabled=true and I can see that the 
 Carrot jars in contrib are getting loaded.
  
 Full log file is here:  http://onespot-development.s3.amazonaws.com/solr.log  
  
 Any ideas?  Thanks for the help.
  
Ok, got a little further.  Seems that Solr doesn't like it if you include jars 
more than once (I had a lib dir and also lib directives in the solrconfig 
which ended up loading the same jars twice).

But now I'm getting these errors:  java.lang.NoClassDefFoundError: 
org/apache/solr/handler/clustering/SearchClusteringEngine

Any help?  Thanks. 

Re: Do Hignlighting + proximity using surround query parser

2012-01-24 Thread Ahmet Arslan
 I got this working the way you
 describe it (in the getHighlightQuery()
 method). The span queries were tripping it up, so I
 extracted the query
 terms and created a DisMax query from them. There'll be a
 loss of accuracy
 in the highlighting, but in my case that's better than no
 highlighting.
 
 Should I just go ahead and submit a patch to SOLR-2703?

I think a separate jira ticket would be more appropriate. 

By the way, o.a.l.search.Query#rewrite(IndexReader reader) should do the trick. 

/**
   * Highlighter does not recognize SurroundQuery.
   * It must be rewritten in its most primitive form to enable highlighting.
   */
  @Override
  public Query getHighlightQuery() throws ParseException {

Query rewritedQuery;

try {
  rewritedQuery = 
getQuery().rewrite(getReq().getSearcher().getIndexReader());
} catch (IOException ioe) {
  rewritedQuery = null;
  LOG.error(query.rewrite() failed, ioe);
}

if (rewritedQuery == null)
  return getQuery();
else
  return rewritedQuery;
  }


solr not working with magento enterprise 1.11

2012-01-24 Thread vishal_asc
I am integrating solr 3.5 with jetty in magento EE 1.11.

I have followed all the necessary steps, configure and tested solr
connection in magento catalog system config.

I have copied magento/lib/Solr/conf/ content to solr installation. I have
run the index management, restarted jetty but when I search any word or
misspell its not showing me Did you mean ? string means not correcting
misspelling. seems solr is not throwing results.

please let me know how can i know solr is working with magento and where
solr save XML documents when magento pushes attributes and product
information in solr ? which directory it stores them.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-not-working-with-magento-enterprise-1-11-tp3686773p3686773.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: solr not working with magento enterprise 1.11

2012-01-24 Thread David Radunz

Hey,

Shouldn't you be asking this question to the Magento people? You 
have an Enterprise edition, so you have paid for their support.


Cheers,

David

On 25/01/2012 2:57 PM, vishal_asc wrote:

I am integrating solr 3.5 with jetty in magento EE 1.11.

I have followed all the necessary steps, configure and tested solr
connection in magento catalog system config.

I have copied magento/lib/Solr/conf/ content to solr installation. I have
run the index management, restarted jetty but when I search any word or
misspell its not showing me Did you mean ? string means not correcting
misspelling. seems solr is not throwing results.

please let me know how can i know solr is working with magento and where
solr save XML documents when magento pushes attributes and product
information in solr ? which directory it stores them.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-not-working-with-magento-enterprise-1-11-tp3686773p3686773.html
Sent from the Solr - User mailing list archive at Nabble.com.




Re: Solr Cores

2012-01-24 Thread Sujatha Arun
Thanks Erick.

Regards
Sujatha

On Mon, Jan 23, 2012 at 11:16 PM, Erick Erickson erickerick...@gmail.comwrote:

 You can have a large number of cores, some people have multiple
 hundreds. Having multiple cores is preferred over having
 multiple JVMs since it's more efficient at sharing system
 resources. If you're running a 32 bit JVM, you are limited in
 the amount of memory you can let the JVM use, so that's a
 consideration, but otherwise use multiple cores in one JVM
 and give that JVM say, half of the physical memory on the
 machine and tune from there.

 Best
 Erick

 On Sun, Jan 22, 2012 at 8:16 PM, Sujatha Arun suja.a...@gmail.com wrote:
  Hello,
 
  We have in production a number of individual solr Instnaces on a single
  JVM.As a result ,we see that the permgenSpace keeps increasing with each
  additional instance added.
 
  I would Like to know ,if we can have solr cores , instead of individual
  instances.
 
 
- Is there any limit to the number of cores ,for a single instance?
- Will this decrease the permgen space as the LIB is shared.?
- Would there be any decrease in performance with number of cores
 added?
- Any thing else that I should know before moving into cores?
 
 
  Any help would be appreciated?
 
  Regards
  Sujatha



Re: solr not working with magento enterprise 1.11

2012-01-24 Thread vishal_asc
Thanks David. As of now we are configuring it on local WAMP server and we
have only development version provided by sales team.

Do you when where solr saves information or push the xml docs when we run
index management in magento ?

I followed this site:
http://www.summasolutions.net/blogposts/magento-apache-solr-set 

Please let me know if you have some other info also.


--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-not-working-with-magento-enterprise-1-11-tp3686773p3686816.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: solr not working with magento enterprise 1.11

2012-01-24 Thread vishal_asc
Thanks David. As of now we are configuring it on local WAMP server and we have 
only development version provided by sales team.

Do you when where solr saves information or push the xml docs when we run index 
management in magento ?

I followed this site: 
http://www.summasolutions.net/blogposts/magento-apache-solr-set

Please let me know if you have some other info also.

Best Regards,
Vishal Porwal

From: David Radunz [via Lucene] 
[mailto:ml-node+s472066n3686805...@n3.nabble.com]
Sent: Wednesday, January 25, 2012 9:47 AM
To: Vishal Porwal
Subject: Re: solr not working with magento enterprise 1.11

Hey,

 Shouldn't you be asking this question to the Magento people? You
have an Enterprise edition, so you have paid for their support.

Cheers,

David

On 25/01/2012 2:57 PM, vishal_asc wrote:

 I am integrating solr 3.5 with jetty in magento EE 1.11.

 I have followed all the necessary steps, configure and tested solr
 connection in magento catalog system config.

 I have copied magento/lib/Solr/conf/ content to solr installation. I have
 run the index management, restarted jetty but when I search any word or
 misspell its not showing me Did you mean ? string means not correcting
 misspelling. seems solr is not throwing results.

 please let me know how can i know solr is working with magento and where
 solr save XML documents when magento pushes attributes and product
 information in solr ? which directory it stores them.

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/solr-not-working-with-magento-enterprise-1-11-tp3686773p3686773.html
 Sent from the Solr - User mailing list archive at Nabble.com.



If you reply to this email, your message will be added to the discussion below:
http://lucene.472066.n3.nabble.com/solr-not-working-with-magento-enterprise-1-11-tp3686773p3686805.html
To unsubscribe from solr not working with magento enterprise 1.11, click 
herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=3686773code=dmlzaGFsLnBvcndhbEBhc2NlbmR1bS5jb218MzY4Njc3M3w5NjEyMzY0MDE=.
NAMLhttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.InstantMailNamespacebreadcrumbs=instant+emails%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml


--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-not-working-with-magento-enterprise-1-11-tp3686773p3686818.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: solr not working with magento enterprise 1.11

2012-01-24 Thread David Radunz

Hey,

I am using Magento Community Edition, I wrote my own Magento 
extension to integrate Solr and it works fine. So I really don't know 
what the Enterprise edition does. On a personal and unrelated note, I 
would never use Windows for a server; Unreliable and most of the system 
resources go towards the OS.


Cheers,

David

On 25/01/2012 3:30 PM, vishal_asc wrote:

Thanks David. As of now we are configuring it on local WAMP server and we have 
only development version provided by sales team.

Do you when where solr saves information or push the xml docs when we run index 
management in magento ?

I followed this site: 
http://www.summasolutions.net/blogposts/magento-apache-solr-set

Please let me know if you have some other info also.

Best Regards,
Vishal Porwal

From: David Radunz [via Lucene] 
[mailto:ml-node+s472066n3686805...@n3.nabble.com]
Sent: Wednesday, January 25, 2012 9:47 AM
To: Vishal Porwal
Subject: Re: solr not working with magento enterprise 1.11

Hey,

  Shouldn't you be asking this question to the Magento people? You
have an Enterprise edition, so you have paid for their support.

Cheers,

David

On 25/01/2012 2:57 PM, vishal_asc wrote:


I am integrating solr 3.5 with jetty in magento EE 1.11.

I have followed all the necessary steps, configure and tested solr
connection in magento catalog system config.

I have copied magento/lib/Solr/conf/ content to solr installation. I have
run the index management, restarted jetty but when I search any word or
misspell its not showing me Did you mean ? string means not correcting
misspelling. seems solr is not throwing results.

please let me know how can i know solr is working with magento and where
solr save XML documents when magento pushes attributes and product
information in solr ? which directory it stores them.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-not-working-with-magento-enterprise-1-11-tp3686773p3686773.html
Sent from the Solr - User mailing list archive at Nabble.com.



If you reply to this email, your message will be added to the discussion below:
http://lucene.472066.n3.nabble.com/solr-not-working-with-magento-enterprise-1-11-tp3686773p3686805.html
To unsubscribe from solr not working with magento enterprise 1.11, click 
herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=3686773code=dmlzaGFsLnBvcndhbEBhc2NlbmR1bS5jb218MzY4Njc3M3w5NjEyMzY0MDE=.
NAMLhttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.InstantMailNamespacebreadcrumbs=instant+emails%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml


--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-not-working-with-magento-enterprise-1-11-tp3686773p3686818.html
Sent from the Solr - User mailing list archive at Nabble.com.




Re: SpellCheck Help

2012-01-24 Thread vishal_asc
I have installed the same solr 3.5 with jetty and integrating it magento 1.11
but it seems to be not working. 
As my search result is not showing Did you mean string ? when I misspelled
any word.

I followed all steps necessary for magento solr integration.

Please help ASAP.

Thanks
Vishal

--
View this message in context: 
http://lucene.472066.n3.nabble.com/SpellCheck-Help-tp3648589p3686756.html
Sent from the Solr - User mailing list archive at Nabble.com.