date:20131121

Re: Suggester - how to return exact match?

2013-11-21 Thread Mirko

Hi,
I'd like to clarify our use case a bit more.

We want to return the exact search query as a suggestion only if it is
present in the index. So in my example we would expect to get the
suggestion foo for the query foo but no suggestion abc for the query
abc (because abc is not in the dictionary).

For me this use case seems quite common. Say, we have three products in our
store: foo, foo 1, foo 2. If the user types foo in the product
search, we want to suggest all our products in the dropdown.

Is this something we can do with the Solr suggester?
Mirko


2013/11/20 Developer bbar...@gmail.com

 May be there is a way to do this but it doesn't make sense to return the
 same
 search query as a suggestion (Search query is not a suggestion as it might
 or might not be present in the index).

 AFAIK you can use various look up algorithm to get the suggestion list and
 they lookup the terms based on the query value (some alogrithm implements
 fuzzy logic too). so searching Foo will return FooBar, Foo2 but not foo.

 You should fetch the suggestion only if the numfound is greater than 0 else
 you don't have any suggestion.




 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Suggester-how-to-return-exact-match-tp4102203p4102259.html
 Sent from the Solr - User mailing list archive at Nabble.com.

SolrServerException while adding an invalid UNIQUE_KEY in solr 4.4

2013-11-21 Thread RadhaJayalakshmi

Hi,I am using solr4.4 with zookeeper 3.3.5. While i was checking for error
conditions of my application, i came across a strange issue.Here is what i
tried:I have three fields  defined in my schemaa) UNIQUE_KEY  - of type
solr.TrieLongb) empId - of type Solr.TrieLongc) companyId - of type
Solr.TrieLongHow Am i Indexing:I am indexing
using SolrJ API. and the data for the indexing will be in a text file,
delimited by | symbol. My Indexer java program will read the textfile lineby
line, splits the data by | symbol and creates SolrInputdocument object (for
every line of the file) and adds the fields with values (that it read from
the file)Now, intentionally, in the data file, for unique_key, i had String
values(instead of long value) . something like123AB|111|222Now, when i index
this data, i am getting the below
exception:*org.apache.solr.client.solrj.SolrServerException*: No live
SolrServers available to handle this request*:[URL of my application]*  
   
at
org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:333)
  
at
org.apache.solr.client.solrj.impl.CloudSolrServer.request(CloudSolrServer.java:318)

at
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
 
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)  
   
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)  
Caused
by: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
Server at *[URL of my application] *returned non ok status:500,
message:Internal Server Error   at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:385)
  
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:180)
  
at
org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:264)
But, when i correct the unique_key field data, but when i gave string data
for other two long fields, i am getting a different
exceptionorg.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
ERROR: [Error stating the field name for which it is
mismathing]orrg.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:424)
  
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:180)
  
at
org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:264)
  
at
org.apache.solr.client.solrj.impl.CloudSolrServer.request(CloudSolrServer.java:318)

at
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
 
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)  
   
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)  
   
at  What is my question here:--During 
indexing, if
solr finds, that for any field, if the fieldtype declared in schema is
mismatching with the data that is being givem, then it should riase the same
type of exception.But in the above case, if it finds a mismatch for
Unique_key, it is raising SolrServerException. For all other fields, it is
raising, RemoteSolrException(which is an unchecked exception). Is this a bug
in solr or is there any reason for thowing different exception for the above
two cases.Expecting a positive replyThanksRadha 




--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrServerException-while-adding-an-invalid-UNIQUE-KEY-in-solr-4-4-tp4102346.html
Sent from the Solr - User mailing list archive at Nabble.com.

Best implementation for multi-price store?

2013-11-21 Thread Alejandro Marqués Rodríguez

Hi,

I've been recently ask to implement an application to search products from
several stores, each store having different prices and stock for the same
product.

So I have products that have the usual fields (name, description, brand,
etc) and also number of units and price for each store. I must be able to
filter for a given store and order by stock or price for that store. The
application should also allow incresing the number of stores, fields
depending of store and number of products without much work.

The numbers for the application are more or less 100 stores and 7M products.

I've been thinking of some ways of defining the index structure but I don't
know wich one is better as I think each one has it's pros and cons.


   1. *Each product-store as a document:* Denormalizing the information so
   for every product and store I have a different document. Pros are that I
   can filter and order without problems and that adding a new store-depending
   field is very easy. Cons are that the index goes from 7M documents to 700M
   and that most of the info is redundant as most of the fields are repeated
   among stores.
   2. *Each field-store as a field:* For example for price I would have
   store1_price, store2_price,  Pros are that the index stays at 7M
   documents, and I can still filter and sort by those fields. Cons are that I
   have to add some logic so if I filter by one store I order for the
   associated price field, and that number of fields increases as number of
   store-depending fields x number of stores. I don't know if having more
   fields affects performance, but adding new store-depending fields will
   increase the number of fields even more
   3. *Join:* First time I read about solr joins thought it was the way to
   go in this case, but after reading a bit more and doing some tests I'm not
   so sure about it... Maybe I've done it wrong but I think it also
   denormalizes the info (So I will also havee 700M documents) and besides I
   can't order or filter by store fields.


I must say my preferred option is number 2, so I don't duplicate
information, I keep a relatively small number of documents and I can filter
and sort by the store fields. However, my main concern here is I don't know
if having too many fields in a document will be harmful to performance.

Which one do you think is the best approach for this application? Is there
a better approach that I have missed?

Thanks in advance



-- 
Alejandro Marqués Rodríguez

Paradigma Tecnológico
http://www.paradigmatecnologico.com
Avenida de Europa, 26. Ática 5. 3ª Planta
28224 Pozuelo de Alarcón
Tel.: 91 352 59 42

Parse eDisMax queries for keywords

2013-11-21 Thread Mirko

Hi,
We would like to implement special handling for queries that contain
certain keywords. Our particular use case:

In the example query Footitle season 1 we want to discover the keywords
season , get the subsequent number, and boost (or filter for) documents
that match 1 on field name=season.

We have two fields in our schema:

!-- titles contains titles --
field name=title type=text indexed=true stored=true
 multiValued=false/

fieldType name=text class=solr.TextField omitNorms=true
analyzer 
charFilter class=solr.MappingCharFilterFactory
mapping=mapping-ISOLatin1Accent.txt/
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
!-- ... --
/analyzer
/fieldType

field name=season type=season_number indexed=true stored=false
multiValued=false/

!-- season contains season numbers --
fieldType name=season_number class=solr.TextField omitNorms=true 
analyzer type=query
tokenizer class=solr.KeywordTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.PatternReplaceFilterFactory pattern=.*(?:season)
*0*([0-9]+).* replacement=$1/
/analyzer
/fieldType


Our idea was to use a Keyword tokenizer and a Regex on the season field
to extract the season number from the complete query.

However, we use a ExtendedDisMax query parser in our search handler:

requestHandler name=/select class=solr.SearchHandler
lst name=defaults
str name=defTypeedismax/str
str name=qf
title season
/str

/lst
/requestHandler


The problem is that the eDisMax tokenizes the query, so that our field
season receives the tokens [Foo, season, 1] without any order,
instead of the complete query.

How can we pass the complete query (untokenized) to the season field? We
don't understand which tokenizer is used here and why our season field
received tokens instead of the complete query.

Or is there another approach to solve this use case with Solr?

Thanks,
Mirko

Re: facet method=enum and uninvertedfield limitations

2013-11-21 Thread Dmitry Kan

What is the actual target speed you are pursuing? Is this for user
suggestions or something of that sort? Content based suggestions with
faceting and esp on 1.4 solr won't be lightning fast.

Have you looked at TermsComponent?
http://wiki.apache.org/solr/TermsComponent

By shingles, which in the rest of the world are more commonly called
ngrams, I meant a way of compressing the number of entities to iterate
through. Let's say if you only store bigrams or trigrams and facet based on
those (less in amount).

Dmitry

On Wed, Nov 20, 2013 at 6:10 PM, Lemke, Michael SZ/HZA-ZSW
lemke...@schaeffler.com wrote:

On Wednesday, November 20, 2013 7:37 AM, Dmitry Kan wrote:

Thanks for your reply.

Since you are faceting on a text field (is this correct?) you deal with a
lot of unique values in it.

Yes, this is a text field and we experimented with reducing the index. As
I said in my original question the stripped down index had 178,000 terms
and it (fc) still didn't work. Is number of terms the relevant quantity?

So your best bet is enum method.

Hm, yes, that works but I have to wait 4 minutes for the answer (with the
original data). Not good.

Also if you
are on solr 4x try building doc values in the index: this suits faceting
well.

We are on Solr 1.4, so, no.

Otherwise start from your spec once again. Can you use shingles instead?

Possibly but I don't know shingles. Although I'd prefer to use our
original
index we are trying to build a specialized index just for this sort of
query but still don't know what to look for.

A query like

q=wordfacet.field=CONTENTfacet=truefacet.limit=10facet.mincount=1facet.method=fcfacet.prefix=arows=0

would give me the top ten results containing 'word' and something starting
with 'a'. That's what I want. An empty facet.prefix should also work.
Eventually, the query will be more complex containing other fields and
filter queries but the basic function should be exactly like this. How
can we achieve this?

Thanks,
Michael

On 19 Nov 2013 17:44, Lemke, Michael SZ/HZA-ZSW
lemke...@schaeffler.com
wrote:

On Friday, November 15, 2013 11:22 AM, Lemke, Michael SZ/HZA-ZSW wrote:

Judging from numerous replies this seems to be a tough question.
Nevertheless, I'd really appreciate any help as we are stuck.
We'd really like to know what in our index causes the facet.method=fc
query to fail.

Thanks,
Michael

On Thu, November 14, 2013 7:26 PM, Yonik Seeley wrote:
On Thu, Nov 14, 2013 at 12:03 PM, Lemke, Michael SZ/HZA-ZSW
lemke...@schaeffler.com wrote:
I am running into performance problems with faceted queries.
If I do a

q=wordfacet.field=CONTENTfacet=truefacet.limit=10facet.mincount=1facet.method=fcfacet.prefix=arows=0

I am getting an exception:
org.apache.solr.common.SolrException: Too many values for
UnInvertedField faceting on field CONTENT
at

org.apache.solr.request.UnInvertedField.uninvert(UnInvertedField.java:384)
at

org.apache.solr.request.UnInvertedField.lt;initgt;(UnInvertedField.java:178)
at

org.apache.solr.request.UnInvertedField.getUnInvertedField(UnInvertedField.java:839)
...

I understand it's got something to do with a 24bit limit somewhere
in the code but I don't understand enough of it to be able to
construct
a specialized index that can be queried with facet.method=enum.

You shouldn't need to do anything differently to try facet.method=enum
(just replace facet.method=fc with facet.method=enum)

This is true and facet.method=enum does work indeed. The problem is
runtime. In particular queries with an empty facet.prefix= run many
seconds if not minutes. I initially asked about this here:

http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201310.mbox/%3c33ec3398272fbe47b64ee3b3e98f69a761427...@de011521.schaeffler.com%3E

It was suggested that fc is much faster than enum and I'd like to
test that. We are still fairly free to design the index such that
it performs well. But to do that we need to understand what is
killing it.

You may also want to add the parameter
facet.enum.cache.minDf=10
to lower memory usage by only usiing the filter cache for terms that
match more than 100K docs.

That helped a little, cut down my particular test from 10 sec to 5 sec.
But still too slow. Mind you this is for an autosuggest feature.

Thanks for your reply.

Michael

--
Dmitry
Blog: http://dmitrykan.blogspot.com
Twitter: twitter.com/dmitrykan

RE: Best implementation for multi-price store?

2013-11-21 Thread Petersen, Robert

Hi,

I'd go with (2) also but using dynamic fields so you don't have to define all 
the storeX_price fields in your schema but rather just one *_price field.  Then 
when you filter on store:store1 you'd know to sort with store1_price and so 
forth for units.  That should be pretty straightforward.

Hope that helps,
Robi

-Original Message-
From: Alejandro Marqués Rodríguez [mailto:amarq...@paradigmatecnologico.com] 
Sent: Thursday, November 21, 2013 1:36 AM
To: solr-user@lucene.apache.org
Subject: Best implementation for multi-price store?

Hi,

I've been recently ask to implement an application to search products from 
several stores, each store having different prices and stock for the same 
product.

So I have products that have the usual fields (name, description, brand,
etc) and also number of units and price for each store. I must be able to 
filter for a given store and order by stock or price for that store. The 
application should also allow incresing the number of stores, fields depending 
of store and number of products without much work.

The numbers for the application are more or less 100 stores and 7M products.

I've been thinking of some ways of defining the index structure but I don't 
know wich one is better as I think each one has it's pros and cons.


   1. *Each product-store as a document:* Denormalizing the information so
   for every product and store I have a different document. Pros are that I
   can filter and order without problems and that adding a new store-depending
   field is very easy. Cons are that the index goes from 7M documents to 700M
   and that most of the info is redundant as most of the fields are repeated
   among stores.
   2. *Each field-store as a field:* For example for price I would have
   store1_price, store2_price,  Pros are that the index stays at 7M
   documents, and I can still filter and sort by those fields. Cons are that I
   have to add some logic so if I filter by one store I order for the
   associated price field, and that number of fields increases as number of
   store-depending fields x number of stores. I don't know if having more
   fields affects performance, but adding new store-depending fields will
   increase the number of fields even more
   3. *Join:* First time I read about solr joins thought it was the way to
   go in this case, but after reading a bit more and doing some tests I'm not
   so sure about it... Maybe I've done it wrong but I think it also
   denormalizes the info (So I will also havee 700M documents) and besides I
   can't order or filter by store fields.


I must say my preferred option is number 2, so I don't duplicate information, I 
keep a relatively small number of documents and I can filter and sort by the 
store fields. However, my main concern here is I don't know if having too many 
fields in a document will be harmful to performance.

Which one do you think is the best approach for this application? Is there a 
better approach that I have missed?

Thanks in advance



--
Alejandro Marqués Rodríguez

Paradigma Tecnológico
http://www.paradigmatecnologico.com
Avenida de Europa, 26. Ática 5. 3ª Planta
28224 Pozuelo de Alarcón
Tel.: 91 352 59 42

Indexing data to a specific collection in Solr 4.5.0

2013-11-21 Thread Reyes, Mark

Hi all:

I’m currently on a Solr 4.5.0 instance and running this tutorial, 
http://lucene.apache.org/solr/4_5_0/tutorial.html

My question is specific to indexing data as proposed from this tutorial,

$ java -jar post.jar solr.xml monitor.xml

The tutorial advises to validate from your localhost,
http://localhost:8983/solr/collection1/select?q=solrwt=xml

However, what if my Solr core has both a collection1 and collection2, yet I 
desire the XML files to only be posted to collection2 only?

If possible, please advise.

Thanks,
Mark

IMPORTANT NOTICE: This e-mail message is intended to be received only by 
persons entitled to receive the confidential information it may contain. E-mail 
messages sent from Bridgepoint Education may contain information that is 
confidential and may be legally privileged. Please do not read, copy, forward 
or store this message unless you are an intended recipient of it. If you 
received this transmission in error, please notify the sender by reply e-mail 
and delete the message and any attachments.

Re: Facet field query on subset of documents

2013-11-21 Thread Luis Lebolo

Hi Erick,

Thanks for the reply and sorry, my fault, wasn't clear enough. I was
wondering if there was a way to remove terms that would always be zero
(because the term came from a document that didn't match the filter query).

Here's an example. I have a bunch of documents with fields 'manufacturer'
and 'location'. If I set my filter query to manufacturer = Sony and all
Sony documents had a value of 'Florida' for location, then I want 'Florida'
NOT to show up in my facet field results. Instead, it shows up with a count
of zero (and it'll always be zero because of my filter query).

Using mincount = 1 doesn't solve my problem because I don't want it to hide
zeroes that came from documents that actually pass my filter query.

Does that make more sense?


On Thu, Nov 21, 2013 at 4:36 PM, Erick Erickson erickerick...@gmail.comwrote:

 That's what faceting does. The facets are only tabulated
 for documents that satisfy they query, including all of
 the filter queries and anh other criteria.

 Otherwise, facet counts would be the same no matter
 what the query was.

 Or I'm completely misunderstanding your question...

 Best,
 Erick


 On Thu, Nov 21, 2013 at 4:22 PM, Luis Lebolo luis.leb...@gmail.com
 wrote:

  Hi All,
 
  Is it possible to perform a facet field query on a subset of documents
 (the
  subset being defined via a filter query for instance)?
 
  I understand that facet pivoting might work, but it would require that
 the
  subset be defined by some field hierarchy, e.g. manufacturer - price
 (then
  only look at the results for the manufacturer I'm interested in).
 
  What if I wanted to define a more complex subset (where the name starts
  with A but ends with Z and some other field is greater than 5 and yet
  another field is not 'x', etc.)?
 
  Ideally I would then define a facet field constraining query to include
  only terms from documents that pass this query.
 
  Thanks,
  Luis

Periodic Slowness on Solr Cloud

2013-11-21 Thread Dave Seltzer

I'm doing some performance testing against an 8-node Solr cloud cluster,
and I'm noticing some periodic slowness.


http://farm4.staticflickr.com/3668/10985410633_23e26c7681_o.png

I'm doing random test searches against an Alias Collection made up of four
smaller (monthly) collections. Like this:

MasterCollection
|- Collection201308
|- Collection201309
|- Collection201310
|- Collection201311

The last collection is constantly updated. New documents are being added at
the rate of about 3 documents per second.

I believe the slowness may due be to NRT, but I'm not sure. How should I
investigate this?

If the slowness is related to NRT, how can I alleviate the issue without
disabling NRT?

Thanks Much!

-Dave

RE: search with wildcard

2013-11-21 Thread Scott Schneider

I know it's documented that Lucene/Solr doesn't apply filters to queries with 
wildcards, but this seems to trip up a lot of users.  I can also see why 
wildcards break a number of filters, but a number of filters (e.g. mapping 
charsets) could mostly or entirely work.  The N-gram filter is another one that 
would be great to still run when there wildcards.  If you indexed 4-grams and 
the query is a *testp*, you currently won't get any results; but the N-gram 
filter could have a wildcard mode that, in this case, would return just the 
first 4-gram as a token.

Is this something you've considered?  It would have to be enabled in the core 
network, but disabled by default for existing filters; then it could be enabled 
1-by-1 for existing filters.  Apologies if the dev list is a better place for 
this.

Scott


 -Original Message-
 From: Ahmet Arslan [mailto:iori...@yahoo.com]
 Sent: Thursday, November 21, 2013 8:40 AM
 To: solr-user@lucene.apache.org
 Subject: Re: search with wildcard
 
 Hi Adnreas,
 
 If you don't want to use wildcards at query time, alternative way is to
 use NGrams at indexing time. This will produce a lot of tokens. e.g.
 For example 4grams of your example : Supertestplan = supe uper pert
 erte rtes *test* estp stpl tpla plan
 
 
 Is that you want? By the way why do you want to search inside of words?
 
 filter class=solr.NGramFilterFactory minGramSize=3
 maxGramSize=4/
 
 
 
 
 On Thursday, November 21, 2013 5:23 PM, Andreas Owen a...@conx.ch
 wrote:
 
 I suppose i have to create another field with diffenet tokenizers and
 set
 the boost very low so it doesn't really mess with my ranking because
 there
 the word is now in 2 fields. What kind of tokenizer can do the job?
 
 
 
 From: Andreas Owen [mailto:a...@conx.ch]
 Sent: Donnerstag, 21. November 2013 16:13
 To: solr-user@lucene.apache.org
 Subject: search with wildcard
 
 
 
 I am querying test in solr 4.3.1 over the field below and it's not
 finding
 all occurences. It seems that if it is a substring of a word like
 Supertestplan it isn't found unless I use a wildcards *test*. This
 is
 write because of my tokenizer but does someone know a way around this?
 I
 don't want to add wildcards because that messes up queries with
 multiple
 words.
 
 
 
 fieldType name=text_de class=solr.TextField
 positionIncrementGap=100
 
       analyzer
 
         tokenizer class=solr.StandardTokenizerFactory/
 
         filter class=solr.LowerCaseFilterFactory/
 
 
 
         filter class=solr.StopFilterFactory ignoreCase=true
 words=lang/stopwords_de.txt format=snowball
 enablePositionIncrements=true/ !-- remove common words --
 
         filter class=solr.GermanNormalizationFilterFactory/
 
                                filter
 class=solr.SnowballPorterFilterFactory language=German/ !--
 remove
 noun/adjective inflections like plural endings --
 
 
 
       /analyzer
 
     /fieldType

Re: Facet field query on subset of documents

2013-11-21 Thread Erick Erickson

That's what faceting does. The facets are only tabulated
for documents that satisfy they query, including all of
the filter queries and anh other criteria.

Otherwise, facet counts would be the same no matter
what the query was.

Or I'm completely misunderstanding your question...

Best,
Erick


On Thu, Nov 21, 2013 at 4:22 PM, Luis Lebolo luis.leb...@gmail.com wrote:

 Hi All,

 Is it possible to perform a facet field query on a subset of documents (the
 subset being defined via a filter query for instance)?

 I understand that facet pivoting might work, but it would require that the
 subset be defined by some field hierarchy, e.g. manufacturer - price (then
 only look at the results for the manufacturer I'm interested in).

 What if I wanted to define a more complex subset (where the name starts
 with A but ends with Z and some other field is greater than 5 and yet
 another field is not 'x', etc.)?

 Ideally I would then define a facet field constraining query to include
 only terms from documents that pass this query.

 Thanks,
 Luis

Re: Indexing data to a specific collection in Solr 4.5.0

2013-11-21 Thread xiezhide


add Durl=http://localhost:8983/solr/collection2/update when run post.jar,
此邮件发送自189邮箱

Reyes, Mark mark.re...@bpiedu.com wrote:

Hi all:

I’m currently on a Solr 4.5.0 instance and running this tutorial, 
http://lucene.apache.org/solr/4_5_0/tutorial.html

My question is specific to indexing data as proposed from this tutorial,

$ java -jar post.jar solr.xml monitor.xml

The tutorial advises to validate from your localhost,
http://localhost:8983/solr/collection1/select?q=solrwt=xml

However, what if my Solr core has both a collection1 and collection2, yet I 
desire the XML files to only be posted to collection2 only?

If possible, please advise.

Thanks,
Mark

IMPORTANT NOTICE: This e-mail message is intended to be received only by 
persons entitled to receive the confidential information it may contain. 
E-mail messages sent from Bridgepoint Education may contain information that 
is confidential and may be legally privileged. Please do not read, copy, 
forward or store this message unless you are an intended recipient of it. If 
you received this transmission in error, please notify the sender by reply 
e-mail and delete the message and any attachments.

search with wildcard

2013-11-21 Thread Andreas Owen

I am querying test in solr 4.3.1 over the field below and it's not finding
all occurences. It seems that if it is a substring of a word like
Supertestplan it isn't found unless I use a wildcards *test*. This is
write because of my tokenizer but does someone know a way around this? I
don't want to add wildcards because that messes up queries with multiple
words.

 

fieldType name=text_de class=solr.TextField positionIncrementGap=100

  analyzer 

tokenizer class=solr.StandardTokenizerFactory/

filter class=solr.LowerCaseFilterFactory/

   

filter class=solr.StopFilterFactory ignoreCase=true
words=lang/stopwords_de.txt format=snowball
enablePositionIncrements=true/ !-- remove common words --

filter class=solr.GermanNormalizationFilterFactory/

   filter
class=solr.SnowballPorterFilterFactory language=German/ !-- remove
noun/adjective inflections like plural endings --



  /analyzer

/fieldType

Re: Periodic Slowness on Solr Cloud

2013-11-21 Thread Erick Erickson

How real time is NRT? In particular, what are you commit settings?

And can you characterize periodic slowness? Queries that usually
take 500ms not tail 10s? Or 1s? How often? How are you measuring?

Details matter, a lot...

Best,
Erick




On Thu, Nov 21, 2013 at 6:03 PM, Dave Seltzer dselt...@tveyes.com wrote:

 I'm doing some performance testing against an 8-node Solr cloud cluster,
 and I'm noticing some periodic slowness.


 http://farm4.staticflickr.com/3668/10985410633_23e26c7681_o.png

 I'm doing random test searches against an Alias Collection made up of four
 smaller (monthly) collections. Like this:

 MasterCollection
 |- Collection201308
 |- Collection201309
 |- Collection201310
 |- Collection201311

 The last collection is constantly updated. New documents are being added at
 the rate of about 3 documents per second.

 I believe the slowness may due be to NRT, but I'm not sure. How should I
 investigate this?

 If the slowness is related to NRT, how can I alleviate the issue without
 disabling NRT?

 Thanks Much!

 -Dave

Multiple similarity scores for the same text field

2013-11-21 Thread Nikos Voskarides

I have the following simplified setting:
My schema contains one text field, named text.
When I perform a query, I need to get the scores for the same text field
but for different similarity functions (e.g. TFIDF, BM25..) and combine
them externally using different weights.
An obvious way to achieve this is to keep multiple copies of the text field
in the schema for each similarity. I am wondering though whether there is a
more space-efficient way of doing this.

Thanks,

Nikos

Re: Indexing data to a specific collection in Solr 4.5.0

2013-11-21 Thread Erick Erickson

you're leaving off the - in front of the D,
-Durl.

Try java -jar post.jar -help for a list of options available


On Thu, Nov 21, 2013 at 12:04 PM, Reyes, Mark mark.re...@bpiedu.com wrote:

 So then,
 $ java -jar post.jar Durl=http://localhost:8983/solr/collection2/update
 solr.xml monitor.xml





 On 11/21/13, 8:14 AM, xiezhide xiezh...@gmail.com wrote:

 
 add Durl=http://localhost:8983/solr/collection2/update when run post.jar,
 此邮件发送自189邮箱
 
 Reyes, Mark mark.re...@bpiedu.com wrote:
 
 Hi all:
 
 I’m currently on a Solr 4.5.0 instance and running this tutorial,
 http://lucene.apache.org/solr/4_5_0/tutorial.html
 
 My question is specific to indexing data as proposed from this tutorial,
 
 $ java -jar post.jar solr.xml monitor.xml
 
 The tutorial advises to validate from your localhost,
 http://localhost:8983/solr/collection1/select?q=solrwt=xml
 
 However, what if my Solr core has both a collection1 and collection2,
 yet I desire the XML files to only be posted to collection2 only?
 
 If possible, please advise.
 
 Thanks,
 Mark
 
 IMPORTANT NOTICE: This e-mail message is intended to be received only by
 persons entitled to receive the confidential information it may contain.
 E-mail messages sent from Bridgepoint Education may contain information
 that is confidential and may be legally privileged. Please do not read,
 copy, forward or store this message unless you are an intended recipient
 of it. If you received this transmission in error, please notify the
 sender by reply e-mail and delete the message and any attachments.


 IMPORTANT NOTICE: This e-mail message is intended to be received only by
 persons entitled to receive the confidential information it may contain.
 E-mail messages sent from Bridgepoint Education may contain information
 that is confidential and may be legally privileged. Please do not read,
 copy, forward or store this message unless you are an intended recipient of
 it. If you received this transmission in error, please notify the sender by
 reply e-mail and delete the message and any attachments.

Re: Suggester - how to return exact match?

2013-11-21 Thread Developer

Might not be a perfect solution but you can use edgengram filter and copy all
your field data to that field and use it for suggestion.

fieldType name=text_autocomplete class=solr.TextField
positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.EdgeNGramFilterFactory minGramSize=1
maxGramSize=250 /
  /analyzer
  analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
  /analyzer
/fieldType

http://localhost:8983/solr/core1/select?q=name:iphone

The above query will return 
iphone
iphone5c
iphone4g



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Suggester-how-to-return-exact-match-tp4102203p4102521.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Periodic Slowness on Solr Cloud

2013-11-21 Thread Mark Miller

Yes, more details…

Solr version, which garbage collector, how does heap usage look, cpu, etc.

- Mark

On Nov 21, 2013, at 6:46 PM, Erick Erickson erickerick...@gmail.com wrote:

 How real time is NRT? In particular, what are you commit settings?
 
 And can you characterize periodic slowness? Queries that usually
 take 500ms not tail 10s? Or 1s? How often? How are you measuring?
 
 Details matter, a lot...
 
 Best,
 Erick
 
 
 
 
 On Thu, Nov 21, 2013 at 6:03 PM, Dave Seltzer dselt...@tveyes.com wrote:
 
 I'm doing some performance testing against an 8-node Solr cloud cluster,
 and I'm noticing some periodic slowness.
 
 
 http://farm4.staticflickr.com/3668/10985410633_23e26c7681_o.png
 
 I'm doing random test searches against an Alias Collection made up of four
 smaller (monthly) collections. Like this:
 
 MasterCollection
 |- Collection201308
 |- Collection201309
 |- Collection201310
 |- Collection201311
 
 The last collection is constantly updated. New documents are being added at
 the rate of about 3 documents per second.
 
 I believe the slowness may due be to NRT, but I'm not sure. How should I
 investigate this?
 
 If the slowness is related to NRT, how can I alleviate the issue without
 disabling NRT?
 
 Thanks Much!
 
 -Dave

Re: SolrServerException while adding an invalid UNIQUE_KEY in solr 4.4

2013-11-21 Thread Shawn Heisey


On 11/21/2013 1:57 AM, RadhaJayalakshmi wrote:

Hi,I am using solr4.4 with zookeeper 3.3.5. While i was checking for error
conditions of my application, i came across a strange issue.Here is what i
tried:I have three fields  defined in my schemaa) UNIQUE_KEY  - of type
solr.TrieLongb) empId - of type Solr.TrieLongc) companyId - of type
Solr.TrieLongHow Am i Indexing:I am indexing
using SolrJ API. and the data for the indexing will be in a text file,
delimited by | symbol. My Indexer java program will read the textfile lineby
line, splits the data by | symbol and creates SolrInputdocument object (for
every line of the file) and adds the fields with values (that it read from
the file)Now, intentionally, in the data file, for unique_key, i had String
values(instead of long value) . something like123AB|111|222Now, when i index
this data, i am getting the below
exception:*org.apache.solr.client.solrj.SolrServerException*: No live
SolrServers available to handle this request*:[URL of my application]*  

at
org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:333)
   
at
org.apache.solr.client.solrj.impl.CloudSolrServer.request(CloudSolrServer.java:318)
 
at
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
  
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)  

at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)  
Caused
by: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
Server at *[URL of my application] *returned non ok status:500,
message:Internal Server Error   at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:385)
   
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:180)
   
at
org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:264)
But, when i correct the unique_key field data, but when i gave string data
for other two long fields, i am getting a different
exceptionorg.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
ERROR: [Error stating the field name for which it is
mismathing]orrg.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:424)
   
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:180)
   
at
org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:264)
   
at
org.apache.solr.client.solrj.impl.CloudSolrServer.request(CloudSolrServer.java:318)
 
at
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
  
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)  

at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)  

at  What is my question here:--During 
indexing, if
solr finds, that for any field, if the fieldtype declared in schema is
mismatching with the data that is being givem, then it should riase the same
type of exception.But in the above case, if it finds a mismatch for
Unique_key, it is raising SolrServerException. For all other fields, it is
raising, RemoteSolrException(which is an unchecked exception). Is this a bug
in solr or is there any reason for thowing different exception for the above
two cases.Expecting a positive replyThanksRadha 



The first exception is an error thrown directly from SolrJ.  It was 
unable to find any server to deal with the request, so it threw its own 
SolrServerException wrapping the last RemoteSolrException (HTTP error 
500) it received.


The second exception happened in a different place.  In this case, the 
request made it past the server-side uniqueKey handling code and into 
the code that handles other fields, which froim what I can see here 
returns a different error message and possibly a different HTTP code.  
Because it was different, SolrJ sent the RemoteSolrException up the 
chain to your application rather than catching and wrapping it in 
SolrServerException.


I am not surprised to hear that you get a different error for invalid 
data in the uniqueKey field than you do in other fields. Because of its 
nature, it must be handled in a different code path.


Thanks,
Shawn

Re: Split shard and stream sub-shards to remote nodes?

2013-11-21 Thread Otis Gospodnetic

Hi,

On Wed, Nov 20, 2013 at 12:53 PM, Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:

 At the Lucene level, I think it would require a directory
 implementation which writes to a remote node directly. Otherwise, on
 the solr side, we must move the leader itself to another node which
 has enough disk space and then split the index.


Hm what about taking the source shard, splitting it, and sending docs
that come out of each sub-shards to a remote node at Solr level, as if
these documents are just being added (i.e. nothing at Lucene level)?

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr  Elasticsearch Support * http://sematext.com/





 On Wed, Nov 20, 2013 at 8:37 PM, Otis Gospodnetic
 otis.gospodne...@gmail.com wrote:
  Do you think this is something that is actually implementable?  If so,
  I'll open an issue.
 
  One use-case where this may come in handy is when the disk space is
  tight.  If a shard is using  50% of the disk space on some node X,
  you can't really split that shard because the 2 new sub-shards will
  not fit on the local disk.  Or is there some trick one could use in
  this situation?
 
  Thanks,
  Otis
  --
  Performance Monitoring * Log Analytics * Search Analytics
  Solr  Elasticsearch Support * http://sematext.com/
 
 
  On Wed, Nov 20, 2013 at 6:48 AM, Shalin Shekhar Mangar
  shalinman...@gmail.com wrote:
  No, it is not supported yet. We can't split to a remote node directly.
  The best bet is trigger a new leader election by unloading the leader
  node once all replicas are active.
 
  On Wed, Nov 20, 2013 at 1:32 AM, Otis Gospodnetic
  otis.gospodne...@gmail.com wrote:
  Hi,
 
  Is it possible to perform a shard split and stream data for the
  new/sub-shards to remote nodes, avoiding persistence of new/sub-shards
  on the local/source node first?
 
  Thanks,
  Otis
  --
  Performance Monitoring * Log Analytics * Search Analytics
  Solr  Elasticsearch Support * http://sematext.com/
 
 
 
  --
  Regards,
  Shalin Shekhar Mangar.



 --
 Regards,
 Shalin Shekhar Mangar.

Re: How to index X™ as #8482; (HTML decimal entity)

2013-11-21 Thread Walter Underwood

And this is the exact problem. Some characters are stored as entities, some are 
not. When it is time to display, what else needs escaped? At a minimum, you 
would have to always store  as amp; to avoid escaping the leading ampersand 
in the entities.

You could store every single character as a numeric entity. Or you could store 
every non-ASCII character as a numeric entity. Or every non-Latin1 character. 
Plus ampersand, of course.

In these e-mails, we are distinguishing between ™ and trade;. How would you do 
that? By storing trade; as amp;trade;.

To avoid all this double-think, always store text as Unicode code points, 
encoded with a standard Unicode method (UTF-8, etc.).

When displaying, only make entities if the codepoints cannot be represented in 
the target character encoding. If you are sending things in US-ASCII, you will 
be sending lots of entities.

A good encoding library has callbacks for characters that cannot be 
represented. You can use these callbacks to format out-of-charset codepoints as 
entities. I've done this in product code, it really works.

Finally, if you don't believe me, believe the XML Infoset, where numeric 
entities are always interpreted as treated as Unicode codepoints.

The other way to go insane is storing local time in the database. Always store 
UTC and convert at the edges.

wunder

On Nov 21, 2013, at 7:50 AM, Jack Krupansky j...@basetechnology.com wrote:

 Would you store a as #65; ?
 
 No, not in any case.
 
 -- Jack Krupansky
 
 -Original Message- From: Michael Sokolov
 Sent: Thursday, November 21, 2013 8:56 AM
 To: solr-user@lucene.apache.org
 Subject: Re: How to index X™ as ™ (HTML decimal entity)
 
 I have to agree w/Walter.  Use unicode as a storage format.  The entity
 encodings are for transfer/interchange.  Encode/decode on the way in and
 out if you have to.  Would you store a as #65; ?  It makes it
 impossible to search for, for one thing.  What if someone wants to
 search for the TM character?
 
 -Mike
 
 On 11/20/13 12:07 PM, Jack Krupansky wrote:
 AFAICT, it's not an extremely bad idea - using SGML/HTML as a format for 
 storing text to be rendered. If you disagree - try explaining yourself.
 
 But maybe TM should be encoded as trade;. Ditto for other named SGML 
 entities.
 
 -- Jack Krupansky
 
 -Original Message- From: Walter Underwood
 Sent: Wednesday, November 20, 2013 11:21 AM
 To: solr-user@lucene.apache.org
 Subject: Re: How to index X™ as ™ (HTML decimal entity)
 
 Again, I'd like to know why this is wanted. It sounds like an X-Y, problem. 
 Storing Unicode characters as XML/HTML encoded character references is an 
 extremely bad idea.
 
 wunder
 
 On Nov 20, 2013, at 5:01 AM, Jack Krupansky j...@basetechnology.com 
 wrote:
 
 Any analysis filtering affects the indexed value only, but the stored value 
 would be unchanged from the original input value. An update processor lets 
 you modify the original input value that will be stored.
 
 -- Jack Krupansky
 
 -Original Message- From: Uwe Reh
 Sent: Wednesday, November 20, 2013 5:43 AM
 To: solr-user@lucene.apache.org
 Subject: Re: How to index X™ as ™ (HTML decimal entity)
 
 What's about having a simple charfilter in the analyzer queue for
 indexing *and* searching. e.g
 charFilter class=solr.PatternReplaceFilterFactory pattern=™
 replacement=#8482; /
 or
 charFilter class=solr.MappingCharFilterFactory
 mapping=mapping-specials.txt /
 
 Uwe
 
 Am 19.11.2013 23:46, schrieb Developer:
 I have a data coming in to SOLR as below.
 
 field name=displayNameX™ - Black/field
 
 I need to store the HTML Entity (decimal) equivalent value (i.e. #8482;)
 in SOLR rather than storing the original value.
 
 Is there a way to do this?
 
 
 -- 
 Walter Underwood
 wun...@wunderwood.org
 
 
 

--
Walter Underwood
wun...@wunderwood.org

Re: Indexing data to a specific collection in Solr 4.5.0

2013-11-21 Thread xiezhide



此邮件发送自189邮箱

Reyes, Mark mark.re...@bpiedu.com wrote:

Hi all:

I’m currently on a Solr 4.5.0 instance and running this tutorial, 
http://lucene.apache.org/solr/4_5_0/tutorial.html

My question is specific to indexing data as proposed from this tutorial,

$ java -jar post.jar solr.xml monitor.xml

The tutorial advises to validate from your localhost,
http://localhost:8983/solr/collection1/select?q=solrwt=xml

However, what if my Solr core has both a collection1 and collection2, yet I 
desire the XML files to only be posted to collection2 only?

If possible, please advise.

Thanks,
Mark

IMPORTANT NOTICE: This e-mail message is intended to be received only by 
persons entitled to receive the confidential information it may contain. 
E-mail messages sent from Bridgepoint Education may contain information that 
is confidential and may be legally privileged. Please do not read, copy, 
forward or store this message unless you are an intended recipient of it. If 
you received this transmission in error, please notify the sender by reply 
e-mail and delete the message and any attachments.

Facet field query on subset of documents

2013-11-21 Thread Luis Lebolo

Hi All,

Is it possible to perform a facet field query on a subset of documents (the
subset being defined via a filter query for instance)?

I understand that facet pivoting might work, but it would require that the
subset be defined by some field hierarchy, e.g. manufacturer - price (then
only look at the results for the manufacturer I'm interested in).

What if I wanted to define a more complex subset (where the name starts
with A but ends with Z and some other field is greater than 5 and yet
another field is not 'x', etc.)?

Ideally I would then define a facet field constraining query to include
only terms from documents that pass this query.

Thanks,
Luis

Re: Parse eDisMax queries for keywords

2013-11-21 Thread Jack Krupansky

The query parser does its own tokenization and parsing before your analyzer 
tokenizer and filters are called, assuring that only one white 
space-delimited token is analyzed at a time.


You're probably best off having an application layer preprocessor for the 
query that enriches the query in the manner that you're describing.


Or, simply settle for a heuristic approach that may give you 70% of what 
you want using only existing Solr features on the server side.


-- Jack Krupansky

-Original Message- 
From: Mirko

Sent: Thursday, November 21, 2013 5:30 AM
To: solr-user@lucene.apache.org
Subject: Parse eDisMax queries for keywords

Hi,
We would like to implement special handling for queries that contain
certain keywords. Our particular use case:

In the example query Footitle season 1 we want to discover the keywords
season , get the subsequent number, and boost (or filter for) documents
that match 1 on field name=season.

We have two fields in our schema:

!-- titles contains titles --
field name=title type=text indexed=true stored=true
multiValued=false/

fieldType name=text class=solr.TextField omitNorms=true
   analyzer 
   charFilter class=solr.MappingCharFilterFactory
mapping=mapping-ISOLatin1Accent.txt/
   tokenizer class=solr.StandardTokenizerFactory/
   filter class=solr.LowerCaseFilterFactory/
   !-- ... --
   /analyzer
/fieldType

field name=season type=season_number indexed=true stored=false
multiValued=false/

!-- season contains season numbers --
fieldType name=season_number class=solr.TextField omitNorms=true 
analyzer type=query
   tokenizer class=solr.KeywordTokenizerFactory/
   filter class=solr.LowerCaseFilterFactory/
filter class=solr.PatternReplaceFilterFactory pattern=.*(?:season)
*0*([0-9]+).* replacement=$1/
   /analyzer
/fieldType


Our idea was to use a Keyword tokenizer and a Regex on the season field
to extract the season number from the complete query.

However, we use a ExtendedDisMax query parser in our search handler:

requestHandler name=/select class=solr.SearchHandler
   lst name=defaults
   str name=defTypeedismax/str
   str name=qf
   title season
   /str

   /lst
/requestHandler


The problem is that the eDisMax tokenizes the query, so that our field
season receives the tokens [Foo, season, 1] without any order,
instead of the complete query.

How can we pass the complete query (untokenized) to the season field? We
don't understand which tokenizer is used here and why our season field
received tokens instead of the complete query.

Or is there another approach to solve this use case with Solr?

Thanks,
Mirko

Re: How to retain the original format of input document in search results in SOLR - Tomcat

2013-11-21 Thread Erick Erickson

Solr (actually Lucene) stores the input _exactly_ as it is entered, and
returns it the same way.

What you're seeing is almost certainly your display mechanism interpreting
the results,
whitespace is notoriously variable in terms of how it's displayed by various
interpretations of the standard. For instance, HTML often just eats
whitespace.




On Thu, Nov 21, 2013 at 1:33 AM, ramesh py pyrames...@gmail.com wrote:

 Hi All,



 I am  new to apache solr. Recently  I could able to configure the solr with
 tomcat successfully. And its working fine except the format of the search
 results i.e., the format of the search results not displaying as like as
 input document.



 I am doing the below things



 1.   Indexing the xml file into solr

 2.   Format of the xml as below

 *doc*

 field name=*F1*some text/field

 field name=*F2* Title1: descriptions of the title

 Title2 : description of the title2

 Title3 : description of title3

 /field

 field name=*F3*some text /field

 /doc



 3.   After index, the results are displaying in the below format.



 *F1 : *some text

 *F2*: Title1: descriptions of the title Title2 : description of the title2
 Title3 : description of title3

 *F3*: some text



 *Expected Result :*



 *F1 : *some text

 *F2*: Title1: descriptions of the title

   Title2 : description of the title2

   Title3 : description of title3

 *F3*: some text





 If we see the F2 field, format id getting changed i.e., input format is of
 F2 field is line by line for each sub title, but in the result it
 displaying as single line.





 I would like to display the result like whenever any subtitle occurs in xml
 file for any field, that subtitle should display in the next  line in the
 results.



 Can anyone please help on this. Thanks in advance.





 Regards,

 Ramesh p.y

 --
 Ramesh P.Y
 pyrames...@gmail.com
 Mobile No:+91-9176361984

Re: search with wildcard

2013-11-21 Thread Ahmet Arslan

Hi Adnreas,

If you don't want to use wildcards at query time, alternative way is to use 
NGrams at indexing time. This will produce a lot of tokens. e.g.
For example 4grams of your example : Supertestplan = supe uper pert erte rtes 
*test* estp stpl tpla plan


Is that you want? By the way why do you want to search inside of words?

filter class=solr.NGramFilterFactory minGramSize=3 maxGramSize=4/




On Thursday, November 21, 2013 5:23 PM, Andreas Owen a...@conx.ch wrote:
 
I suppose i have to create another field with diffenet tokenizers and set
the boost very low so it doesn't really mess with my ranking because there
the word is now in 2 fields. What kind of tokenizer can do the job?



From: Andreas Owen [mailto:a...@conx.ch] 
Sent: Donnerstag, 21. November 2013 16:13
To: solr-user@lucene.apache.org
Subject: search with wildcard



I am querying test in solr 4.3.1 over the field below and it's not finding
all occurences. It seems that if it is a substring of a word like
Supertestplan it isn't found unless I use a wildcards *test*. This is
write because of my tokenizer but does someone know a way around this? I
don't want to add wildcards because that messes up queries with multiple
words.



fieldType name=text_de class=solr.TextField positionIncrementGap=100

      analyzer 

        tokenizer class=solr.StandardTokenizerFactory/

        filter class=solr.LowerCaseFilterFactory/

                              

        filter class=solr.StopFilterFactory ignoreCase=true
words=lang/stopwords_de.txt format=snowball
enablePositionIncrements=true/ !-- remove common words --

        filter class=solr.GermanNormalizationFilterFactory/

                               filter
class=solr.SnowballPorterFilterFactory language=German/ !-- remove
noun/adjective inflections like plural endings --

        

      /analyzer

    /fieldType

Re: How to index X™ as #8482; (HTML decimal entity)

2013-11-21 Thread Michael Sokolov

OK - probably I should have said A,or #97; :)  My point was just 
that there is not really anything special about special characters.


On 11/21/2013 10:50 AM, Jack Krupansky wrote:

Would you store a as #65; ?

No, not in any case.

-- Jack Krupansky

-Original Message- From: Michael Sokolov
Sent: Thursday, November 21, 2013 8:56 AM
To: solr-user@lucene.apache.org
Subject: Re: How to index X™ as ™ (HTML decimal entity)

I have to agree w/Walter.  Use unicode as a storage format.  The entity
encodings are for transfer/interchange.  Encode/decode on the way in and
out if you have to.  Would you store a as #65; ?  It makes it
impossible to search for, for one thing.  What if someone wants to
search for the TM character?

-Mike

On 11/20/13 12:07 PM, Jack Krupansky wrote:
AFAICT, it's not an extremely bad idea - using SGML/HTML as a 
format for storing text to be rendered. If you disagree - try 
explaining yourself.


But maybe TM should be encoded as trade;. Ditto for other named 
SGML entities.


-- Jack Krupansky

-Original Message- From: Walter Underwood
Sent: Wednesday, November 20, 2013 11:21 AM
To: solr-user@lucene.apache.org
Subject: Re: How to index X™ as ™ (HTML decimal entity)

Again, I'd like to know why this is wanted. It sounds like an X-Y, 
problem. Storing Unicode characters as XML/HTML encoded character 
references is an extremely bad idea.


wunder

On Nov 20, 2013, at 5:01 AM, Jack Krupansky 
j...@basetechnology.com wrote:


Any analysis filtering affects the indexed value only, but the 
stored value would be unchanged from the original input value. An 
update processor lets you modify the original input value that will 
be stored.


-- Jack Krupansky

-Original Message- From: Uwe Reh
Sent: Wednesday, November 20, 2013 5:43 AM
To: solr-user@lucene.apache.org
Subject: Re: How to index X™ as ™ (HTML decimal entity)

What's about having a simple charfilter in the analyzer queue for
indexing *and* searching. e.g
charFilter class=solr.PatternReplaceFilterFactory pattern=™
replacement=#8482; /
or
charFilter class=solr.MappingCharFilterFactory
mapping=mapping-specials.txt /

Uwe

Am 19.11.2013 23:46, schrieb Developer:

I have a data coming in to SOLR as below.

field name=displayNameX™ - Black/field

I need to store the HTML Entity (decimal) equivalent value (i.e. 
#8482;)

in SOLR rather than storing the original value.

Is there a way to do this?




--
Walter Underwood
wun...@wunderwood.org

RE: search with wildcard

2013-11-21 Thread Andreas Owen

I suppose i have to create another field with diffenet tokenizers and set
the boost very low so it doesn't really mess with my ranking because there
the word is now in 2 fields. What kind of tokenizer can do the job?

From: Andreas Owen [mailto:a...@conx.ch] 
Sent: Donnerstag, 21. November 2013 16:13
To: solr-user@lucene.apache.org
Subject: search with wildcard

I am querying test in solr 4.3.1 over the field below and it's not finding
all occurences. It seems that if it is a substring of a word like
Supertestplan it isn't found unless I use a wildcards *test*. This is
write because of my tokenizer but does someone know a way around this? I
don't want to add wildcards because that messes up queries with multiple
words.

fieldType name=text_de class=solr.TextField positionIncrementGap=100

  analyzer 

tokenizer class=solr.StandardTokenizerFactory/

filter class=solr.LowerCaseFilterFactory/

filter class=solr.StopFilterFactory ignoreCase=true
words=lang/stopwords_de.txt format=snowball
enablePositionIncrements=true/ !-- remove common words --

filter class=solr.GermanNormalizationFilterFactory/

   filter
class=solr.SnowballPorterFilterFactory language=German/ !-- remove
noun/adjective inflections like plural endings --

  /analyzer

/fieldType

Re: search with wildcard

2013-11-21 Thread Jack Krupansky

You might be able to make use of the dictionary compound word filter, but 
you will have to build up a dictionary of words to use:


http://lucene.apache.org/core/4_5_1/analyzers-common/org/apache/lucene/analysis/compound/DictionaryCompoundWordTokenFilterFactory.html

My e-book has some examples and a better description.

-- Jack Krupansky

-Original Message- 
From: Ahmet Arslan

Sent: Thursday, November 21, 2013 11:40 AM
To: solr-user@lucene.apache.org
Subject: Re: search with wildcard

Hi Adnreas,

If you don't want to use wildcards at query time, alternative way is to use 
NGrams at indexing time. This will produce a lot of tokens. e.g.
For example 4grams of your example : Supertestplan = supe uper pert erte 
rtes *test* estp stpl tpla plan



Is that you want? By the way why do you want to search inside of words?

filter class=solr.NGramFilterFactory minGramSize=3 maxGramSize=4/




On Thursday, November 21, 2013 5:23 PM, Andreas Owen a...@conx.ch wrote:

I suppose i have to create another field with diffenet tokenizers and set
the boost very low so it doesn't really mess with my ranking because there
the word is now in 2 fields. What kind of tokenizer can do the job?



From: Andreas Owen [mailto:a...@conx.ch]
Sent: Donnerstag, 21. November 2013 16:13
To: solr-user@lucene.apache.org
Subject: search with wildcard



I am querying test in solr 4.3.1 over the field below and it's not finding
all occurences. It seems that if it is a substring of a word like
Supertestplan it isn't found unless I use a wildcards *test*. This is
write because of my tokenizer but does someone know a way around this? I
don't want to add wildcards because that messes up queries with multiple
words.



fieldType name=text_de class=solr.TextField positionIncrementGap=100

 analyzer

   tokenizer class=solr.StandardTokenizerFactory/

   filter class=solr.LowerCaseFilterFactory/



   filter class=solr.StopFilterFactory ignoreCase=true
words=lang/stopwords_de.txt format=snowball
enablePositionIncrements=true/ !-- remove common words --

   filter class=solr.GermanNormalizationFilterFactory/

  filter
class=solr.SnowballPorterFilterFactory language=German/ !-- remove
noun/adjective inflections like plural endings --



 /analyzer

   /fieldType

Re: Periodic Slowness on Solr Cloud

2013-11-21 Thread Dave Seltzer

Lots of questions. Okay.

In digging a little deeper and looking at the config I see that
nrtModetrue/nrtMode is commented out.  I believe this is the default
setting. So I don't know if NRT is enabled or not. Maybe just a red herring.

I don't know what Garbage Collector we're using. In this test I'm running
Solr 4.5.1 using Jetty from the example directory.

The CPU on the 8 nodes all stay around 70% use during the test. The nodes
have 28GB of RAM. Java is using about 6GB and the rest is being used by OS
cache.

To perform the test we're running 200 concurrent threads in JMeter. The
threads hit HAProxy which loadbalances the requests among the nodes. Each
query is for a random word out of a list of about 10,000 words. Some of the
queries have faceting turned on.

Because we're heavily loading the system the queries are returning quite
slowly. For a simple search, the average response time was 300ms. The peak
response time was 11,000ms. The spikes in latency seem to occur about every
2.5 minutes.

I haven't spent that much time messing with SolrConfig, so most of the
settings are the out-of-the-box defaults.

Where should I start to look?

Thanks so much!

-Dave





On Thu, Nov 21, 2013 at 6:53 PM, Mark Miller markrmil...@gmail.com wrote:

 Yes, more details…

 Solr version, which garbage collector, how does heap usage look, cpu, etc.

 - Mark

 On Nov 21, 2013, at 6:46 PM, Erick Erickson erickerick...@gmail.com
 wrote:

  How real time is NRT? In particular, what are you commit settings?
 
  And can you characterize periodic slowness? Queries that usually
  take 500ms not tail 10s? Or 1s? How often? How are you measuring?
 
  Details matter, a lot...
 
  Best,
  Erick
 
 
 
 
  On Thu, Nov 21, 2013 at 6:03 PM, Dave Seltzer dselt...@tveyes.com
 wrote:
 
  I'm doing some performance testing against an 8-node Solr cloud cluster,
  and I'm noticing some periodic slowness.
 
 
  http://farm4.staticflickr.com/3668/10985410633_23e26c7681_o.png
 
  I'm doing random test searches against an Alias Collection made up of
 four
  smaller (monthly) collections. Like this:
 
  MasterCollection
  |- Collection201308
  |- Collection201309
  |- Collection201310
  |- Collection201311
 
  The last collection is constantly updated. New documents are being
 added at
  the rate of about 3 documents per second.
 
  I believe the slowness may due be to NRT, but I'm not sure. How should I
  investigate this?
 
  If the slowness is related to NRT, how can I alleviate the issue without
  disabling NRT?
 
  Thanks Much!
 
  -Dave

Re: How to index X™ as #8482; (HTML decimal entity)

2013-11-21 Thread Jack Krupansky


there is not really anything special about special characters

Well, the distinction was about named entities, which are indeed special.

Besides, in general, for more sophisticated text processing, character 
types are a valid distinction.


But all of this begs the question of the original question: I need to store 
the HTML Entity (decimal) equivalent value (i.e. #8482;) in SOLR rather 
than storing the original value.


Maybe the original poster could clarify the nature of their need.

-- Jack Krupansky

-Original Message- 
From: Michael Sokolov

Sent: Thursday, November 21, 2013 11:37 AM
To: solr-user@lucene.apache.org
Subject: Re: How to index X™ as ™ (HTML decimal entity)

OK - probably I should have said A,or #97; :)  My point was just
that there is not really anything special about special characters.

On 11/21/2013 10:50 AM, Jack Krupansky wrote:

Would you store a as #65; ?

No, not in any case.

-- Jack Krupansky

-Original Message- From: Michael Sokolov
Sent: Thursday, November 21, 2013 8:56 AM
To: solr-user@lucene.apache.org
Subject: Re: How to index X™ as ™ (HTML decimal entity)

I have to agree w/Walter.  Use unicode as a storage format.  The entity
encodings are for transfer/interchange.  Encode/decode on the way in and
out if you have to.  Would you store a as #65; ?  It makes it
impossible to search for, for one thing.  What if someone wants to
search for the TM character?

-Mike

On 11/20/13 12:07 PM, Jack Krupansky wrote:
AFAICT, it's not an extremely bad idea - using SGML/HTML as a format 
for storing text to be rendered. If you disagree - try explaining 
yourself.


But maybe TM should be encoded as trade;. Ditto for other named SGML 
entities.


-- Jack Krupansky

-Original Message- From: Walter Underwood
Sent: Wednesday, November 20, 2013 11:21 AM
To: solr-user@lucene.apache.org
Subject: Re: How to index X™ as ™ (HTML decimal entity)

Again, I'd like to know why this is wanted. It sounds like an X-Y, 
problem. Storing Unicode characters as XML/HTML encoded character 
references is an extremely bad idea.


wunder

On Nov 20, 2013, at 5:01 AM, Jack Krupansky j...@basetechnology.com 
wrote:


Any analysis filtering affects the indexed value only, but the stored 
value would be unchanged from the original input value. An update 
processor lets you modify the original input value that will be stored.


-- Jack Krupansky

-Original Message- From: Uwe Reh
Sent: Wednesday, November 20, 2013 5:43 AM
To: solr-user@lucene.apache.org
Subject: Re: How to index X™ as ™ (HTML decimal entity)

What's about having a simple charfilter in the analyzer queue for
indexing *and* searching. e.g
charFilter class=solr.PatternReplaceFilterFactory pattern=™
replacement=#8482; /
or
charFilter class=solr.MappingCharFilterFactory
mapping=mapping-specials.txt /

Uwe

Am 19.11.2013 23:46, schrieb Developer:

I have a data coming in to SOLR as below.

field name=displayNameX™ - Black/field

I need to store the HTML Entity (decimal) equivalent value (i.e. 
#8482;)

in SOLR rather than storing the original value.

Is there a way to do this?




--
Walter Underwood
wun...@wunderwood.org

Re: confirm subscribe to solr-user@lucene.apache.org

2013-11-21 Thread Paule LECUYER


I confirm

.

How to implement a conditional copyField working for partial updates ?

2013-11-21 Thread Paule LECUYER



Hello,

I'm using Solr 4.x. In my solr schema I have the following fields defined :

  field name=content type=text_general indexed=false  
stored=true multiValued=true /
  field name=all type=text_general indexed=true  
stored=false multiValued=true termVectors=true /
  field name=eng type=text_en indexed=true stored=false  
multiValued=true termVectors=true /
  field name=ita type=text_it indexed=true stored=false  
multiValued=true termVectors=true /
  field name=fre type=text_fr indexed=true stored=false  
multiValued=true termVectors=true /

  ...
copyField source=content dest=all/

To fill in the language specific fields, I use a custom update  
processor chain, with a custom ConditionalCopyProcessor that copies  
content field into appropriate language field, depending on document  
language (as explained in  
http://wiki.apache.org/solr/UpdateRequestProcessor).


Problem is this custom chain is applied on update request document,  
thus it works all right when inserting a new document, or updating the  
whole document, but I lose language specific fields when I do a  
partial update (as those fields are not stored, and as the request  
document contains only updated fields).


I would avoid to set language specific fields to stored=true, as  
content field may hold big values.


Is there a way to have solr execute my ConditionalCopyProcessor on the  
actual updated doc (the one resulting from solr retrieving all stored  
values and merging with update request values), and not on the request  
doc ?


Thank a lot for your help.

P. Lecuyer

.

Re: How to index X™ as #8482; (HTML decimal entity)

2013-11-21 Thread Jack Krupansky

Ah... now I understand your perspective - you have taken a narrow view of 
what text is. A broader view is that it can contain formatting and special 
entities as well, or rich text in general. My read is that it all 
depends on the nature of the application and its requirements, not a one 
size fits all approach. The four main approaches being pure ASCII, 
Unicode/UTF-8, SGML for non-ASCII characters, and full HTML for formatting 
and rich text. And let the app needs determine which is most appropriate for 
each piece of text.


The goal of SGML and HTML is not to hard-wire the final presentation, but 
simply to preserve some level of source format and structure, and then apply 
final presentation formatting on top of that.


Some apps may opt to store the same information in multiple formats, such as 
one for raw text search, one for basic display, and one for detail 
display.


I'm more of a platform guy than an app-specific guy - give the app 
developer tools that they can blend to meet their own requirements (or 
interests or tastes.)


But Solr users should make no mistake, SGML entities are a perfectly valid 
intermediate format for rich text.


-- Jack Krupansky

-Original Message- 
From: Walter Underwood

Sent: Thursday, November 21, 2013 11:44 AM
To: solr-user@lucene.apache.org
Subject: Re: How to index X™ as ™ (HTML decimal entity)

And this is the exact problem. Some characters are stored as entities, some 
are not. When it is time to display, what else needs escaped? At a minimum, 
you would have to always store  as amp; to avoid escaping the leading 
ampersand in the entities.


You could store every single character as a numeric entity. Or you could 
store every non-ASCII character as a numeric entity. Or every non-Latin1 
character. Plus ampersand, of course.


In these e-mails, we are distinguishing between ™ and trade;. How would you 
do that? By storing trade; as amp;trade;.


To avoid all this double-think, always store text as Unicode code points, 
encoded with a standard Unicode method (UTF-8, etc.).


When displaying, only make entities if the codepoints cannot be represented 
in the target character encoding. If you are sending things in US-ASCII, you 
will be sending lots of entities.


A good encoding library has callbacks for characters that cannot be 
represented. You can use these callbacks to format out-of-charset codepoints 
as entities. I've done this in product code, it really works.


Finally, if you don't believe me, believe the XML Infoset, where numeric 
entities are always interpreted as treated as Unicode codepoints.


The other way to go insane is storing local time in the database. Always 
store UTC and convert at the edges.


wunder

On Nov 21, 2013, at 7:50 AM, Jack Krupansky j...@basetechnology.com 
wrote:



Would you store a as #65; ?

No, not in any case.

-- Jack Krupansky

-Original Message- From: Michael Sokolov
Sent: Thursday, November 21, 2013 8:56 AM
To: solr-user@lucene.apache.org
Subject: Re: How to index X™ as ™ (HTML decimal entity)

I have to agree w/Walter.  Use unicode as a storage format.  The entity
encodings are for transfer/interchange.  Encode/decode on the way in and
out if you have to.  Would you store a as #65; ?  It makes it
impossible to search for, for one thing.  What if someone wants to
search for the TM character?

-Mike

On 11/20/13 12:07 PM, Jack Krupansky wrote:
AFAICT, it's not an extremely bad idea - using SGML/HTML as a format 
for storing text to be rendered. If you disagree - try explaining 
yourself.


But maybe TM should be encoded as trade;. Ditto for other named SGML 
entities.


-- Jack Krupansky

-Original Message- From: Walter Underwood
Sent: Wednesday, November 20, 2013 11:21 AM
To: solr-user@lucene.apache.org
Subject: Re: How to index X™ as ™ (HTML decimal entity)

Again, I'd like to know why this is wanted. It sounds like an X-Y, 
problem. Storing Unicode characters as XML/HTML encoded character 
references is an extremely bad idea.


wunder

On Nov 20, 2013, at 5:01 AM, Jack Krupansky j...@basetechnology.com 
wrote:


Any analysis filtering affects the indexed value only, but the stored 
value would be unchanged from the original input value. An update 
processor lets you modify the original input value that will be stored.


-- Jack Krupansky

-Original Message- From: Uwe Reh
Sent: Wednesday, November 20, 2013 5:43 AM
To: solr-user@lucene.apache.org
Subject: Re: How to index X™ as ™ (HTML decimal entity)

What's about having a simple charfilter in the analyzer queue for
indexing *and* searching. e.g
charFilter class=solr.PatternReplaceFilterFactory pattern=™
replacement=#8482; /
or
charFilter class=solr.MappingCharFilterFactory
mapping=mapping-specials.txt /

Uwe

Am 19.11.2013 23:46, schrieb Developer:

I have a data coming in to SOLR as below.

field name=displayNameX™ - Black/field

I need to store the HTML Entity (decimal) equivalent value (i.e. 
#8482;)

in SOLR rather than

Re: How to index X™ as #8482; (HTML decimal entity)

2013-11-21 Thread Michael Sokolov

I have to agree w/Walter.  Use unicode as a storage format.  The entity 
encodings are for transfer/interchange.  Encode/decode on the way in and 
out if you have to.  Would you store a as #65; ?  It makes it 
impossible to search for, for one thing.  What if someone wants to 
search for the TM character?


-Mike

On 11/20/13 12:07 PM, Jack Krupansky wrote:
AFAICT, it's not an extremely bad idea - using SGML/HTML as a format 
for storing text to be rendered. If you disagree - try explaining 
yourself.


But maybe TM should be encoded as trade;. Ditto for other named 
SGML entities.


-- Jack Krupansky

-Original Message- From: Walter Underwood
Sent: Wednesday, November 20, 2013 11:21 AM
To: solr-user@lucene.apache.org
Subject: Re: How to index X™ as ™ (HTML decimal entity)

Again, I'd like to know why this is wanted. It sounds like an X-Y, 
problem. Storing Unicode characters as XML/HTML encoded character 
references is an extremely bad idea.


wunder

On Nov 20, 2013, at 5:01 AM, Jack Krupansky 
j...@basetechnology.com wrote:


Any analysis filtering affects the indexed value only, but the stored 
value would be unchanged from the original input value. An update 
processor lets you modify the original input value that will be stored.


-- Jack Krupansky

-Original Message- From: Uwe Reh
Sent: Wednesday, November 20, 2013 5:43 AM
To: solr-user@lucene.apache.org
Subject: Re: How to index X™ as ™ (HTML decimal entity)

What's about having a simple charfilter in the analyzer queue for
indexing *and* searching. e.g
charFilter class=solr.PatternReplaceFilterFactory pattern=™
replacement=#8482; /
or
charFilter class=solr.MappingCharFilterFactory
mapping=mapping-specials.txt /

Uwe

Am 19.11.2013 23:46, schrieb Developer:

I have a data coming in to SOLR as below.

field name=displayNameX™ - Black/field

I need to store the HTML Entity (decimal) equivalent value (i.e. 
#8482;)

in SOLR rather than storing the original value.

Is there a way to do this?




--
Walter Underwood
wun...@wunderwood.org

Re: How to index X™ as #8482; (HTML decimal entity)

2013-11-21 Thread Walter Underwood

I know all about formatted text -- I worked at MarkLogic. That is why I 
mentioned the XML Infoset.

Numeric entities are part of the final presentation, really, part of the 
encoding. They should never be stored. Always store the Unicode.

Numeric and named entities are a convenience for tools and encodings that can't 
handle  Unicode. That is all they are.

wunder

On Nov 21, 2013, at 9:02 AM, Jack Krupansky j...@basetechnology.com wrote:

 Ah... now I understand your perspective - you have taken a narrow view of 
 what text is. A broader view is that it can contain formatting and special 
 entities as well, or rich text in general. My read is that it all depends 
 on the nature of the application and its requirements, not a one size fits 
 all approach. The four main approaches being pure ASCII, Unicode/UTF-8, SGML 
 for non-ASCII characters, and full HTML for formatting and rich text. And let 
 the app needs determine which is most appropriate for each piece of text.
 
 The goal of SGML and HTML is not to hard-wire the final presentation, but 
 simply to preserve some level of source format and structure, and then apply 
 final presentation formatting on top of that.
 
 Some apps may opt to store the same information in multiple formats, such as 
 one for raw text search, one for basic display, and one for detail display.
 
 I'm more of a platform guy than an app-specific guy - give the app 
 developer tools that they can blend to meet their own requirements (or 
 interests or tastes.)
 
 But Solr users should make no mistake, SGML entities are a perfectly valid 
 intermediate format for rich text.
 
 -- Jack Krupansky
 
 -Original Message- From: Walter Underwood
 Sent: Thursday, November 21, 2013 11:44 AM
 To: solr-user@lucene.apache.org
 Subject: Re: How to index X™ as ™ (HTML decimal entity)
 
 And this is the exact problem. Some characters are stored as entities, some 
 are not. When it is time to display, what else needs escaped? At a minimum, 
 you would have to always store  as amp; to avoid escaping the leading 
 ampersand in the entities.
 
 You could store every single character as a numeric entity. Or you could 
 store every non-ASCII character as a numeric entity. Or every non-Latin1 
 character. Plus ampersand, of course.
 
 In these e-mails, we are distinguishing between ™ and trade;. How would you 
 do that? By storing trade; as amp;trade;.
 
 To avoid all this double-think, always store text as Unicode code points, 
 encoded with a standard Unicode method (UTF-8, etc.).
 
 When displaying, only make entities if the codepoints cannot be represented 
 in the target character encoding. If you are sending things in US-ASCII, you 
 will be sending lots of entities.
 
 A good encoding library has callbacks for characters that cannot be 
 represented. You can use these callbacks to format out-of-charset codepoints 
 as entities. I've done this in product code, it really works.
 
 Finally, if you don't believe me, believe the XML Infoset, where numeric 
 entities are always interpreted as treated as Unicode codepoints.
 
 The other way to go insane is storing local time in the database. Always 
 store UTC and convert at the edges.
 
 wunder
 
 On Nov 21, 2013, at 7:50 AM, Jack Krupansky j...@basetechnology.com wrote:
 
 Would you store a as #65; ?
 
 No, not in any case.
 
 -- Jack Krupansky
 
 -Original Message- From: Michael Sokolov
 Sent: Thursday, November 21, 2013 8:56 AM
 To: solr-user@lucene.apache.org
 Subject: Re: How to index X™ as ™ (HTML decimal entity)
 
 I have to agree w/Walter.  Use unicode as a storage format.  The entity
 encodings are for transfer/interchange.  Encode/decode on the way in and
 out if you have to.  Would you store a as #65; ?  It makes it
 impossible to search for, for one thing.  What if someone wants to
 search for the TM character?
 
 -Mike
 
 On 11/20/13 12:07 PM, Jack Krupansky wrote:
 AFAICT, it's not an extremely bad idea - using SGML/HTML as a format for 
 storing text to be rendered. If you disagree - try explaining yourself.
 
 But maybe TM should be encoded as trade;. Ditto for other named SGML 
 entities.
 
 -- Jack Krupansky
 
 -Original Message- From: Walter Underwood
 Sent: Wednesday, November 20, 2013 11:21 AM
 To: solr-user@lucene.apache.org
 Subject: Re: How to index X™ as ™ (HTML decimal entity)
 
 Again, I'd like to know why this is wanted. It sounds like an X-Y, problem. 
 Storing Unicode characters as XML/HTML encoded character references is an 
 extremely bad idea.
 
 wunder
 
 On Nov 20, 2013, at 5:01 AM, Jack Krupansky j...@basetechnology.com 
 wrote:
 
 Any analysis filtering affects the indexed value only, but the stored 
 value would be unchanged from the original input value. An update 
 processor lets you modify the original input value that will be stored.
 
 -- Jack Krupansky
 
 -Original Message- From: Uwe Reh
 Sent: Wednesday, November 20, 2013 5:43 AM
 To: solr-user@lucene.apache.org
 Subject:

Re: Indexing data to a specific collection in Solr 4.5.0

2013-11-21 Thread Reyes, Mark

So then,
$ java -jar post.jar Durl=http://localhost:8983/solr/collection2/update
solr.xml monitor.xml





On 11/21/13, 8:14 AM, xiezhide xiezh...@gmail.com wrote:


add Durl=http://localhost:8983/solr/collection2/update when run post.jar,
此邮件发送自189邮箱

Reyes, Mark mark.re...@bpiedu.com wrote:

Hi all:

I’m currently on a Solr 4.5.0 instance and running this tutorial,
http://lucene.apache.org/solr/4_5_0/tutorial.html

My question is specific to indexing data as proposed from this tutorial,

$ java -jar post.jar solr.xml monitor.xml

The tutorial advises to validate from your localhost,
http://localhost:8983/solr/collection1/select?q=solrwt=xml

However, what if my Solr core has both a collection1 and collection2,
yet I desire the XML files to only be posted to collection2 only?

If possible, please advise.

Thanks,
Mark

IMPORTANT NOTICE: This e-mail message is intended to be received only by
persons entitled to receive the confidential information it may contain.
E-mail messages sent from Bridgepoint Education may contain information
that is confidential and may be legally privileged. Please do not read,
copy, forward or store this message unless you are an intended recipient
of it. If you received this transmission in error, please notify the
sender by reply e-mail and delete the message and any attachments.


IMPORTANT NOTICE: This e-mail message is intended to be received only by 
persons entitled to receive the confidential information it may contain. E-mail 
messages sent from Bridgepoint Education may contain information that is 
confidential and may be legally privileged. Please do not read, copy, forward 
or store this message unless you are an intended recipient of it. If you 
received this transmission in error, please notify the sender by reply e-mail 
and delete the message and any attachments.

Re: How to index X™ as #8482; (HTML decimal entity)

2013-11-21 Thread Jack Krupansky


Would you store a as #65; ?

No, not in any case.

-- Jack Krupansky

-Original Message- 
From: Michael Sokolov

Sent: Thursday, November 21, 2013 8:56 AM
To: solr-user@lucene.apache.org
Subject: Re: How to index X™ as ™ (HTML decimal entity)

I have to agree w/Walter.  Use unicode as a storage format.  The entity
encodings are for transfer/interchange.  Encode/decode on the way in and
out if you have to.  Would you store a as #65; ?  It makes it
impossible to search for, for one thing.  What if someone wants to
search for the TM character?

-Mike

On 11/20/13 12:07 PM, Jack Krupansky wrote:
AFAICT, it's not an extremely bad idea - using SGML/HTML as a format for 
storing text to be rendered. If you disagree - try explaining yourself.


But maybe TM should be encoded as trade;. Ditto for other named SGML 
entities.


-- Jack Krupansky

-Original Message- From: Walter Underwood
Sent: Wednesday, November 20, 2013 11:21 AM
To: solr-user@lucene.apache.org
Subject: Re: How to index X™ as ™ (HTML decimal entity)

Again, I'd like to know why this is wanted. It sounds like an X-Y, 
problem. Storing Unicode characters as XML/HTML encoded character 
references is an extremely bad idea.


wunder

On Nov 20, 2013, at 5:01 AM, Jack Krupansky j...@basetechnology.com 
wrote:


Any analysis filtering affects the indexed value only, but the stored 
value would be unchanged from the original input value. An update 
processor lets you modify the original input value that will be stored.


-- Jack Krupansky

-Original Message- From: Uwe Reh
Sent: Wednesday, November 20, 2013 5:43 AM
To: solr-user@lucene.apache.org
Subject: Re: How to index X™ as ™ (HTML decimal entity)

What's about having a simple charfilter in the analyzer queue for
indexing *and* searching. e.g
charFilter class=solr.PatternReplaceFilterFactory pattern=™
replacement=#8482; /
or
charFilter class=solr.MappingCharFilterFactory
mapping=mapping-specials.txt /

Uwe

Am 19.11.2013 23:46, schrieb Developer:

I have a data coming in to SOLR as below.

field name=displayNameX™ - Black/field

I need to store the HTML Entity (decimal) equivalent value (i.e. 
#8482;)

in SOLR rather than storing the original value.

Is there a way to do this?




--
Walter Underwood
wun...@wunderwood.org

RE: Periodic Slowness on Solr Cloud

2013-11-21 Thread Doug Turnbull

Dave you might want to connect JVisualVm and see if there's any pattern
with latency and garbage collection. That's a frequent culprit for
periodic hits in latency.

More info here
http://docs.oracle.com/javase/6/docs/technotes/guides/visualvm/jmx_connections.html

There's a couple GC implementations in Java that can be tuned as needed

With JvisualVM You can also add the mbeans plugin to get a ton of
performance stats out of Solr that might help debug latency issues.

Doug

Sent from my Windows Phone From: Dave Seltzer
Sent: 11/21/2013 8:42 PM
To: solr-user@lucene.apache.org
Subject: Re: Periodic Slowness on Solr Cloud
Lots of questions. Okay.

In digging a little deeper and looking at the config I see that
nrtModetrue/nrtMode is commented out.  I believe this is the default
setting. So I don't know if NRT is enabled or not. Maybe just a red herring.

I don't know what Garbage Collector we're using. In this test I'm running
Solr 4.5.1 using Jetty from the example directory.

The CPU on the 8 nodes all stay around 70% use during the test. The nodes
have 28GB of RAM. Java is using about 6GB and the rest is being used by OS
cache.

To perform the test we're running 200 concurrent threads in JMeter. The
threads hit HAProxy which loadbalances the requests among the nodes. Each
query is for a random word out of a list of about 10,000 words. Some of the
queries have faceting turned on.

Because we're heavily loading the system the queries are returning quite
slowly. For a simple search, the average response time was 300ms. The peak
response time was 11,000ms. The spikes in latency seem to occur about every
2.5 minutes.

I haven't spent that much time messing with SolrConfig, so most of the
settings are the out-of-the-box defaults.

Where should I start to look?

Thanks so much!

-Dave





On Thu, Nov 21, 2013 at 6:53 PM, Mark Miller markrmil...@gmail.com wrote:

 Yes, more details…

 Solr version, which garbage collector, how does heap usage look, cpu, etc.

 - Mark

 On Nov 21, 2013, at 6:46 PM, Erick Erickson erickerick...@gmail.com
 wrote:

  How real time is NRT? In particular, what are you commit settings?
 
  And can you characterize periodic slowness? Queries that usually
  take 500ms not tail 10s? Or 1s? How often? How are you measuring?
 
  Details matter, a lot...
 
  Best,
  Erick
 
 
 
 
  On Thu, Nov 21, 2013 at 6:03 PM, Dave Seltzer dselt...@tveyes.com
 wrote:
 
  I'm doing some performance testing against an 8-node Solr cloud cluster,
  and I'm noticing some periodic slowness.
 
 
  http://farm4.staticflickr.com/3668/10985410633_23e26c7681_o.png
 
  I'm doing random test searches against an Alias Collection made up of
 four
  smaller (monthly) collections. Like this:
 
  MasterCollection
  |- Collection201308
  |- Collection201309
  |- Collection201310
  |- Collection201311
 
  The last collection is constantly updated. New documents are being
 added at
  the rate of about 3 documents per second.
 
  I believe the slowness may due be to NRT, but I'm not sure. How should I
  investigate this?
 
  If the slowness is related to NRT, how can I alleviate the issue without
  disabling NRT?
 
  Thanks Much!
 
  -Dave

a function query of time, frequency and score.

2013-11-21 Thread sling

Hi, guys.

I indexed 1000 documents, which have fields like title, ptime and frequency.

The title is a text fild, the ptime is a date field, and the frequency is a
int field.
Frequency field is ups and downs. say sometimes its value is 0, and
sometimes its value is 999.

Now, in my app, the query could work with function query well. The function
query is implemented as the score multiplied by an decreased date-weight
array. 

However, I have got no idea to add the frequency to this formula...

so could someone give me a clue?

Thanks again!

sling



--
View this message in context: 
http://lucene.472066.n3.nabble.com/a-function-query-of-time-frequency-and-score-tp4102531.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Periodic Slowness on Solr Cloud

2013-11-21 Thread Doug Turnbull

Additional info on GC selection
http://www.oracle.com/technetwork/java/javase/gc-tuning-6-140523.html#available_collectors

If response time is more important than overall throughput and garbage
collection pauses must be kept shorter than approximately one second, then
select the concurrent collector with -XX:+UseConcMarkSweepGC. If only one
or two processors are available, consider using incremental mode, described
below.

I'm not entirely certain of the implications of GC tuning for SolrCloud. I
imagine distributed searching is going to be as slow as the slowest core
being queried.

I'd also be curious as to the root-cause of any excess GC churn. It sounds
like you're doing a ton of random queries. This probably creates a lot of
evictions your caches. There's nothing really worth caching, so the caches
fill up and empty frequently, causing a lot of heap activity. If you expect
to have high-load and a ton of turnover in queries, then tuning down cache
size might help minimize GC churn.

Solr Meter is another great tool for your perf testing that can help get at
some of these caching issues. It gives you some higher-level stats about
cache eviction, etc.
https://code.google.com/p/solrmeter/

-Doug

On Thu, Nov 21, 2013 at 10:24 PM, Doug Turnbull
dturnb...@opensourceconnections.com wrote:

Dave you might want to connect JVisualVm and see if there's any pattern
with latency and garbage collection. That's a frequent culprit for
periodic hits in latency.

More info here

http://docs.oracle.com/javase/6/docs/technotes/guides/visualvm/jmx_connections.html

There's a couple GC implementations in Java that can be tuned as needed

With JvisualVM You can also add the mbeans plugin to get a ton of
performance stats out of Solr that might help debug latency issues.

Doug

Sent from my Windows Phone From: Dave Seltzer
Sent: 11/21/2013 8:42 PM
To: solr-user@lucene.apache.org
Subject: Re: Periodic Slowness on Solr Cloud
Lots of questions. Okay.

In digging a little deeper and looking at the config I see that
nrtModetrue/nrtMode is commented out. I believe this is the default
setting. So I don't know if NRT is enabled or not. Maybe just a red
herring.

I don't know what Garbage Collector we're using. In this test I'm running
Solr 4.5.1 using Jetty from the example directory.

The CPU on the 8 nodes all stay around 70% use during the test. The nodes
have 28GB of RAM. Java is using about 6GB and the rest is being used by OS
cache.

To perform the test we're running 200 concurrent threads in JMeter. The
threads hit HAProxy which loadbalances the requests among the nodes. Each
query is for a random word out of a list of about 10,000 words. Some of the
queries have faceting turned on.

Because we're heavily loading the system the queries are returning quite
slowly. For a simple search, the average response time was 300ms. The peak
response time was 11,000ms. The spikes in latency seem to occur about every
2.5 minutes.

I haven't spent that much time messing with SolrConfig, so most of the
settings are the out-of-the-box defaults.

Where should I start to look?

Thanks so much!

-Dave

On Thu, Nov 21, 2013 at 6:53 PM, Mark Miller markrmil...@gmail.com
wrote:

Yes, more details…

Solr version, which garbage collector, how does heap usage look, cpu,
etc.

- Mark

On Nov 21, 2013, at 6:46 PM, Erick Erickson erickerick...@gmail.com
wrote:

How real time is NRT? In particular, what are you commit settings?

And can you characterize periodic slowness? Queries that usually
take 500ms not tail 10s? Or 1s? How often? How are you measuring?

Details matter, a lot...

Best,
Erick

On Thu, Nov 21, 2013 at 6:03 PM, Dave Seltzer dselt...@tveyes.com
wrote:

I'm doing some performance testing against an 8-node Solr cloud
cluster,
and I'm noticing some periodic slowness.

http://farm4.staticflickr.com/3668/10985410633_23e26c7681_o.png

I'm doing random test searches against an Alias Collection made up of
four
smaller (monthly) collections. Like this:

MasterCollection
|- Collection201308
|- Collection201309
|- Collection201310
|- Collection201311

The last collection is constantly updated. New documents are being
added at
the rate of about 3 documents per second.

I believe the slowness may due be to NRT, but I'm not sure. How
should I
investigate this?

If the slowness is related to NRT, how can I alleviate the issue
without
disabling NRT?

Thanks Much!

-Dave

--
Doug Turnbull
Search Big Data Architect
OpenSource Connections http://o19s.com

Re: Periodic Slowness on Solr Cloud

2013-11-21 Thread Dave Seltzer

Thanks Doug!

One thing I'm not clear on is how do I know if this is in-fact related to
Garbage Collection. If you're right, and the cluster is only as slow as its
slowest link, how do I determine that this is GC. Do I have to run the
profiler on all eight nodes?

Or is it a matter of turning on the correct logging and then watching and
waiting.

Thanks!

-D

On Thu, Nov 21, 2013 at 11:20 PM, Doug Turnbull
dturnb...@opensourceconnections.com wrote:

Additional info on GC selection

http://www.oracle.com/technetwork/java/javase/gc-tuning-6-140523.html#available_collectors

I'm not entirely certain of the implications of GC tuning for SolrCloud. I
imagine distributed searching is going to be as slow as the slowest core
being queried.

Solr Meter is another great tool for your perf testing that can help get
at some of these caching issues. It gives you some higher-level stats about
cache eviction, etc.
https://code.google.com/p/solrmeter/

-Doug

On Thu, Nov 21, 2013 at 10:24 PM, Doug Turnbull
dturnb...@opensourceconnections.com wrote:

Dave you might want to connect JVisualVm and see if there's any pattern
with latency and garbage collection. That's a frequent culprit for
periodic hits in latency.

More info here

http://docs.oracle.com/javase/6/docs/technotes/guides/visualvm/jmx_connections.html

There's a couple GC implementations in Java that can be tuned as needed

With JvisualVM You can also add the mbeans plugin to get a ton of
performance stats out of Solr that might help debug latency issues.

Doug

Sent from my Windows Phone From: Dave Seltzer
Sent: 11/21/2013 8:42 PM
To: solr-user@lucene.apache.org
Subject: Re: Periodic Slowness on Solr Cloud
Lots of questions. Okay.

I don't know what Garbage Collector we're using. In this test I'm running
Solr 4.5.1 using Jetty from the example directory.

The CPU on the 8 nodes all stay around 70% use during the test. The nodes
have 28GB of RAM. Java is using about 6GB and the rest is being used by OS
cache.

I haven't spent that much time messing with SolrConfig, so most of the
settings are the out-of-the-box defaults.

Where should I start to look?

Thanks so much!

-Dave

On Thu, Nov 21, 2013 at 6:53 PM, Mark Miller markrmil...@gmail.com
wrote:

Yes, more details…

Solr version, which garbage collector, how does heap usage look, cpu,
etc.

- Mark

On Nov 21, 2013, at 6:46 PM, Erick Erickson erickerick...@gmail.com
wrote:

How real time is NRT? In particular, what are you commit settings?

And can you characterize periodic slowness? Queries that usually
take 500ms not tail 10s? Or 1s? How often? How are you measuring?

Details matter, a lot...

Best,
Erick

On Thu, Nov 21, 2013 at 6:03 PM, Dave Seltzer dselt...@tveyes.com
wrote:

I'm doing some performance testing against an 8-node Solr cloud
cluster,
and I'm noticing some periodic slowness.

http://farm4.staticflickr.com/3668/10985410633_23e26c7681_o.png

I'm doing random test searches against an Alias Collection made up of
four
smaller (monthly) collections. Like this:

MasterCollection
|- Collection201308
|- Collection201309
|- Collection201310
|- Collection201311

The last collection is constantly updated. New documents are being
added at
the rate of about 3 documents per second.

I believe the slowness may due be to NRT, but I'm not sure. How
should I
investigate this?

If the slowness is related to NRT, how can I

Re: SolrServerException while adding an invalid UNIQUE_KEY in solr 4.4

2013-11-21 Thread RadhaJayalakshmi

Thanks Shawn for your response.
So, from your email, it seems that unique_key validation is handled
differently from other field validation.
But what i am not very clear, is what the unique_key has to do with finding
the live server?
Becase if there is any mismatch in the unique_key, it is throwing
SolrServerException saying No live servers found.. Because live servers
are being sourced by clusterstate of zookeeper. so i feel the unique key is
particular to a core/index.
So looking to understand the nature of this exception. Please explain me how
unique_key and live servers are related




--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrServerException-while-adding-an-invalid-UNIQUE-KEY-in-solr-4-4-tp4102346p4102533.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Periodic Slowness on Solr Cloud

2013-11-21 Thread Shawn Heisey

On 11/21/2013 6:41 PM, Dave Seltzer wrote:
 In digging a little deeper and looking at the config I see that
 nrtModetrue/nrtMode is commented out.  I believe this is the default
 setting. So I don't know if NRT is enabled or not. Maybe just a red herring.

I had never seen this setting before.  The default is true.  SolrCloud
requires that it be set to true.  Looks like it's a new parameter in
4.5, added by SOLR-4909.  From what I can tell reading the issue,
turning it off effectively disables soft commits.

https://issues.apache.org/jira/browse/SOLR-4909

You've said that you are adding about 3 documents per second, but you
haven't said anything about how often you are doing commits.  Erick's
question basically boils down to this:  How quickly after indexing do
you expect the changes to be visible on a search, and how often are you
doing commits?

Generally speaking (and ignoring the fact that nrtMode now exists), NRT
is not something you enable, it's something you try to achieve, by using
soft commits quickly and often, and by adjusting the configuration to
make the commits go faster.

If you are trying to keep the interval between indexing and document
visibility down to less than a few seconds (especially if it's less than
one second), then you are trying to achieve NRT.

There's a lot of information on the following wiki page about
performance problems.  This specific link is to the last part of that
page, which deals with slow commits:

http://wiki.apache.org/solr/SolrPerformanceProblems#Slow_commits

 I don't know what Garbage Collector we're using. In this test I'm running
 Solr 4.5.1 using Jetty from the example directory.

If you aren't using any tuning parameters beyond setting the max heap,
then you are using the default parallel collector.  It's a poor choice
for Solr unless your heap is very small.  At 6GB, yours isn't very
small.  It's not particularly huge either, but not small.

 The CPU on the 8 nodes all stay around 70% use during the test. The nodes
 have 28GB of RAM. Java is using about 6GB and the rest is being used by OS
 cache.

How big is your index?  If it's larger than about 30 GB, you probably
need more memory.  If it's much larger than about 40 GB, you definitely
need more memory.

 To perform the test we're running 200 concurrent threads in JMeter. The
 threads hit HAProxy which loadbalances the requests among the nodes. Each
 query is for a random word out of a list of about 10,000 words. Some of the
 queries have faceting turned on.

That's a pretty high query load.  If you want to get anywhere near top
performance out of it, you'll want to have enough memory to fit your
entire index into RAM.  You'll also need to reduce the load introduced
by indexing.  A large part of the load from indexing comes from commits.

 Because we're heavily loading the system the queries are returning quite
 slowly. For a simple search, the average response time was 300ms. The peak
 response time was 11,000ms. The spikes in latency seem to occur about every
 2.5 minutes.

I would bet that you're having one or both of the following issues:

1) Garbage collection issues from one or more of the following:
 a) Heap too small.
 b) Using the default GC instead of CMS with tuning.
2) General performance issues from one or more of the following:
 a) Not enough cache memory for your index size.
 b) Too-frequent commits.
 c) Commits taking a lot of time and resources due to cache warming.

With a high query and index load, any problems become magnified.

 I haven't spent that much time messing with SolrConfig, so most of the
 settings are the out-of-the-box defaults.

The defaults are very good for small to medium indexes and low to medium
query load.  If you have a big index and/or high query load, you'll
generally need to tune.

Thanks,
Shawn

Re: Best implementation for multi-price store?

2013-11-21 Thread Alejandro Marqués Rodríguez

Hi Robert,

That was the idea, dynamic fields, so, as you said, it is easier to sort
and filter. Besides, having dynamic fields it would be easier to add new
stores, as I wouldn't have to modify the schema :)

Thanks for the answer!


2013/11/21 Petersen, Robert robert.peter...@mail.rakuten.com

 Hi,

 I'd go with (2) also but using dynamic fields so you don't have to define
 all the storeX_price fields in your schema but rather just one *_price
 field.  Then when you filter on store:store1 you'd know to sort with
 store1_price and so forth for units.  That should be pretty straightforward.

 Hope that helps,
 Robi

 -Original Message-
 From: Alejandro Marqués Rodríguez [mailto:
 amarq...@paradigmatecnologico.com]
 Sent: Thursday, November 21, 2013 1:36 AM
 To: solr-user@lucene.apache.org
 Subject: Best implementation for multi-price store?

 Hi,

 I've been recently ask to implement an application to search products from
 several stores, each store having different prices and stock for the same
 product.

 So I have products that have the usual fields (name, description, brand,
 etc) and also number of units and price for each store. I must be able to
 filter for a given store and order by stock or price for that store. The
 application should also allow incresing the number of stores, fields
 depending of store and number of products without much work.

 The numbers for the application are more or less 100 stores and 7M
 products.

 I've been thinking of some ways of defining the index structure but I
 don't know wich one is better as I think each one has it's pros and cons.


1. *Each product-store as a document:* Denormalizing the information so
for every product and store I have a different document. Pros are that I
can filter and order without problems and that adding a new
 store-depending
field is very easy. Cons are that the index goes from 7M documents to
 700M
and that most of the info is redundant as most of the fields are
 repeated
among stores.
2. *Each field-store as a field:* For example for price I would have
store1_price, store2_price,  Pros are that the index stays at 7M
documents, and I can still filter and sort by those fields. Cons are
 that I
have to add some logic so if I filter by one store I order for the
associated price field, and that number of fields increases as number of
store-depending fields x number of stores. I don't know if having more
fields affects performance, but adding new store-depending fields will
increase the number of fields even more
3. *Join:* First time I read about solr joins thought it was the way to
go in this case, but after reading a bit more and doing some tests I'm
 not
so sure about it... Maybe I've done it wrong but I think it also
denormalizes the info (So I will also havee 700M documents) and besides
 I
can't order or filter by store fields.


 I must say my preferred option is number 2, so I don't duplicate
 information, I keep a relatively small number of documents and I can filter
 and sort by the store fields. However, my main concern here is I don't know
 if having too many fields in a document will be harmful to performance.

 Which one do you think is the best approach for this application? Is there
 a better approach that I have missed?

 Thanks in advance



 --
 Alejandro Marqués Rodríguez

 Paradigma Tecnológico
 http://www.paradigmatecnologico.com
 Avenida de Europa, 26. Ática 5. 3ª Planta
 28224 Pozuelo de Alarcón
 Tel.: 91 352 59 42




-- 
Alejandro Marqués Rodríguez

Paradigma Tecnológico
http://www.paradigmatecnologico.com
Avenida de Europa, 26. Ática 5. 3ª Planta
28224 Pozuelo de Alarcón
Tel.: 91 352 59 42

Re: SolrServerException while adding an invalid UNIQUE_KEY in solr 4.4

2013-11-21 Thread Shawn Heisey

On 11/21/2013 9:51 PM, RadhaJayalakshmi wrote:
 Thanks Shawn for your response.
 So, from your email, it seems that unique_key validation is handled
 differently from other field validation.
 But what i am not very clear, is what the unique_key has to do with finding
 the live server?
 Becase if there is any mismatch in the unique_key, it is throwing
 SolrServerException saying No live servers found.. Because live servers
 are being sourced by clusterstate of zookeeper. so i feel the unique key is
 particular to a core/index.
 So looking to understand the nature of this exception. Please explain me how
 unique_key and live servers are related

It's the HTTP error code, 500, which means internal server error.  SolrJ
interprets this to mean that there's something wrong with that server,
which is what the HTTP protocol specification says it must do.  That
makes it try the next server.  Because the problem is not actually a
server issue, the next server returns the same error.  This continues
until it's tried them all and gives up.

The validation for other fields returns a different error, one that
SolrJ interprets as a problem with the request, so it doesn't try other
servers.

Strictly speaking, Solr probably should not return error 500 for unique
key validation issues, which makes this a minor bug.  The actual results
are correct, because the update fails and the application is notified.
If all possible exceptions are caught, then it all works correctly.

Thanks,
Shawn

46 matches

Mail list logo