Re: Tokenizers and DelimitedPayloadTokenFilterFactory

2015-08-25 Thread Jamie Johnson
To be clear, we are using payloads as a way to attach authorizations to
individual tokens within Solr.  The payloads are normal Solr Payloads
though we are not using floats, we are using the identity payload encoder
(org.apache.lucene.analysis.payloads.IdentityEncoder) which allows for
storing a byte[] of our choosing into the payload field.

This works great for text, but now that I'm indexing more than just text I
need a way to specify the payload on the other field types.  Does that make
more sense?

On Tue, Aug 25, 2015 at 12:52 PM, Erick Erickson erickerick...@gmail.com
wrote:

 This really sounds like an XY problem. Or when you use
 payload it's not the Solr payload.

 So Solr Payloads are a float value that you can attach to
 individual terms to influence the scoring. Attaching the
 _same_ payload to all terms in a field is much the same
 thing as boosting on any matches in the field at query time
 or boosting on the field at index time (this latter assuming
 that different docs would have different boosts).

 So can you back up a bit and tell us what you're trying to
 accomplish maybe we can be sure we're both talking about
 the same thing ;)

 Best,
 Erick

 On Tue, Aug 25, 2015 at 9:09 AM, Jamie Johnson jej2...@gmail.com wrote:
  I would like to specify a particular payload for all tokens emitted from
 a
  tokenizer, but don't see a clear way to do this.  Ideally I could specify
  that something like the DelimitedPayloadTokenFilter be run on the entire
  field and then standard analysis be done on the rest of the field, so in
  the case that I had the following text
 
  this is a test\Foo
 
  I would like to create tokens this, is, a, test each with a
 payload
  of Foo.  From what I'm seeing though only test get's the payload.  Is
 there
  anyway to accomplish this or will I need to implement a custom tokenizer?



Re: testing with EmbeddedSolrServer

2015-08-25 Thread Mikhail Khludnev
Hello,

I'm trying to guess what are you doing. It's not clear so far.
I found http://stackoverflow.com/questions/11951695/embedded-solr-dih
My conclusion, if you play with DIH and EmbeddedSolrServer you'd better to
avoid the third beast, you don't need to bother with tests.
I guess that main() is over while DIH runs in background thread. You need
to loop status command until import is over. or add synchronous=true
parameter to full-import command it should switch to synchronous mode:
https://github.com/apache/lucene-solr/blob/trunk/solr/contrib/dataimporthandler/src/java/org/apache/solr/handler/dataimport/DataImportHandler.java#L199

Take care


On Tue, Aug 25, 2015 at 4:41 PM, Moen Endre endre.m...@imr.no wrote:

 Is there an example of integration-testing with EmbeddedSolrServer that
 loads data from a data importhandler - then queries the data? Ive tried
 doing this based on
 org.apache.solr.client.solrj.embedded.TestEmbeddedSolrServerConstructors.

 But no data is being imported.  Here is the test-class ive tried:
 https://gist.github.com/emoen/5d0a28df91c4c1127238

 Ive also tried writing a test by extending AbstractSolrTestCase - but
 havnt got this working. Ive documented some of the log output here:
 http://stackoverflow.com/questions/32052642/solrcorestate-already-closed-with-unit-test-using-embeddedsolrserver-v-5-2-1

 Should I extend AbstractSolrTestCase or SolrTestCaseJ4 when writing tests?

 Cheers
 Endre




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
mkhlud...@griddynamics.com


attribute based recommender with solr

2015-08-25 Thread ReconX92
Hey Guys,
I wanted to create a simple, attributed based food recommender with solr.

The User makes his choice concerning ingredients, cooking time, difficulty
and so on.
It is based on a SQL database where the recipes are stored. 
So, for example the user likes tomatoes, then the recipes with tomatoes
should be boosted and ranked better.
Sounds easy, but the sources I found are pretty shallow and theoretical.
They don't really help.

Maybe someone has done this before and is willing to help me :D I also would
be happy about some good sources, I already did research for hours.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/attribute-based-recommender-with-solr-tp4225186.html
Sent from the Solr - User mailing list archive at Nabble.com.


Tokenizers and DelimitedPayloadTokenFilterFactory

2015-08-25 Thread Jamie Johnson
I would like to specify a particular payload for all tokens emitted from a
tokenizer, but don't see a clear way to do this.  Ideally I could specify
that something like the DelimitedPayloadTokenFilter be run on the entire
field and then standard analysis be done on the rest of the field, so in
the case that I had the following text

this is a test\Foo

I would like to create tokens this, is, a, test each with a payload
of Foo.  From what I'm seeing though only test get's the payload.  Is there
anyway to accomplish this or will I need to implement a custom tokenizer?


Re: Tokenizers and DelimitedPayloadTokenFilterFactory

2015-08-25 Thread Erick Erickson
This really sounds like an XY problem. Or when you use
payload it's not the Solr payload.

So Solr Payloads are a float value that you can attach to
individual terms to influence the scoring. Attaching the
_same_ payload to all terms in a field is much the same
thing as boosting on any matches in the field at query time
or boosting on the field at index time (this latter assuming
that different docs would have different boosts).

So can you back up a bit and tell us what you're trying to
accomplish maybe we can be sure we're both talking about
the same thing ;)

Best,
Erick

On Tue, Aug 25, 2015 at 9:09 AM, Jamie Johnson jej2...@gmail.com wrote:
 I would like to specify a particular payload for all tokens emitted from a
 tokenizer, but don't see a clear way to do this.  Ideally I could specify
 that something like the DelimitedPayloadTokenFilter be run on the entire
 field and then standard analysis be done on the rest of the field, so in
 the case that I had the following text

 this is a test\Foo

 I would like to create tokens this, is, a, test each with a payload
 of Foo.  From what I'm seeing though only test get's the payload.  Is there
 anyway to accomplish this or will I need to implement a custom tokenizer?


Re: Tokenizers and DelimitedPayloadTokenFilterFactory

2015-08-25 Thread Erick Erickson
Oh My. What fun!

bq: I need a way to specify the payload on the other field types

Not to my knowledge. The payload mechanism is built on
the capability of having a filter in the analysis chain. And there's
no analysis chain with primitive types (string, numeric and the like).

Hmmm. Totally off the top of my head, but I wonder if you could
use a Binary type and customize all the reading to spoof
whatever primitive types you wanted while respecting your
auth tokens?

Best,
Erick


On Tue, Aug 25, 2015 at 10:37 AM, Jamie Johnson jej2...@gmail.com wrote:
 To be clear, we are using payloads as a way to attach authorizations to
 individual tokens within Solr.  The payloads are normal Solr Payloads
 though we are not using floats, we are using the identity payload encoder
 (org.apache.lucene.analysis.payloads.IdentityEncoder) which allows for
 storing a byte[] of our choosing into the payload field.

 This works great for text, but now that I'm indexing more than just text I
 need a way to specify the payload on the other field types.  Does that make
 more sense?

 On Tue, Aug 25, 2015 at 12:52 PM, Erick Erickson erickerick...@gmail.com
 wrote:

 This really sounds like an XY problem. Or when you use
 payload it's not the Solr payload.

 So Solr Payloads are a float value that you can attach to
 individual terms to influence the scoring. Attaching the
 _same_ payload to all terms in a field is much the same
 thing as boosting on any matches in the field at query time
 or boosting on the field at index time (this latter assuming
 that different docs would have different boosts).

 So can you back up a bit and tell us what you're trying to
 accomplish maybe we can be sure we're both talking about
 the same thing ;)

 Best,
 Erick

 On Tue, Aug 25, 2015 at 9:09 AM, Jamie Johnson jej2...@gmail.com wrote:
  I would like to specify a particular payload for all tokens emitted from
 a
  tokenizer, but don't see a clear way to do this.  Ideally I could specify
  that something like the DelimitedPayloadTokenFilter be run on the entire
  field and then standard analysis be done on the rest of the field, so in
  the case that I had the following text
 
  this is a test\Foo
 
  I would like to create tokens this, is, a, test each with a
 payload
  of Foo.  From what I'm seeing though only test get's the payload.  Is
 there
  anyway to accomplish this or will I need to implement a custom tokenizer?



Unknown query parser 'terms' with TermsComponent defined

2015-08-25 Thread P Williams
Hi,

We've encountered a strange situation, I'm hoping someone might be able to
shed some light. We're using Solr 4.9 deployed in Tomcat 7.

We build a query that has these params:

'params'={
  'fl'='id',
  'sort'='system_create_dtsi asc',
  'indent'='true',
  'start'='0',
  'q'='_query_:{!raw f=has_model_ssim}Batch AND ({!terms
f=id}ft849m81z)',
  'qt'='standard',
  'wt'='ruby',
  'rows'=['1',
'1000']}},

And it responds with an error message
'error'={

'msg'='Unknown query parser \'terms\'',
'code'=400}}

The terms component is defined in solrconfig.xml:

  searchComponent name=termsComponent class=solr.TermsComponent /

  requestHandler name=/terms class=solr.SearchHandler
lst name=defaults
  bool name=termstrue/bool
/lst
arr name=components
  strtermsComponent/str
/arr
  /requestHandler

And the Standard Response Handler is defined:
requestHandler name=standard class=solr.SearchHandler lst name=
defaults str name=echoParamsexplicit/str str name=defTypelucene
/str /lst /requestHandler

In case its useful, we have
luceneMatchVersion4.9/luceneMatchVersion

Why would we be getting the Unknown query parser \'terms\' error?

Thanks,
Tricia


Re: how to prevent uuid-field changing in /update query?

2015-08-25 Thread Jack Krupansky
UUIDUpdateProcessorFactory - An update processor that adds a newly
generated UUID value to any document being added that does not already have
a value in the specified field.

See:
http://lucene.apache.org/solr/5_2_1/solr-core/org/apache/solr/update/processor/UUIDUpdateProcessorFactory.html

-- Jack Krupansky

On Tue, Aug 25, 2015 at 4:22 AM, CrazyDiamond crazy_diam...@mail.ru wrote:

 i have uuid field. it is not set as unique, but nevertheless i want it not
 to
 be changed every time when i call /update. it might be because  i added
 requesthandler with name /update which contains uuid update срфшт .But if
 i not do this i have no uuid at all.May be i can config uuid update-chain
 to
 set uuid only if it is blank?



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/how-to-prevent-uuid-field-changing-in-update-query-tp4225113.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Using copyField with dynamicField

2015-08-25 Thread Scott Dawson
Zach,
As an alternative to 'copyField', you might want to consider the
CloneFieldUpdateProcessorFactory:
http://lucene.apache.org/solr/5_0_0/solr-core/org/apache/solr/update/processor/CloneFieldUpdateProcessorFactory.html

It supports specification of field names with regular expressions,
exclusion of specific fields that otherwise match the regex, etc.  Much
more flexible than copyField, in my opinion.

Regards,
Scott

On Mon, Aug 24, 2015 at 10:39 PM, Erick Erickson erickerick...@gmail.com
wrote:

 What is reported in the Solr log? That's usually much more informative.

 Best,
 Erick

 On Mon, Aug 24, 2015 at 5:26 PM, Alexandre Rafalovitch
 arafa...@gmail.com wrote:
  It should work (at first glance). copyField does support wildcards.
 
  Do you have a field called text? Also, your field name and field
  type text have the same name. Not sure it is the best idea.
 
  Regards,
 Alex.
  
  Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
  http://www.solr-start.com/
 
 
  On 24 August 2015 at 17:27, Zach Thompson z...@duckduckgo.com wrote:
  Hi All,
 
  Is it possible to use copyField with dynamicField?  I was trying to do
  the following,
 
  dynamicField name=*_text type=text indexed=true  stored=true/
  copyField source=*_text dest=text maxChars=100 /
 
  and getting a 400 error on trying to copy the first dynamic field.
  Without the copyField the fields seem to load ok.
 
  --
Zach Thompson
z...@duckduckgo.com
 
 



RE: User Authentication

2015-08-25 Thread Davis, Daniel (NIH/NLM) [C]
We use CAS as well, and are also not using ZooKeeper/SolrCloud.   We may move 
to SolrCloud after getting our current very-basic setup into production.
We'll definitely take a look at the rule-based authorization plugin and see how 
we can leverage that.

-Original Message-
From: LeZotte, Tom [mailto:tom.lezo...@vanderbilt.edu] 
Sent: Monday, August 24, 2015 4:37 PM
To: solr-user@lucene.apache.org
Subject: Re: User Authentication

Bosco,

We use CAS for user authentication, not sure if we have Kerberos working 
anywhere. Also we are not using ZooKeeper, because we are only running one 
server currently.

thanks

Tom LeZotte
Health I.T. - Senior Product Developer
(p) 615-875-8830






On Aug 24, 2015, at 3:12 PM, Don Bosco Durai 
bo...@apache.orgmailto:bo...@apache.org wrote:

Just curious, is Kerberos an option for you? If so, mostly all your 3 use cases 
will addressed.

Bosco


On 8/24/15, 12:18 PM, Steven White 
swhite4...@gmail.commailto:swhite4...@gmail.com wrote:

Hi Noble,

Is everything in the link you provided applicable to Solr 5.2.1?

Thanks

Steve

On Mon, Aug 24, 2015 at 2:20 PM, Noble Paul 
noble.p...@gmail.commailto:noble.p...@gmail.com wrote:

did you manage to look at the reference guide?
https://cwiki.apache.org/confluence/display/solr/Securing+Solr

On Mon, Aug 24, 2015 at 9:23 PM, LeZotte, Tom tom.lezo...@vanderbilt.edu 
wrote:
Alex
I got a super secret release of Solr 5.3.1, wasn¹t suppose to say anything.

Yes I¹m running 5.2.1, I will check out the release notes for 5.3.

Was looking for three types of user authentication, I guess.
1. the Admin Console
2. User auth for each Core ( and select and update) on a server.
3. HTML interface access (example: ajax-solr
https://github.com/evolvingweb/ajax-solr)

Thanks

Tom LeZotte
Health I.T. - Senior Product Developer
(p) 615-875-8830






On Aug 24, 2015, at 10:05 AM, Alexandre Rafalovitch 
arafa...@gmail.commailto:arafa...@gmail.com
mailto:arafa...@gmail.com wrote:

Thanks for the email from the future. It is good to start to prepare for 5.3.1 
now that 5.3 is nearly out.

Joking aside (and assuming Solr 5.2.1), what exactly are you trying to achieve? 
Solr should not actually be exposed to the users directly. It should be hiding 
in a backend only visible to your middleware. If you are looking for a HTML 
interface that talks directly to Solr after authentication, that's not the 
right way to set it up.

That said, some security features are being rolled out and you should 
definitely check the release notes for the 5.3.

Regards,
 Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 24 August 2015 at 10:01, LeZotte, Tom 
tom.lezo...@vanderbilt.edumailto:tom.lezo...@vanderbilt.edu
wrote:
Hi Solr Community

I have been trying to add user authentication to our Solr 5.3.1 RedHat install. 
I¹ve found some examples on user authentication on the Jetty side.
But they have failed.

Does any one have a step by step example on authentication for the admin 
screen? And a core?


Thanks

Tom LeZotte
Health I.T. - Senior Product Developer
(p) 615-875-8830










--
-
Noble Paul



Re: Please answer my question on StackOverflow ... Best approach to guarantee commits in SOLR

2015-08-25 Thread Jack Krupansky
You could also look at an integrated product such as DataStax Enterprise
which fully integrates the Cassandra database and Solr - you execute your
database transactions in Cassandra and then DSE Search automatically
indexes the data in the embedded version of Solr.

See:
http://www.datastax.com/products/datastax-enterprise-search

About the only downside is that it is a proprietary product and the
integration is not open source.


-- Jack Krupansky

On Tue, Aug 25, 2015 at 10:15 AM, Upayavira u...@odoko.co.uk wrote:



 On Tue, Aug 25, 2015, at 01:21 PM, Simer P wrote:
 
 http://stackoverflow.com/questions/32138845/what-is-the-best-approach-to-guarantee-commits-in-apache-solr
  .
 
  *Question:* How can I get guarantee commits with Apache SOLR where
  persisting data to disk and visibility are both equally important ?
 
  *Background:* We have a website which requires high end search
  functionality for machine learning and also requires guaranteed commit
  for
  financial transaction. We just want to SOLR as our only datastore to keep
  things simple and *do not* want to use another database on the side.
 
  I can't seem to find any answer to this question. The simplest solution
  for
  a financial transaction seems to be to periodically query SOLR for the
  record after it has been persisted but this can have longer wait time or
  is
  there a better solution ?
 
  Can anyone please suggest a solution for achieving guaranteed commits
  with SOLR ?

 Be sure whether you are trying to use the wrong tool for the job. Solr
 does not offer per transaction guarantees. It is heavily optimised
 around high read/low write situations (i.e. more reads than writes). If
 you commit to disk too often, the implementation will be very
 inefficient (it will create lots of segments that need to be merged, and
 caches will become ineffective).

 Also, when you issue a commit, it commits all pending documents,
 regardless of whom posted them to Solr. These do not sound like things
 that suit your application.

 There remains the possibility (even if extremely uncommon/unlikely) that
 a transaction could be lost were a server to die/loose power in the few
 seconds between a post and a subsequent commit.

 Personally, I'd use a more traditional database for the data, then also
 post it to Solr for fast search/faceting/etc as needed.

 But then, perhaps there's more to your usecase than I have so far
 understood.

 Upayavira



Re: Please answer my question on StackOverflow ... Best approach to guarantee commits in SOLR

2015-08-25 Thread Upayavira


On Tue, Aug 25, 2015, at 01:21 PM, Simer P wrote:
 http://stackoverflow.com/questions/32138845/what-is-the-best-approach-to-guarantee-commits-in-apache-solr
 .
 
 *Question:* How can I get guarantee commits with Apache SOLR where
 persisting data to disk and visibility are both equally important ?
 
 *Background:* We have a website which requires high end search
 functionality for machine learning and also requires guaranteed commit
 for
 financial transaction. We just want to SOLR as our only datastore to keep
 things simple and *do not* want to use another database on the side.
 
 I can't seem to find any answer to this question. The simplest solution
 for
 a financial transaction seems to be to periodically query SOLR for the
 record after it has been persisted but this can have longer wait time or
 is
 there a better solution ?
 
 Can anyone please suggest a solution for achieving guaranteed commits
 with SOLR ?

Be sure whether you are trying to use the wrong tool for the job. Solr
does not offer per transaction guarantees. It is heavily optimised
around high read/low write situations (i.e. more reads than writes). If
you commit to disk too often, the implementation will be very
inefficient (it will create lots of segments that need to be merged, and
caches will become ineffective).

Also, when you issue a commit, it commits all pending documents,
regardless of whom posted them to Solr. These do not sound like things
that suit your application.

There remains the possibility (even if extremely uncommon/unlikely) that
a transaction could be lost were a server to die/loose power in the few
seconds between a post and a subsequent commit. 

Personally, I'd use a more traditional database for the data, then also
post it to Solr for fast search/faceting/etc as needed.

But then, perhaps there's more to your usecase than I have so far
understood.

Upayavira


Re: how to index document with multiple words (phrases) and words permutation?

2015-08-25 Thread simon
What you want to do is basically named entity recognition. We have a quite
similar use case (medical/scientific documents, need to look for disease
names /drug names /MeSH terms, etc).

Take a look at David Smiley's Solr Text Tagger (
https://github.com/OpenSextant/SolrTextTagger ) which we've been using with
some success for this task.

best

-Simon

On Mon, Aug 24, 2015 at 2:13 PM, afrooz afr.rahm...@gmail.com wrote:

 Thanks Erick,
 I will explain the detail scenario so you might give me a solution:
 I want to annotate a medical document base on only medical dictionary. I
 don't need to annotate non medical words of document at all.
 The medical dictionary contains terms which contains multiple words, and
 these terms all together has a specific medical meanings. For example back
 Pain, back and pain are two separate words but together they have
 another meaning. these terms might be using in different orders in a
 sentences but all with a same meaning. Ex breast cancer or cancer in
 breast should be consider the same...
 We have terms even more than 6 words also.

 So the question is that I have a document with around 700 words and i need
 to annotate this document base on medical terminology of 3 million size in
 records
 any idea how to do this?



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/how-to-index-document-with-multiple-words-phrases-and-words-permutation-tp4224919p4224970.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Behavior of grouping on a field with same value spread across shards.

2015-08-25 Thread Erick Erickson
That's not really the case. Perhaps you're confusing
group.ngroups and group.facet with just grouping?

See the ref guide:
https://cwiki.apache.org/confluence/display/solr/Result+Grouping#ResultGrouping-DistributedResultGroupingCaveats

Best,
Erick

On Tue, Aug 25, 2015 at 4:51 AM, Modassar Ather modather1...@gmail.com wrote:
 Hi,

 As per my understanding, to group on a field all documents with the same
 value in the field have to be in the same shard.

 Can we group by a field where the documents with the same value in that
 field will be distributed across shards?
 Please let me know what are the limitations, feature not available or
 performance issues for such fields?

 Thanks,
 Modassar


Re: Query timeAllowed and its behavior.

2015-08-25 Thread Shawn Heisey
On 8/25/2015 3:18 AM, Modassar Ather wrote:
 Kindly help me understand the query time allowed attribute. The following
 is set in solrconfig.xml.
 int name=timeAllowed30/int

 Does this setting stop the query from running after the timeAllowed is
 reached? If not is there a way to stop it as it will occupy resources in
 background for no benefit.

That is certainly the *goal* of timeAllowed ... but mostly it serves as
a way to try and offer a guarantee that a query will not take longer
than a certain amount of time, so your user application will receive a
response, which might be an error or negative response, within that
stated timeframe.  Multithreaded programming is tricky in the best
circumstances.  If you introduce the idea of killing threads into the
mix, it becomes REALLY complicated.  I would not be very surprised to
learn that parts of the query which run in parallel, such as the filter
queries, continue to run in the background and populate caches even if
the user query has been aborted because of timeAllowed.

You could open a feature request issue in Jira, but I suspect that
aborting *everything* for timeAllowed is a really hard problem that
nobody wants to tackle.  If you can figure out how to solve it, your
patch will be reviewed and possibly committed.

Thanks,
Shawn



Re: splitting shards on 4.7.2 with custom plugins

2015-08-25 Thread Anshum Gupta
Can you elaborate a bit more on the setup, what do the custom plugins do,
what error do you get ? It seems like a classloader/classpath issue to me
which doesn't really relate to Shard splitting.


On Tue, Aug 25, 2015 at 7:59 PM, Jeff Courtade courtadej...@gmail.com
wrote:

 I am getting failures when trying too split shards on solr 4.2.7 with
 custom plugins.

 It fails regularily it cannot find the jar files for  plugins when creating
 the new cores/shards.

 Ideas?

 --
 Thanks,

 Jeff Courtade
 M: 240.507.6116




-- 
Anshum Gupta


Re: Tokenizers and DelimitedPayloadTokenFilterFactory

2015-08-25 Thread Jamie Johnson
Looks like I have something basic working for Trie fields.  I am doing
exactly what I said in my previous email, so good news there.  I think this
is a big step as there are only a few field types left that I need to
support, those being date (should be similar to Trie) and Spatial fields,
which at a glance looked like it provided a way to provide the token stream
through an extension.  Definitely need to look more though.

All of this said though, is this really the right way to get payloads into
these types of fields?  Should a jira feature request be added for this?
On Aug 25, 2015 8:13 PM, Jamie Johnson jej2...@gmail.com wrote:

 Right, I had assumed (obviously here is my problem) that I'd be able to
 specify payloads for the field regardless of the field type.  Looking at
 TrieField that is certainly non-trivial.  After a bit of digging it appears
 that if I wanted to do something here I'd need to build a new TrieField,
 override createField and provide a Field that would return something like
 NumericTokenStream but also provide the payloads.  Like you said sounds
 interesting to say the least...

 Were payloads not really intended to be used for these types of fields
 from a Lucene perspective?


 On Tue, Aug 25, 2015 at 6:29 PM, Erick Erickson erickerick...@gmail.com
 wrote:

 Well, you're going down a path that hasn't been trodden before ;).

 If you can treat your primitive types as text types you might get
 some traction, but that makes a lot of operations like numeric
 comparison difficult.

 H. another idea from left field. For single-valued types,
 what about a sidecar field that has the auth token? And even
 for a multiValued field, two parallel fields are guaranteed to
 maintain order so perhaps you could do something here. Yes,
 I'm waving my hands a LOT here.

 I suspect that trying to have a custom type that incorporates
 payloads for, say, trie fields will be interesting to say the least.
 Numeric types are packed to save storage etc. so it'll be
 an adventure..

 Best,
 Erick

 On Tue, Aug 25, 2015 at 2:43 PM, Jamie Johnson jej2...@gmail.com wrote:
  We were originally using this approach, i.e. run things through the
  KeywordTokenizer - DelimitedPayloadFilter - WordDelimiterFilter.
 Again
  this works fine for text, though I had wanted to use the
 StandardTokenizer
  in the chain.  Is there an equivalent filter that does what the
  StandardTokenizer does?
 
  All of this said this doesn't address the issue of the primitive field
  types, which at this point is the bigger issue.  Given this use case
 should
  there be another way to provide payloads?
 
  My current thinking is that I will need to provide custom
 implementations
  for all of the field types I would like to support payloads on which
 will
  essentially be copies of the standard versions with some extra sugar
 to
  read/write the payloads (I don't see a way to wrap/delegate these at
 this
  point because AttributeSource has the attribute retrieval related
 methods
  as final so I can't simply wrap another tokenizer and return my added
  attributes + the wrapped attributes).  I know my use case is a bit
 strange,
  but I had not expected to need to do this given that Lucene/Solr
 supports
  payloads on these field types, they just aren't exposed.
 
  As always I appreciate any ideas if I'm barking up the wrong tree here.
 
  On Tue, Aug 25, 2015 at 2:52 PM, Markus Jelsma 
 markus.jel...@openindex.io
  wrote:
 
  Well, if i remember correctly (i have no testing facility at hand)
  WordDelimiterFilter maintains payloads on emitted sub terms. So if you
 use
  a KeywordTokenizer, input 'some text^PAYLOAD', and have a
  DelimitedPayloadFilter, the entire string gets a payload. You can then
  split that string up again in individual tokens. It is possible to
 abuse
  WordDelimiterFilter for it because it has a types parameter that you
 can
  use to split it on whitespace if its input is not trimmed. Otherwise
 you
  can use any other character instead of a space as your input.
 
  This is a crazy idea, but it might work.
 
  -Original message-
   From:Jamie Johnson jej2...@gmail.com
   Sent: Tuesday 25th August 2015 19:37
   To: solr-user@lucene.apache.org
   Subject: Re: Tokenizers and DelimitedPayloadTokenFilterFactory
  
   To be clear, we are using payloads as a way to attach authorizations
 to
   individual tokens within Solr.  The payloads are normal Solr Payloads
   though we are not using floats, we are using the identity payload
 encoder
   (org.apache.lucene.analysis.payloads.IdentityEncoder) which allows
 for
   storing a byte[] of our choosing into the payload field.
  
   This works great for text, but now that I'm indexing more than just
 text
  I
   need a way to specify the payload on the other field types.  Does
 that
  make
   more sense?
  
   On Tue, Aug 25, 2015 at 12:52 PM, Erick Erickson 
  erickerick...@gmail.com
   wrote:
  
This really sounds like an XY problem. Or when you use

splitting shards on 4.7.2 with custom plugins

2015-08-25 Thread Jeff Courtade
I am getting failures when trying too split shards on solr 4.2.7 with
custom plugins.

It fails regularily it cannot find the jar files for  plugins when creating
the new cores/shards.

Ideas?

--
Thanks,

Jeff Courtade
M: 240.507.6116


Re: Solr performance is slow with just 1GB of data indexed

2015-08-25 Thread Toke Eskildsen
On Wed, 2015-08-26 at 10:10 +0800, Zheng Lin Edwin Yeo wrote:
 I'm currently trying out on the Carrot2 Workbench and get it to call Solr
 to see how they did the clustering. Although it still takes some time to do
 the clustering, but the results of the cluster is much better than mine. I
 think its probably due to the different settings like the fragSize and
 desiredCluserCountBase?

Either that or the carrot bundled with Solr is an older version.

 By the way, the link on the clustering example
 https://cwiki.apache.org/confluence/display/solr/Result is not working as
 it says 'Page Not Found'.

That is because it is too long for a single line. Try copy-pasting it:

https://cwiki.apache.org/confluence/display/solr/Result
+Clustering#ResultClustering-Configuration

- Toke Eskildsen, State and University Library, Denmark




Re: Behavior of grouping on a field with same value spread across shards.

2015-08-25 Thread Modassar Ather
Thanks Erick,

I saw the link. So is it that the grouping functionality works fine in
distributed search except the two cases mentioned in the link?

Regards,
Modassar

On Tue, Aug 25, 2015 at 10:40 PM, Erick Erickson erickerick...@gmail.com
wrote:

 That's not really the case. Perhaps you're confusing
 group.ngroups and group.facet with just grouping?

 See the ref guide:

 https://cwiki.apache.org/confluence/display/solr/Result+Grouping#ResultGrouping-DistributedResultGroupingCaveats

 Best,
 Erick

 On Tue, Aug 25, 2015 at 4:51 AM, Modassar Ather modather1...@gmail.com
 wrote:
  Hi,
 
  As per my understanding, to group on a field all documents with the same
  value in the field have to be in the same shard.
 
  Can we group by a field where the documents with the same value in that
  field will be distributed across shards?
  Please let me know what are the limitations, feature not available or
  performance issues for such fields?
 
  Thanks,
  Modassar



Re: Solr performance is slow with just 1GB of data indexed

2015-08-25 Thread Toke Eskildsen
On Tue, 2015-08-25 at 10:40 +0800, Zheng Lin Edwin Yeo wrote:
 Would like to confirm, when I set rows=100, does it mean that it only build
 the cluster based on the first 100 records that are returned by the search,
 and if I have 1000 records that matches the search, all the remaining 900
 records will not be considered for clustering?

That is correct. It is not stated very clearly, but it follows from
trading the comments in the third example at
https://cwiki.apache.org/confluence/display/solr/Result
+Clustering#ResultClustering-Configuration

 As if that is the case, the result of the cluster may not be so accurate as
 there is a possibility that the first 100 records might have a large amount
 of similarities in the records, while the subsequent 900 records have
 differences that could have impact on the cluster result.

Such is the nature of on-the-fly clustering. The clustering aims to be
as representative of your search result as possible. Assigning more
weight to the higher scoring documents (in this case: All the weight, as
those beyond the top-100 are not even considered) does this.

If that does not fit your expectations, maybe you need something else?
Plain faceting perhaps? Or maybe enrichment of the documents with some
sort of entity extraction?

- Toke Eskildsen, State and University Library, Denmark




how to prevent uuid-field changing in /update query?

2015-08-25 Thread CrazyDiamond
i have uuid field. it is not set as unique, but nevertheless i want it not to
be changed every time when i call /update. it might be because  i added
requesthandler with name /update which contains uuid update срфшт .But if
i not do this i have no uuid at all.May be i can config uuid update-chain to
set uuid only if it is blank?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-prevent-uuid-field-changing-in-update-query-tp4225113.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: User Authentication

2015-08-25 Thread Don Bosco Durai
You might have to use 5.3 when it is publicly available. It supports Basic
Auth. But based on my understanding for the authentication/authorization
framework implemented in 5.2, you need to use Solr Cloud/Zookeeper for
configuring the plugins.

Noble, Anshum or Ishan can confirm it. They are original authors for these
features.

Thanks

Bosco



On 8/24/15, 2:30 PM, Steven White swhite4...@gmail.com wrote:

For my project, Keberos is not a requirement.  What I need is:

1) Basic Auth to Solr server (at all access levels)
2) SSL support

My setup is not using ZK, it's a single core.

Steve

On Mon, Aug 24, 2015 at 4:12 PM, Don Bosco Durai bo...@apache.org wrote:

 Just curious, is Kerberos an option for you? If so, mostly all your 3
use
 cases will addressed.

 Bosco


 On 8/24/15, 12:18 PM, Steven White swhite4...@gmail.com wrote:

 Hi Noble,
 
 Is everything in the link you provided applicable to Solr 5.2.1?
 
 Thanks
 
 Steve
 
 On Mon, Aug 24, 2015 at 2:20 PM, Noble Paul noble.p...@gmail.com
wrote:
 
  did you manage to look at the reference guide?
  https://cwiki.apache.org/confluence/display/solr/Securing+Solr
 
  On Mon, Aug 24, 2015 at 9:23 PM, LeZotte, Tom
  tom.lezo...@vanderbilt.edu wrote:
   Alex
   I got a super secret release of Solr 5.3.1, wasn¹t suppose to say
  anything.
  
   Yes I¹m running 5.2.1, I will check out the release notes for 5.3.
  
   Was looking for three types of user authentication, I guess.
   1. the Admin Console
   2. User auth for each Core ( and select and update) on a server.
   3. HTML interface access (example: ajax-solr
  https://github.com/evolvingweb/ajax-solr)
  
   Thanks
  
   Tom LeZotte
   Health I.T. - Senior Product Developer
   (p) 615-875-8830
  
  
  
  
  
  
   On Aug 24, 2015, at 10:05 AM, Alexandre Rafalovitch
 arafa...@gmail.com
  mailto:arafa...@gmail.com wrote:
  
   Thanks for the email from the future. It is good to start to
prepare
   for 5.3.1 now that 5.3 is nearly out.
  
   Joking aside (and assuming Solr 5.2.1), what exactly are you
trying to
   achieve? Solr should not actually be exposed to the users
directly. It
   should be hiding in a backend only visible to your middleware. If
you
   are looking for a HTML interface that talks directly to Solr after
   authentication, that's not the right way to set it up.
  
   That said, some security features are being rolled out and you
should
   definitely check the release notes for the 5.3.
  
   Regards,
 Alex.
   
   Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
   http://www.solr-start.com/
  
  
   On 24 August 2015 at 10:01, LeZotte, Tom
tom.lezo...@vanderbilt.edu
  wrote:
   Hi Solr Community
  
   I have been trying to add user authentication to our Solr 5.3.1
RedHat
  install. I¹ve found some examples on user authentication on the Jetty
 side.
  But they have failed.
  
   Does any one have a step by step example on authentication for the
 admin
  screen? And a core?
  
  
   Thanks
  
   Tom LeZotte
   Health I.T. - Senior Product Developer
   (p) 615-875-8830
  
  
  
  
  
  
  
 
 
 
  --
  -
  Noble Paul
 







Re: Unknown query parser 'terms' with TermsComponent defined

2015-08-25 Thread Chris Hostetter

1) The terms Query Parser (TermsQParser) has nothing to do with the 
TermsComponent (the first is for quering many distinct terms, the 
later is for requesting info about low level terms in your index)

https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-TermsQueryParser
https://cwiki.apache.org/confluence/display/solr/The+Terms+Component

2) TermsQParser (which is what you are trying to use with the {!terms... 
query syntax) was not added to Solr until 4.10

3) based on your example query, i'm pretty sure what you want is the 
TermQParser: term (singular, no s) ...

https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-TermQueryParser

{!term f=id}ft849m81z


: We've encountered a strange situation, I'm hoping someone might be able to
: shed some light. We're using Solr 4.9 deployed in Tomcat 7.
...
:   'q'='_query_:{!raw f=has_model_ssim}Batch AND ({!terms
f=id}ft849m81z)',
...
: 'msg'='Unknown query parser \'terms\'',
: 'code'=400}}

...

: The terms component is defined in solrconfig.xml:
: 
:   searchComponent name=termsComponent class=solr.TermsComponent /

-Hoss
http://www.lucidworks.com/


RE: Tokenizers and DelimitedPayloadTokenFilterFactory

2015-08-25 Thread Markus Jelsma
Well, if i remember correctly (i have no testing facility at hand) 
WordDelimiterFilter maintains payloads on emitted sub terms. So if you use a 
KeywordTokenizer, input 'some text^PAYLOAD', and have a DelimitedPayloadFilter, 
the entire string gets a payload. You can then split that string up again in 
individual tokens. It is possible to abuse WordDelimiterFilter for it because 
it has a types parameter that you can use to split it on whitespace if its 
input is not trimmed. Otherwise you can use any other character instead of a 
space as your input.

This is a crazy idea, but it might work. 
 
-Original message-
 From:Jamie Johnson jej2...@gmail.com
 Sent: Tuesday 25th August 2015 19:37
 To: solr-user@lucene.apache.org
 Subject: Re: Tokenizers and DelimitedPayloadTokenFilterFactory
 
 To be clear, we are using payloads as a way to attach authorizations to
 individual tokens within Solr.  The payloads are normal Solr Payloads
 though we are not using floats, we are using the identity payload encoder
 (org.apache.lucene.analysis.payloads.IdentityEncoder) which allows for
 storing a byte[] of our choosing into the payload field.
 
 This works great for text, but now that I'm indexing more than just text I
 need a way to specify the payload on the other field types.  Does that make
 more sense?
 
 On Tue, Aug 25, 2015 at 12:52 PM, Erick Erickson erickerick...@gmail.com
 wrote:
 
  This really sounds like an XY problem. Or when you use
  payload it's not the Solr payload.
 
  So Solr Payloads are a float value that you can attach to
  individual terms to influence the scoring. Attaching the
  _same_ payload to all terms in a field is much the same
  thing as boosting on any matches in the field at query time
  or boosting on the field at index time (this latter assuming
  that different docs would have different boosts).
 
  So can you back up a bit and tell us what you're trying to
  accomplish maybe we can be sure we're both talking about
  the same thing ;)
 
  Best,
  Erick
 
  On Tue, Aug 25, 2015 at 9:09 AM, Jamie Johnson jej2...@gmail.com wrote:
   I would like to specify a particular payload for all tokens emitted from
  a
   tokenizer, but don't see a clear way to do this.  Ideally I could specify
   that something like the DelimitedPayloadTokenFilter be run on the entire
   field and then standard analysis be done on the rest of the field, so in
   the case that I had the following text
  
   this is a test\Foo
  
   I would like to create tokens this, is, a, test each with a
  payload
   of Foo.  From what I'm seeing though only test get's the payload.  Is
  there
   anyway to accomplish this or will I need to implement a custom tokenizer?
 
 


Query timeAllowed and its behavior.

2015-08-25 Thread Modassar Ather
Hi,

Kindly help me understand the query time allowed attribute. The following
is set in solrconfig.xml.
int name=timeAllowed30/int

Does this setting stop the query from running after the timeAllowed is
reached? If not is there a way to stop it as it will occupy resources in
background for no benefit.

Thanks,
Modassar


Re: Search opening hours

2015-08-25 Thread Yonik Seeley
On Tue, Aug 25, 2015 at 5:02 PM, O. Klein kl...@octoweb.nl wrote:
 I'm trying to find the best way to search for stores that are open NOW.

It's probably not the *best* way, but assuming it's currently 4:10pm,
you could do

+open:[* TO 1610] +close:[1610 TO *]

And to account for days of the week have different fields for each day
openM, closeM, openT, closeT, etc...  not super elegant, but seems to
get the job done.

-Yonik


Exact substring search with ngrams

2015-08-25 Thread Christian Ramseyer
Hi

I'm trying to build an index for technical documents that basically
works like grep, i.e. the user gives an arbitray substring somewhere
in a line of a document and the exact matches will be returned. I
specifically want no stemming etc. and keep all whitespace, parentheses
etc. because they might be significant. The only normalization is that
the search should be case-insensitvie.

I tried to achieve this by tokenizing on line breaks, and then building
trigrams of the individual lines:

fieldType name=configtext_trigram class=solr.TextField 

analyzer type=index

tokenizer class=solr.PatternTokenizerFactory
pattern=\R group=-1/

filter class=solr.NGramFilterFactory
minGramSize=3 maxGramSize=3/
filter class=solr.LowerCaseFilterFactory/

/analyzer

analyzer type=query

tokenizer class=solr.NGramTokenizerFactory
minGramSize=3 maxGramSize=3/
filter class=solr.LowerCaseFilterFactory/

/analyzer
/fieldType

Then in the search, I use the edismax parser with mm=100%, so given the
documents


{id:test1,content:
encryption
10.0.100.22
description
}

{id:test2,content:
10.100.0.22
description
}

and the query content:encryption, this will turn into

parsedquery_toString:

+((content:enc content:ncr content:cry content:ryp
content:ypt content:pti content:tio content:ion)~8),

and return only the first document. All fine and dandy. But I have a
problem with possible false positives. If the search is e.g.

content:.100.22

then the generated query will be

parsedquery_toString:
+((content:.10 content:100 content:00. content:0.2 content:.22)~5),

and because all of tokens are also generated for document test2 in the
proximity of 5, both documents will wrongly be returned.

So somehow I'd need to express the query content:.10 content:100
content:00. content:0.2 content:.22 with *the tokens exactly in this
order and nothing in between*. Is this somehow possible, maybe by using
the termvectors/termpositions stuff? Or am I trying to do something
that's fundamentally impossible? Other good ideas how to achieve this
kind of behaviour?

Thanks
Christian





Re: Exact substring search with ngrams

2015-08-25 Thread Erick Erickson
Hmmm, this sounds like a nonsensical question, but what do you mean
by arbitrary substring?

Because if your substrings consist of whole _tokens_, then ngramming
is totally unnecessary (and gets in the way). Phrase queries with no slop
fulfill this requirement.

But let's assume you need to march within tokens, i.e. if the doc
contains my dog has fleas, you need to match input like as fle, in this
case ngramming is an option.

You have substantially different index and query time chains. The result is that
the offsets for all the grams at index time are the same in the quick experiment
I tried, all were 1. But at query time, each gram had an incremented position.

I'd start by using the query time analysis chain for indexing also. Next, I'd
try enclosing multiple words in double quotes at query time and go from there.
What you have now is an anti-pattern in that having substantially
different index
and query time analysis chains is not something that's likely to be very
predictable unless you know _exactly_ what the consequences are.

The admin/analysis page is your friend, in this case check the
verbose checkbox
to see what I mean.

Best,
Erick

On Tue, Aug 25, 2015 at 3:00 PM, Christian Ramseyer r...@networkz.ch wrote:
 Hi

 I'm trying to build an index for technical documents that basically
 works like grep, i.e. the user gives an arbitray substring somewhere
 in a line of a document and the exact matches will be returned. I
 specifically want no stemming etc. and keep all whitespace, parentheses
 etc. because they might be significant. The only normalization is that
 the search should be case-insensitvie.

 I tried to achieve this by tokenizing on line breaks, and then building
 trigrams of the individual lines:

 fieldType name=configtext_trigram class=solr.TextField 

 analyzer type=index

 tokenizer class=solr.PatternTokenizerFactory
 pattern=\R group=-1/

 filter class=solr.NGramFilterFactory
 minGramSize=3 maxGramSize=3/
 filter class=solr.LowerCaseFilterFactory/

 /analyzer

 analyzer type=query

 tokenizer class=solr.NGramTokenizerFactory
 minGramSize=3 maxGramSize=3/
 filter class=solr.LowerCaseFilterFactory/

 /analyzer
 /fieldType

 Then in the search, I use the edismax parser with mm=100%, so given the
 documents


 {id:test1,content:
 encryption
 10.0.100.22
 description
 }

 {id:test2,content:
 10.100.0.22
 description
 }

 and the query content:encryption, this will turn into

 parsedquery_toString:

 +((content:enc content:ncr content:cry content:ryp
 content:ypt content:pti content:tio content:ion)~8),

 and return only the first document. All fine and dandy. But I have a
 problem with possible false positives. If the search is e.g.

 content:.100.22

 then the generated query will be

 parsedquery_toString:
 +((content:.10 content:100 content:00. content:0.2 content:.22)~5),

 and because all of tokens are also generated for document test2 in the
 proximity of 5, both documents will wrongly be returned.

 So somehow I'd need to express the query content:.10 content:100
 content:00. content:0.2 content:.22 with *the tokens exactly in this
 order and nothing in between*. Is this somehow possible, maybe by using
 the termvectors/termpositions stuff? Or am I trying to do something
 that's fundamentally impossible? Other good ideas how to achieve this
 kind of behaviour?

 Thanks
 Christian





Re: Tokenizers and DelimitedPayloadTokenFilterFactory

2015-08-25 Thread Erick Erickson
Well, you're going down a path that hasn't been trodden before ;).

If you can treat your primitive types as text types you might get
some traction, but that makes a lot of operations like numeric
comparison difficult.

H. another idea from left field. For single-valued types,
what about a sidecar field that has the auth token? And even
for a multiValued field, two parallel fields are guaranteed to
maintain order so perhaps you could do something here. Yes,
I'm waving my hands a LOT here.

I suspect that trying to have a custom type that incorporates
payloads for, say, trie fields will be interesting to say the least.
Numeric types are packed to save storage etc. so it'll be
an adventure..

Best,
Erick

On Tue, Aug 25, 2015 at 2:43 PM, Jamie Johnson jej2...@gmail.com wrote:
 We were originally using this approach, i.e. run things through the
 KeywordTokenizer - DelimitedPayloadFilter - WordDelimiterFilter.  Again
 this works fine for text, though I had wanted to use the StandardTokenizer
 in the chain.  Is there an equivalent filter that does what the
 StandardTokenizer does?

 All of this said this doesn't address the issue of the primitive field
 types, which at this point is the bigger issue.  Given this use case should
 there be another way to provide payloads?

 My current thinking is that I will need to provide custom implementations
 for all of the field types I would like to support payloads on which will
 essentially be copies of the standard versions with some extra sugar to
 read/write the payloads (I don't see a way to wrap/delegate these at this
 point because AttributeSource has the attribute retrieval related methods
 as final so I can't simply wrap another tokenizer and return my added
 attributes + the wrapped attributes).  I know my use case is a bit strange,
 but I had not expected to need to do this given that Lucene/Solr supports
 payloads on these field types, they just aren't exposed.

 As always I appreciate any ideas if I'm barking up the wrong tree here.

 On Tue, Aug 25, 2015 at 2:52 PM, Markus Jelsma markus.jel...@openindex.io
 wrote:

 Well, if i remember correctly (i have no testing facility at hand)
 WordDelimiterFilter maintains payloads on emitted sub terms. So if you use
 a KeywordTokenizer, input 'some text^PAYLOAD', and have a
 DelimitedPayloadFilter, the entire string gets a payload. You can then
 split that string up again in individual tokens. It is possible to abuse
 WordDelimiterFilter for it because it has a types parameter that you can
 use to split it on whitespace if its input is not trimmed. Otherwise you
 can use any other character instead of a space as your input.

 This is a crazy idea, but it might work.

 -Original message-
  From:Jamie Johnson jej2...@gmail.com
  Sent: Tuesday 25th August 2015 19:37
  To: solr-user@lucene.apache.org
  Subject: Re: Tokenizers and DelimitedPayloadTokenFilterFactory
 
  To be clear, we are using payloads as a way to attach authorizations to
  individual tokens within Solr.  The payloads are normal Solr Payloads
  though we are not using floats, we are using the identity payload encoder
  (org.apache.lucene.analysis.payloads.IdentityEncoder) which allows for
  storing a byte[] of our choosing into the payload field.
 
  This works great for text, but now that I'm indexing more than just text
 I
  need a way to specify the payload on the other field types.  Does that
 make
  more sense?
 
  On Tue, Aug 25, 2015 at 12:52 PM, Erick Erickson 
 erickerick...@gmail.com
  wrote:
 
   This really sounds like an XY problem. Or when you use
   payload it's not the Solr payload.
  
   So Solr Payloads are a float value that you can attach to
   individual terms to influence the scoring. Attaching the
   _same_ payload to all terms in a field is much the same
   thing as boosting on any matches in the field at query time
   or boosting on the field at index time (this latter assuming
   that different docs would have different boosts).
  
   So can you back up a bit and tell us what you're trying to
   accomplish maybe we can be sure we're both talking about
   the same thing ;)
  
   Best,
   Erick
  
   On Tue, Aug 25, 2015 at 9:09 AM, Jamie Johnson jej2...@gmail.com
 wrote:
I would like to specify a particular payload for all tokens emitted
 from
   a
tokenizer, but don't see a clear way to do this.  Ideally I could
 specify
that something like the DelimitedPayloadTokenFilter be run on the
 entire
field and then standard analysis be done on the rest of the field,
 so in
the case that I had the following text
   
this is a test\Foo
   
I would like to create tokens this, is, a, test each with a
   payload
of Foo.  From what I'm seeing though only test get's the payload.  Is
   there
anyway to accomplish this or will I need to implement a custom
 tokenizer?
  
 



ANNOUNCE: Apache Solr Reference Guide for Solr 5.3 released

2015-08-25 Thread Cassandra Targett
The Lucene PMC is pleased to announce the release of the Solr Reference
Guide for Solr 5.3.

This 577 page PDF is the definitive guide for using Apache Solr and can be
downloaded from:

https://www.apache.org/dyn/closer.cgi/lucene/solr/ref-guide/

If you have


Re: Unknown query parser 'terms' with TermsComponent defined

2015-08-25 Thread P Williams
Thanks Hoss! It's obvious what the problem(s) are when you lay it all out
that way.

On Tue, Aug 25, 2015 at 12:14 PM, Chris Hostetter hossman_luc...@fucit.org
wrote:


 1) The terms Query Parser (TermsQParser) has nothing to do with the
 TermsComponent (the first is for quering many distinct terms, the
 later is for requesting info about low level terms in your index)


 https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-TermsQueryParser
 https://cwiki.apache.org/confluence/display/solr/The+Terms+Component

 2) TermsQParser (which is what you are trying to use with the {!terms...
 query syntax) was not added to Solr until 4.10

 3) based on your example query, i'm pretty sure what you want is the
 TermQParser: term (singular, no s) ...


 https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-TermQueryParser

 {!term f=id}ft849m81z


 : We've encountered a strange situation, I'm hoping someone might be able
 to
 : shed some light. We're using Solr 4.9 deployed in Tomcat 7.
 ...
 :   'q'='_query_:{!raw f=has_model_ssim}Batch AND ({!terms
 f=id}ft849m81z)',
 ...
 : 'msg'='Unknown query parser \'terms\'',
 : 'code'=400}}

 ...

 : The terms component is defined in solrconfig.xml:
 :
 :   searchComponent name=termsComponent class=solr.TermsComponent /

 -Hoss
 http://www.lucidworks.com/



RE: Bot protection (CAPTCHA)

2015-08-25 Thread Davis, Daniel (NIH/NLM) [C]
 So, usually, the middleware is the answer, just like with a database.

With applications backed by database systems, there is usually an application 
server tier, and then a database tier.   There may be a web server tier in 
front of the application server tier.The search engine and database belong 
in the same tier.Suppose your search needs the title and some other 
information to be displayed with search results - store these in the search 
engine.   Suppose your detailed pages need lots of additional fields - maybe 
you can keep those in your database and retrieve them only as needed for 
click-through.

-Original Message-
From: Alexandre Rafalovitch [mailto:arafa...@gmail.com] 
Sent: Tuesday, August 25, 2015 9:40 AM
To: solr-user solr-user@lucene.apache.org
Subject: Re: Bot protection (CAPTCHA)

The standard answer is that exposing the API is a REALLY bad idea. To start 
from, you can issue the delete commands through the API. And they can be 
escaped in multiple different ways.

Plus, you have admin UI there as well to manipulate the cores as well as to see 
the configuration files for them.

So, usually, the middleware is the answer, just like with a database.

Most recent (5.3!) version of Solr added some authentication, but that's still 
not something you could use from a public web page, as that would imply 
hard-coding password.

You could possibly make index read-only, lock down filesystem, etc.
But that's a lot of effort and logistics.

Regards,
   Alex.


Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 25 August 2015 at 09:29, Dmitry Savenko d...@dsavenko.com wrote:
 Hello,

 I plan to expose Solr search REST API to the world, so it can be 
 called from my web page directly, without additional server layer. I'm 
 concerned about bots, so I plan to add CAPTCHA to my page. Surely, I'd 
 like to do it with as little effort as possible. Does Solr provide 
 CAPTCHA support out of the box or via some plugins? I've searched the 
 docs and haven't found any mentions of it.

 Or, maybe, exposing the API is an extremely bad idea, and I should 
 have a middle layer on the server side?

 Any help would be much appreciated!

 Best regards, Dmitry.


Search opening hours

2015-08-25 Thread O. Klein
I'm trying to find the best way to search for stores that are open NOW.

I have day of week, open and closing times.

I've seen some examples, but not an exact fit.

What is the best way to tackle this?

Thank you for any suggestions you have to offer.








--
View this message in context: 
http://lucene.472066.n3.nabble.com/Search-opening-hours-tp4225250.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re:Query timeAllowed and its behavior.

2015-08-25 Thread Jonathon Marks (BLOOMBERG/ LONDON)
timeAllowed applies to the time taken by the collector in each shard 
(TimeLimitingCollector). Once timeAllowed is exceeded the collector terminates 
early, returning any partial results it has and freeing the resources it was 
using.
From Solr 5.0 timeAllowed also applies to the query expansion phase and 
SolrClient request retry.

From: solr-user@lucene.apache.org At: Aug 25 2015 10:18:07
Subject: Re:Query timeAllowed and its behavior.

Hi,

Kindly help me understand the query time allowed attribute. The following
is set in solrconfig.xml.
int name=timeAllowed30/int

Does this setting stop the query from running after the timeAllowed is
reached? If not is there a way to stop it as it will occupy resources in
background for no benefit.

Thanks,
Modassar




Re: Lucene/Solr 5.0 and custom FieldCahe implementation

2015-08-25 Thread Mikhail Khludnev
On Tue, Aug 25, 2015 at 2:03 PM, Jamie Johnson jej2...@gmail.com wrote:

 Thanks Mikhail.  If I'm reading the SimpleFacets class correctly, out
 delegates to DocValuesFacets when facet method is FC, what used to be
 FieldCache I believe.  DocValuesFacets either uses DocValues or builds then
 using the UninvertingReader.


Ah.. got it. Thanks for reminding this details.It seems like even
docValues=true doesn't help with your custom implementation.



 I am not seeing a clean extension point to add a custom UninvertingReader
 to Solr, would the only way be to copy the FacetComponent and SimpleFacets
 and modify as needed?

Sadly, yes. There is no proper extension point. Also, consider overriding
SolrIndexSearcher.wrapReader(SolrCore, DirectoryReader) where the
particular UninvertingReader is created, there you can pass the own one,
which refers to custom FieldCache.


 On Aug 25, 2015 12:42 AM, Mikhail Khludnev mkhlud...@griddynamics.com
 wrote:

  Hello Jamie,
  I don't understand how it could choose DocValuesFacets (it occurs on
  docValues=true) field, but then switches to UninvertingReader/FieldCache
  which means docValues=false. If you can provide more details it would be
  great.
  Beside of that, I suppose you can only implement and inject your own
  UninvertingReader, I don't think there is an extension point for this.
 It's
  too specific requirement.
 
  On Tue, Aug 25, 2015 at 3:50 AM, Jamie Johnson jej2...@gmail.com
 wrote:
 
   as mentioned in a previous email I have a need to provide security
  controls
   at the term level.  I know that Lucene/Solr doesn't support this so I
 had
   baked something onto a 4.x baseline that was sufficient for my use
 cases.
   I am now looking to move that implementation to 5.x and am running into
  an
   issue around faceting.  Previously we were able to provide a custom
 cache
   implementation that would create separate cache entries given a
  particular
   set of security controls, but in Solr 5 some faceting is delegated to
   DocValuesFacets which delegates to UninvertingReader in my case (we are
  not
   storing DocValues).  The issue I am running into is that before 5.x I
 had
   the ability to influence the FieldCache that was used at the Solr level
  to
   also include a security token into the key so each cache entry was
 scoped
   to a particular level.  With the current implementation the FieldCache
   seems to be an internal detail that I can't influence in anyway.  Is
 this
   correct?  I had noticed this Jira ticket
   https://issues.apache.org/jira/browse/LUCENE-5427, is there any
 movement
   on
   this?  Is there another way to influence the information that is put
 into
   these caches?  As always thanks in advance for any suggestions.
  
   -Jamie
  
 
 
 
  --
  Sincerely yours
  Mikhail Khludnev
  Principal Engineer,
  Grid Dynamics
 
  http://www.griddynamics.com
  mkhlud...@griddynamics.com
 




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
mkhlud...@griddynamics.com


Re: Lucene/Solr 5.0 and custom FieldCahe implementation

2015-08-25 Thread Jamie Johnson
I had seen this as well, if I over wrote this by extending
SolrIndexSearcher how do I have my extension used?  I didn't see a way that
could be plugged in.
On Aug 25, 2015 7:15 AM, Mikhail Khludnev mkhlud...@griddynamics.com
wrote:

 On Tue, Aug 25, 2015 at 2:03 PM, Jamie Johnson jej2...@gmail.com wrote:

  Thanks Mikhail.  If I'm reading the SimpleFacets class correctly, out
  delegates to DocValuesFacets when facet method is FC, what used to be
  FieldCache I believe.  DocValuesFacets either uses DocValues or builds
 then
  using the UninvertingReader.
 

 Ah.. got it. Thanks for reminding this details.It seems like even
 docValues=true doesn't help with your custom implementation.


 
  I am not seeing a clean extension point to add a custom UninvertingReader
  to Solr, would the only way be to copy the FacetComponent and
 SimpleFacets
  and modify as needed?
 
 Sadly, yes. There is no proper extension point. Also, consider overriding
 SolrIndexSearcher.wrapReader(SolrCore, DirectoryReader) where the
 particular UninvertingReader is created, there you can pass the own one,
 which refers to custom FieldCache.


  On Aug 25, 2015 12:42 AM, Mikhail Khludnev mkhlud...@griddynamics.com
 
  wrote:
 
   Hello Jamie,
   I don't understand how it could choose DocValuesFacets (it occurs on
   docValues=true) field, but then switches to
 UninvertingReader/FieldCache
   which means docValues=false. If you can provide more details it would
 be
   great.
   Beside of that, I suppose you can only implement and inject your own
   UninvertingReader, I don't think there is an extension point for this.
  It's
   too specific requirement.
  
   On Tue, Aug 25, 2015 at 3:50 AM, Jamie Johnson jej2...@gmail.com
  wrote:
  
as mentioned in a previous email I have a need to provide security
   controls
at the term level.  I know that Lucene/Solr doesn't support this so I
  had
baked something onto a 4.x baseline that was sufficient for my use
  cases.
I am now looking to move that implementation to 5.x and am running
 into
   an
issue around faceting.  Previously we were able to provide a custom
  cache
implementation that would create separate cache entries given a
   particular
set of security controls, but in Solr 5 some faceting is delegated to
DocValuesFacets which delegates to UninvertingReader in my case (we
 are
   not
storing DocValues).  The issue I am running into is that before 5.x I
  had
the ability to influence the FieldCache that was used at the Solr
 level
   to
also include a security token into the key so each cache entry was
  scoped
to a particular level.  With the current implementation the
 FieldCache
seems to be an internal detail that I can't influence in anyway.  Is
  this
correct?  I had noticed this Jira ticket
https://issues.apache.org/jira/browse/LUCENE-5427, is there any
  movement
on
this?  Is there another way to influence the information that is put
  into
these caches?  As always thanks in advance for any suggestions.
   
-Jamie
   
  
  
  
   --
   Sincerely yours
   Mikhail Khludnev
   Principal Engineer,
   Grid Dynamics
  
   http://www.griddynamics.com
   mkhlud...@griddynamics.com
  
 



 --
 Sincerely yours
 Mikhail Khludnev
 Principal Engineer,
 Grid Dynamics

 http://www.griddynamics.com
 mkhlud...@griddynamics.com



Re: how to prevent uuid-field changing in /update query?

2015-08-25 Thread Jamie Johnson
It sounds like you need to control when the uuid is and is not created,
just feels like you'd get better mileage doing this outside of solr
On Aug 25, 2015 7:49 AM, CrazyDiamond crazy_diam...@mail.ru wrote:

 Why not generate the uuid client side on the initial save and reuse this on
 updates?  i can't do this because i have delta-import queries which also
 should be able to assign uuid when it is  needed



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/how-to-prevent-uuid-field-changing-in-update-query-tp4225113p4225137.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: how to prevent uuid-field changing in /update query?

2015-08-25 Thread CrazyDiamond
It sounds like you need to control when the uuid is and is not created, 
just feels like you'd get better mileage doing this outside of solr 
Can I simply insert a condition(blank or not ) in uuid update-chain?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-prevent-uuid-field-changing-in-update-query-tp4225113p4225141.html
Sent from the Solr - User mailing list archive at Nabble.com.


Behavior of grouping on a field with same value spread across shards.

2015-08-25 Thread Modassar Ather
Hi,

As per my understanding, to group on a field all documents with the same
value in the field have to be in the same shard.

Can we group by a field where the documents with the same value in that
field will be distributed across shards?
Please let me know what are the limitations, feature not available or
performance issues for such fields?

Thanks,
Modassar


Re: how to prevent uuid-field changing in /update query?

2015-08-25 Thread Jamie Johnson
Why not generate the uuid client side on the initial save and reuse this on
updates?
On Aug 25, 2015 4:22 AM, CrazyDiamond crazy_diam...@mail.ru wrote:

 i have uuid field. it is not set as unique, but nevertheless i want it not
 to
 be changed every time when i call /update. it might be because  i added
 requesthandler with name /update which contains uuid update срфшт .But if
 i not do this i have no uuid at all.May be i can config uuid update-chain
 to
 set uuid only if it is blank?



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/how-to-prevent-uuid-field-changing-in-update-query-tp4225113.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Query timeAllowed and its behavior.

2015-08-25 Thread Modassar Ather
Thanks for your response Jonathon.

Please correct me if I am wrong in following points.
   -query actually ceases to run once time allowed is reached and releases
all the resources.
   -query expansion is stopped and the query is terminated from execution
releasing all the resources.

Thanks,
Modassar

On Tue, Aug 25, 2015 at 4:46 PM, Jonathon Marks (BLOOMBERG/ LONDON) 
jmark...@bloomberg.net wrote:

 timeAllowed applies to the time taken by the collector in each shard
 (TimeLimitingCollector). Once timeAllowed is exceeded the collector
 terminates early, returning any partial results it has and freeing the
 resources it was using.
 From Solr 5.0 timeAllowed also applies to the query expansion phase and
 SolrClient request retry.

 From: solr-user@lucene.apache.org At: Aug 25 2015 10:18:07
 Subject: Re:Query timeAllowed and its behavior.

 Hi,

 Kindly help me understand the query time allowed attribute. The following
 is set in solrconfig.xml.
 int name=timeAllowed30/int

 Does this setting stop the query from running after the timeAllowed is
 reached? If not is there a way to stop it as it will occupy resources in
 background for no benefit.

 Thanks,
 Modassar





Re: how to prevent uuid-field changing in /update query?

2015-08-25 Thread CrazyDiamond
Why not generate the uuid client side on the initial save and reuse this on 
updates?  i can't do this because i have delta-import queries which also
should be able to assign uuid when it is  needed



--
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-prevent-uuid-field-changing-in-update-query-tp4225113p4225137.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: how to prevent uuid-field changing in /update query?

2015-08-25 Thread Jamie Johnson
I am honestly not familiar enough to say.  Best to try it
On Aug 25, 2015 7:59 AM, CrazyDiamond crazy_diam...@mail.ru wrote:

 It sounds like you need to control when the uuid is and is not created,
 just feels like you'd get better mileage doing this outside of solr
 Can I simply insert a condition(blank or not ) in uuid update-chain?



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/how-to-prevent-uuid-field-changing-in-update-query-tp4225113p4225141.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Lucene/Solr 5.0 and custom FieldCahe implementation

2015-08-25 Thread Jamie Johnson
Thanks Mikhail.  If I'm reading the SimpleFacets class correctly, out
delegates to DocValuesFacets when facet method is FC, what used to be
FieldCache I believe.  DocValuesFacets either uses DocValues or builds then
using the UninvertingReader.

I am not seeing a clean extension point to add a custom UninvertingReader
to Solr, would the only way be to copy the FacetComponent and SimpleFacets
and modify as needed?
On Aug 25, 2015 12:42 AM, Mikhail Khludnev mkhlud...@griddynamics.com
wrote:

 Hello Jamie,
 I don't understand how it could choose DocValuesFacets (it occurs on
 docValues=true) field, but then switches to UninvertingReader/FieldCache
 which means docValues=false. If you can provide more details it would be
 great.
 Beside of that, I suppose you can only implement and inject your own
 UninvertingReader, I don't think there is an extension point for this. It's
 too specific requirement.

 On Tue, Aug 25, 2015 at 3:50 AM, Jamie Johnson jej2...@gmail.com wrote:

  as mentioned in a previous email I have a need to provide security
 controls
  at the term level.  I know that Lucene/Solr doesn't support this so I had
  baked something onto a 4.x baseline that was sufficient for my use cases.
  I am now looking to move that implementation to 5.x and am running into
 an
  issue around faceting.  Previously we were able to provide a custom cache
  implementation that would create separate cache entries given a
 particular
  set of security controls, but in Solr 5 some faceting is delegated to
  DocValuesFacets which delegates to UninvertingReader in my case (we are
 not
  storing DocValues).  The issue I am running into is that before 5.x I had
  the ability to influence the FieldCache that was used at the Solr level
 to
  also include a security token into the key so each cache entry was scoped
  to a particular level.  With the current implementation the FieldCache
  seems to be an internal detail that I can't influence in anyway.  Is this
  correct?  I had noticed this Jira ticket
  https://issues.apache.org/jira/browse/LUCENE-5427, is there any movement
  on
  this?  Is there another way to influence the information that is put into
  these caches?  As always thanks in advance for any suggestions.
 
  -Jamie
 



 --
 Sincerely yours
 Mikhail Khludnev
 Principal Engineer,
 Grid Dynamics

 http://www.griddynamics.com
 mkhlud...@griddynamics.com



Re: Tokenizers and DelimitedPayloadTokenFilterFactory

2015-08-25 Thread Jamie Johnson
Right, I had assumed (obviously here is my problem) that I'd be able to
specify payloads for the field regardless of the field type.  Looking at
TrieField that is certainly non-trivial.  After a bit of digging it appears
that if I wanted to do something here I'd need to build a new TrieField,
override createField and provide a Field that would return something like
NumericTokenStream but also provide the payloads.  Like you said sounds
interesting to say the least...

Were payloads not really intended to be used for these types of fields from
a Lucene perspective?


On Tue, Aug 25, 2015 at 6:29 PM, Erick Erickson erickerick...@gmail.com
wrote:

 Well, you're going down a path that hasn't been trodden before ;).

 If you can treat your primitive types as text types you might get
 some traction, but that makes a lot of operations like numeric
 comparison difficult.

 H. another idea from left field. For single-valued types,
 what about a sidecar field that has the auth token? And even
 for a multiValued field, two parallel fields are guaranteed to
 maintain order so perhaps you could do something here. Yes,
 I'm waving my hands a LOT here.

 I suspect that trying to have a custom type that incorporates
 payloads for, say, trie fields will be interesting to say the least.
 Numeric types are packed to save storage etc. so it'll be
 an adventure..

 Best,
 Erick

 On Tue, Aug 25, 2015 at 2:43 PM, Jamie Johnson jej2...@gmail.com wrote:
  We were originally using this approach, i.e. run things through the
  KeywordTokenizer - DelimitedPayloadFilter - WordDelimiterFilter.  Again
  this works fine for text, though I had wanted to use the
 StandardTokenizer
  in the chain.  Is there an equivalent filter that does what the
  StandardTokenizer does?
 
  All of this said this doesn't address the issue of the primitive field
  types, which at this point is the bigger issue.  Given this use case
 should
  there be another way to provide payloads?
 
  My current thinking is that I will need to provide custom implementations
  for all of the field types I would like to support payloads on which will
  essentially be copies of the standard versions with some extra sugar to
  read/write the payloads (I don't see a way to wrap/delegate these at this
  point because AttributeSource has the attribute retrieval related methods
  as final so I can't simply wrap another tokenizer and return my added
  attributes + the wrapped attributes).  I know my use case is a bit
 strange,
  but I had not expected to need to do this given that Lucene/Solr supports
  payloads on these field types, they just aren't exposed.
 
  As always I appreciate any ideas if I'm barking up the wrong tree here.
 
  On Tue, Aug 25, 2015 at 2:52 PM, Markus Jelsma 
 markus.jel...@openindex.io
  wrote:
 
  Well, if i remember correctly (i have no testing facility at hand)
  WordDelimiterFilter maintains payloads on emitted sub terms. So if you
 use
  a KeywordTokenizer, input 'some text^PAYLOAD', and have a
  DelimitedPayloadFilter, the entire string gets a payload. You can then
  split that string up again in individual tokens. It is possible to abuse
  WordDelimiterFilter for it because it has a types parameter that you can
  use to split it on whitespace if its input is not trimmed. Otherwise you
  can use any other character instead of a space as your input.
 
  This is a crazy idea, but it might work.
 
  -Original message-
   From:Jamie Johnson jej2...@gmail.com
   Sent: Tuesday 25th August 2015 19:37
   To: solr-user@lucene.apache.org
   Subject: Re: Tokenizers and DelimitedPayloadTokenFilterFactory
  
   To be clear, we are using payloads as a way to attach authorizations
 to
   individual tokens within Solr.  The payloads are normal Solr Payloads
   though we are not using floats, we are using the identity payload
 encoder
   (org.apache.lucene.analysis.payloads.IdentityEncoder) which allows for
   storing a byte[] of our choosing into the payload field.
  
   This works great for text, but now that I'm indexing more than just
 text
  I
   need a way to specify the payload on the other field types.  Does that
  make
   more sense?
  
   On Tue, Aug 25, 2015 at 12:52 PM, Erick Erickson 
  erickerick...@gmail.com
   wrote:
  
This really sounds like an XY problem. Or when you use
payload it's not the Solr payload.
   
So Solr Payloads are a float value that you can attach to
individual terms to influence the scoring. Attaching the
_same_ payload to all terms in a field is much the same
thing as boosting on any matches in the field at query time
or boosting on the field at index time (this latter assuming
that different docs would have different boosts).
   
So can you back up a bit and tell us what you're trying to
accomplish maybe we can be sure we're both talking about
the same thing ;)
   
Best,
Erick
   
On Tue, Aug 25, 2015 at 9:09 AM, Jamie Johnson jej2...@gmail.com
  wrote:
   

Re: how to prevent uuid-field changing in /update query?

2015-08-25 Thread Chris Hostetter


: updates?  i can't do this because i have delta-import queries which also
: should be able to assign uuid when it is  needed

You really need to give us a full and complete picture of what exactly you 
are currently doing, what's working, what's not working, and when it's not 
working what is it doing and how is that differnet from what you expect.


example: you mentioned you have requesthandler with name /update which 
contains uuid update срфшт (presumably you mean the processor) but you 
haven't shown us your configs, or any of your logs, so we can see how 
exactly it's configured, or if/how it's being used.


If UUIDUpdateProcessorFactory is in place, then it should only generate a 
new UUID if the document doesn't already have one -- if you are using DIH 
to add documents to the index, and the uuid you are using/generating 
isn't also the uniqueKey field, then the UUIDUpdateProcessorFactory 
doens't have any way of magically knowing when a new document is 
actually a replacement for an old document.


(If you are using Atomic Updates, then registering 
UUIDUpdateProcessorFactory *after* the DistributedUpdateProcessorFactory 
can help -- but that doesn't sound like it's relevant if you are using DIH 
detla updates)




Please review this page and give us *all* the details about your 
current setup, your goal, and the specific problem you are facing...


https://wiki.apache.org/solr/UsingMailingLists



-Hoss
http://www.lucidworks.com/

RE: Spellcheck / Suggestions : Append custom dictionary to SOLR default index

2015-08-25 Thread Dyer, James
Max,

If you know the entire list of words you want to spellcheck against, you can 
use FileBasedSpellChecker.  See 
http://wiki.apache.org/solr/FileBasedSpellChecker .

If, however, you have a field you want to spellcheck against but also want 
additional words added, consider using a copy of the field for spellcheck 
purposes, and then index the additional terms to that field.   You may be able 
to accomplish this easily, for instance, by using index-time synonyms in the 
analysis chain for the spellcheck field.  Or you could just append them to any 
document (more than once if you want to boost the term frequency).

Keep in mind that while this will work fine for regular word-by-word spell 
suggestions, collations are not going to work well with these approaches.

James Dyer
Ingram Content Group

-Original Message-
From: Max Chadwick [mailto:mpchadw...@gmail.com] 
Sent: Monday, August 24, 2015 9:43 PM
To: solr-user@lucene.apache.org
Subject: Spellcheck / Suggestions : Append custom dictionary to SOLR default 
index

Is there a way to append a set of words the the out-of-box solr index when
using the spellcheck / suggestions feature?


Re: Search opening hours

2015-08-25 Thread Alexandre Rafalovitch
Have you seen:
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201212.mbox/%3c1354991310424-4025359.p...@n3.nabble.com%3E
https://wiki.apache.org/solr/SpatialForTimeDurations
https://people.apache.org/~hossman/spatial-for-non-spatial-meetup-20130117/

Regards,
   Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 25 August 2015 at 17:02, O. Klein kl...@octoweb.nl wrote:
 I'm trying to find the best way to search for stores that are open NOW.

 I have day of week, open and closing times.

 I've seen some examples, but not an exact fit.

 What is the best way to tackle this?

 Thank you for any suggestions you have to offer.








 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Search-opening-hours-tp4225250.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: Tokenizers and DelimitedPayloadTokenFilterFactory

2015-08-25 Thread Jamie Johnson
We were originally using this approach, i.e. run things through the
KeywordTokenizer - DelimitedPayloadFilter - WordDelimiterFilter.  Again
this works fine for text, though I had wanted to use the StandardTokenizer
in the chain.  Is there an equivalent filter that does what the
StandardTokenizer does?

All of this said this doesn't address the issue of the primitive field
types, which at this point is the bigger issue.  Given this use case should
there be another way to provide payloads?

My current thinking is that I will need to provide custom implementations
for all of the field types I would like to support payloads on which will
essentially be copies of the standard versions with some extra sugar to
read/write the payloads (I don't see a way to wrap/delegate these at this
point because AttributeSource has the attribute retrieval related methods
as final so I can't simply wrap another tokenizer and return my added
attributes + the wrapped attributes).  I know my use case is a bit strange,
but I had not expected to need to do this given that Lucene/Solr supports
payloads on these field types, they just aren't exposed.

As always I appreciate any ideas if I'm barking up the wrong tree here.

On Tue, Aug 25, 2015 at 2:52 PM, Markus Jelsma markus.jel...@openindex.io
wrote:

 Well, if i remember correctly (i have no testing facility at hand)
 WordDelimiterFilter maintains payloads on emitted sub terms. So if you use
 a KeywordTokenizer, input 'some text^PAYLOAD', and have a
 DelimitedPayloadFilter, the entire string gets a payload. You can then
 split that string up again in individual tokens. It is possible to abuse
 WordDelimiterFilter for it because it has a types parameter that you can
 use to split it on whitespace if its input is not trimmed. Otherwise you
 can use any other character instead of a space as your input.

 This is a crazy idea, but it might work.

 -Original message-
  From:Jamie Johnson jej2...@gmail.com
  Sent: Tuesday 25th August 2015 19:37
  To: solr-user@lucene.apache.org
  Subject: Re: Tokenizers and DelimitedPayloadTokenFilterFactory
 
  To be clear, we are using payloads as a way to attach authorizations to
  individual tokens within Solr.  The payloads are normal Solr Payloads
  though we are not using floats, we are using the identity payload encoder
  (org.apache.lucene.analysis.payloads.IdentityEncoder) which allows for
  storing a byte[] of our choosing into the payload field.
 
  This works great for text, but now that I'm indexing more than just text
 I
  need a way to specify the payload on the other field types.  Does that
 make
  more sense?
 
  On Tue, Aug 25, 2015 at 12:52 PM, Erick Erickson 
 erickerick...@gmail.com
  wrote:
 
   This really sounds like an XY problem. Or when you use
   payload it's not the Solr payload.
  
   So Solr Payloads are a float value that you can attach to
   individual terms to influence the scoring. Attaching the
   _same_ payload to all terms in a field is much the same
   thing as boosting on any matches in the field at query time
   or boosting on the field at index time (this latter assuming
   that different docs would have different boosts).
  
   So can you back up a bit and tell us what you're trying to
   accomplish maybe we can be sure we're both talking about
   the same thing ;)
  
   Best,
   Erick
  
   On Tue, Aug 25, 2015 at 9:09 AM, Jamie Johnson jej2...@gmail.com
 wrote:
I would like to specify a particular payload for all tokens emitted
 from
   a
tokenizer, but don't see a clear way to do this.  Ideally I could
 specify
that something like the DelimitedPayloadTokenFilter be run on the
 entire
field and then standard analysis be done on the rest of the field,
 so in
the case that I had the following text
   
this is a test\Foo
   
I would like to create tokens this, is, a, test each with a
   payload
of Foo.  From what I'm seeing though only test get's the payload.  Is
   there
anyway to accomplish this or will I need to implement a custom
 tokenizer?
  
 



Please answer my question on StackOverflow ... Best approach to guarantee commits in SOLR

2015-08-25 Thread Simer P
http://stackoverflow.com/questions/32138845/what-is-the-best-approach-to-guarantee-commits-in-apache-solr
.

*Question:* How can I get guarantee commits with Apache SOLR where
persisting data to disk and visibility are both equally important ?

*Background:* We have a website which requires high end search
functionality for machine learning and also requires guaranteed commit for
financial transaction. We just want to SOLR as our only datastore to keep
things simple and *do not* want to use another database on the side.

I can't seem to find any answer to this question. The simplest solution for
a financial transaction seems to be to periodically query SOLR for the
record after it has been persisted but this can have longer wait time or is
there a better solution ?

Can anyone please suggest a solution for achieving guaranteed commits
with SOLR ?


Re: Performance gain with setting !cache=false in the query for complex queries

2015-08-25 Thread wwang525
Hi Erick,

Up to now, all the tests were based on randomly generated requests. 

In reality, many requests will get executed more than twice since this is to
support the advertising project. On the other hand, new queries could be
generated daily. So some of the filter queries will be used frequently for a
period of time, and will not be used afterwards. 

I will take your advice to analyze the real queries once the project is in
production.

Thank you very much!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Performance-gain-with-setting-cache-false-in-the-query-for-complex-queries-tp4224931p4225147.html
Sent from the Solr - User mailing list archive at Nabble.com.


Bot protection (CAPTCHA)

2015-08-25 Thread Dmitry Savenko
Hello,

I plan to expose Solr search REST API to the world, so it can be called
from my web page directly, without additional server layer. I'm
concerned about bots, so I plan to add CAPTCHA to my page. Surely, I'd
like to do it with as little effort as possible. Does Solr provide
CAPTCHA support out of the box or via some plugins? I've searched the
docs and haven't found any mentions of it.

Or, maybe, exposing the API is an extremely bad idea, and I should have
a middle layer on the server side?

Any help would be much appreciated!

Best regards, Dmitry.


Re: Bot protection (CAPTCHA)

2015-08-25 Thread Alexandre Rafalovitch
The standard answer is that exposing the API is a REALLY bad idea. To
start from, you can issue the delete commands through the API. And
they can be escaped in multiple different ways.

Plus, you have admin UI there as well to manipulate the cores as well
as to see the configuration files for them.

So, usually, the middleware is the answer, just like with a database.

Most recent (5.3!) version of Solr added some authentication, but
that's still not something you could use from a public web page, as
that would imply hard-coding password.

You could possibly make index read-only, lock down filesystem, etc.
But that's a lot of effort and logistics.

Regards,
   Alex.


Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 25 August 2015 at 09:29, Dmitry Savenko d...@dsavenko.com wrote:
 Hello,

 I plan to expose Solr search REST API to the world, so it can be called
 from my web page directly, without additional server layer. I'm
 concerned about bots, so I plan to add CAPTCHA to my page. Surely, I'd
 like to do it with as little effort as possible. Does Solr provide
 CAPTCHA support out of the box or via some plugins? I've searched the
 docs and haven't found any mentions of it.

 Or, maybe, exposing the API is an extremely bad idea, and I should have
 a middle layer on the server side?

 Any help would be much appreciated!

 Best regards, Dmitry.


testing with EmbeddedSolrServer

2015-08-25 Thread Moen Endre
Is there an example of integration-testing with EmbeddedSolrServer that loads 
data from a data importhandler - then queries the data? Ive tried doing this 
based on 
org.apache.solr.client.solrj.embedded.TestEmbeddedSolrServerConstructors.

But no data is being imported.  Here is the test-class ive tried: 
https://gist.github.com/emoen/5d0a28df91c4c1127238

Ive also tried writing a test by extending AbstractSolrTestCase - but havnt got 
this working. Ive documented some of the log output here: 
http://stackoverflow.com/questions/32052642/solrcorestate-already-closed-with-unit-test-using-embeddedsolrserver-v-5-2-1

Should I extend AbstractSolrTestCase or SolrTestCaseJ4 when writing tests?

Cheers
Endre


CloudSolrClient does not distribute suggest.build=true

2015-08-25 Thread Arcadius Ahouansou
When using the new Suggester component (with AnalyzingInfixSuggester) in
Solr trunk with solrj, the suggest.build command seems to be executed only
on one of the solr cloud nodes.

I had to add shards.qt=/suggest and
shards=host1:port2/solr/mycollection,host2:port2/solr/mycollection... to
distribute the build command on all nodes.

Given that we are using SolrCloud, I would have expected the build command
to behave like an cloud update and be sent to all nodes without the need of
specifying shards and shards.qt

Thanks.

Arcadius.


Re: Solr performance is slow with just 1GB of data indexed

2015-08-25 Thread Zheng Lin Edwin Yeo
Hi Toke,

Thank you for your reply.

I'm currently trying out on the Carrot2 Workbench and get it to call Solr
to see how they did the clustering. Although it still takes some time to do
the clustering, but the results of the cluster is much better than mine. I
think its probably due to the different settings like the fragSize and
desiredCluserCountBase?

By the way, the link on the clustering example
https://cwiki.apache.org/confluence/display/solr/Result is not working as
it says 'Page Not Found'.

Regards,
Edwin


On 25 August 2015 at 15:29, Toke Eskildsen t...@statsbiblioteket.dk wrote:

 On Tue, 2015-08-25 at 10:40 +0800, Zheng Lin Edwin Yeo wrote:
  Would like to confirm, when I set rows=100, does it mean that it only
 build
  the cluster based on the first 100 records that are returned by the
 search,
  and if I have 1000 records that matches the search, all the remaining 900
  records will not be considered for clustering?

 That is correct. It is not stated very clearly, but it follows from
 trading the comments in the third example at
 https://cwiki.apache.org/confluence/display/solr/Result
 +Clustering#ResultClustering-Configuration

  As if that is the case, the result of the cluster may not be so accurate
 as
  there is a possibility that the first 100 records might have a large
 amount
  of similarities in the records, while the subsequent 900 records have
  differences that could have impact on the cluster result.

 Such is the nature of on-the-fly clustering. The clustering aims to be
 as representative of your search result as possible. Assigning more
 weight to the higher scoring documents (in this case: All the weight, as
 those beyond the top-100 are not even considered) does this.

 If that does not fit your expectations, maybe you need something else?
 Plain faceting perhaps? Or maybe enrichment of the documents with some
 sort of entity extraction?

 - Toke Eskildsen, State and University Library, Denmark