Re: Solr 8.0 query length limit

2021-02-18 Thread Thomas Corthals
You can send big queries as a POST request instead of a GET request.

Op do 18 feb. 2021 om 11:38 schreef Anuj Bhargava :

> Solr 8.0 query length limit
>
> We are having an issue where queries are too big, we get no result. And if
> we remove a few keywords we get the result.
>
> Error we get - error 414 (Request-URI Too Long)
>
>
> Have made the following changes in jetty.xml, still the same error
>
> * name="solr.jetty.output.buffer.size" default="32768" />*
> * name="solr.jetty.output.aggregation.size" default="32768" />*
> * name="solr.jetty.request.header.size" default="65536" />*
> * name="solr.jetty.response.header.size" default="32768" />*
> * name="solr.jetty.send.server.version" default="false" />*
> * name="solr.jetty.send.date.header" default="false" />*
> * name="solr.jetty.header.cache.size" default="1024" />*
> * name="solr.jetty.delayDispatchUntilContent" default="false"/>*
>


Wrong HTTP status for HEAD request

2021-01-27 Thread Thomas Corthals
Hi,

In Solr 8.6.1, a GET request or a HEAD request for a non-existing term in a
managed resource (stopword or synonym) returns a HTTP status "404 Not
Found".

$ curl -i "
http://localhost:8983/solr/techproducts/schema/analysis/synonyms/english/foobar;
| head -n 1
HTTP/1.1 404 Not Found

$ curl -I "
http://localhost:8983/solr/techproducts/schema/analysis/synonyms/english/foobar;
| head -n 1
HTTP/1.1 404 Not Found

In Solr 8.7.0, the same GET request still returns "404 Not Found", but the
HEAD request now returns "200 OK" as if the term actually exists.

$ curl -i "
http://localhost:8983/solr/techproducts/schema/analysis/synonyms/english/foobar;
| head -n 1
HTTP/1.1 404 Not Found

$ curl -I "
http://localhost:8983/solr/techproducts/schema/analysis/synonyms/english/foobar;
| head -n 1
HTTP/1.1 200 OK

I presume that's a bug?

Thomas


JVM Memory Issue with Solr 8.7.0

2020-11-06 Thread Thomas Heldmann
Dear All,

After the release of Solr 8.7.0 I want to test the new version on my
notebook. It has the following specifications: Windows 10 64-bit, 16 GB
RAM, Amazon Corretto 11 64-bit, 50 GB free disk space. I downloaded
solr-8.7.0.zip and unzipped it into a local folder. In order to start
Solr in cloud mode and to use the Blob Store API, I start it with the
following command:

C:\Users\...\SolrCloud\solr-8.7.0\bin>solr start -cloud
-Denable.runtime.lib=true

So far everything works fine, I am able to access the Solr GUI via
http://localhost:8983/solr and the JVM Memory usage is about 200 MB.

Since the configset, which I want to load to Solr, requires a big jar
file with synonym files and commons-lang-2.6.jar, I created a folder
C:\Users\...\SolrCloud\solr-8.7.0\server\solr\lib where I copied these
two jar files into. Now I uploaded the configset to ZooKeeper using the
following command:

solr zk upconfig -d ... -z localhost:9983 -n ...

Now I create the collection via the Solr GUI. In earlier Solr versions,
JVM Memory usage was increased for a few seconds after creating the
collection and then it decreased and no Java errors occurred. But with
Solr 8.7.0, Solr uses the entire JVM Memory which it has by default (512
MB), the browser hangs up, my notebook becomes extremely slow and in the
Windows command line I am getting a java.lang.OutOfMemoryError. My first
thought was that 512 MB JVM Memory might be too little, so I stopped
Solr, activated the "set SOLR_JAVA_MEM" line in the bin\solr.in.cmd
file, set -Xmx to 1024m and restarted Solr. But Solr again claimed the
entire JVM Memory. I increased -Xmx again to 1024m, but that did not
help either.

>From the CHANGES.txt I learned that Circuit Breaker Infrastructure and a
JVM heap usage memory tracking circuit breaker implementation was
introduced with Solr 8.7.0. I am not using a Circuit Breaker in my
solrconfig.xml. Is it possible that the issue described above is because
I am not using a Circuit Breaker? If this is not the case, has there
anything else changed from Solr 8.6.3 to Solr 8.7.0 that might cause
this issue? Or is there a problem with Solr and Windows 10 or Amazon
Corretto?

As I already said, the procedure described above worked well for the
Solr versions since Solr 6.6.1, without java.lang.OutOfMemoryError after
creating the collection.

Best regards,
Thomas Heldmann


Re: Why use a different analyzer for "index" and "query"?

2020-09-10 Thread Thomas Corthals
Hi Steve

I have a real-world use case. We don't apply a synonym filter at index
time, but we do apply a managed synonym filter at query time. This allows
content managers to add new synonyms (or remove existing ones) "on the fly"
without having to reindex any documents.

Thomas

Op do 10 sep. 2020 om 17:29 schreef Dunham-Wilkie, Mike CITZ:EX <
mike.dunham-wil...@gov.bc.ca>:

> Hi Steven,
>
> I can think of one case.  If we have an index of database table or column
> names, e.g., words like 'THIS_IS_A_TABLE_NAME', we may want to split the
> name at the underscores when indexing (as well as keep the original), since
> the individual parts might be significant and meaningful.  When querying,
> though, if the searcher types in THIS_IS_A_TABLE_NAME then they are likely
> looking for the whole string, so we wouldn't want to split it apart.
>
> There also seems to be a debate on whether the SYNONYM filter should be
> included on indexing, on querying, or on both.  Google "solr synonyms index
> vs query"
>
> Mike
>
> -Original Message-
> From: Steven White 
> Sent: September 10, 2020 8:19 AM
> To: solr-user@lucene.apache.org
> Subject: Why use a different analyzer for "index" and "query"?
>
> [EXTERNAL] This email came from an external source. Only open attachments
> or links that you are expecting from a known sender.
>
>
> Hi everyone,
>
> In Solr's schema, I have come across field types that use a different
> logic for "index" than for "query".  To be clear, I"m talking about this
> block:
>
>  positionIncrementGap="100">
>   
>
>   
>   
>
>   
> 
>
> Why would one want to not use the same logic for both and simply use:
>
>  positionIncrementGap="100">
>   
>
>   
> 
>
> What are real word use cases to use a different analyzer for index and
> query?
>
> Thanks,
>
> Steve
>


Rule-Based permissions for cores

2020-08-31 Thread Thomas Corthals
Hi,

I'm trying to configure the Rule-Based Authorization Plugin in Solr 8.4.0
in standalone mode. My goal is to limit a user's access to one or more
designated cores. My security.json looks like this:

{
  "authentication":{
"blockUnknown":true,
"class":"solr.BasicAuthPlugin",
"credentials":{
  "solr":"...",
  "user1":"...",
  "user2":"..."},
"realm":"Solr",
"forwardCredentials":false,
"":{"v":0}},
  "authorization":{
"class":"solr.RuleBasedAuthorizationPlugin",
"permissions":[
  {
"name":"security-edit",
"role":"admin",
"index":1},
  {
"name":"read",
"collection":"core1",
"role":"role1",
"index":2},
  {
"name":"read",
"collection":"core2",
"role":"role2",
"index":3},
  {
"name":"all",
"role":"admin",
"index":4}],
"user-role":{
  "solr":"admin",
  "user1":"role1",
  "user2":"role2"},
    "":{"v":0}}}

With this setup, I'm unable to read from any of the cores with either user.
If I "delete-permission":4 both users can read from either core, not just
"their" core.

I have tried custom permissions like this to no avail:
{"name": "access-core1", "collection": "core1", "role": "role1"},
{"name": "access-core2", "collection": "core2", "role": "role2"},
{"name": "all", "role": "admin"}

Is it possible to do this for cores? Or am I out of luck because I'm not
using collections?

Regards

Thomas


Re: SynonymFilterFactory deprecated, documentation and search

2020-07-30 Thread Thomas Corthals
Do keep this paragraph from the docs in mind when switching from a
non-graph to a graph filter:

If you use this filter during indexing, you must follow it with a Flatten
Graph Filter to squash tokens on top of one another like the Synonym
Filter, because the indexer can’t directly consume a graph. To get fully
correct positional queries when your synonym replacements are multiple
tokens, you should instead apply synonyms using this filter at query time.

Regards,

Thomas

Op do 30 jul. 2020 10:17 schreef Colvin Cowie :

> That does some like an unhelpful example to have, though
>
> https://lucene.apache.org/solr/guide/8_6/filter-descriptions.html#synonym-filter
> does clearly state that it is deprecated in favour of
> SynonymGraphFilterFactory .
> Deprecated classes will (should) continue to work, but are likely to be
> removed at some point, e.g. the next major release. IIRC (might be wrong
> though) you can simply replace with SynonymFilterFactory with
> SynonymGraphFilterFactory
> and it should just work in most cases, but do test it.
>
> On Thu, 30 Jul 2020 at 07:52, Jayadevan Maymala 
> wrote:
>
> > Hi all,
> >
> > We have been using SynonymFilterFactory with Solr 7.3. It seems to be
> > working,
> > Going through the documentation for 8.6, I noticed that it was
> deprecated a
> > long time ago, probably before 7.3
> > The documentation at this url, for version 8.6 -
> >
> >
> https://lucene.apache.org/solr/guide/8_6/field-type-definitions-and-properties.html
> > does give  > synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/> as an
> > example.
> > Two doubts -
> > Does a deprecated class continue working?
> > Shouldn't the documentation be updated to modify the example?
> >
> > A request - if the documentation at the url mentioned above has a search,
> > that will really help. I could find only a Page Title lookup.
> >
> > Regards,
> > Jayadevan
> >
>


Tokenizing managed synonyms

2020-07-06 Thread Thomas Corthals
Hi,

Is it possible to specify a Tokenizer Factory on a Managed Synonym Graph
Filter? I would like to use a Standard Tokenizer or Keyword Tokenizer on
some fields.

Best,

Thomas


Re: Solr Float/Double multivalues fields

2020-07-03 Thread Thomas Corthals
Op vr 3 jul. 2020 om 14:11 schreef Bram Van Dam :

> On 03/07/2020 09:50, Thomas Corthals wrote:
> > I think this should go in the ref guide. If your product depends on this
> > behaviour, you want reassurance that it isn't going to change in the next
> > release. Not everyone will go looking through the javadoc to see if this
> is
> > implied.
>
> This is in the ref guide. Section DocValues. Here's the quote:
>
> DocValues are only available for specific field types. The types chosen
> determine the underlying Lucene
> docValue type that will be used. The available Solr field types are:
> • StrField, and UUIDField:
> ◦ If the field is single-valued (i.e., multi-valued is false), Lucene
> will use the SORTED type.
> ◦ If the field is multi-valued, Lucene will use the SORTED_SET type.
> Entries are kept in sorted order and
> duplicates are removed.
> • BoolField:
> ◦ If the field is single-valued (i.e., multi-valued is false), Lucene
> will use the SORTED type.
> © 2019, Apache Software Foundation
>  Guide Version 7.7 - Published: 2019-03-04
> Page 212 of 1426
>  Apache Solr Reference Guide 7.7
> ◦ If the field is multi-valued, Lucene will use the SORTED_SET type.
> Entries are kept in sorted order and
> duplicates are removed.
> • Any *PointField Numeric or Date fields, EnumFieldType, and
> CurrencyFieldType:
> ◦ If the field is single-valued (i.e., multi-valued is false), Lucene
> will use the NUMERIC type.
> ◦ If the field is multi-valued, Lucene will use the SORTED_NUMERIC type.
> Entries are kept in sorted order
> and duplicates are kept.
> • Any of the deprecated Trie* Numeric or Date fields, EnumField and
> CurrencyField:
> ◦ If the field is single-valued (i.e., multi-valued is false), Lucene
> will use the NUMERIC type.
> ◦ If the field is multi-valued, Lucene will use the SORTED_SET type.
> Entries are kept in sorted order and
> duplicates are removed.
> These Lucene types are related to how the values are sorted and stored.
>

Great for docValues. But I couldn't find anything similar for multiValued
in the field type pages of the ref guide (unless I totally missed it
of course). It doesn't have to be as elaborate, as long as it's clear and
doesn't leave users wondering or assuming.


Re: Solr Float/Double multivalues fields

2020-07-03 Thread Thomas Corthals
I think this should go in the ref guide. If your product depends on this
behaviour, you want reassurance that it isn't going to change in the next
release. Not everyone will go looking through the javadoc to see if this is
implied.

Typically it'll either be something like "are always returned in insertion
order" or "are currently returned in insertion order, but your code
shouldn't rely on this behaviour because it can change in future releases".
That's usually sufficient to make an informed decision on how to handle
returned values.

If it's different for docValues, that's even more reason to state it
clearly in the ref guide to avoid confusion.

Best,
Thomas

Op do 2 jul. 2020 om 20:37 schreef Erick Erickson :

> This is true _unless_ you fetch from docValues. docValues are SORTED_SETs,
> so the results will be both ordered and deduplicated if you return them
> as part of the field list.
>
> Don’t really think it needs to go into the ref guide, it’s just inherent
> in storing
> any kind of value. You wouldn’t expect multiple text entries in a
> multiValued
> field to be rearranged when returning the stored values either.
>
> Best,
> Erick
>
> > On Jul 2, 2020, at 2:21 PM, Vincenzo D'Amore  wrote:
> >
> > Thanks, and genuinely asking: is there written somewhere in the
> > documentation too? If no, could anyone suggest to me which doc page
> should
> > I try to update?
> >
> > On Thu, Jul 2, 2020 at 8:08 PM Colvin Cowie 
> > wrote:
> >
> >> The order of values within a multivalued field should match the
> insertion
> >> order. -- we certainly rely on that in our product.
> >>
> >> Order is guaranteed to be maintained for values in a multi-valued field.
> >>>
> >>
> >>
> https://lucene.472066.n3.nabble.com/order-question-on-solr-multi-value-field-tp4027695p4028057.html
> >>
> >> On Thu, 2 Jul 2020 at 18:52, Vincenzo D'Amore 
> wrote:
> >>
> >>> Hi all,
> >>>
> >>> simple question: Solr float/double multivalue fields preserve the order
> >> of
> >>> inserted values?
> >>>
> >>> Best regards,
> >>> Vincenzo
> >>>
> >>> --
> >>> Vincenzo D'Amore
> >>>
> >>
> >
> >
> > --
> > Vincenzo D'Amore
>
>


Re: [EXTERNAL] Getting rid of Master/Slave nomenclature in Solr

2020-06-18 Thread Thomas Corthals
Since "overseer" is also problematic, I'd like to propose "orchestrator" as
an alternative.

Thomas

Op vr 19 jun. 2020 04:34 schreef Walter Underwood :

> We don’t get to decide whether “master” is a problem. The rest of the world
> has already decided that it is a problem.
>
> Our task is to replace the terms “master” and “slave” in Solr.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Jun 18, 2020, at 6:50 PM, Rahul Goswami 
> wrote:
> >
> > I agree with Phill, Noble and Ilan above. The problematic term is "slave"
> > (not master) which I am all for changing if it causes less regression
> than
> > removing BOTH master and slave. Since some people have pointed out Github
> > changing the "master" terminology, in my personal opinion, it was not a
> > measured response to addressing the bigger problem we are all trying to
> > tackle. There is no concept of a "slave" branch, and "master" by itself
> is
> > a pretty generic term (Is someone having "mastery" over a skill a bad
> > thing?). I fear all it would end up achieving in the end with Github is a
> > mess of broken build scripts at best.
> > So +1 on "slave" being the problematic term IMO, not "master".
> >
> > On Thu, Jun 18, 2020 at 8:19 PM Phill Campbell
> >  wrote:
> >
> >> Master - Worker
> >> Master - Peon
> >> Master - Helper
> >> Master - Servant
> >>
> >> The term that is not wanted is “slave’. The term “master” is not a
> problem
> >> IMO.
> >>
> >>> On Jun 18, 2020, at 3:59 PM, Jan Høydahl 
> wrote:
> >>>
> >>> I support Mike Drob and Trey Grainger. We shuold re-use the
> >> leader/replica
> >>> terminology from Cloud. Even if you hand-configure a master/slave
> cluster
> >>> and orchestrate what doc goes to which node/shard, and hand-code your
> >> shards
> >>> parameter, you will still have a cluster where you’d send updates to
> the
> >> leader of
> >>> each shard and the replicas would replicate the index from the leader.
> >>>
> >>> Let’s instead find a new good name for the cluster type. Standalone
> kind
> >> of works
> >>> for me, but I see it can be confused with single-node. We have also
> >> discussed
> >>> replacing SolrCloud (which is a terrible name) with something more
> >> descriptive.
> >>>
> >>> Today: SolrCloud vs Master/slave
> >>> Alt A: SolrCloud vs Standalone
> >>> Alt B: SolrCloud vs Legacy
> >>> Alt C: Clustered vs Independent
> >>> Alt D: Clustered vs Manual mode
> >>>
> >>> Jan
> >>>
> >>>> 18. jun. 2020 kl. 15:53 skrev Mike Drob :
> >>>>
> >>>> I personally think that using Solr cloud terminology for this would be
> >> fine
> >>>> with leader/follower. The leader is the one that accepts updates,
> >> followers
> >>>> cascade the updates somehow. The presence of ZK or election doesn’t
> >> really
> >>>> change this detail.
> >>>>
> >>>> However, if folks feel that it’s confusing, then I can’t tell them
> that
> >>>> they’re not confused. Especially when they’re working with others who
> >> have
> >>>> less Solr experience than we do and are less familiar with the
> >> intricacies.
> >>>>
> >>>> Primary/Replica seems acceptable. Coordinator instead of Overseer
> seems
> >>>> acceptable.
> >>>>
> >>>> Would love to see this in 9.0!
> >>>>
> >>>> Mike
> >>>>
> >>>> On Thu, Jun 18, 2020 at 8:25 AM John Gallagher
> >>>>  wrote:
> >>>>
> >>>>> While on the topic of renaming roles, I'd like to propose finding a
> >> better
> >>>>> term than "overseer" which has historical slavery connotations as
> well.
> >>>>> Director, perhaps?
> >>>>>
> >>>>>
> >>>>> John Gallagher
> >>>>>
> >>>>> On Thu, Jun 18, 2020 at 8:48 AM Jason Gerlowski <
> gerlowsk...@gmail.com
> >>>
> >>>>> wrote:
> >>>>>
> >>>>>> +1 to rename master/slave, and +1 to choosing terminology distinct
> >>

Re: Order of spellcheck suggestions

2020-06-16 Thread Thomas Corthals
Can anybody shed some light on this? If not, I'm going to report it as a
bug in JIRA.

Thomas

Op za 13 jun. 2020 13:37 schreef Thomas Corthals :

> Hi
>
> I'm seeing different ordering on the spellcheck suggestions in cloud mode
> when using spellcheck.extendedResults=false vs.
> spellcheck.extendedResults=true.
>
> Solr 8.5.2 in cloud mode with 2 nodes, 1 collection with numShards = 2 &
> replicationFactor = 1, techproducts configset and example data:
>
> $ curl '
> http://localhost:8983/solr/techproducts/spell?q=power%20cort=false
> '
>
> "suggestion":["cord", "corp", "card"]}],
>
> $ curl '
> http://localhost:8983/solr/techproducts/spell?q=power%20cort=true
> '
>
> "suggestion":[{ "word":"corp", "freq":2}, { "word":"cord", "freq":1}, {
> "word":"card", "freq":4}]}],
>
> The correct order should be "corp" (LD: 1, freq: 2), "cord" (LD: 1, freq:
> 1) , "card" (LD: 2, freq: 4). In standalone mode, I get "corp", "cord",
> "card" with extendedResults true or false.
>
> The results are the same for the /spell and /browse request handlers in
> that configset. I've put all combinations side by side in this spreadsheet:
> https://docs.google.com/spreadsheets/d/1ym44TlbomXMCeoYpi_eOBmv6-mZHCZ0nhsVDB_dDavM/edit?usp=sharing
>
> Is it something in the configuration? Or a bug?
>
> Thomas
>


Order of spellcheck suggestions

2020-06-13 Thread Thomas Corthals
Hi

I'm seeing different ordering on the spellcheck suggestions in cloud mode
when using spellcheck.extendedResults=false vs.
spellcheck.extendedResults=true.

Solr 8.5.2 in cloud mode with 2 nodes, 1 collection with numShards = 2 &
replicationFactor = 1, techproducts configset and example data:

$ curl '
http://localhost:8983/solr/techproducts/spell?q=power%20cort=false
'

"suggestion":["cord", "corp", "card"]}],

$ curl '
http://localhost:8983/solr/techproducts/spell?q=power%20cort=true
'

"suggestion":[{ "word":"corp", "freq":2}, { "word":"cord", "freq":1}, {
"word":"card", "freq":4}]}],

The correct order should be "corp" (LD: 1, freq: 2), "cord" (LD: 1, freq:
1) , "card" (LD: 2, freq: 4). In standalone mode, I get "corp", "cord",
"card" with extendedResults true or false.

The results are the same for the /spell and /browse request handlers in
that configset. I've put all combinations side by side in this spreadsheet:
https://docs.google.com/spreadsheets/d/1ym44TlbomXMCeoYpi_eOBmv6-mZHCZ0nhsVDB_dDavM/edit?usp=sharing

Is it something in the configuration? Or a bug?

Thomas


Re: Fw: TolerantUpdateProcessorFactory not functioning

2020-06-09 Thread Thomas Corthals
If your XML or JSON can't be parsed, your content never makes it to the
update chain.

It looks like you're trying to index non-UTF-8 data. You can set the
encoding of your XML in the Content-Type header of your POST request.

-H 'Content-Type: text/xml; charset=GB18030'

JSON only allows UTF-8, UTF-16 or UTF-32.

Best,

Thomas

Op di 9 jun. 2020 07:11 schreef Hup Chen :

> Any idea?
> I still won't be able to get TolerantUpdateProcessorFactory working, solr
> exited at any error without any tolerance, any suggestions will be
> appreciated.
> curl "
> http://localhost:7070/solr/mycore/update?update.chain=tolerant-chain=100;
> -d @data.xml
>
> 
> 
>
> 
>   
>   100
>   400
>   1
> 
> 
>   
> org.apache.solr.common.SolrException
> com.ctc.wstx.exc.WstxEOFException
>   
>   Unexpected EOF; was expecting a close tag for element
> field
>  at [row,col {unknown-source}]: [1,8191]
>   400
> 
> 
>
>
> 
> From: Hup Chen
> Sent: Friday, May 29, 2020 7:29 PM
> To: solr-user@lucene.apache.org 
> Subject: TolerantUpdateProcessorFactory not functioning
>
> Hi,
>
> My solr indexing did not tolerate bad record but simply exited even I have
> configured TolerantUpdateProcessorFactory  in solrconfig.xml.
> Please advise how could I get TolerantUpdateProcessorFactory  to be
> working?
>
> solrconfig.xml:
>
>  
>
>  100
>
>
>  
>
> restarted solr before indexing:
> service solr stop
> service solr start
>
> curl "
> http://localhost:7070/solr/mycore/update?update.chain=tolerant-chain=100;
> -d @test.json
>
> The first record is a bad record in test.json, the rest were not indexed.
>
> {
>   "responseHeader":{
> "errors":[{
> "type":"ADD",
> "id":"0007264097",
> "message":"ERROR: [doc=0007264097] Error adding field
> 'usedshipping'='' msg=empty String"}],
> "maxErrors":100,
> "status":400,
> "QTime":0},
>   "error":{
> "metadata":[
>   "error-class","org.apache.solr.common.SolrException",
>   "root-error-class","org.apache.solr.common.SolrException"],
> "msg":"Cannot parse provided JSON: Expected key,value separator ':':
> char=\",position=1240 AFTER='isbn\":\"4032171203\", \"sku\":\"\",
> \"title\":\"ãã³ãã¡ã¡ããã³ã \"author\"' BEFORE=':\"Sachiko
> OÃtomo\", ãã, \"ima'",
> "code":400}}
>
>


Atomic updates with add-distinct in Solr 7 cloud

2020-06-08 Thread Thomas Corthals
Hi

I'm trying to do atomic updates with an 'add-distinct' modifier in a Solr 7
cloud. It seems to behave like an 'add' and I end up with double values in
my multiValued field. This only happens with multiple values for the field
in an update (cat:{"add-distinct":["a","b","d"]} exhibits this
problem, cat:{"add-distinct":"a"} doesn't). When running the same update
request with a single core, or a Solr 8 cloud, I get the expected result.

This is a minimal test case with Solr 7.7.3 in cloud mode, 2 nodes, a
collection with shard count 1 and replicationFactor 2, using the
techproducts configset.

$ curl -X POST -H 'Content-Type: text/json' '
http://localhost:8983/solr/techproducts/update?commit=true' --data-binary
'[{"id":123,cat:["a","b","c"]}]'
{
  "responseHeader":{
"rf":2,
"status":0,
"QTime":75}}

$ curl -X POST -H 'Content-Type: text/json' '
http://localhost:8983/solr/techproducts/update?commit=true' --data-binary
'[{"id":123,cat:{"add-distinct":["a","b","d"]}}]'
{
  "responseHeader":{
"rf":2,
"status":0,
"QTime":81}}

$ curl '
http://localhost:8983/solr/techproducts/select?q=id%3A123=true'
{
  "response":{"numFound":1,"start":0,"docs":[
  {
"id":"123",
"cat":["a",
  "b",
  "c",
  "a",
  "b",
  "d"],
"_version_":1668919799351083008}]
  }}

Is this a known issue or am I missing something here?

Kind regards

Thomas Corthals


Create a core from scratch trough the API

2020-03-25 Thread Thomas Mortagne
Hi everyone,

I'm currently testing with Solr Standalone 8.1.1.

I have the following need: through the API Solr standalone create an
empty core and then use the schema API to add what I need. Similar to
create a sql database and then create tables (except that I need only
one table in my case) and columns. The fact that there is a Solr
schema API leaded me to think that creating such an empty core would
be easy but I actually can't find a proper way to do it.

I cannot put each core configuration file in the right filesystem
location before creating it through the API, I want the minimal
configuration files to be generated or (or copied from a pre existing
location on the server at worst).

I was hopping that configsets would be the answer but it seems that
those are not readonly config template as I expected but the actual
configuration files directly used by the core. This means any call to
the schema API unfortunately modify those files which means modify the
schema for all cores created from the same configset. If the configset
is not writable then Solr complain it's not writable and save
modifications in memory. I could not find any property to tell Solr to
copy the configset instead of using it directly but maybe I missed it.

Is this use case not possible with Solr Standalone or did I missed
something obvious ? My version is too old and there is something in
more recent version ?

Thanks,
-- 
Thomas Mortagne


Javadocs are not linkable

2020-02-27 Thread Thomas Scheffler
Hi,

I recently noticed that the SOLR javadocs hosted by lucene are not linkable as 
the „package-list“ file is not downloadable. Is this on purpose?

$ curl https://lucene.apache.org/solr/8_4_0/solr-solrj/package-list


301 Moved Permanently

Moved Permanently
The document has moved https://lucene.apache.org/solr/8_4_0/solr-solrj/package-list/;>here.



It’s the same issue with older versions. My maven build fails with:

MavenReportException: Error while generating Javadoc:
[ERROR] Exit code: 1 - javadoc: error - Error fetching URL: 
https://lucene.apache.org/solr/8_3_0/solr-solrj/

kind regards

Thomas

Re: Reindex Required for Merge Policy Changes?

2020-02-25 Thread Zimmermann, Thomas
Thanks so much Erick. Sounds like this should be a perfect approach to helping 
resolve our current issue.

On 2/24/20, 6:48 PM, "Erick Erickson"  wrote:

Thomas:
Yes, upgrading to 7.5+ will automagically take advantage of the 
improvements, eventually... No, you don’t have to reindex.

The “eventually” part. As you add, and particularly replace, existing 
documents, TMP will make decisions based on the new policy. If you’ve optimized 
in the past and have a very large segment (I.e. > 5G), it’ll be rewritten when 
the number of deleted docs exceeds the threshold; I don’t remember what the 
exact number is. Point is it’ll recover from having an over-large segment over 
time and _eventually_ the largest segment will be < 5G.

Absent a previous optimize making a large segment, I’d just consider 
optimizing after you’ve upgraded. The TMP revisions respect the max segment 
size, so that should purge all deleted documents from your index without 
creating a too-large one. Thereafter the number of deleted docs should remain < 
about 33%. It only really approaches that percentage when you’re updating lots 
of existing docs.

Finally, expungeDeletes is less expensive than optimize because it doesn’t 
rewrite segments with 10% deleted docs so that’s an alternative to optimizing 
after upgrading.


Best,
Erick

> On Feb 24, 2020, at 5:42 PM, Zimmermann, Thomas 
 wrote:
> 
> Hi Folks –
> 
> Few questions before I tackled an upgrade here. Looking to go from 7.4 to 
7.7.2 to take advantage of the improved Tiered Merge Policy and segment cleanup 
– we are dealing with some high (45%) deleted doc counts in a few cores. Would 
simply upgrading Solr and setting the cores to use Lucene 7.7.2 take advantage 
of these features? Would I need to reindex to get existing segments merged more 
efficiently? Does it depend on the size of my current segments vs the 
configuration of the merge policy or would upgrading simply allow solr to do 
its own thing help mitigate this issue?
> 
> Also – I noticed the 7.5+ defaults to the Autoscaling for replication, 
and 8.0 defaults to legacy. Would I potentially need to make changes to my 
existing configs to ensure they stay on Legacy replication?
> 
> Thanks much!
> TZ
> 
> 
> 




Reindex Required for Merge Policy Changes?

2020-02-24 Thread Zimmermann, Thomas
Hi Folks –

Few questions before I tackled an upgrade here. Looking to go from 7.4 to 7.7.2 
to take advantage of the improved Tiered Merge Policy and segment cleanup – we 
are dealing with some high (45%) deleted doc counts in a few cores. Would 
simply upgrading Solr and setting the cores to use Lucene 7.7.2 take advantage 
of these features? Would I need to reindex to get existing segments merged more 
efficiently? Does it depend on the size of my current segments vs the 
configuration of the merge policy or would upgrading simply allow solr to do 
its own thing help mitigate this issue?

Also – I noticed the 7.5+ defaults to the Autoscaling for replication, and 8.0 
defaults to legacy. Would I potentially need to make changes to my existing 
configs to ensure they stay on Legacy replication?

Thanks much!
TZ





Re-creating deleted Managed Stopwords lists results in error

2020-02-17 Thread Thomas Corthals
Hi

I've run into an issue with creating a Managed Stopwords list that has the
same name as a previously deleted list. Going through the same flow with
Managed Synonyms doesn't result in this unexpected behaviour. Am I missing
something or did I discover a bug in Solr?

On a newly started solr with the techproducts core:

curl -X PUT -H 'Content-type:application/json' --data-binary
'{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
curl -X DELETE
http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
curl http://localhost:8983/solr/admin/cores?action=RELOAD\=techproducts
curl -X PUT -H 'Content-type:application/json' --data-binary
'{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist

The second PUT request results in a status 500 with error
msg "java.util.LinkedHashMap cannot be cast to java.util.List".

Similar requests for synonyms work fine, no matter how many times I repeat
the CREATE/DELETE/RELOAD cycle:

curl -X PUT -H 'Content-type:application/json' --data-binary
'{"class":"org.apache.solr.rest.schema.analysis.ManagedSynonymGraphFilterFactory$SynonymManager"}'
http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
curl -X DELETE
http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
curl http://localhost:8983/solr/admin/cores?action=RELOAD\=techproducts
curl -X PUT -H 'Content-type:application/json' --data-binary
'{"class":"org.apache.solr.rest.schema.analysis.ManagedSynonymGraphFilterFactory$SynonymManager"}'
http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap

Reloading after creating the Stopwords list but not after deleting it works
without error too on a fresh techproducts core (you'll have to remove the
directory from disk and create the core again after running the previous
commands).

curl -X PUT -H 'Content-type:application/json' --data-binary
'{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
curl http://localhost:8983/solr/admin/cores?action=RELOAD\=techproducts
curl -X DELETE
http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
curl -X PUT -H 'Content-type:application/json' --data-binary
'{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist

And even curiouser, when doing a CREATE/DELETE for Stopwords, then a
CREATE/DELETE for Synonyms, and only then a RELOAD of the core, the cycle
can be completed twice. (Again, on a freshly created techproducts core.)
Only the third attempt to create a list results in an error. Synonyms can
still be created and deleted repeatedly after this.

curl -X PUT -H 'Content-type:application/json' --data-binary
'{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
curl -X DELETE
http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
curl -X PUT -H 'Content-type:application/json' --data-binary
'{"class":"org.apache.solr.rest.schema.analysis.ManagedSynonymGraphFilterFactory$SynonymManager"}'
http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
curl -X DELETE
http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
curl http://localhost:8983/solr/admin/cores?action=RELOAD\=techproducts
curl -X PUT -H 'Content-type:application/json' --data-binary
'{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
curl -X DELETE
http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
curl -X PUT -H 'Content-type:application/json' --data-binary
'{"class":"org.apache.solr.rest.schema.analysis.ManagedSynonymGraphFilterFactory$SynonymManager"}'
http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
curl -X DELETE
http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
curl http://localhost:8983/solr/admin/cores?action=RELOAD\=techproducts
curl -X PUT -H 'Content-type:application/json' --data-binary
'{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist

The same successes/errors occur when running each cycle against a different
core if the cores share the same configset.

Any ideas on what might be going wrong?


Re: sample_techproducts tutorial (8.1 guide) has wrong collectioname?

2019-06-27 Thread Thomas Egense
Thank you,
 I will fix the image to have the correct collection name. It was confusing
showing a different collection image overview
that the one you see when following the tutorial.
/Thomas

On Thu, Jun 27, 2019 at 3:45 PM Alexandre Rafalovitch 
wrote:

> Actually, the tutorial does say "Here’s the first place where we’ll
> deviate from the default options." and the result name should be
> techproducts.
>
> It is the image that is no longer correct and needs to be updated. And
> perhaps the text should be made clearer.
>
> A pull request with updated image (and matching JIRA) would be most
> welcome. As would any comments on the tutorial sequence in general, as
> we haven't touched it for quite a while. In fact, if somebody wanted
> to flash out the whole tutorial sequence to be more in line with the
> recent Solr features.
>
> Regards,
>Alex.
>
> On Thu, 27 Jun 2019 at 07:42, Thomas Egense 
> wrote:
> >
> > Solr 8.1 tutorial:
> > https://lucene.apache.org/solr/guide/8_1/solr-tutorial.html
> >
> > Following the guide to where you have created the collection and checking
> > the
> > admin page, you get the same picture as shown in
> > "Figure 1. SolrCloud Diagram"
> > (collectionname = gettingstarted) <---
> >
> > Next step is indexing the tech-products samples:
> > solr-8.1.0:$ bin/post -c techproducts example/exampledocs/*
> >
> > But this fails, since the collectionname is "gettingstarted"
> >
> > Instead you have to index with
> > bin/post -c gettingstarted example/exampledocs/*
> >
> > In earlier tutorials the collection name  was indeed "techproducts", so
> it
> > is
> > the collection name that has changed.
> >
> > It is just me doing something wrong? It is hard to believe a such obvious
> > error
> > has not been corrected yet? It seems the 7.1 tutorial has the same error.
> >
> > /Thomas Egense
>


sample_techproducts tutorial (8.1 guide) has wrong collectioname?

2019-06-27 Thread Thomas Egense
Solr 8.1 tutorial:
https://lucene.apache.org/solr/guide/8_1/solr-tutorial.html

Following the guide to where you have created the collection and checking
the
admin page, you get the same picture as shown in
"Figure 1. SolrCloud Diagram"
(collectionname = gettingstarted) <---

Next step is indexing the tech-products samples:
solr-8.1.0:$ bin/post -c techproducts example/exampledocs/*

But this fails, since the collectionname is "gettingstarted"

Instead you have to index with
bin/post -c gettingstarted example/exampledocs/*

In earlier tutorials the collection name  was indeed "techproducts", so it
is
the collection name that has changed.

It is just me doing something wrong? It is hard to believe a such obvious
error
has not been corrected yet? It seems the 7.1 tutorial has the same error.

/Thomas Egense


Re: Inconsistent debugQuery score with multiplicative boost

2019-01-16 Thread Thomas Aglassinger
On 04.01.19, 09:11, "Thomas Aglassinger"  wrote:

>  When debugging a query using multiplicative boost based on the product() 
> function I noticed that the score computed in the explain section is correct 
> while the score in the actual result is wrong.

We digged into this further and seem to have found the culprit. 

The last working version is Solr 7.2.1. Using git bisect we found out that the 
issue got introduced with LUCENE-8099 (a refactoring). There's two changes that 
break the scoring in different ways:

LUCENE-8099: Deprecate CustomScoreQuery, BoostedQuery, BoostingQuery
LUCENE-8099: Replace BoostQParserPlugin.boostQuery() with 
FunctionScoreQuery.boostByValue()

Reverting parts of these changes to the previous version based on a deprecated 
class (which LUCENE-8099 clean up) seems to fix the issue.

We created a Solr issue to document our current findings and changes: 
https://issues.apache.org/jira/browse/SOLR-13126

It contains a patch for our experimental fix (which currently is in a rough 
state) and a test case that can reproduce the issue starting with Solr 7.3 up 
to the current master.

A proper fix of course would not revert to deprecated classes again but fix 
whatever went wrong during LUCENE-8099. 

Hopefully someone with a deeper understand of the mechanics behind can take a 
look into this.

Best regards, Thomas.




Re: Questions for SynonymGraphFilter and WordDelimiterGraphFilter

2019-01-07 Thread Thomas Aglassinger
Hi Wei,

here's a fairly simple field type we currently use in a project that seems to 
do the job with graph synonyms. Maybe this helps as a starting point for you:














As you can see we use the same filters for both indexing and query, so this 
might have some impact on positional queries but so far it seems negligible for 
the short synonyms we use in practice. Also there is no need for the 
FlattenGraphFilter.

The WhitespaceTokenizerFactory ensures that you can define synonyms with 
hyphens like mac-book -> macbook.

Best regards, Thomas.


On 05.01.19, 02:11, "Wei"  wrote:

Hello,

We are upgrading to Solr 7.6.0 and noticed that SynonymFilter and
WordDelimiterFilter have been deprecated. Solr doc recommends to use
SynonymGraphFilter and WordDelimiterGraphFilter instead 
I guess the StopFilter mess up the SynonymGraphFilter output? Not sure
if  it's a solr defect or there is a guideline that StopFilter should
not be put after graph filters.

Thanks in advance for you input.


Thanks,

Wei




Inconsistent debugQuery score with multiplicative boost

2019-01-04 Thread Thomas Aglassinger
Hi!

When debugging a query using multiplicative boost based on the product() 
function I noticed that the score computed in the explain section is correct 
while the score in the actual result is wrong.

As an example here’s a simple query that boosts a field name_text_de 
(containing German product names). The term “Netzteil” boost to 200% and “Sony” 
boosts to 300%. A name that contains both terms would be boosted to 600%. If a 
term does not match, a default pseudo boost of 1 is used (multiplicative 
identity). The params of the responseHeader in the query result are:

"q":"{!boost b=$ymb}(+{!lucene v=$yq})",
"ymb":"product(query({!v=\"name_text_de\\:Netzteil\\^=2.0\"},1),query({!v=\"name_text_de\\:Sony\\^=3.0\"},1))",
"yq":"*:*",

The parsed query of the ymb parameter translates to:

FunctionScoreQuery(FunctionScoreQuery(+*:*, scored by 
boost(product(query((ConstantScore(name_text_de:netzteil))^2.0,def=1.0),query((ConstantScore(name_text_de:sony))^3.0,def=1.0)

For a product that contains both terms, the score in the result and explain 
section correctly yields 6.0:

"name_text_de":"Original Sony Vaio Netzteil",
"score":6.0,

6.0 = product of:
  1.0 = boost
  6.0 = product of:
1.0 = *:*
6.0 = 
product(query((ConstantScore(name_text_de:netzteil))^2.0,def=1.0)=2.0,query((ConstantScore(name_text_de:sony))^3.0,def=1.0)=3.0)

However, for a product with only “Netzteil” in the name, the result score 
wrongly is 1.0 while the explain score correctly is 2.0:

"name_text_de":"GS-Netzteil 20W schwarz",
"score":1.0,

2.0 = product of:
  1.0 = boost
  2.0 = product of:
1.0 = *:*
2.0 = 
product(query((ConstantScore(name_text_de:netzteil))^2.0,def=1.0)=2.0,query((ConstantScore(name_text_de:sony))^3.0,def=1.0)=1.0)

(Note: the filter chain splits words on hyphen so the “GS-“ in front of the 
“Netzteil” should not be an issue.)

Here’s the complete filter chain for the text_de field type:














Interestingly if I simplify the query to only boost on “Netzteil”, the score in 
both the result and explain section are correctly 2.0.

I reproduced this with a local Solr 7.5.0 server (no sharding, no replica) on 
Mac OS X 10.14.1.

I found mention of a somewhat similar situation with BooleanQuery, which was 
considered a bug and fixed in 2016: 
https://issues.apache.org/jira/browse/LUCENE-7132

So my questions are:

1. Is there something wrong in my query that prevents the “Netzteil”-only 
product to get a score of 2.0?
2. Shouldn’t the score in the result and the explain section always be the same?

Best regards,
Thomas


Expression Evaluation

2018-12-06 Thread Thomas L. Redman
I suspect nobody wants to broach this topic, this has to have come up before, 
but I can not find an authoritative answer. How does the Standard Query Parser 
evaluate boolean expressions? I have three fields, content, status and 
source_name. The expression

content:bement AND status:relevant

yields 111 documents. The expression

source_name:Web

yields 78050168 documents. However, the expression

content:bement AND status:relevant OR source_name:Web

yields 111 documents. Can anybody describe the order of operation, operator 
priorities used in evaluating the above expression? It looks to me as if it 
takes the intersection of content:bement and status:relevant, then limits 
successive set operators to that set. Is that true? So any additional “OR” 
expressions will have no effect?

Re: Documentation on SolrJ

2018-12-01 Thread Thomas L. Redman
Hi Jason. You Solr folks are really on top of things, I thank you Cassandra and 
Shawn for all the excellent support. 

Short story, I can wait. I am building a 1.0 version of a new tool to query our 
very complex and large (100M docs) datastore, not to find individual documents, 
but to find subsets of the data suitable for end users (Social Science mostly) 
researchers. As soon as we get to 7.6/8.0, I will work toward a 1.1 release to 
include the improved grouping, nested faceting and so on.  To know this is even 
in the pipe makes my day. 

You guys are in need of more documentation. I hope I’m not hurting any 
feelings, that is not my intention. Solr is a top shelf product, and I would 
not be one to minimize all the hard work. I think I agree with you Jason, some 
additions to the existing tutorial to cover more complex query capabilities 
would probably do the trick. I don’t think you need 600 pages like the Solr Ref 
Guide document. This will make more sense to do when we get to the 8.0 release 
(or the next release including JSON API support). I retire next year, may have 
some free time to build a more extensive query exemplar and document that. Is 
there a formal procedure I need to adhere to if I want to contribute?



> On Nov 30, 2018, at 10:40 AM, Jason Gerlowski  wrote:
> 
> Hi Thomas,
> 
> I recently added a first pass at JSON faceting support to SolrJ.  The
> main classes are "JsonQueryRequest" and "DirectJsonQueryRequest" and
> live in the package "org.apache.solr.client.solrj.request.json"
> (https://github.com/apache/lucene-solr/tree/master/solr/solrj/src/java/org/apache/solr/client/solrj/request/json).
> I've also added examples of how to use this code on the "JSON
> Faceting" page in the Solr ref guide.  Unfortunately, since this is a
> recent addition it hasn't been released yet.  These classes will be in
> the next 7x release (if there is one), or in 8.0 when that arrives.
> This probably isn't super helpful for you.
> 
> Without this code, you have a few options:
> 
> 1. If the facet requests you'd like to make are relatively
> structured/similar, you can subclass QueryRequest and override
> getContentWriter().  "ContentWriters" are the abstraction SolrJ is
> using to write out the request body.  So you can trivially implement
> getContentWriter to wrap a hardcoded string with some templated
> variables. If interested, also checkout
> "RequestWriter.StringPayloadContentWriter".  This'll be sufficient for
> very cookie cutter facet requests, where maybe only a few parameters
> change but nothing else.
> 2. If hardcoding a string JSON body is too inflexible, the JSON
> faceting API is "just query params" like everything else.  You can
> build your facet request and attach it to the request as a SolrParams
> entry.  Doing this wouldn't be the most fun code to write, but it's
> always possible.
> 3. You can copy-paste the unreleased JSON faceting helper classes I
> mentioned above into your codebase.  They're not released in SolrJ but
> you can still use them by copying them locally and using those copies
> until you're able to use a SolrJ that contains these classes.  If you
> go this route, please let me or someone else in the community know
> your thoughts.  Their being unreleased makes them a bit more of a pain
> to use, but it also gives us an opportunity to iterate and improve
> them before a release comes and ties us to the existing (maybe awful)
> interfaces.
> 
>> It would be wonderful if a document of this caliber was provided solely for 
>> SolrJ in the form of a tutorial.
> We definitely need more "SolrJ Examples" coverage, though I'm not sure
> the best way to expose/structure that.  Solr has a *ton* of API
> surface area, and SolrJ is responsible for covering all of it.  Even
> if I imagine a SolrJ version of the standard "Getting Started"
> tutorial which shows users how to create a collection, index docs, do
> a query, and do a faceting request...that'd only cover a fraction of
> what's out there.  It might be easier to scale our SolrJ examples by
> integrating them into the pages we already have for individual APIs
> instead.  I'm all for a SolrJ tutorial, or SolrJ Cookbook sort of
> thing if you like those ideas better though, and would also volunteer
> to help edit or review things in that area.
> 
> Sorry, this got a little long.  But hope that helps.
> 
> Best,
> 
> Jason
> On Fri, Nov 30, 2018 at 11:31 AM Cassandra Targett
>  wrote:
>> 
>> Support for the JSON Facet API in SolrJ was very recently committed via 
>> https://issues.apache.org/jira/browse/SOLR-12965 
>> <https://issues.apache.org/jira/browse/SOLR-12965>. This missed the cut-off 
>> for 7.6 

Re: Documentation on SolrJ

2018-11-30 Thread Thomas L. Redman
Hi Shawn, thanks for the prompt reply!

> On Nov 29, 2018, at 4:55 PM, Shawn Heisey  wrote:
> 
> On 11/29/2018 2:01 PM, Thomas L. Redman wrote:
>> Hi! I am wanting to do nested facets/Grouping/Expand-Collapse using SolrJ, 
>> and I can find no API for that. I see I can add a pivot field, I guess to a 
>> query in general, but that doesn’t seem to work at all, I get an NPE. The 
>> documentation on SolrJ is sorely lacking, the documentation I have found is 
>> less than a readme. Are there any books that provided a good tretise on 
>> SolrJ specifically? Does SolrJ support these more advanced features?
> 
> I don't have any specific details for that use case.

Check out page 498 of the PDF, that includes a brief but powerful discussion of 
the JSON Facet API. For just one example, I am interested in faceting a nominal 
field within a date range bucket. Example: I want to facet publication_date 
field into YEAR buckets, and within each YEAR bucket, facet on author to get 
the most prolific authors in that year, AND to also facet genre with the same 
bucket to find out how much scifi, adventure and so on was produced that year. 
From what I am seeing, beyond pivots(and pivots won’t support this specific use 
case), I don’t see this capability is supported by the SolrJ API, but this is a 
hugely powerful feature, and needs to be supported.

Furthermore, I want to be able to support a vaste range of facets within a 
single query, perhaps including some collapse and expand, groupings and so on.

> 
> If you share the code that gives you NPE, somebody might be able to help you 
> get it working.

I haven’t looked in to this enough to drop it in somebody elses' lap at this 
point, I suspect I am not using the API correctly. And since this won’t allow 
what I want, I’m not too worried about it.

> 
> The best place to find documentation for SolrJ is actually SolrJ itself -- 
> the javadocs.  Much of that can be accessed pretty easily if you are using an 
> IDE to do your development.  Here is a link to the top level of the SolrJ 
> javadocs:
> 
> https://lucene.apache.org/solr/7_5_0/solr-solrj/index.html 
> <https://lucene.apache.org/solr/7_5_0/solr-solrj/index.html>

The JavaDocs are limited. I surmise from tracing the code a bit though that I 
need to rely less on methods provided directly by SolrQuery, and add parameters 
using methods of the superclasses more frequently. Those superclass methods add 
simply key value pairs. Still not sure this will allow me the flexibility I 
need, particularly if the JSON Facet API is not supported.

> 
> There's some documentation here, in the official reference guide:
> 
> https://lucene.apache.org/solr/guide/7_5/using-solrj.html 
> <https://lucene.apache.org/solr/guide/7_5/using-solrj.html>

This is an excellent document. It would be wonderful if a document of this 
caliber was provided solely for SolrJ in the form of a tutorial. The existing 
online tutorial says nothing about how to do anything beyond a simple query. I 
notice in this document most of the examples of how to issue queries, for 
example, use curl to issue query. Simply put, this is not a practical approach 
for the typical user. That being the case, people need to build real UIs around 
applications that hide the intricacies of the search API. I would rather not 
build my own API, since SolrJ is already in place, and seems quite powerful. I 
have been using it for a few years, but really just to do queries.

I might be interested in contributing to such a document, provided it is 
sufficiently succinct. I find myself quite busy these days. But I think I would 
really have to ramp up my understanding of SolrJ to be of any use. Is there any 
such document in the works, or any interested parties? I am NOT a good writer, 
I would need somebody to review my work for both accuracy and grammar.

Also, if the JSON API supported by SolrJ, or is there any plan to support?

Documentation on SolrJ

2018-11-29 Thread Thomas L. Redman
Hi! I am wanting to do nested facets/Grouping/Expand-Collapse using SolrJ, and 
I can find no API for that. I see I can add a pivot field, I guess to a query 
in general, but that doesn’t seem to work at all, I get an NPE. The 
documentation on SolrJ is sorely lacking, the documentation I have found is 
less than a readme. Are there any books that provided a good tretise on SolrJ 
specifically? Does SolrJ support these more advanced features?

Re: CloudSolrClient produces tons of CLUSTERSTATUS commands against single server in Cloud

2018-11-06 Thread Zimmermann, Thomas
Hi Shawn,

We¹re equally impressed by how well the server is handling it. We¹re using
Sematext for monitoring and the load on the box has been steady under 1
and not entering a swap state memory wise.

We are 100% certain the traffic is coming from the 3 web hosts running
this code. We have put some custom logging in place that logs all requests
to an access style log and stores that data in kibana/logstash. In
logstash we are able to confirm that all these requests (~40million in the
last 12 hours) are coming from our web front ends directly to a single box
in the cluster.

Our client codes is on separate servers from our solr servers and zk has
it¹s own boxes as well.

Here¹s a scrubbed pastbin of our cluster status response from that machine
that is getting all the traffic, I pulled this via browser on my local
machine.
https://pastebin.com/42haKVME

We can attempt to update the SolrJ dependency on our lower env and see if
that fixes the problem if you think that a good course of action, but we
are also in the midst of switching over to HTTP Client to resolve the
production issues we are seeing ASAP, so I can¹t promise a timeline. If
you think there¹s a chance that will fix this, we could of course give it
a quick go.


-TZ



On 11/6/18, 12:35 PM, "Shawn Heisey"  wrote:

>On 11/6/2018 10:12 AM, Zimmermann, Thomas wrote:
>> Shawn -
>>
>> Server performance is fine and request time are great. We are tolerating
>> the level of traffic, but the server that is taking all the hits is
>> obviously performing a bit slower than the others. Response times are
>> under 5MS avg for queries on all servers, which is within our perf
>> thresholds.
>
>I was asking specifically about the clusterstatus requests -- whether
>the response looks complete if you manually execute the same request and
>whether it returns quickly.  And I'd like to see the solr.log where
>these are happening.
>
>Knowing that requests in general are performing well is good info,
>although I have no idea how that is possible on the node that is getting
>over a thousand clusterstatus requests per second.  I would expect that
>node to be essentially dead under that much load.  Since it's apparently
>handling it fine ... that's really impressive.
>
>> We are running 7.4 on the client and server side, moving to 7.5 was
>> troublesome for us so we are holding off for the time being.
>
>I was hoping you could just upgrade the SolrJ client, which would
>involve either replacing the solrj jar or bumping the version number in
>the config for a dependency manager (things like ivy, maven, gradle,
>etc).  A 7.5 client should be pretty safe against 7.4 servers.  The
>client would be newer than the server and very close to the same
>version, which is the general recommendation for CloudSolrClient when
>the two versions cannot be identical for some reason.
>
>Are you absolutely sure that those requests are coming from the program
>with CloudSolrClient?  To find out, you'll need to enable the request
>log in jetty.xml (it just needs to be un-commented) and restart the
>server.  The source address is not logged in solr.log.  It's very
>important to be absolutely sure where the requests are coming from.  If
>you're running the client code on the same machine as one of your Solr
>servers, it will be difficult to be sure about the source, so I would
>definitely suggest running the client code on a completely different
>machine, so the source addresses in the request log are useful.
>
>Thanks,
>Shawn
>



Re: CloudSolrClient produces tons of CLUSTERSTATUS commands against single server in Cloud

2018-11-06 Thread Zimmermann, Thomas
I should mention I¹m also hanging out in the Solr IRC Channel today under
the nick ³apatheticnow² if anyone wants to follow up in real time during
business hours EST.

On 11/6/18, 11:39 AM, "Shawn Heisey"  wrote:

>On 11/6/2018 9:06 AM, Zimmermann, Thomas wrote:
>> For example - 75k request per minute going to this one box, and 3.5k
>>RPM to all other nodes in the cloud.
>>
>> All of those extra requests on the one box are
>>"/solr/admin/collections?collection=collectionName=CLUSTERSTATUS
>>t=javabin=2"
>
>That sounds like either a bug or some kind of problem in your setup.
>Over a thousand requests per second will overwhelm a single Solr node,
>even if the info can be satisfied entirely from memory and doesn't
>require complex calculations or large-scale data retrieval like a
>regular query does.
>
>If you manually execute that request, do you get a response, and does it
>return quickly or take a significant amount of time?  If the request
>itself has problems, maybe CloudSolrClient is repeating it frequently
>because it's not getting the info it's after.  Can you share the full
>log entry from solr.log for one of those requests?
>
>I try to keep an eye on things with CloudSolrClient, but I have very
>limited experience with it.  I cannot imagine that the behavior you're
>seeing is normal.  It sounds very wrong to me.
>
>Since I do not know all that much about how CloudSolrClient's background
>threads work, I cannot say for sure whether it's a bug or a problem with
>your setup.  Can you try upgrading the Solr jars in your client app to
>7.5.0 and see if that makes any difference?  What version of Solr are
>you running on the server side?
>
>> Our plan right now is to roll back to the basic HTTP client and pass
>>all traffic through our load balancer, but would like to understand if
>>this is an expected interaction for the Cloud Client, a misconfiguration
>>on our end, or a bug
>
>At least you have that as an option!  Some people might not be able to
>do that.
>
>Thanks,
>Shawn
>



Re: CloudSolrClient produces tons of CLUSTERSTATUS commands against single server in Cloud

2018-11-06 Thread Zimmermann, Thomas
Erik - 

This box did have all the leaders for the dozen or so collections we have
when the cloud spun up. We were able to force the leaders for other cores
onto other nodes using the apis, but did not see this traffic load migrate
to the new hosts when leadership changed. All nodes are NRT. The requests
are 99% queries to load content on the web front ends, a few intermittent
updates with comments, new content creation, etc.

Jason - 

1. We are instantiating the cloud client with our VIP Load Balancer url.
We ran into a memory leak issue when passing in ZK server addresses that
forced this path.
2. No we did not tweak any cache TTLs
3. This codebase interacts with three collections in our cloud, and we are
seeing CLUSTERSTATUS checks for all 3.

Shawn -

Server performance is fine and request time are great. We are tolerating
the level of traffic, but the server that is taking all the hits is
obviously performing a bit slower than the others. Response times are
under 5MS avg for queries on all servers, which is within our perf
thresholds.

We are running 7.4 on the client and server side, moving to 7.5 was
troublesome for us so we are holding off for the time being.

Thanks,
TZ



On 11/6/18, 11:39 AM, "Shawn Heisey"  wrote:

>On 11/6/2018 9:06 AM, Zimmermann, Thomas wrote:
>> For example - 75k request per minute going to this one box, and 3.5k
>>RPM to all other nodes in the cloud.
>>
>> All of those extra requests on the one box are
>>"/solr/admin/collections?collection=collectionName=CLUSTERSTATUS
>>t=javabin=2"
>
>That sounds like either a bug or some kind of problem in your setup.
>Over a thousand requests per second will overwhelm a single Solr node,
>even if the info can be satisfied entirely from memory and doesn't
>require complex calculations or large-scale data retrieval like a
>regular query does.
>
>If you manually execute that request, do you get a response, and does it
>return quickly or take a significant amount of time?  If the request
>itself has problems, maybe CloudSolrClient is repeating it frequently
>because it's not getting the info it's after.  Can you share the full
>log entry from solr.log for one of those requests?
>
>I try to keep an eye on things with CloudSolrClient, but I have very
>limited experience with it.  I cannot imagine that the behavior you're
>seeing is normal.  It sounds very wrong to me.
>
>Since I do not know all that much about how CloudSolrClient's background
>threads work, I cannot say for sure whether it's a bug or a problem with
>your setup.  Can you try upgrading the Solr jars in your client app to
>7.5.0 and see if that makes any difference?  What version of Solr are
>you running on the server side?
>
>> Our plan right now is to roll back to the basic HTTP client and pass
>>all traffic through our load balancer, but would like to understand if
>>this is an expected interaction for the Cloud Client, a misconfiguration
>>on our end, or a bug
>
>At least you have that as an option!  Some people might not be able to
>do that.
>
>Thanks,
>Shawn
>



CloudSolrClient produces tons of CLUSTERSTATUS commands against single server in Cloud

2018-11-06 Thread Zimmermann, Thomas
Question about CloudSolrClient and CLUSTERSTATUS. We just deployed a 3 server 
ZK cluster and a 5 node solr cluster using the CloudSolrClient in Solr 7.4.

We're seeing a TON of traffic going to one server with just cluster status 
commands. Every single query seems to be hitting this box for status, but the 
rest of the query load is divided evenly amongst the servers. Is this an 
expected interaction in this client?

For example - 75k request per minute going to this one box, and 3.5k RPM to all 
other nodes in the cloud.

All of those extra requests on the one box are 
"/solr/admin/collections?collection=collectionName=CLUSTERSTATUS=javabin=2"

Our plan right now is to roll back to the basic HTTP client and pass all 
traffic through our load balancer, but would like to understand if this is an 
expected interaction for the Cloud Client, a misconfiguration on our end, or a 
bug


Data Import Handler with Solr Source behind Load Balancer

2018-09-11 Thread Zimmermann, Thomas
We have a Solr v7 Instance sourcing data from a Data Import Handler with a Solr 
data source running Solr v4. When it hits a single server in that instance 
directly, all documents are read and written correctly to the v7. When we hit 
the load balancer DNS entry, the resulting data import handler json states that 
it read all the documents and skipped none, and all looks fine, but the result 
set is missing ~20% of the documents in the v7 core. This has happened multiple 
time on multiple environments.

Any thoughts on whether this might be a bug in the underlying DIH code? I'll 
also pass it along to the server admins on our side for input.


Re: Spring Content Error in Plugin

2018-08-28 Thread Zimmermann, Thomas
In case anyone else runs into this, I tracked it down. I had to force maven to 
explicitly include all of it’s dependent jars in the plugin jar using the 
assembly plugin in the pom like so:




  

  maven-assembly-plugin

  2.5.3

  



  jar-with-dependencies



  

  



  make-assembly

  package

  

single

  



  




Cheers!

TZ

From: Tom mailto:tzimmerm...@techtarget.com>>
Date: Monday, August 27, 2018 at 11:32 PM
To: "solr-user@lucene.apache.org" 
mailto:solr-user@lucene.apache.org>>
Subject: Spring Content Error in Plugin

Hi,

We have a custom java plugin that leverages the UpdateRequestProcessorFactory 
to push data to multiple cores when a single core is written to. We are 
building the plugin with maven, deploying it to /solr/lib and sourcing the jar 
via a lib directive in our solr config. It currently works correctly in our 
Solr 5.x cluster.

In Solr 7 when attempting to create the core, the plugin is failing with the 
long stack trace later in this post, but it seems to boil down to solr not 
finding a the Spring Context jar (Caused by: java.lang.ClassNotFoundException: 
org.springframework.context.ConfigurableApplicationContext

The jar is imported in the file, and maven has a dependency to bring it in. The 
maven build works perfectly, the dependency is built and the jar is generated.

Any ideas on a starting point for tracking this down? I’ve dug through a bunch 
of stack overflow’s with the same issue but not directly tied to solr and had 
no luck.

Thanks!

POM



org.springframework

org.springframework.context

3.2.2.RELEASE




Error

ERROR - 2018-08-28 03:15:54.253; [c:vignette_de s:shard1 r:core_node5 
x:vignette_de_shard1_replica_n2] org.apache.solr.handler.RequestHandlerBase; 
org.apache.solr.common.SolrException: Error CREATEing SolrCore 
'vignette_de_shard1_replica_n2': Unable to create core 
[vignette_de_shard1_replica_n2] Caused by: 
org.springframework.context.ConfigurableApplicationContext

at org.apache.solr.core.CoreContainer.create(CoreContainer.java:1084)

at 
org.apache.solr.handler.admin.CoreAdminOperation.lambda$static$0(CoreAdminOperation.java:94)

at 
org.apache.solr.handler.admin.CoreAdminOperation.execute(CoreAdminOperation.java:380)

at 
org.apache.solr.handler.admin.CoreAdminHandler$CallInfo.call(CoreAdminHandler.java:395)

at 
org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:180)

at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:199)

at org.apache.solr.servlet.HttpSolrCall.handleAdmin(HttpSolrCall.java:734)

at 
org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:715)

at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:496)

at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:377)

at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:323)

at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1634)

at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:533)

at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:146)

at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)

at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)

at 
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)

at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1595)

at 
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)

at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1253)

at 
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)

at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:473)

at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1564)

at 
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)

at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1155)

at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)

at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219)

at 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:126)

at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)

at 
org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)

at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)

at org.eclipse.jetty.server.Server.handle(Server.java:531)

at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:352)

at 

Spring Content Error in Plugin

2018-08-27 Thread Zimmermann, Thomas
Hi,

We have a custom java plugin that leverages the UpdateRequestProcessorFactory 
to push data to multiple cores when a single core is written to. We are 
building the plugin with maven, deploying it to /solr/lib and sourcing the jar 
via a lib directive in our solr config. It currently works correctly in our 
Solr 5.x cluster.

In Solr 7 when attempting to create the core, the plugin is failing with the 
long stack trace later in this post, but it seems to boil down to solr not 
finding a the Spring Context jar (Caused by: java.lang.ClassNotFoundException: 
org.springframework.context.ConfigurableApplicationContext

The jar is imported in the file, and maven has a dependency to bring it in. The 
maven build works perfectly, the dependency is built and the jar is generated.

Any ideas on a starting point for tracking this down? I’ve dug through a bunch 
of stack overflow’s with the same issue but not directly tied to solr and had 
no luck.

Thanks!

POM



org.springframework

org.springframework.context

3.2.2.RELEASE




Error

ERROR - 2018-08-28 03:15:54.253; [c:vignette_de s:shard1 r:core_node5 
x:vignette_de_shard1_replica_n2] org.apache.solr.handler.RequestHandlerBase; 
org.apache.solr.common.SolrException: Error CREATEing SolrCore 
'vignette_de_shard1_replica_n2': Unable to create core 
[vignette_de_shard1_replica_n2] Caused by: 
org.springframework.context.ConfigurableApplicationContext

at org.apache.solr.core.CoreContainer.create(CoreContainer.java:1084)

at 
org.apache.solr.handler.admin.CoreAdminOperation.lambda$static$0(CoreAdminOperation.java:94)

at 
org.apache.solr.handler.admin.CoreAdminOperation.execute(CoreAdminOperation.java:380)

at 
org.apache.solr.handler.admin.CoreAdminHandler$CallInfo.call(CoreAdminHandler.java:395)

at 
org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:180)

at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:199)

at org.apache.solr.servlet.HttpSolrCall.handleAdmin(HttpSolrCall.java:734)

at 
org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:715)

at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:496)

at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:377)

at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:323)

at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1634)

at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:533)

at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:146)

at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)

at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)

at 
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)

at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1595)

at 
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)

at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1253)

at 
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)

at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:473)

at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1564)

at 
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)

at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1155)

at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)

at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219)

at 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:126)

at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)

at 
org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)

at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)

at org.eclipse.jetty.server.Server.handle(Server.java:531)

at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:352)

at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260)

at 
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:281)

at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102)

at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118)

at 
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)

at 
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)

at 
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)

at 
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126)

at 

Re: Copyto with DIH Interpreting string as MultiValued field on copy

2018-08-18 Thread Zimmermann, Thomas
Makes total sense. Thanks to both of your for the clarification!

On 8/18/18, 8:03 AM, "Alexandre Rafalovitch"  wrote:

>Amd part of the issue is that SolrEntityProcessor does not take individual
>field definitions. So that part is ignored and instead just 'fl' mapping
>is
>used as Shawn explained.
>
>So you could also remap authorText in that definition to an ignored field.
>See
>https://github.com/apache/lucene-solr/blob/master/solr/example/example-DIH
>/solr/solr/conf/solr-data-config.xml
>
>Regards,
>Alex
>
>On Fri, Aug 17, 2018, 11:50 PM Shawn Heisey,  wrote:
>
>> On 8/17/2018 6:15 PM, Zimmermann, Thomas wrote:
>> > I¹m trying to track down an odd issue I¹m seeing when using the
>> SolrEntityProcessor to seed some test data from a solr 4.x cluster to a
>> solr 7.x cluster. It seems like strings are being interpreted as
>> multivalued when passed from a string field to a text field via the
>>copyTo
>> directive. Any clever ideas how to resolve this?
>>
>> What's happening is deceptively simple.
>>
>> In the source system, you're copying from author to authorText.  Both
>> fields are stored.  So if you have "Jeff Hartley" in author, you also
>> have "Jeff Hartley" in authorText. So what's happening is that when the
>> destination system imports from the source system, it gets "Jeff
>> Hartley" in both fields, and then copyField says "put a copy of what's
>> in author into authorText" ... and suddenly there are two copies of
>> "Jeff Hartley" in authorText.
>>
>> There are two ways to deal with this:
>>
>> 1) In the query you're doing with SolrEntityProcessor, add an "fl"
>> parameter and list all the fields *except* authorText and any other
>> field where this same problem is happening.
>>
>> 2) Remove the copyField from the schema until after the import from the
>> source server is done.
>>
>> Thanks,
>> Shawn
>>
>>



Copyto with DIH Interpreting string as MultiValued field on copy

2018-08-17 Thread Zimmermann, Thomas
Hi,

I’m trying to track down an odd issue I’m seeing when using the 
SolrEntityProcessor to seed some test data from a solr 4.x cluster to a solr 
7.x cluster. It seems like strings are being interpreted as multivalued when 
passed from a string field to a text field via the copyTo directive. Any clever 
ideas how to resolve this?

Schema:


Fields and CopyTo








Text fieldtype declaration:








































DIH Config:





http://cluster.solr.eng.techtarget.com/solr/vignette "

query="*:*"

fl="*,orig_version_l:_version_">



















Error:


org.apache.solr.common.SolrException: ERROR: 
[doc=d751e434c69b6210VgnVCM100d01c80aRCRD] Error adding field 
'author'='Jeff Hartley' msg=Multiple values encountered for non multiValued 
copy field authorText: Jeff Hartley

at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:203) 
~[solr-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - jpountz 
- 2018-06-18 16:55:13]

at 
org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:101)
 ~[solr-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - jpountz 
- 2018-06-18 16:55:13]

at 
org.apache.solr.update.DirectUpdateHandler2.updateDocument(DirectUpdateHandler2.java:980)
 ~[solr-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - jpountz 
- 2018-06-18 16:55:13]

at 
org.apache.solr.update.DirectUpdateHandler2.updateDocOrDocValues(DirectUpdateHandler2.java:971)
 ~[solr-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - jpountz 
- 2018-06-18 16:55:13]

at 
org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:348)
 ~[solr-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - jpountz 
- 2018-06-18 16:55:13]

at 
org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:284)
 ~[solr-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - jpountz 
- 2018-06-18 16:55:13]

at 
org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:234)
 ~[solr-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - jpountz 
- 2018-06-18 16:55:13]

at 
org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:67)
 ~[solr-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - jpountz 
- 2018-06-18 16:55:13]

at 
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
 ~[solr-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - jpountz 
- 2018-06-18 16:55:13]

at 
org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:950)
 ~[solr-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - jpountz 
- 2018-06-18 16:55:13]

at 
org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1168)
 ~[solr-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - jpountz 
- 2018-06-18 16:55:13]

at 
org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:633)
 ~[solr-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - jpountz 
- 2018-06-18 16:55:13]

at 
org.apache.solr.update.processor.LogUpdateProcessorFactory$LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:103)
 ~[solr-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - jpountz 
- 2018-06-18 16:55:13]

at org.apache.solr.handler.dataimport.SolrWriter.upload(SolrWriter.java:80) 
~[?:?]

at 
org.apache.solr.handler.dataimport.DataImportHandler$1.upload(DataImportHandler.java:258)
 ~[?:?]

at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:527)
 ~[?:?]

at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:415)
 ~[?:?]

at 
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:330) 
~[?:?]

at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:233) 
~[?:?]

at 
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:424)
 ~[?:?]

at 
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:483) 
~[?:?]

at 
org.apache.solr.handler.dataimport.DataImporter.lambda$runAsync$0(DataImporter.java:466)
 ~[?:?]

at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172]

Caused by: org.apache.solr.common.SolrException: Multiple values encountered 
for non multiValued copy field authorText: Jeff Hartley

at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:180) 
~[solr-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - jpountz 
- 2018-06-18 16:55:13]

... 22 more



Re: Is Running the Same Filters on Index and Query Redundant?

2018-08-15 Thread Zimmermann, Thomas
Hi Andrea,

Thanks so much. I wasn¹t thinking in the correct perspective on the query
portion of the analyzer, but your explanation makes perfect sense. In my
head I imagine the result set of the query being transformed by the
filters, but in actuality the filter is being applied to the query itself
before processing. This makes sense on my end and I think it answer my
questions. 

Excellent point on the html strip factory. I¹ll evaluate our use cases.

This was all brought about by switching from the deprecated synonym and
word delimiter factories to the new graph based factories, where we
stopped filtering on insert for those and switched to filtering on query
based on recommendations from the Solr Doc.

Thanks,
TZ

On 8/15/18, 3:17 PM, "Andrea Gazzarini"  wrote:

>Hi Thomas,
>as you know, the two analyzers play in a different moment, with a
>different input and a different goal for the corresponding output:
>
>  * index analyzer: input is a field value, output is used for building
>the index
>  * query analyzer: input is a (user) query string, output is used for
>building a (Solr) query
>
>At index time a term dictionary is built, and a retrieval time the
>output query tries to find a match in that dictionary. I wouldn't call
>it "redundancy" because even if the filter is the same, it is applied to
>a different input and it has a different goal.
>
>Some filters must be present both at index at query time because
>otherwise you won't find any match: if you put a lowercase filter only
>on the index side, queries with uppercase chars won't find any match.
>Some others don't (one example is the SynonymGraphFilter you've used
>only at query time). In general, everything depends on your needs and
>it's perfectly valid to have symmetric (index analyzer = query analyzer)
>and asymmetric text analysis (index analyzer != query analyzer).
>
>Without knowing your context is very hard to guess if there's something
>wrong in the configuration. What is the part of the analyzers you think
>is redundant?
>
>On top of that: in your chain the HTMLStripCharFilterFactory applied at
>query time is something unusual, because while it makes perfectly sense
>at index time (where I guess you index some HTML source), at query time
>I can't imagine a scenario where the user inputs queries containing HTML
>tags.
>
>Best,
>Andrea
>
>On 15/08/18 20:43, Zimmermann, Thomas wrote:
>> Hi,
>>
>> We have the text field below configured on fields that are both stored
>>and indexed. It seems to me that applying the same filters on both index
>>and query would be redundant, and perhaps a waste of processing on the
>>retrieval side if the filter work was already done on the index side. Is
>>this a fair statement to make? Should I only be applying filters on one
>>end of the transaction?
>>
>> Thanks,
>> TZ
>>
>>
>> >positionIncrementGap="100">
>>
>>
>>
>>  
>>
>>  
>>
>>  >words="stopwords.txt" />
>>
>>  
>>
>>  >language="English" protected="protwords.txt"/>
>>
>>  
>>
>>
>>
>>
>>
>>  
>>
>>  
>>
>>  >synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>>
>>  >words="stopwords.txt" />
>>
>>  >generateWordParts="1" generateNumberParts="1" catenateWords="0"
>>catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>>
>>  
>>
>>  >language="English" protected="protwords.txt"/>
>>
>>  
>>
>>
>>
>>  
>>
>>
>>
>



Is Running the Same Filters on Index and Query Redundant?

2018-08-15 Thread Zimmermann, Thomas
Hi,

We have the text field below configured on fields that are both stored and 
indexed. It seems to me that applying the same filters on both index and query 
would be redundant, and perhaps a waste of processing on the retrieval side if 
the filter work was already done on the index side. Is this a fair statement to 
make? Should I only be applying filters on one end of the transaction?

Thanks,
TZ


   

  













  

  

















  






Atomic update deletes deduplication signature

2018-08-09 Thread Thomas Eckart

Hello,

I am having trouble when doing atomic updates in combination with 
SignatureUpdateProcessorFactory (on Solr 7.2). Normal commits of new 
documents work as expected and generate a valid signature:


curl "$URL/update?commit=true" -H 'Content-type:application/json' -d 
'{"add":{"doc":{"id": "TEST_ID1", "description": "description", 
"country": "country"}}}' && curl "$URL/select?q=id:TEST_ID1"


"response":{"numFound":1,"start":0,"docs":[
{
   "id":"TEST_ID1",
   "description":["description"],
   "country":["country"],
   "_signature":"e577e465b9099ba8",  <-- valid signature
   "_version_":1608322850016460800}]
}}

However, when updating a field (that is not used for generating the 
signature) the signature is replaced by "":


curl "$URL/update?commit=true" -H 'Content-type:application/json' -d 
'{"add":{"doc":{"id": "TEST_ID1", "country": {"set": "country2"' && 
curl "$URL/select?q=id:TEST_ID1"


"response":{"numFound":1,"start":0,"docs":[
{
   "id":"TEST_ID1",
   "description":["description"],
   "country":["country2"],
   "_signature":"",  <-- broken signature
   "_version_":1608322857485467648}]
}}

This looks a lot like the second problem mentioned in an old Solr JIRA 
issue ([1]). Unfortunately, there is no relevant response in the 
discussion there.

Any ideas how to fix this?

Thank you,
Thomas


solrconfig.xml:

[...]
   
  true
  _signature
  false
  description
  solr.processor.Lookup3Signature
   
   
   



[1] https://issues.apache.org/jira/browse/SOLR-4016


Re: Memory Leak in 7.3 to 7.4

2018-08-02 Thread Thomas Scheffler
Hi,

my final verdict is the upgrade to Tika 1.17. If I downgrade the libraries just 
for tika back to 1.16 and keep the rest of SOLR 7.4.0 the heap usage after 
about 85 % of the index process and manual trigger of the garbage collector is 
about 60-70 MB (That low!!!)

My problem now is that we have several setups that triggers this reliably but 
there is no simple test case that „fails“ if Tika 1.17 or 1.18 is used. I also 
do not know if the error is inside Tika or inside the glue code that makes Tika 
usable in SOLR.

Should I file an issue for this?

kind regards,

Thomas


> Am 02.08.2018 um 12:06 schrieb Thomas Scheffler 
> :
> 
> Hi,
> 
> we noticed a memory leak in a rather small setup. 40.000 metadata documents 
> with nearly as much files that have „literal.*“ fields with it. While 7.2.1 
> has brought some tika issues (due to a beta version) the real problems 
> started to appear with version 7.3.0 which are currently unresolved in 7.4.0. 
> Memory consumption is out-of-roof. Where previously 512MB heap was enough, 
> now 6G aren’t enough to index all files.
> I am now to a point where I can track this down to the libraries in 
> solr-7.4.0/contrib/extraction/lib/. If I replace them all by the libraries 
> shipped with 7.2.1 the problem disappears. As most files are PDF documents I 
> tried updating pdfbox to 2.0.11 and tika to 1.18 with no solution to the 
> problem. I will next try to downgrade these single libraries back to 2.0.6 
> and 1.16 to see if these are the source of the memory leak.
> 
> In the mean time I would like to know if anybody else experienced the same 
> problems?
> 
> kind regards,
> 
> Thomas




signature.asc
Description: Message signed with OpenPGP


Re: Memory Leak in 7.3 to 7.4

2018-08-02 Thread Thomas Scheffler
Hi,

SOLR is shipping with a script that handles OOM errors. And produces log files 
for every case with content like this:

Running OOM killer script for process 9015 for Solr on port 28080
Killed process 9015

This script works ;-)

kind regards

Thomas



> Am 02.08.2018 um 12:28 schrieb Vincenzo D'Amore :
> 
> Not clear if you had experienced an OOM error.
> 
> On Thu, Aug 2, 2018 at 12:06 PM Thomas Scheffler <
> thomas.scheff...@uni-jena.de> wrote:
> 
>> Hi,
>> 
>> we noticed a memory leak in a rather small setup. 40.000 metadata
>> documents with nearly as much files that have „literal.*“ fields with it.
>> While 7.2.1 has brought some tika issues (due to a beta version) the real
>> problems started to appear with version 7.3.0 which are currently
>> unresolved in 7.4.0. Memory consumption is out-of-roof. Where previously
>> 512MB heap was enough, now 6G aren’t enough to index all files.
>> I am now to a point where I can track this down to the libraries in
>> solr-7.4.0/contrib/extraction/lib/. If I replace them all by the libraries
>> shipped with 7.2.1 the problem disappears. As most files are PDF documents
>> I tried updating pdfbox to 2.0.11 and tika to 1.18 with no solution to the
>> problem. I will next try to downgrade these single libraries back to 2.0.6
>> and 1.16 to see if these are the source of the memory leak.
>> 
>> In the mean time I would like to know if anybody else experienced the same
>> problems?
>> 
>> kind regards,
>> 
>> Thomas
>> 
> 
> 
> --
> Vincenzo D'Amore




signature.asc
Description: Message signed with OpenPGP


Memory Leak in 7.3 to 7.4

2018-08-02 Thread Thomas Scheffler
Hi,

we noticed a memory leak in a rather small setup. 40.000 metadata documents 
with nearly as much files that have „literal.*“ fields with it. While 7.2.1 has 
brought some tika issues (due to a beta version) the real problems started to 
appear with version 7.3.0 which are currently unresolved in 7.4.0. Memory 
consumption is out-of-roof. Where previously 512MB heap was enough, now 6G 
aren’t enough to index all files.
I am now to a point where I can track this down to the libraries in 
solr-7.4.0/contrib/extraction/lib/. If I replace them all by the libraries 
shipped with 7.2.1 the problem disappears. As most files are PDF documents I 
tried updating pdfbox to 2.0.11 and tika to 1.18 with no solution to the 
problem. I will next try to downgrade these single libraries back to 2.0.6 and 
1.16 to see if these are the source of the memory leak.

In the mean time I would like to know if anybody else experienced the same 
problems?

kind regards,

Thomas


signature.asc
Description: Message signed with OpenPGP


Preferred PHP Client Library

2018-07-16 Thread Zimmermann, Thomas
Hi,

We're in the midst of our first major Solr upgrade in years and are trying to 
run some cleanup across all of our client codebases. We're currently using the 
standard PHP Solr Extension when communicating with our cluster from our 
Wordpress installs. http://php.net/manual/en/book.solr.php

Few questions.

Should we have any concerns about communicating with a Solr 7 cloud from that 
client?
Is anyone using another client they prefer? If so what are the benefits of 
switching to it?

Thanks!
TZ


Re: 7.3 appears to leak

2018-07-16 Thread Thomas Scheffler
Hi,

we noticed the same problems here in a rather small setup. 40.000 metadata 
documents with nearly as much files that have „literal.*“ fields with it. While 
7.2.1 has brought some tika issues the real problems started to appear with 
version 7.3.0 which are currently unresolved in 7.4.0. Memory consumption is 
out-of-roof. Where previously 512MB heap was enough, now 6G aren’t enough to 
index all files.

kind regards,

Thomas

> Am 04.07.2018 um 15:03 schrieb Markus Jelsma :
> 
> Hello Andrey,
> 
> I didn't think of that! I will try it when i have the courage again, probably 
> next week or so.
> 
> Many thanks,
> Markus
> 
> 
> -Original message-
>> From:Kydryavtsev Andrey 
>> Sent: Wednesday 4th July 2018 14:48
>> To: solr-user@lucene.apache.org
>> Subject: Re: 7.3 appears to leak
>> 
>> If it is not possible to find a resource leak by code analysis and there is 
>> no better ideas, I can suggest a brute force approach:
>> - Clone Solr's sources from appropriate branch 
>> https://github.com/apache/lucene-solr/tree/branch_7_3
>> - Log every searcher's holder increment/decrement operation in a way to 
>> catch every caller name (use Thread.currentThread().getStackTrace() or 
>> something) 
>> https://github.com/apache/lucene-solr/blob/branch_7_3/solr/core/src/java/org/apache/solr/util/RefCounted.java
>> - Build custom artefacts and upload them on prod
>> - After memory leak happened - analyse logs to see what part of 
>> functionality doesn't decrement searcher after counter was incremented. If 
>> searchers are leaked - there should be such code I guess.
>> 
>> This is not something someone would like to do, but it is what it is.
>> 
>> 
>> 
>> Thank you,
>> 
>> Andrey Kudryavtsev
>> 
>> 
>> 03.07.2018, 14:26, "Markus Jelsma" :
>>> Hello Erick,
>>> 
>>> Even the silliest ideas may help us, but unfortunately this is not the 
>>> case. All our Solr nodes run binaries from the same source from our central 
>>> build server, with the same libraries thanks to provisioning. Only schema 
>>> and config are different, but the  directive is the same all over.
>>> 
>>> Are there any other ideas, speculations, whatever, on why only our main 
>>> text collection leaks a SolrIndexSearcher instance on commit since 7.3.0 
>>> and every version up?
>>> 
>>> Many thanks?
>>> Markus
>>> 
>>> -Original message-
>>>>  From:Erick Erickson 
>>>>  Sent: Friday 29th June 2018 19:34
>>>>  To: solr-user 
>>>>  Subject: Re: 7.3 appears to leak
>>>> 
>>>>  This is truly puzzling then, I'm clueless. It's hard to imagine this
>>>>  is lurking out there and nobody else notices, but you've eliminated
>>>>  the custom code. And this is also very peculiar:
>>>> 
>>>>  * it occurs only in our main text search collection, all other
>>>>  collections are unaffected;
>>>>  * despite what i said earlier, it is so far unreproducible outside
>>>>  production, even when mimicking production as good as we can;
>>>> 
>>>>  Here's a tedious idea. Restart Solr with the -v option, I _think_ that
>>>>  shows you each and every jar file Solr loads. Is it "somehow" possible
>>>>  that your main collection is loading some jar from somewhere that's
>>>>  different than you expect? 'cause silly ideas like this are all I can
>>>>  come up with.
>>>> 
>>>>  Erick
>>>> 
>>>>  On Fri, Jun 29, 2018 at 9:56 AM, Markus Jelsma
>>>>   wrote:
>>>>  > Hello Erick,
>>>>  >
>>>>  > The custom search handler doesn't interact with SolrIndexSearcher, this 
>>>> is really all it does:
>>>>  >
>>>>  >   public void handleRequestBody(SolrQueryRequest req, SolrQueryResponse 
>>>> rsp) throws Exception {
>>>>  > super.handleRequestBody(req, rsp);
>>>>  >
>>>>  > if (rsp.getToLog().get("hits") instanceof Integer) {
>>>>  >   rsp.addHttpHeader("X-Solr-Hits", 
>>>> String.valueOf((Integer)rsp.getToLog().get("hits")));
>>>>  > }
>>>>  > if (rsp.getToLog().get("hits") instanceof Long) {
>>>>  >   rsp.addHttpHeader("X-Solr-Hits", 
>>>> String.valueOf((Long)rsp.getToLog().get("hits")));
>>>>  > }
&

Re: Managed Schemas and Version Control

2018-07-02 Thread Zimmermann, Thomas
Thanks all! I think we will maintain our current approach of hand editing
the configs in git and implement something at the shell level to automate
the process of running upconfig and performing a core reload.



Override a single value in a Config Set

2018-07-02 Thread Zimmermann, Thomas
Hi,

We have several cores with identical configurations with the sole exception 
being the language of their document sets. I'd like to leverage Config Sets to 
manage the going forward, but ran into two issues I'm struggling to solve 
conceptually.

Sample Cores:
our_documents
our_documents_de
our_documents_es
our_documents_fr

The two values I'd like to override are are:

Set a default field value for a field called "language" to the language of the 
core, ex = "de" on a german core.
Override some text field analyzers to use the correct language
Override index specific language files like stopwords.txt

All of our config files live in SVN and pushed out to staging/prod envs via 
zkcli, so we want to avoid API dependent settings on our prod servers. We 
always want our configs in SVN and don't want to rely on the API to manage 
production settings in a way that we can't change via redeploying our code.

Any thoughts on if this is feasible? Should I just stick with independent core 
configs?

Thanks,
TZ





Managed Schemas and Version Control

2018-06-29 Thread Zimmermann, Thomas
Hi,

We're transitioning from Solr 4.10 to 7.x and working through our options 
around managing our schemas. Currently we manage our schema files in a git 
repository, make changes to the xml files, and then push them out to our 
zookeeper cluster via the zkcli and the upconfig command like:

/apps/solr/bin/zkcli.sh -cmd upconfig -zkhost host.com:9580 -collection core 
-confname core -confdir /apps/solr/cores/core/conf/ -solrhome /apps/solr/

This allows us to deploy schema changes without restarting the cluster, while 
maintaining version control. It looks like we could do the exact same process 
using Solr 7 and the solr control script like

bin/solr zk upconfig -z 111.222.333.444:2181 -n mynewconfig -d 
/path/to/configset

Now of course we'd like to improve this process if possible, since manually 
pushing schema files to the ZK server and reloading the cores is a bit command 
line intensive. Does anyone has any guidance or experience here leveraging the 
managed schema api to make updates to a schema in production while maintaining 
a version controlled copy of the schema. I'd considered using the api to make 
changes to our schemas, and then saving off the generated schema api to git, or 
saving off a script that creates the schema file using the managed api to git, 
but I'm not sure if that is any easier or just adds complexity.

Any thoughts or experience appreciated.

Thanks,
TZ


Re: Solr 7.4 and Zookeeper 3.4.12

2018-06-29 Thread Zimmermann, Thomas
Thanks Shawn - I misspoke when I said recommendation, should have said
³packaged with². I appreciate the feedback and the quick updates to the
Jira issue. We¹ll plan to proceed with 3.4.12 when we go live.

-TZ

On 6/29/18, 11:38 AM, "Shawn Heisey"  wrote:

>On 6/28/2018 8:39 PM, Zimmermann, Thomas wrote:
>> I was wondering if there was a reason Solr 7.4 is still recommending ZK
>>3.4.11 as the major version in the official changelog vs shipping with
>>3.4.12 despite the known regression in 3.4.11. Are there any known
>>issues with running 7.4 alongside ZK 3.4.12. We are beginning a major
>>Solr upgrade project (4.10 to 7.4) and want to stand up the most recent
>>supported versions of both ZK/Solr as part of the process.
>
>That is NOT a recommendation.
>
>The mention of ZK 3.4.11 in Solr's CHANGES.txt file is simply the
>version of ZK that Solr ships with.  ZK is included with Solr mostly for
>the client functionality.  The regression is in the server code, and
>unless you run the embedded ZK server, which is not recommended for
>production, the ZK library that ships with Solr will not experience the
>regression.
>
>I am not aware of anywhere in Solr or its reference guide that makes a
>recommendation about a specific version of ZK.  The reference guide does
>mention version 3.4.11, but that's only because that's the version that
>Solr includes.  The version number in the documentation source code is
>dynamic and will always match the specific version that Solr includes.
>
>The compatibility goals of the ZK project indicate that you can run any
>3.4.x or 3.5.x version of ZK on the server side and be compatible with
>the ZK 3.4.x client that's in Solr.
>
>Look for "Backward Compatibility" on this page:
>
>https://cwiki.apache.org/confluence/display/ZOOKEEPER/ReleaseManagement
>
>We have an issue for upgrading the version of ZK in Solr to 3.4.12.  I
>have uploaded a new patch on that issue to try and clear up any
>confusion about what version of ZK is recommended for use with Solr:
>
>https://issues.apache.org/jira/browse/SOLR-12346
>
>Thanks,
>Shawn
>



Solr 7.4 and Zookeeper 3.4.12

2018-06-28 Thread Zimmermann, Thomas
Hi,

I was wondering if there was a reason Solr 7.4 is still recommending ZK 3.4.11 
as the major version in the official changelog vs shipping with 3.4.12 despite 
the known regression in 3.4.11. Are there any known issues with running 7.4 
alongside ZK 3.4.12. We are beginning a major Solr upgrade project (4.10 to 
7.4) and want to stand up the most recent supported versions of both ZK/Solr as 
part of the process.

Thanks,
TZ


delta-update alternative on filechanges when using FileListEntityProcessor

2018-05-27 Thread Thomas Lustig
I configured a DataImportHandler using a FileListEntityProcessor to import
files from a folder.
This setup works really great, but i do not now how i should handle changes
on the filesystem (e.g. files added, deleted,...)
Should I always do a "full-import"? As far as i read "delta-import" is only
supported by SqlEntityProcessor.
Is there a best practise, that is recommended?
Thanks in advance for helping me

Br
Tom


Re: Is it possible to index documents without storing their content?

2018-05-24 Thread Thomas Lustig
Thanks Emir for the great answer :)

Br

Tom

2018-05-23 10:16 GMT+02:00 Emir Arnautović <emir.arnauto...@sematext.com>:

> Hi Tom,
> Yes it is possible - see field options: https://lucene.apache.org/
> solr/guide/6_6/defining-fields.html#DefiningFields-
> OptionalFieldTypeOverrideProperties <https://lucene.apache.org/
> solr/guide/6_6/defining-fields.html#DefiningFields-
> OptionalFieldTypeOverrideProperties>. There is stored option.
> If you are asking about actual documents in original format, it is not
> even recommended to be stored in Solr.
> If you are asking if someone will be able to reconstruct document from
> Solr even if it is not stored then answer is it depends on how you index,
> one might be able to partially reconstruct it.
>
> HTH,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 23 May 2018, at 06:46, Thomas Lustig <tm.lus...@gmail.com> wrote:
> >
> > dear community,
> >
> > Is it possible to index documents (e.g. pdf, word,...)  for
> fulltextsearch
> > without storing their content(payload) inside Solr server?
> >
> > Thanking you in advance for your help
> >
> > BR
> >
> > Tom
>
>


simple enrich uploaded binary documents with sha256 hashes

2018-05-24 Thread Thomas Lustig
dear community,

I would like to automatically add a sha256 filehash to a Document field
after a binary file is posted to a ExtractingRequestHandler.
First i thought, that the ExtractingRequestHandler has such a feature, but
so far i did not find a configuration.
It was mentioned that I should implement my own  Update Request Processor
to calculate the hash and add it to a field.
The  SignatureUpdateProcessor seemed to be an out-of-the-box option, but it
only supports md5 and also does not access the raw binary stream.

The important thing is that i do need the binary stream of the uploaded
file to calculate a correct hashvalue (e.g. md5, sha256,..)
Is it possible to also arrange this with a ScriptUpdateProcessor and
javascript?.

thanks in advance for any help

Tom


Is it possible to index documents without storing their content?

2018-05-22 Thread Thomas Lustig
dear community,

Is it possible to index documents (e.g. pdf, word,...)  for fulltextsearch
without storing their content(payload) inside Solr server?

Thanking you in advance for your help

BR

Tom


DocValuesField fails if bytes > 32k in solr 7.2.1

2018-03-15 Thread Minu Theresa Thomas
Hello Team,

I am using solr 7.2.1. I am getting an exception while indexing saying that
"DocValuesField  is too large, must be <= 32766, retry?"

This is my field in my managed schema.




When I checked this lucene ticket -
https://issues.apache.org/jira/browse/LUCENE-4583, it says its fixed long
time back.

Can someone please let me know how do I get this fixed?

Thanks and Regards,
Minu


How to check if a solr core is down and is ready for a solr re-start

2017-08-31 Thread Minu Theresa Thomas
Hello Team,

I have few experiences where restart of a solr node is the only option when
a core goes down. I am trying to automate the restart of a solr server when
a core goes down or the replica is unresponsive over a period of time.

I have a script to check if the cores/ replicas associated with a node is
up. I have two approaches - One is to get the cores from solr CLUSTERSTATUS
API and do a PING on each core. If atleast one core on the node doesn't
repond to ping, then mark that node down and do restart after few retries.
Second is to get the cores from the solr CLUSTERSTATUS API along with its
status. If the status is down, then mark that node down and do a restart
after few retries.

Which is the best way/ recommended approach to check if a core associated
with a node is down and is ready for a solr service restart?

Thanks!


Re: Clustering on copy fields

2017-07-26 Thread Thomas Krebs
This is understood.

My question is: I have a keep words filter on field2. field2 is used for 
clustering.
Will the cluster algorithm use „some data“ or the result of the application of 
the keep words filter applied to „some data“.

Cheers,
Thomas


> Am 26.07.2017 um 01:36 schrieb Erick Erickson <erickerick...@gmail.com>:
> 
> copyFields are completely independent. The _raw_ data is passed to both. IOW,
> 
> 
> sending
> some data
> 
> is equivalent to this with no copyfield
> some data
> some data
> Best,
> Erick
> 
> 
> On Tue, Jul 25, 2017 at 11:28 AM, Thomas Krebs <thkr...@gmx.de> wrote:
>> I have defined a copied field on which I would like to use clustering. I 
>> understood that the destination field will store the full content despite 
>> the filter chain I defined.
>> 
>> Now, I have a keep word filter defined on the copied field.
>> 
>> If I run clustering on the copied field will it use the result of the filter 
>> chain, i.e. the tokens passed through the keep word filter or will it run on 
>> the full content?



Clustering on copy fields

2017-07-25 Thread Thomas Krebs
I have defined a copied field on which I would like to use clustering. I 
understood that the destination field will store the full content despite the 
filter chain I defined.

Now, I have a keep word filter defined on the copied field.

If I run clustering on the copied field will it use the result of the filter 
chain, i.e. the tokens passed through the keep word filter or will it run on 
the full content?

Same score for different length matches

2017-06-30 Thread Thomas Michael Engelke
 Hey,

we have multiple documents that are matches for the query in question
("name:hubwagen"). Thing is, some of the documents only contain the
query, while others match 100% in the "name" field:


 
 Hochhubwagen
 5.9861565
 
 Hubwagen
 5.9861565


The debug looks like this (for the first and 5th match):


 
 namhubwagnamehubwag
 
 
 name:Hubwagen
 name:Hubwagen
 name:hubwag
 name:hubwag
 
 
5.9861565 = (MATCH) weight(name:hubwag in 8093) [DefaultSimilarity],
result of:
 5.9861565 = fieldWeight in 8093, product of:
 1.0 = tf(freq=1.0), with freq of:
 1.0 = termFreq=1.0
 5.9861565 = idf(docFreq=109, maxDocs=16101)
 1.0 = fieldNorm(doc=8093)

 
5.9861565 = (MATCH) weight(name:hubwag in 9537) [DefaultSimilarity],
result of:
 5.9861565 = fieldWeight in 9537, product of:
 1.0 = tf(freq=1.0), with freq of:
 1.0 = termFreq=1.0
 5.9861565 = idf(docFreq=109, maxDocs=16101)
 1.0 = fieldNorm(doc=9537)


Now, I am decently certain that at one point in time it worked in a way
that a higher match length would rank higher. As far as I can read in
the SolrRelevancyFAQ, the correct term is "lengthNorm". However, I a
missing a preference for the full match.

Usually, the debug helps me identify mistakes, but in this case, the
debug only tells me that the scores are perfectly equal, down to the
lowest level. 

Data import handler and no status in web-ui

2017-06-06 Thread Thomas Porschberg
Hi,

I use DIH in solr-cloud mode (implicit route) in solr6.5.1.
When I start the import it works fine and I see the progress in the logfile.
However, when I click the "Refresh Status" button in the web-ui while the 
import is running
I only see "No information available (idle)". 
So I have to look in the logfile the observe when the import was finished.

In the old solr, non-cloud and non-partitioned, there was a hourglass while the 
import was running.

Any idea?

Best regards
Thomas


Replace a solr node which is using a block storage

2017-06-01 Thread Minu Theresa Thomas
Hi,

I am new to Solr. I have a use case to add a new node when an existing node
goes down. The new node with a new IP should contain all the replicas that
the previous node had. So I am using a network storage (cinder storage) in
which the data directory (where the solr.xml and the core directories
resides) is getting created when a node starts up. The new node with a new
IP after the replacement will contain the same set of directories which the
old node had. I have noticed the new node is added to the cluster without
the need for an ADDREPLICA.

Is this an expected behavior in Solr? Does ZK still hold the references to
old node? What's the recommended solution if I want to re-use the data
directory associated with the old new while spinning up a new node. The
goal is to avoid data loss and to reduce the time taken to recover a node.

Thanks in advance!

-Minu


Re: setup solrcloud from scratch vie web-ui

2017-05-18 Thread Thomas Porschberg

> Shawn Heisey <apa...@elyograg.org> hat am 17. Mai 2017 um 15:10 geschrieben:
> 
> 
> On 5/17/2017 6:18 AM, Thomas Porschberg wrote:
> > Thank you. I am now a step further.
> > I could import data into the new collection with the DIH. However I 
> > observed the following exception 
> > in solr.log:
> >
> > request: 
> > http://127.0.1.1:8983/solr/hugo_shard1_replica1/update?update.distrib=TOLEADER=http%3A%2F%2F127.0.1.1%3A8983%2Fsolr%2Fhugo_shard2_replica1%2F=javabin=2
> > Remote error message: This IndexSchema is not mutable.
> 
> This probably means that the configuration has an update processor that
> adds unknown fields, but is using the classic schema instead of the
> managed schema.  If you want unknown fields to automatically be guessed
> and added, then you need the managed schema.  If not, then remove the
> custom update processor chain.  If this doesn't sound like what's wrong,
> then we will need the entire error message including the full Java
> stacktrace.  That may be in the other instance's solr.log file.

Ok, commenting out the "update processor chain" was a solution. I use classic 
schema.

> 
> > I imagine to split my data per day of the year. My idea was to create 365 
> > shards of type compositeKey.
> 
> You cannot control shard routing explicitly with the compositeId
> router.  That router uses a hash of the uniqueKey field to decide which
> shard gets the document.  As its name implies, the hash can be composite
> -- parts of the hash can be decided by multiple parts of the value in
> the field, but it's still hashed.
> 
> You must use the implicit router (which means all routing is manual) if
> you want to explicitly name the shard that receives the data.

I was now able to create 365 shards with the 'implicit' router.
In the collection-API call I also specified 
router.field=part_crit 
which is the day of the year 1..365
I added this field in my SQL-statement and in schema.xml.

Next step I thought would be to trigger the dataimport.

However I get:

2017-05-18 05:41:37.417 ERROR (Thread-14) [c:hansi s:308 r:core_node76 
x:hansi_308_replica1] o.a.s.h.d.DataImporter Full Import 
failed:java.lang.RuntimeException: org.apache.solr.common.SolrException: No 
registered leader was found after waiting for 4000ms , collection: hansi slice: 
230
at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:270)
at 
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:416)
at 
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:475)
at 
org.apache.solr.handler.dataimport.DataImporter.lambda$runAsync$0(DataImporter.java:458)
at java.lang.Thread.run(Thread.java:745)

when I start the import.

What could be the reason?

Thank you
Thomas


Re: setup solrcloud from scratch vie web-ui

2017-05-17 Thread Thomas Porschberg
> Tom Evans <tevans...@googlemail.com> hat am 17. Mai 2017 um 11:48 geschrieben:
> 
> 
> On Wed, May 17, 2017 at 6:28 AM, Thomas Porschberg
> <tho...@randspringer.de> wrote:
> > Hi,
> >
> > I did not manipulating the data dir. What I did was:
> >
> > 1. Downloaded solr-6.5.1.zip
> > 2. ensured no solr process is running
> > 3. unzipped solr-6.5.1.zip to ~/solr_new2/solr-6.5.1
> > 3. started an external zookeeper
> > 4. copied a conf directory from a working non-cloudsolr (6.5.1) to
> >~/solr_new2/solr-6.5.1 so that I have ~/solr_new2/solr-6.5.1/conf
> >   (see http://randspringer.de/solrcloud_test/my.zip for content)
> 
> ..in which you've manipulated the dataDir! :)
> 
> The problem (I think) is that you have set a fixed data dir, and when
> Solr attempts to create a second core (for whatever reason, in your
> case it looks like you are adding a shard), Solr puts it exactly where
> you have told it to, in the same directory as the previous one. It
> finds the lock and blows up, because each core needs to be in a
> separate directory, but you've instructed Solr to put them in the same
> one.
> 
> Start with a the solrconfig from basic_configs configset that ships
> with Solr and add the special things that your installation needs. I
> am not massively surprised that your non cloud config does not work in
> cloud mode, when we moved to SolrCloud, we rewrote from scratch
> solrconfig.xml and schema.xml, starting from basic_configs and adding
> anything particular that we needed from our old config, checking every
> difference that we have from stock config and noting/discerning why,
> and ensuring that our field types are using the same names for the
> same types as basic_config wherever possible.
> 
> I only say all that because to fix this issue is a single thing, but
> you should spend the time comparing configs because this will not be
> the only issue. Anyway, to fix this problem, in your solrconfig.xml
> you have:
> 
>   data
> 
> It should be
> 
>   ${solr.data.dir:}
> 
> Which is still in your config, you've just got it commented out :)

Thank you. I am now a step further. 
I could import data into the new collection with the DIH. However I observed 
the following exception 
in solr.log:

request: 
http://127.0.1.1:8983/solr/hugo_shard1_replica1/update?update.distrib=TOLEADER=http%3A%2F%2F127.0.1.1%3A8983%2Fsolr%2Fhugo_shard2_replica1%2F=javabin=2
Remote error message: This IndexSchema is not mutable.
at 
org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:345)

I also noticed that only one shard is filled.
The wiki describes how to populate data with the rest api. However I use the 
data importer.
I imagine to split my data per day of the year. My idea was to create 365 
shards of type compositeKey. In my SQL I have a date field and it is no problem 
to overwrite data after one year.
However I'm looking for a good example how to achieve this. May be I need in 
this case 365 dataimport.xml files under each shard one... with some 
modulo-expression for the specific day.
Currently the dataimport.xml is in the conf directory.
So I'm looking for a good example how to use the DIH with solrcloud.
Should it work to create a implicit router instead of compositeKey router (with 
365 shards) and simply specfiy as router.field= ?

Thomas


Re: setup solrcloud from scratch vie web-ui

2017-05-16 Thread Thomas Porschberg
Hi,

I did not manipulating the data dir. What I did was:

1. Downloaded solr-6.5.1.zip
2. ensured no solr process is running
3. unzipped solr-6.5.1.zip to ~/solr_new2/solr-6.5.1
3. started an external zookeeper 
4. copied a conf directory from a working non-cloudsolr (6.5.1) to 
   ~/solr_new2/solr-6.5.1 so that I have ~/solr_new2/solr-6.5.1/conf
  (see http://randspringer.de/solrcloud_test/my.zip for content)
5. postd the conf to zookeeper with:
   bin/solr zk upconfig -n heise -d ./conf -z localhost:2181
6. started solr in cloud mode with
   bin/solr -c -z localhost:2181
7. tried to create a acollection with
   bin/solr create -c heise -shards 2
   -->failed with:
  
Connecting to ZooKeeper at localhost:2181 ...
INFO  - 2017-05-17 07:06:38.249; 
org.apache.solr.client.solrj.impl.ZkClientClusterStateProvider; Cluster at 
localhost:2181 ready
Re-using existing configuration directory heise

Creating new collection 'heise' using command:
http://localhost:8983/solr/admin/collections?action=CREATE=heise=2=1=2=heise


ERROR: Failed to create collection 'heise' due to: 
{127.0.1.1:8983_solr=org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:Error
 from server at http://127.0.1.1:8983/solr: Error CREATEing SolrCore 
'heise_shard2_replica1': Unable to create core [heise_shard2_replica1] Caused 
by: Lock held by this virtual machine: 
/home/pberg/solr_new2/solr-6.5.1/server/data/index/write.lock}

8. Tried with 1 shard, worked -->
pberg@porschberg:~/solr_new2/solr-6.5.1$ bin/solr create -c heise -shards 1

Connecting to ZooKeeper at localhost:2181 ...
INFO  - 2017-05-17 07:21:01.632; 
org.apache.solr.client.solrj.impl.ZkClientClusterStateProvider; Cluster at 
localhost:2181 ready
Re-using existing configuration directory heise

Creating new collection 'heise' using command:
http://localhost:8983/solr/admin/collections?action=CREATE=heise=1=1=1=heise

{
  "responseHeader":{
"status":0,
"QTime":2577},
  "success":{"127.0.1.1:8983_solr":{
  "responseHeader":{
"status":0,
"QTime":1441},
  "core":"heise_shard1_replica1"}}}


What did I wrong? I want to use multiple shards on ONE node.

Best regards 
Thomas



> Shawn Heisey <apa...@elyograg.org> hat am 16. Mai 2017 um 16:30 geschrieben:
> 
> 
> On 5/12/2017 8:49 AM, Thomas Porschberg wrote:
> > ERROR: Failed to create collection 'cat' due to: 
> > {127.0.1.1:8983_solr=org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:Error
> >  from server at http://127.0.1.1:8983/solr: Error CREATEing SolrCore 
> > 'cat_shard1_replica1': Unable to create core [cat_shard1_replica1] Caused 
> > by: Lock held by this virtual machine: 
> > /home/pberg/solr_new2/solr-6.5.1/server/data/bestand/index/write.lock}
> 
> The same Solr instance is already holding the lock on the index at
> /home/pberg/solr_new2/solr-6.5.1/server/data/bestand/index.  This means
> that Solr already has a core using that index directory.
> 
> If the write.lock were present but wasn't being held by the same
> instance, then the message would have said it was held by another program.
> 
> This sounds like you are manually manipulating settings like dataDir. 
> When you start the server from an extracted download (not as a service)
> and haven't messed with any configurations, the index directory for a
> single-shard single-replica "cat" collection should be something like
> the following, and should not be overridden unless you understand
> *EXACTLY* how SolrCloud functions and have a REALLY good reason for
> changing it:
> 
> /home/pberg/solr_new2/solr-6.5.1/server/solr/cat_shard1_replica1/data/index
> 
> On the "Sorry, no dataimport-handler defined!" problem, this is
> happening because the solrconfig.xml file being used by the collection
> does not have any configuration for the dataimport handler.  It's not
> enough to add a DIH config file, solrconfig.xml must have a dataimport
> handler defined that references the DIH config file.
> 
> Thanks,
> Shawn
>


Re: SolrCloud ... Unable to create core ... Caused by: Lock held by this virtual machine:...

2017-05-15 Thread Thomas Porschberg
Hi,

I get no error message and the shard is created when I use 
numShards=1
in the url.

http://localhost:8983/solr/admin/collections?action=CREATE=karpfen=1=1=1=karpfen
 --> success

http://localhost:8983/solr/admin/collections?action=CREATE=karpfen=2=1=2=karpfen
--> error

Thomas


> Susheel Kumar <susheel2...@gmail.com> hat am 15. Mai 2017 um 14:36 
> geschrieben:
> 
> 
> what happens if you create just one shard.  Just use this command directly
> on browser or thru curl.  Empty the contents from
>  /home/pberg/solr_new2/solr-6.5.1/server/data before running
> 
> http://localhost:8983/solr/admin/collections?action=
> CREATE=karpfen=1=1&
> maxShardsPerNode=1=karpfen
> <http://localhost:8983/solr/admin/collections?action=CREATE=karpfen=2=1=2=karpfen>
> 
> On Mon, May 15, 2017 at 2:14 AM, Thomas Porschberg <tho...@randspringer.de>
> wrote:
> 
> > Hi,
> >
> > I have problems to setup solrcloud on one node with 2 shards. What I did:
> >
> > 1. Started a external zookeeper
> > 2. Ensured that no solr process is running with 'bin/solr status'
> > 3. Posted a working conf directory from a non-cloud solr to zookeeper
> >with
> >'bin/solr zk upconfig -n karpfen -d 
> > /home/pberg/solr_new/solr-6.5.1/server/solr/tommy/conf
> > -z localhost:2181'
> >--> no errors
> > 4. Started solr in cloud mode with
> >   'bin/solr -c -z localhost:2181'
> > 5. Tried to create a new collection with 2 shards with
> >'bin/solr create -c karpfen -shards 2'
> >
> > The output is:
> >
> > Connecting to ZooKeeper at localhost:2181 ...
> > INFO  - 2017-05-12 18:52:22.807; 
> > org.apache.solr.client.solrj.impl.ZkClientClusterStateProvider;
> > Cluster at localhost:2181 ready
> > Re-using existing configuration directory karpfen
> >
> > Creating new collection 'karpfen' using command:
> > http://localhost:8983/solr/admin/collections?action=
> > CREATE=karpfen=2=1&
> > maxShardsPerNode=2=karpfen
> >
> >
> > ERROR: Failed to create collection 'karpfen' due to: {127.0.1.1:8983
> > _solr=org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:Error
> > from server at http://127.0.1.1:8983/solr: Error CREATEing SolrCore
> > 'karpfen_shard2_replica1': Unable to create core [karpfen_shard2_replica1]
> > Caused by: Lock held by this virtual machine: /home/pberg/solr_new2/solr-6.
> > 5.1/server/data/ohrdruf_bestand/index/write.lock}
> >
> >
> > The conf directory I copied contains the following files:
> > currency.xml     elevate.xml  protwords.txt   stopwords.txt
> > dataimport-cobt2.properties  lang schema.xml  synonyms.txt
> > dataimport.xml   params.json  solrconfig.xml
> >
> > "lang" is a directory.
> >
> > Are my steps wrong? Did I miss something important?
> >
> > Any help is really welcome.
> >
> > Thomas
> >


SolrCloud ... Unable to create core ... Caused by: Lock held by this virtual machine:...

2017-05-15 Thread Thomas Porschberg
Hi,

I have problems to setup solrcloud on one node with 2 shards. What I did:

1. Started a external zookeeper
2. Ensured that no solr process is running with 'bin/solr status'
3. Posted a working conf directory from a non-cloud solr to zookeeper
   with
   'bin/solr zk upconfig -n karpfen -d 
/home/pberg/solr_new/solr-6.5.1/server/solr/tommy/conf -z localhost:2181'
   --> no errors
4. Started solr in cloud mode with
  'bin/solr -c -z localhost:2181'
5. Tried to create a new collection with 2 shards with
   'bin/solr create -c karpfen -shards 2'

The output is:

Connecting to ZooKeeper at localhost:2181 ...
INFO  - 2017-05-12 18:52:22.807; 
org.apache.solr.client.solrj.impl.ZkClientClusterStateProvider; Cluster at 
localhost:2181 ready
Re-using existing configuration directory karpfen

Creating new collection 'karpfen' using command:
http://localhost:8983/solr/admin/collections?action=CREATE=karpfen=2=1=2=karpfen


ERROR: Failed to create collection 'karpfen' due to: 
{127.0.1.1:8983_solr=org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:Error
 from server at http://127.0.1.1:8983/solr: Error CREATEing SolrCore 
'karpfen_shard2_replica1': Unable to create core [karpfen_shard2_replica1] 
Caused by: Lock held by this virtual machine: 
/home/pberg/solr_new2/solr-6.5.1/server/data/ohrdruf_bestand/index/write.lock}

   
The conf directory I copied contains the following files:
currency.xml elevate.xml  protwords.txt   stopwords.txt
dataimport-cobt2.properties  lang schema.xml  synonyms.txt
dataimport.xml   params.json  solrconfig.xml

"lang" is a directory.

Are my steps wrong? Did I miss something important? 

Any help is really welcome.

Thomas


Re: setup solrcloud from scratch vie web-ui

2017-05-12 Thread Thomas Porschberg
Hi,

I think I did one mistake when I started in step 3 solr without 
zookeeper-option.
I did:
 bin/solr start -c
but I think it should:
bin/solr start -c  -z localhost:2181

The problem is now when repeat step 4 (creating a collection) I get the 
following error:

//I uploaded my cat-config again to zookeeper with
// bin/solr zk upconfig -n cat -d $HOME/solr-6.5.1/server/solr/tommy/conf -z // 
localhost:2181


bin/solr create -c cat -shards 2

Connecting to ZooKeeper at localhost:2181 ...
INFO  - 2017-05-12 16:38:06.593; 
org.apache.solr.client.solrj.impl.ZkClientClusterStateProvider; Cluster at 
localhost:2181 ready
Re-using existing configuration directory cat

Creating new collection 'cat' using command:
http://localhost:8983/solr/admin/collections?action=CREATE=cat=2=1=2=cat


ERROR: Failed to create collection 'cat' due to: 
{127.0.1.1:8983_solr=org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:Error
 from server at http://127.0.1.1:8983/solr: Error CREATEing SolrCore 
'cat_shard1_replica1': Unable to create core [cat_shard1_replica1] Caused by: 
Lock held by this virtual machine: 
/home/pberg/solr_new2/solr-6.5.1/server/data/bestand/index/write.lock}

This "data/bestand" is configured in solrconfig.xml (from tommy standalone) with
data/bestand

I tried to create the directory 
/home/pberg/solr_new2/solr-6.5.1/server/data/bestand/index/ manually , but 
nothing changed.

What is the reason for this CREATE-error?

Thomas




> ANNAMANENI RAVEENDRA <a.raveendra...@gmail.com> hat am 12. Mai 2017 um 15:54 
> geschrieben:
> 
> 
> Hi ,
> 
> If there is a request handler configured in solrconfig.xml and update the
> Conf in zookeeper it should show up
> 
> If already did it try reloading configuration
> 
> Thanks
> Ravi
> 
> 
> On Fri, 12 May 2017 at 9:46 AM, Thomas Porschberg <tho...@randspringer.de>
> wrote:
> 
> > > > This is another problem I see: With my non-cloud core I have a
> > conf-directory where I have dataimport.xml, schema.xml and solrconfig.xml.
> > > > I think these 3 files are enough to import my data from my relational
> > database.
> > > > Under example/cloud I could not find one of them. How to setup DIH for
> > the solrcould?
> > >
> > > The entire configuration (what would normally be in the conf directory)
> > > is in zookeeper when you're in cloud mode, not in the core directories.
> > > You must upload a directory containing the same files that would
> > > normally be in a conf directory as a named configset to zookeeper before
> > > you try to create your collection.  This is something that the "bin/solr
> > > create" command does for you in cloud mode, typically using one of the
> > > configsets included on the disk as a source.
> > >
> > >
> > https://cwiki.apache.org/confluence/display/solr/Using+ZooKeeper+to+Manage+Configuration+Files
> > >
> > Ok, thank you. I did the following steps.
> >
> > 1. Started an external zookeeper
> > 2. Copied a conf-directory to zookeeper:
> > bin/solr zk upconfig -n books -d $HOME/solr-6.5.1/server/solr/tommy/conf
> > -z localhost:2181
> > // This is a conf-directory from a standalone solr when dataimport was
> > working!
> > --> Connecting to ZooKeeper at localhost:2181 ...
> > Uploading <> for config books to ZooKeeper at localhost:2181
> > // I think no errors, but how can I check it in zookeeper? I found no
> > files solrconfig.xml ...
> > in the zookeeper directories (installation dir and data dir)
> > 3. Started solr:
> > bin/solr start -c
> > 4. Created a books collection with 2 shards
> > bin/solr create -c books -shards 2
> >
> > Result: I see in the web-ui my books collection with the 2 shards. No
> > errors so far.
> > However, the Dataimport-entry says:
> > "Sorry, no dataimport-handler defined!"
> >
> > What could be the reason?
> >
> > Thomas
> >


Re: setup solrcloud from scratch vie web-ui

2017-05-12 Thread Thomas Porschberg
> > This is another problem I see: With my non-cloud core I have a 
> > conf-directory where I have dataimport.xml, schema.xml and solrconfig.xml. 
> > I think these 3 files are enough to import my data from my relational 
> > database.
> > Under example/cloud I could not find one of them. How to setup DIH for the 
> > solrcould?
> 
> The entire configuration (what would normally be in the conf directory)
> is in zookeeper when you're in cloud mode, not in the core directories. 
> You must upload a directory containing the same files that would
> normally be in a conf directory as a named configset to zookeeper before
> you try to create your collection.  This is something that the "bin/solr
> create" command does for you in cloud mode, typically using one of the
> configsets included on the disk as a source.
> 
> https://cwiki.apache.org/confluence/display/solr/Using+ZooKeeper+to+Manage+Configuration+Files
> 
Ok, thank you. I did the following steps.

1. Started an external zookeeper
2. Copied a conf-directory to zookeeper: 
bin/solr zk upconfig -n books -d $HOME/solr-6.5.1/server/solr/tommy/conf -z 
localhost:2181
// This is a conf-directory from a standalone solr when dataimport was working!
--> Connecting to ZooKeeper at localhost:2181 ...
Uploading <> for config books to ZooKeeper at localhost:2181
// I think no errors, but how can I check it in zookeeper? I found no files 
solrconfig.xml ...
in the zookeeper directories (installation dir and data dir)
3. Started solr:
bin/solr start -c
4. Created a books collection with 2 shards
bin/solr create -c books -shards 2

Result: I see in the web-ui my books collection with the 2 shards. No errors so 
far.
However, the Dataimport-entry says:
"Sorry, no dataimport-handler defined!"

What could be the reason?

Thomas


setup solrcloud from scratch vie web-ui

2017-05-12 Thread Thomas Porschberg
Hi,

I want to setup a solrcloud. I want to  test sharding with one node, no 
replication.
I have some experience with the non-cloud solr and I also run the cloud 
examples.
I also have to use the DIH for importing. I think I can live with the internal 
zookeeper.

I did my first steps with solr-6.5.1.

My first question is: Is it possible to setup a new solrcloud with the web-ui 
only?

When I start solr with: 'bin/solr start -c'

I get a menu on the left side where I can create new collections and cores.
I think when I have only one node with no replication a collection maps to one 
core, right?

Should I create first the core or the collection? 
What should I fill in as instanceDir? 

For example: When I create at the command line 
a 'books/data' directory under '$HOME/solr-6.5.1/server/solr'
and then fill in 'books' as instanceDir and 'data' as data-Directory 
I get a 
'SolrCore Initialization Failures'

books: 
org.apache.solr.common.cloud.ZooKeeperException:org.apache.solr.common.cloud.ZooKeeperException:
 Could not find configName for collection books found:null


Is something like a step by step manual available? 
Next step would be to setup DIH again. 

This is another problem I see: With my non-cloud core I have a conf-directory 
where I have dataimport.xml, schema.xml and solrconfig.xml. 
I think these 3 files are enough to import my data from my relational database.
Under example/cloud I could not find one of them. How to setup DIH for the 
solrcould?

Best regards
Thomas


CDCR: Help With Tlog Growth Issues

2016-11-10 Thread Thomas Tickle
I am having an issue with cdcr that I could use some assistance in resolving.

I followed the instructions found here: 
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=62687462

The CDCR is setup with a single source to a single target.  Both the source and 
target cluster are identically setup as 3 machines, each running an external 
zookeeper and a solr instance.  I've enabled the data replication and 
successfully seen the documents replicated from the source to the target with 
no errors in the log files.

However, when examining the /cdcr?action=QUEUES command, I noticed that the 
tlogTotalSize and tlogTotalCount were alarmingly high.  Checking the data 
directory for each shard, I was able to confirm that there was several thousand 
logs files of each 3-4 megs.  It added up to almost 35 GBs of tlogs.  
Obviously, this amount of tlogs causes a serious issue when trying to restart a 
solr server after activities such as patch.

Is it normal for old tlogs to never get removed in a CDCR setup?


Thomas Tickle



Nothing in this message is intended to constitute an electronic signature 
unless a specific statement to the contrary is included in this message.

Confidentiality Note: This message is intended only for the person or entity to 
which it is addressed. It may contain confidential and/or privileged material. 
Any review, transmission, dissemination or other use, or taking of any action 
in reliance upon this message by persons or entities other than the intended 
recipient is prohibited and may be unlawful. If you received this message in 
error, please contact the sender and delete it from your computer.


filter groups

2016-07-04 Thread Thomas Scheffler

Hi,

I have metadata and file indexed in solr. All have a different id of 
cause but share the same value for "returnId" if they belong to the same 
metadata that describes a bunch of files (1:n).


When I start a search. I usually use grouping instead of join queries to 
keep the information where the hit occurred.


Now there it's getting tricky. I want to filter out groups depending on 
a field that is only available on metadata documents: visibility.


I want to search in solr like: "Find all documents containing 'foo' 
grouped by returnId, where the metadata visibility is 'public'"


So it should find any 'foo' files but only display the result if the 
corresponding metadata documents field visibility='public'.


Faceting also uses just the information inside groups. Can I give SOLR 
some information for 'fq' and 'facet.*' to work with my setup?


I am still using SOLR 4.10.5

kind regards

Thomas


Getting org.apache.lucene.document.Field instead of String in SolrDocument#get

2016-01-27 Thread Thomas Mortagne
Hi guys,

I have some code using SolrInstance#queryAndStreamResponse and since I
moved to Solr 5.3.1 (from 4.10.4) my StreamingResponseCallback is
called with a SolrDocument filled with Field instead of the String it
used to receive when calling #get('myfield').

Is this expected ? Should I change all my code dealing with
SolrDocument to be carefull about that ? From what I could see those
Fields are put in SolrDocument by DocsStreamer which seems to be new
in 5.3 but did not digged much more.

It looks a bit weird to me given the javadoc of #getFieldValue which
is implemented exactly like #get. Also it's not consistent with
SolrInstance#query behavior which return me SolrDocument containing
values and not Fields.

Sorry if I missed it in the releases notes.

Thanks for your time !
-- 
Thomas


Suggester needed for returning suggestions when term is not start of field value

2015-08-07 Thread Thomas Michael Engelke
 Hey,

I'm playing around with the suggester component, and it works perfectly
as described: Suggestions for 'logitech mouse' include 'logitech mouse
g500' and 'logitech mouse gaming'.

However, when the words in the record supplying the suggester do not
follow each other as in the search terms, nothing is returned.
Suggestions for 'logitech mouse' do not include 'logitech g500 mouse'.

Is there a suggester implementation that can suggest records that way?

Best wishes. 

Re: Dollar signs in field names

2015-07-28 Thread Thomas Seidl
Thanks for your answer!

As mentioned, I'm aware of the problems with other characters like
colons and dashes. I've just never run into any issues with dollar
signs. And previously, before there was an official definition, I heard
from several people that valid Java identifiers was a good rule of
thumb – which would include dollar signs.

I'd just hoped that when there would be a definition (and it's of course
very good and important that there now is one) it would more or less
mirror that rule of thumb and also allow for dollar signs.

Now it's a pretty tough call whether to use them or not.

Cheers,
Thomas

On 2015-07-27 21:31, Erick Erickson wrote:
 The problem has been that field naming conventions weren't
 _ever_ defined strictly. It's not that anyone is taking away
 the ability to use other characters,  rather it's codifying what's always
 been true; Solr isn't guaranteed to play nice with naming
 conventions other than those specified on the page you
 referenced, alphanumerics and underscores and _not_ starting
 with numerics.
 
 The danger is that parsing the incoming URL can run into
 issues. Take for instance a colon. How would the parsing
 process distinguish that from a field:value separator? Or a
 hyphen when is that NOT and when is that part of a field
 name? Periods are also interesting. You can specify some
 params (e.g. facet params) with periods (f.field.prop=). No
 guarantee has ever been made that a field _name_ with a
 period won't confuse things. It happens to work, but that's
 not by design, just like dollar signs.
 
 So you can use dollar signs, but there won't be any attempts
 to support it if some component somewhere doesn't do the
 right thing with it. And no guarantee that there aren't current
 corner cases where that causes problems. And if it does cause
 problems, support won't be added.
 
 Best,
 Erick
 
 On Mon, Jul 27, 2015 at 10:42 AM, Thomas Seidl re...@gmx.net wrote:
 Hi all,

 I've used dollar signs in field names for several years now, as an easy
 way to escape bad characters (like colons) coming in from the original
 source of the data, and I've never had any problems. Since I don't know
 of any Solr request parameters that use a dollar sign as a special
 character, I also wouldn't know where one might occur.

 But while I remember that the supported format for field names was
 previously completely undocumented (and it was basically almost
 anything is supported, but some things might not work with some
 characters), I now read that for about a year there has been a strict
 definition/recommendation in the Solr wiki [1] which doesn't allow for
 dollar signs.

 [1] https://cwiki.apache.org/confluence/display/solr/Defining+Fields

 So, my question is: Is this just for an easier definition, or is there a
 real danger of problems when using dollar signs in field names? Or,
 differently: How bad of an idea is it?
 Also, where was this definition discussed, why was this decision
 reached? Is there really an argument against dollar signs? I have to say
 it is really very handy to have a character available for field names
 that is usually not allowed in programming language's identifiers (as a
 cheap escape character).

 Thanks in advance,
 Thomas
 


Dollar signs in field names

2015-07-27 Thread Thomas Seidl
Hi all,

I've used dollar signs in field names for several years now, as an easy
way to escape bad characters (like colons) coming in from the original
source of the data, and I've never had any problems. Since I don't know
of any Solr request parameters that use a dollar sign as a special
character, I also wouldn't know where one might occur.

But while I remember that the supported format for field names was
previously completely undocumented (and it was basically almost
anything is supported, but some things might not work with some
characters), I now read that for about a year there has been a strict
definition/recommendation in the Solr wiki [1] which doesn't allow for
dollar signs.

[1] https://cwiki.apache.org/confluence/display/solr/Defining+Fields

So, my question is: Is this just for an easier definition, or is there a
real danger of problems when using dollar signs in field names? Or,
differently: How bad of an idea is it?
Also, where was this definition discussed, why was this decision
reached? Is there really an argument against dollar signs? I have to say
it is really very handy to have a character available for field names
that is usually not allowed in programming language's identifiers (as a
cheap escape character).

Thanks in advance,
Thomas


Using edismax in a filter query

2015-07-10 Thread Thomas Seidl
Hi all,

I was wondering if there's any way to use the Extended DisMax query
parser in an fq filter query?
The problem is that I have a facet.query with which I want to check
whether a certain set of keywords would have any results. But since the
normal query goes across multiple fields, I end up with something like this:

  facet.query=(field1:search OR field2:search OR field3:search OR
field4:search) AND (field1:keys OR field2:keys OR field3:keys OR
field4:keys)

(Just with a lot more fields.) On the one hand this is rather ugly to
see in the logs, but mostly I'm concerned that this would be harder to
parse for Solr than using its own edismax parser to do the job.

So, is there a way to do that? Or are there any other alternatives to
achieve this (except sending a second query, of course)?
Since the fields used can change from request to request, it's not
possible to dump all their contents into a single field for that purpose.

Thanks in advance,
Thomas


Performance of q.alt vs. fq

2015-07-10 Thread Thomas Seidl
Hi all,

I am working a lot with Drupal and Apache Solr. There, we implemented a
performance improvement that would, for filter-only queries (i.e., no
q parameter, just fqs) instead move the filters to the q.alt
parameter (idea based on this blog post [1]).

[1]
https://web.archive.org/web/20120817044656/http://www.derivante.com/2009/04/27/100x-increase-in-solr-performance-and-throughput

Before, we had q.alt=*:* to return all results and then filter via
fq. So, e.g., this query:
  q.alt=*:*fq=field1:foofq=field2:bar
becomes this:
  q.alt=(field1:foo) AND (field2:bar)

However, now I've read some complaints that the former is actually
faster, and some other people also pointed out (in separate discussions)
that fq is much faster than q.

So, can anyone shed some lights on this, what the internal mechanics are
that make the one or the other faster? Are there other suggestions for
how to make a filters-only search as fast as possible?

Also, can it be that it recently changed that the q.alt parameter now
influences relevance (in Solr 5.x)? I could have sworn that wasn't the
case previously.

Thanks in advance,
Thomas


Re: Using edismax in a filter query

2015-07-10 Thread Thomas Seidl
Hi Ahmet,

Brilliant, thanks a lot!
I thought it might be possible with local parameters, but couldn't find
any information anywhere on how (especially setting the multi-valued
qf parameter).

Thanks again,
Thomas

On 2015-07-10 14:09, Ahmet Arslan wrote:
 Hi Tomasi
 
 Yes it is possible, please see local params : 
 https://cwiki.apache.org/confluence/display/solr/Local+Parameters+in+Queries
 
 fq={!edismax qf='field1 field2 field'}search key
 Ahmet
 
 
 On Friday, July 10, 2015 2:20 PM, Thomas Seidl re...@gmx.net wrote:
 
 
 
 Hi all,
 
 I was wondering if there's any way to use the Extended DisMax query
 parser in an fq filter query?
 The problem is that I have a facet.query with which I want to check
 whether a certain set of keywords would have any results. But since the
 normal query goes across multiple fields, I end up with something like this:
 
   facet.query=(field1:search OR field2:search OR field3:search OR
 field4:search) AND (field1:keys OR field2:keys OR field3:keys OR
 field4:keys)
 
 (Just with a lot more fields.) On the one hand this is rather ugly to
 see in the logs, but mostly I'm concerned that this would be harder to
 parse for Solr than using its own edismax parser to do the job.
 
 So, is there a way to do that? Or are there any other alternatives to
 achieve this (except sending a second query, of course)?
 Since the fields used can change from request to request, it's not
 possible to dump all their contents into a single field for that purpose.
 
 Thanks in advance,
 Thomas
 


Re: Questions regarding autosuggest (Solr 5.2.1)

2015-06-30 Thread Thomas Michael Engelke
 God damn. Thank you.

*ashamed*

Am 30.06.2015 00:21 schrieb Erick Erickson: 

 Try not putting it in double quotes?
 
 Best,
 Erick
 
 On Mon, Jun 29, 2015 at 12:22 PM, Thomas Michael Engelke
 thomas.enge...@posteo.de wrote:
 
 A friend and I are trying to develop some software using Solr in the 
 background, and with that comes alot of changes. We're used to older 
 versions (4.3 and below). We especially have problems with the autosuggest 
 feature. This is the field definition (schema.xml) for our autosuggest 
 field: field name=autosuggest type=autosuggest indexed=true 
 stored=true required=false multiValued=true / ... copyField 
 source=name dest=autosuggest / ... fieldType name=autosuggest 
 class=solr.TextField positionIncrementGap=100 analyzer type=index 
 tokenizer class=solr.WhitespaceTokenizerFactory/ filter 
 class=solr.WordDelimiterFilterFactory splitOnCaseChange=0 
 splitOnNumerics=1 generateWordParts=1 generateNumberParts=1 
 catenateWords=1 catenateNumbers=0 catenateAll=0 preserveOriginal=0/ 
 filter class=solr.LowerCaseFilterFactory/ filter 
 class=solr.StopFilterFactory words=stopwords.txt ignoreCase=true 
 enablePositionIncrements=true
format=snowball/ filter 
class=solr.DictionaryCompoundWordTokenFilterFactory 
dictionary=dictionary.txt minWordSize=5 minSubwordSize=3 
maxSubwordSize=30 onlyLongestMatch=false/ filter 
class=solr.GermanNormalizationFilterFactory/ filter 
class=solr.SnowballPorterFilterFactory language=German2 
protected=protwords.txt/ filter class=solr.EdgeNGramFilterFactory 
minGramSize=2 maxGramSize=30/ filter 
class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer 
type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter 
class=solr.WordDelimiterFilterFactory splitOnCaseChange=0 
splitOnNumerics=1 generateWordParts=1 generateNumberParts=1 
catenateWords=1 catenateNumbers=0 catenateAll=0 preserveOriginal=0/ 
filter class=solr.LowerCaseFilterFactory/ filter 
class=solr.StopFilterFactory words=stopwords.txt ignoreCase=true 
enablePositionIncrements=true format=snowball/ filter
class=solr.GermanNormalizationFilterFactory/ filter 
class=solr.SnowballPorterFilterFactory language=German2 
protected=protwords.txt/ filter 
class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType 
Afterwards, we defined an autosuggest component to use this field, like this 
(solrconfig.xml): searchComponent name=suggest 
class=solr.SuggestComponent lst name=suggester str 
name=namemySuggester/str str name=lookupImplFuzzyLookupFactory/str 
str name=storeDirsuggester_fuzzy_dir/str str 
name=dictionaryImplDocumentDictionaryFactory/str str 
name=fieldsuggest/str str 
name=suggestAnalyzerFieldTypeautosuggest/str str 
name=buildOnStartupfalse/str str name=buildOnCommitfalse/str /lst 
/searchComponent And add a requesthandler to test out the functionality: 
requestHandler name=/suggesthandler class=solr.SearchHandler 
startup=lazy  lst name=defaults str name=suggesttrue/str str
name=suggest.count10/str str name=suggest.dictionarymySuggester/str 
/lst arr name=components strsuggest/str /arr /requestHandler 
However, trying to start the core that has this configuration, a long exception 
occurs, telling us this: Error in configuration: autosuggest is not defined 
in the schema Now, that seems to be wrong. Any idea how to fix that?
 

Questions regarding autosuggest (Solr 5.2.1)

2015-06-29 Thread Thomas Michael Engelke
 

 A friend and I are trying to develop some software using Solr in the
background, and with that comes alot of changes. We're used to older
versions (4.3 and below). We especially have problems with the
autosuggest feature.

This is the field definition (schema.xml) for our autosuggest field:

field name=autosuggest type=autosuggest indexed=true
stored=true required=false multiValued=true /
...
copyField source=name dest=autosuggest /
...
fieldType name=autosuggest class=solr.TextField
positionIncrementGap=100
 analyzer type=index
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.WordDelimiterFilterFactory splitOnCaseChange=0
splitOnNumerics=1 generateWordParts=1 generateNumberParts=1
catenateWords=1 catenateNumbers=0 catenateAll=0
preserveOriginal=0/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.StopFilterFactory words=stopwords.txt
ignoreCase=true enablePositionIncrements=true format=snowball/
 filter class=solr.DictionaryCompoundWordTokenFilterFactory
dictionary=dictionary.txt minWordSize=5 minSubwordSize=3
maxSubwordSize=30 onlyLongestMatch=false/
 filter class=solr.GermanNormalizationFilterFactory/
 filter class=solr.SnowballPorterFilterFactory language=German2
protected=protwords.txt/
 filter class=solr.EdgeNGramFilterFactory minGramSize=2
maxGramSize=30/
 filter class=solr.RemoveDuplicatesTokenFilterFactory/
 /analyzer
 analyzer type=query
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.WordDelimiterFilterFactory splitOnCaseChange=0
splitOnNumerics=1 generateWordParts=1 generateNumberParts=1
catenateWords=1 catenateNumbers=0 catenateAll=0
preserveOriginal=0/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.StopFilterFactory words=stopwords.txt
ignoreCase=true enablePositionIncrements=true format=snowball/
 filter class=solr.GermanNormalizationFilterFactory/
 filter class=solr.SnowballPorterFilterFactory language=German2
protected=protwords.txt/
 filter class=solr.RemoveDuplicatesTokenFilterFactory/
 /analyzer
/fieldType

Afterwards, we defined an autosuggest component to use this field, like
this (solrconfig.xml):

searchComponent name=suggest class=solr.SuggestComponent
 lst name=suggester
 str name=namemySuggester/str
 str name=lookupImplFuzzyLookupFactory/str
 str name=storeDirsuggester_fuzzy_dir/str
 str name=dictionaryImplDocumentDictionaryFactory/str
 str name=fieldsuggest/str
 str name=suggestAnalyzerFieldTypeautosuggest/str
 str name=buildOnStartupfalse/str
 str name=buildOnCommitfalse/str
 /lst
/searchComponent

And add a requesthandler to test out the functionality:

requestHandler name=/suggesthandler class=solr.SearchHandler
startup=lazy 
 lst name=defaults
 str name=suggesttrue/str
 str name=suggest.count10/str
 str name=suggest.dictionarymySuggester/str
 /lst
 arr name=components
 strsuggest/str
 /arr
/requestHandler

However, trying to start the core that has this configuration, a long
exception occurs, telling us this:

Error in configuration: autosuggest is not defined in the schema

Now, that seems to be wrong. Any idea how to fix that? 

Problem with german hyphenated words not being found

2015-06-11 Thread Thomas Michael Engelke
 Hey,

in german, you can string most nouns together by using hyphens, like
this:

Industrie = industry
Anhänger = trailer

Industrie-Anhänger = trailer for industrial use

Here [1], you can see me querying Industrieanhänger from the name
field (name:Industrieanhänger), to make sure the index actually contains
the word. Our data is structured that products are listed without the
hyphen.

Now, customers can come around and use the hyphenated version as a
search term (i.e.industrie-anhänger), and of course we want them to
find what they are looking for. I've set it up so that the
WordDelimiterFilterFactory uses catenateWords=1, so that these words
are catenated. An analysis of Industrieanhänger as index and
industrie-anhänger as query can be seen here [2].

You can see that both word parts are found. However, querying for
industrie-anhänger does not yield results, only when the hyphen is
removed, as you can see here [3]. I'm not sure how to proceed from here,
as the results of the analysis have so far always lined up with what I
could see when querying. Here's the schema definition for text, the
field type for the name field:

fieldType name=text class=solr.TextField positionIncrementGap=100
autoGeneratePhraseQueries=true
 analyzer type=index
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.WordDelimiterFilterFactory splitOnCaseChange=1
splitOnNumerics=1 generateWordParts=1 generateNumberParts=1
catenateWords=1 catenateNumbers=0 catenateAll=0
preserveOriginal=1/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.DictionaryCompoundWordTokenFilterFactory
dictionary=dictionary.txt minWordSize=5 minSubwordSize=3
maxSubwordSize=30 onlyLongestMatch=false/
 filter class=solr.StopFilterFactory words=stopwords.txt
ignoreCase=true enablePositionIncrements=true format=snowball/
 filter class=solr.GermanNormalizationFilterFactory/
 filter class=solr.SnowballPorterFilterFactory language=German2
protected=protwords.txt/
 filter class=solr.RemoveDuplicatesTokenFilterFactory/
 /analyzer
 analyzer type=query
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.WordDelimiterFilterFactory splitOnCaseChange=1
splitOnNumerics=1 generateWordParts=1 generateNumberParts=1
catenateWords=1 catenateNumbers=0 catenateAll=0
preserveOriginal=1/
 filter class=solr.LowerCaseFilterFactory/
 !-- filter class=solr.DictionaryCompoundWordTokenFilterFactory
dictionary=dictionary.txt minWordSize=5 minSubwordSize=3
maxSubwordSize=30 onlyLongestMatch=false/ --
 filter class=solr.StopFilterFactory words=stopwords.txt
ignoreCase=true enablePositionIncrements=true format=snowball/
 filter class=solr.GermanNormalizationFilterFactory/
 filter class=solr.SnowballPorterFilterFactory language=German2
protected=protwords.txt/
 filter class=solr.RemoveDuplicatesTokenFilterFactory/
 /analyzer
/fieldType

I've also thought it might be a problem with URL encoding not encoding
the hyphen, but replacing it with %2D didn't change the outcome (and was
probably wrong anyway).

Any help is greatly appreciated. 

Links:
--
[1] http://imgur.com/2oEC5vz
[2] http://i.imgur.com/H0AhEsF.png
[3] http://imgur.com/dzmMe7t


Re: Problem with german hyphenated words not being found

2015-06-11 Thread Thomas Michael Engelke
 Thank you for your input. Here's how the query looks with
debugQuery=true:

rawquerystring: name:industrie-anhänger,
 querystring: name:industrie-anhänger,
 parsedquery: MultiPhraseQuery(name:(industrie-anhang industri)
(anhang industrieanhang)),
 parsedquery_toString: name:(industrie-anhang industri) (anhang
industrieanhang),

 It looks like there are some rules applied, expressed by the braces.
What's the correct interpretation of that? The default operator is OR,
yet this looks like the terms inside the braces group using AND.

Am 11.06.2015 12:40 schrieb Upayavira: 

 The next thing to do is add debugQuery=true to your URL (or enable it in
 the query pane of the admin UI). Then look for the parsed query info.
 
 On the standard text_en field which includes an English stop word
 filter, I ran a query on Jack and Jill's House which showed
 this output:
 
 rawquerystring: text_en:(Jack and Jill's House), querystring:
 text_en:(Jack and Jill's House), parsedquery: text_en:jack
 text_en:jill text_en:hous, parsedquery_toString: text_en:jack
 text_en:jill text_en:hous,
 
 You can see that the parsed query is formed *after* analysis, so you can
 see exactly what is being queried for.
 
 Also, as a corollary to this, you can use the schema browser (or
 faceting for that matter) to view what terms are being indexed, to see
 if they should match.
 
 HTH
 
 Upayavira
 
 Am 11.06.2015 12:00 schrieb Upayavira:
 Have you used the analysis tab in the admin UI? You can type in

sentences for both index and query time and see how they would be
analysed by various fields/field types.

Once you have got index time and query time to result in the same tokens
at the end of the analysis chain, you should start seeing matches in
your queries.

Upayavira

On Thu, Jun 11, 2015, at 10:26 AM, Thomas Michael Engelke wrote:

 Hey, in german, you can string most nouns together by using hyphens, like 
 this: Industrie = industry Anhänger = trailer Industrie- Anhänger = trailer 
 for industrial use Here [1[1]], you can see me querying Industrieanhänger 
 from the name field (name:Industrieanhänger), to make sure the index 
 actually contains the word. Our data is structured that products are listed 
 without the hyphen. Now, customers can come around and use the hyphenated 
 version as a search term (i.e.industrie-anhänger), and of course we want 
 them to find what they are looking for. I've set it up so that the 
 WordDelimiterFilterFactory uses catenateWords=1, so that these words are 
 catenated. An analysis of Industrieanhänger as index and 
 industrie-anhänger as query can be seen here [2[2]]. You can see that both 
 word parts are found. However, querying for industrie- anhänger does not 
 yield results, only when the hyphen is removed, as you can see here [3[3]]. 
 I'm not sure how to proceed from
here, as the results of the analysis have so far always lined up with what I 
could see when querying. Here's the schema definition for text, the field 
type for the name field: fieldType name=text class=solr.TextField 
positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer 
type=index tokenizer class=solr.StandardTokenizerFactory/ filter 
class=solr.WordDelimiterFilterFactory splitOnCaseChange=1 
splitOnNumerics=1 generateWordParts=1 generateNumberParts=1 
catenateWords=1 catenateNumbers=0 catenateAll=0 preserveOriginal=1/ 
filter class=solr.LowerCaseFilterFactory/ filter 
class=solr.DictionaryCompoundWordTokenFilterFactory 
dictionary=dictionary.txt minWordSize=5 minSubwordSize=3 
maxSubwordSize=30 onlyLongestMatch=false/ filter 
class=solr.StopFilterFactory words=stopwords.txt ignoreCase=true 
enablePositionIncrements=true format=snowball/ filter 
class=solr.GermanNormalizationFilterFactory/ filter
class=solr.SnowballPorterFilterFactory language=German2 
protected=protwords.txt/ filter 
class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer 
type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter 
class=solr.WordDelimiterFilterFactory splitOnCaseChange=1 
splitOnNumerics=1 generateWordParts=1 generateNumberParts=1 
catenateWords=1 catenateNumbers=0 catenateAll=0 preserveOriginal=1/ 
filter class=solr.LowerCaseFilterFactory/ !-- filter 
class=solr.DictionaryCompoundWordTokenFilterFactory 
dictionary=dictionary.txt minWordSize=5 minSubwordSize=3 
maxSubwordSize=30 onlyLongestMatch=false/ -- filter 
class=solr.StopFilterFactory words=stopwords.txt ignoreCase=true 
enablePositionIncrements=true format=snowball/ filter 
class=solr.GermanNormalizationFilterFactory/ filter 
class=solr.SnowballPorterFilterFactory language=German2 
protected=protwords.txt/ filter
class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType I've 
also thought it might be a problem with URL encoding not encoding the hyphen, 
but replacing it with %2D didn't change the outcome (and was probably wrong 
anyway). Any help is greatly appreciated. Links: -- [1] 
http://imgur.com/2oEC5vz [1] [2] http://i.imgur.com

Solr: Elevate with complex query specifying field names

2015-05-31 Thread Thomas Michael Engelke
 

I have Solr as the backend to an ECommerce solution where the fields can
be configured to be searchable, which generates a schema.xml and loads
it into Solr. 

Now we also allow to configure Solr search weight per field to affect
queries, so my queries usually look something like this: 

spellcheck=truefl=entity_id,scorehl.snippets=1start=0q=ean:test+name:test^10.00+persartnr:test^5.00+persartnr_direct:test+short_description:testspellcheck.q=testspellcheck.build=true=truehl.simple.pre=span+class%3Dhighlighthl.simple.post=/spanjson.nl=maphl.fl=name,short_descriptionwt=jsonspellcheck.collate=truehl=truerows=1000

Now, I want to add query elevation to my mix. I got it to work pretty
flawlessly, however, I'm not sure how to get it to work with my queries
as they specifically state field names and especially boosts on a
regular basis. 

This works and gets elevated when queried as q=test: 

elevate
 query text=test
 doc id=14153 /
 /query
/elevate

However, when queried as q=name:test^10.00, this elevation does not
work/doesn't elevate. 

Is there a way around that? Can I specify the naked query somehow for
the elevation component? 
 

Reading an index while it is being updated?

2015-05-13 Thread Guy Thomas
Up to now we've been using Lucene without Solr.

The Lucene index is being updated and when the update is finished we notify a 
Hessian proxy service running on the web server that wants to read the index. 
When this proxy service is notified, the server knows it can read the updated 
index.

Do we have the use a similar set-up when using Solr, that is:

1. Create/update the index

2. Notify the Solr client



[cid:image001.jpg@01D08D5B.0112E420]

  Guy Thomas
  Analist-Programmeur

  Provincie Vlaams-Brabant
  Dienst Projecten en Ontwikkelingen
  Provincieplein 1 - 3010 Leuven
  Tel: 016-26 79 45
  www.vlaamsbrabant.behttp://www.vlaamsbrabant.be/




Aan dit bericht kunnen geen rechten worden ontleend. Alle berichten naar dit
professioneel e-mailadres kunnen door de werkgever gelezen worden. In het kader
van de vervulling van onze taak van openbaar belang nemen wij uw relevante
persoonlijke gegevens op in onze bestanden. U kunt deze inzien en verbeteren
conform de Wet Verwerking Persoonsgegevens van 8 december 1992.

Het ondernemingsnummer van het provinciebestuur is 0253.973.219



Integration Tests with SOLR 5

2015-02-24 Thread Thomas Scheffler

Hi,

I noticed that not only SOLR does not deliver a WAR file anymore but 
also advices not to try to provide a custom WAR file that can be 
deployed anymore as future version may depend on custom jetty features.


Until 4.10. we were able to provide a WAR file with all the plug-ins we 
need for easier installs. The same WAR file was used together with an 
web application WAR running integration tests and to check if all 
application details still work. We used the cargo-mave2-plugin and 
different servlet container for testing. I think this is quiet common 
thing to do with continuous integration.


Now I wonder if anyone has a similar setup and with integration tests 
running against SOLR 5.


- No artifacts can be used, so no local repository cache is present
- How to deploy your schema.xml, stopwords, solr plug-ins etc. for 
testing in an isolated environment

- What does a maven boilerplate code look like?

Any ideas would be appreciated.

Kind regards,

Thomas


Confirm Solr index corruption

2015-02-17 Thread Thomas Mathew
Hi All,

I use Solr 4.4.0 in a master-slave configuration. Last week, the master
server ran out of disk (logs got too big too quick due to a bug in our
system). Because of this, we weren't able to add new docs to an index. The
first thing I did was to delete a few old log files to free up disk space
(later I moved the other logs to free up disk). The index is working fine
even after this fiasco.

The next day, a colleague of mine pointed out that we may be missing a few
documents in the index. I suspect the above scenario may have broken the
index. I ran the checkIndex against this index. It didn't mention of any
corruption though.

Right now, the index has about 25k docs. I haven't optimized this index in
a while, and there are about 4000 deleted-docs. How can I confirm if we
lost anything? If we've lost docs, is there a way to recover it?

Thanks in advance!!

Regards
Thomas


Re: leader split-brain at least once a day - need help

2015-01-13 Thread Thomas Lamy

Hi Mark,

we're currently at 4.10.2, update to 4.10.3 ist scheduled for tomorrow.

T

Am 12.01.15 um 17:30 schrieb Mark Miller:

bq. ClusterState says we are the leader, but locally we don't think so

Generally this is due to some bug. One bug that can lead to it was recently
fixed in 4.10.3 I think. What version are you on?

- Mark

On Mon Jan 12 2015 at 7:35:47 AM Thomas Lamy t.l...@cytainment.de wrote:


Hi,

I found no big/unusual GC pauses in the Log (at least manually; I found
no free solution to analyze them that worked out of the box on a
headless debian wheezy box). Eventually i tried with -Xmx8G (was 64G
before) on one of the nodes, after checking allocation after 1 hour run
time was at about 2-3GB. That didn't move the time frame where a restart
was needed, so I don't think Solr's JVM GC is the problem.
We're trying to get all of our node's logs (zookeeper and solr) into
Splunk now, just to get a better sorted view of what's going on in the
cloud once a problem occurs. We're also enabling GC logging for
zookeeper; maybe we were missing problems there while focussing on solr
logs.

Thomas


Am 08.01.15 um 16:33 schrieb Yonik Seeley:

It's worth noting that those messages alone don't necessarily signify
a problem with the system (and it wouldn't be called split brain).
The async nature of updates (and thread scheduling) along with
stop-the-world GC pauses that can change leadership, cause these
little windows of inconsistencies that we detect and log.

-Yonik
http://heliosearch.org - native code faceting, facet functions,
sub-facets, off-heap data


On Wed, Jan 7, 2015 at 5:01 AM, Thomas Lamy t.l...@cytainment.de

wrote:

Hi there,

we are running a 3 server cloud serving a dozen
single-shard/replicate-everywhere collections. The 2 biggest

collections are

~15M docs, and about 13GiB / 2.5GiB size. Solr is 4.10.2, ZK 3.4.5,

Tomcat

7.0.56, Oracle Java 1.7.0_72-b14

10 of the 12 collections (the small ones) get filled by DIH full-import

once

a day starting at 1am. The second biggest collection is updated usind

DIH

delta-import every 10 minutes, the biggest one gets bulk json updates

with

commits once in 5 minutes.

On a regular basis, we have a leader information mismatch:
org.apache.solr.update.processor.DistributedUpdateProcessor; Request

says it

is coming from leader, but we are the leader
or the opposite
org.apache.solr.update.processor.DistributedUpdateProcessor;

ClusterState

says we are the leader, but locally we don't think so

One of these pop up once a day at around 8am, making either some cores

going

to recovery failed state, or all cores of at least one cloud node into
state gone.
This started out of the blue about 2 weeks ago, without changes to

neither

software, data, or client behaviour.

Most of the time, we get things going again by restarting solr on the
current leader node, forcing a new election - can this be triggered

while

keeping solr (and the caches) up?
But sometimes this doesn't help, we had an incident last weekend where

our

admins didn't restart in time, creating millions of entries in
/solr/oversser/queue, making zk close the connection, and leader

re-elect

fails. I had to flush zk, and re-upload collection config to get solr up
again (just like in https://gist.github.com/

isoboroff/424fcdf63fa760c1d1a7).

We have a much bigger cloud (7 servers, ~50GiB Data in 8 collections,

1500

requests/s) up and running, which does not have these problems since
upgrading to 4.10.2.


Any hints on where to look for a solution?

Kind regards
Thomas

--
Thomas Lamy
Cytainment AG  Co KG
Nordkanalstrasse 52
20097 Hamburg

Tel.: +49 (40) 23 706-747
Fax: +49 (40) 23 706-139
Sitz und Registergericht Hamburg
HRA 98121
HRB 86068
Ust-ID: DE213009476



--
Thomas Lamy
Cytainment AG  Co KG
Nordkanalstrasse 52
20097 Hamburg

Tel.: +49 (40) 23 706-747
Fax: +49 (40) 23 706-139

Sitz und Registergericht Hamburg
HRA 98121
HRB 86068
Ust-ID: DE213009476





--
Thomas Lamy
Cytainment AG  Co KG
Nordkanalstrasse 52
20097 Hamburg

Tel.: +49 (40) 23 706-747
Fax: +49 (40) 23 706-139

Sitz und Registergericht Hamburg
HRA 98121
HRB 86068
Ust-ID: DE213009476



Re: leader split-brain at least once a day - need help

2015-01-12 Thread Thomas Lamy

Hi,

I found no big/unusual GC pauses in the Log (at least manually; I found 
no free solution to analyze them that worked out of the box on a 
headless debian wheezy box). Eventually i tried with -Xmx8G (was 64G 
before) on one of the nodes, after checking allocation after 1 hour run 
time was at about 2-3GB. That didn't move the time frame where a restart 
was needed, so I don't think Solr's JVM GC is the problem.
We're trying to get all of our node's logs (zookeeper and solr) into 
Splunk now, just to get a better sorted view of what's going on in the 
cloud once a problem occurs. We're also enabling GC logging for 
zookeeper; maybe we were missing problems there while focussing on solr 
logs.


Thomas


Am 08.01.15 um 16:33 schrieb Yonik Seeley:

It's worth noting that those messages alone don't necessarily signify
a problem with the system (and it wouldn't be called split brain).
The async nature of updates (and thread scheduling) along with
stop-the-world GC pauses that can change leadership, cause these
little windows of inconsistencies that we detect and log.

-Yonik
http://heliosearch.org - native code faceting, facet functions,
sub-facets, off-heap data


On Wed, Jan 7, 2015 at 5:01 AM, Thomas Lamy t.l...@cytainment.de wrote:

Hi there,

we are running a 3 server cloud serving a dozen
single-shard/replicate-everywhere collections. The 2 biggest collections are
~15M docs, and about 13GiB / 2.5GiB size. Solr is 4.10.2, ZK 3.4.5, Tomcat
7.0.56, Oracle Java 1.7.0_72-b14

10 of the 12 collections (the small ones) get filled by DIH full-import once
a day starting at 1am. The second biggest collection is updated usind DIH
delta-import every 10 minutes, the biggest one gets bulk json updates with
commits once in 5 minutes.

On a regular basis, we have a leader information mismatch:
org.apache.solr.update.processor.DistributedUpdateProcessor; Request says it
is coming from leader, but we are the leader
or the opposite
org.apache.solr.update.processor.DistributedUpdateProcessor; ClusterState
says we are the leader, but locally we don't think so

One of these pop up once a day at around 8am, making either some cores going
to recovery failed state, or all cores of at least one cloud node into
state gone.
This started out of the blue about 2 weeks ago, without changes to neither
software, data, or client behaviour.

Most of the time, we get things going again by restarting solr on the
current leader node, forcing a new election - can this be triggered while
keeping solr (and the caches) up?
But sometimes this doesn't help, we had an incident last weekend where our
admins didn't restart in time, creating millions of entries in
/solr/oversser/queue, making zk close the connection, and leader re-elect
fails. I had to flush zk, and re-upload collection config to get solr up
again (just like in https://gist.github.com/isoboroff/424fcdf63fa760c1d1a7).

We have a much bigger cloud (7 servers, ~50GiB Data in 8 collections, 1500
requests/s) up and running, which does not have these problems since
upgrading to 4.10.2.


Any hints on where to look for a solution?

Kind regards
Thomas

--
Thomas Lamy
Cytainment AG  Co KG
Nordkanalstrasse 52
20097 Hamburg

Tel.: +49 (40) 23 706-747
Fax: +49 (40) 23 706-139
Sitz und Registergericht Hamburg
HRA 98121
HRB 86068
Ust-ID: DE213009476




--
Thomas Lamy
Cytainment AG  Co KG
Nordkanalstrasse 52
20097 Hamburg

Tel.: +49 (40) 23 706-747
Fax: +49 (40) 23 706-139

Sitz und Registergericht Hamburg
HRA 98121
HRB 86068
Ust-ID: DE213009476



Re: leader split-brain at least once a day - need help

2015-01-08 Thread Thomas Lamy

Hi Alan,
thanks for the pointer, I'll look at our gc logs

Am 07.01.2015 um 15:46 schrieb Alan Woodward:

I had a similar issue, which was caused by 
https://issues.apache.org/jira/browse/SOLR-6763.  Are you getting long GC 
pauses or similar before the leader mismatches occur?

Alan Woodward
www.flax.co.uk


On 7 Jan 2015, at 10:01, Thomas Lamy wrote:


Hi there,

we are running a 3 server cloud serving a dozen 
single-shard/replicate-everywhere collections. The 2 biggest collections are 
~15M docs, and about 13GiB / 2.5GiB size. Solr is 4.10.2, ZK 3.4.5, Tomcat 
7.0.56, Oracle Java 1.7.0_72-b14

10 of the 12 collections (the small ones) get filled by DIH full-import once a 
day starting at 1am. The second biggest collection is updated usind DIH 
delta-import every 10 minutes, the biggest one gets bulk json updates with 
commits once in 5 minutes.

On a regular basis, we have a leader information mismatch:
org.apache.solr.update.processor.DistributedUpdateProcessor; Request says it is 
coming from leader, but we are the leader
or the opposite
org.apache.solr.update.processor.DistributedUpdateProcessor; ClusterState says 
we are the leader, but locally we don't think so

One of these pop up once a day at around 8am, making either some cores going to recovery 
failed state, or all cores of at least one cloud node into state gone.
This started out of the blue about 2 weeks ago, without changes to neither 
software, data, or client behaviour.

Most of the time, we get things going again by restarting solr on the current 
leader node, forcing a new election - can this be triggered while keeping solr 
(and the caches) up?
But sometimes this doesn't help, we had an incident last weekend where our 
admins didn't restart in time, creating millions of entries in 
/solr/oversser/queue, making zk close the connection, and leader re-elect 
fails. I had to flush zk, and re-upload collection config to get solr up again 
(just like in https://gist.github.com/isoboroff/424fcdf63fa760c1d1a7).

We have a much bigger cloud (7 servers, ~50GiB Data in 8 collections, 1500 
requests/s) up and running, which does not have these problems since upgrading 
to 4.10.2.


Any hints on where to look for a solution?

Kind regards
Thomas

--
Thomas Lamy
Cytainment AG  Co KG
Nordkanalstrasse 52
20097 Hamburg

Tel.: +49 (40) 23 706-747
Fax: +49 (40) 23 706-139
Sitz und Registergericht Hamburg
HRA 98121
HRB 86068
Ust-ID: DE213009476






--
Thomas Lamy
Cytainment AG  Co KG
Nordkanalstrasse 52
20097 Hamburg

Tel.: +49 (40) 23 706-747
Fax: +49 (40) 23 706-139

Sitz und Registergericht Hamburg
HRA 98121
HRB 86068
Ust-ID: DE213009476



leader split-brain at least once a day - need help

2015-01-07 Thread Thomas Lamy

Hi there,

we are running a 3 server cloud serving a dozen 
single-shard/replicate-everywhere collections. The 2 biggest collections 
are ~15M docs, and about 13GiB / 2.5GiB size. Solr is 4.10.2, ZK 3.4.5, 
Tomcat 7.0.56, Oracle Java 1.7.0_72-b14


10 of the 12 collections (the small ones) get filled by DIH full-import 
once a day starting at 1am. The second biggest collection is updated 
usind DIH delta-import every 10 minutes, the biggest one gets bulk json 
updates with commits once in 5 minutes.


On a regular basis, we have a leader information mismatch:
org.apache.solr.update.processor.DistributedUpdateProcessor; Request 
says it is coming from leader, but we are the leader

or the opposite
org.apache.solr.update.processor.DistributedUpdateProcessor; 
ClusterState says we are the leader, but locally we don't think so


One of these pop up once a day at around 8am, making either some cores 
going to recovery failed state, or all cores of at least one cloud 
node into state gone.
This started out of the blue about 2 weeks ago, without changes to 
neither software, data, or client behaviour.


Most of the time, we get things going again by restarting solr on the 
current leader node, forcing a new election - can this be triggered 
while keeping solr (and the caches) up?
But sometimes this doesn't help, we had an incident last weekend where 
our admins didn't restart in time, creating millions of entries in 
/solr/oversser/queue, making zk close the connection, and leader 
re-elect fails. I had to flush zk, and re-upload collection config to 
get solr up again (just like in 
https://gist.github.com/isoboroff/424fcdf63fa760c1d1a7).


We have a much bigger cloud (7 servers, ~50GiB Data in 8 collections, 
1500 requests/s) up and running, which does not have these problems 
since upgrading to 4.10.2.



Any hints on where to look for a solution?

Kind regards
Thomas

--
Thomas Lamy
Cytainment AG  Co KG
Nordkanalstrasse 52
20097 Hamburg

Tel.: +49 (40) 23 706-747
Fax: +49 (40) 23 706-139
Sitz und Registergericht Hamburg
HRA 98121
HRB 86068
Ust-ID: DE213009476



TrieLongField not store large longs correctly

2014-11-26 Thread Thomas L. Redman
I believe I have encountered a bug in SOLR. I have a data type defined as 
follows:

fieldType name=long class=solr.TrieLongField precisionStep=0 
positionIncrementGap=0”/

And I have a field defined like so:

field name=aid type=long indexed=true stored=true multiValued=false 
required=true omitNorms=true /

I have not been able to reproduce this problem for smaller numbers, but for 
some of the very large numbers, the value that gets stored for this “aid” field 
is not the same as the number that gets indexed. For example, 20140716126615474 
is stored as 20140716126615470, or in any even, that is the way it is getting 
reported back. When I issue a query, “aid: 20140716126615474”, the value 
reported back for aid is 20140716126615470!

Any suggestions?

Re: TrieLongField not store large longs correctly

2014-11-26 Thread Thomas L. Redman
I was using the SOLR administrative interface to issue my queries. When I 
bypass the administrative interface and go directly to SOLR, the JSON return 
indicates the AID is as it should be. The issue is in the presentation layer of 
the Solr Admin UI. Which is good news.

Thanks all, my bad. Should have checked presentation layer first.

 On Nov 26, 2014, at 8:47 PM, Yonik Seeley yo...@heliosearch.com wrote:
 
 Yeah, XML was fine, JSON outside admin was fine... it's definitely
 just the client (admin).
 Oh, you meant the JSON formatting code in the client - yeah.
 Hopefully there is a way to fix it w/o sacrificing our nice syntax
 highlighting.
 
 -Yonik
 http://heliosearch.org - native code faceting, facet functions,
 sub-facets, off-heap data
 
 On Wed, Nov 26, 2014 at 9:41 PM, Alexandre Rafalovitch
 arafa...@gmail.com wrote:
 Sounds like a JSON formatting code then? What happens when the return
 format is XML?
 
 Also, what happens if the request is made with browser debug panel
 open and we can compare what is on the wire with what is in the
 browser?
 
 Regards,
   Alex.
 Personal: http://www.outerthoughts.com/ and @arafalov
 Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
 Solr popularizers community: https://www.linkedin.com/groups?gid=6713853
 
 
 On 26 November 2014 at 20:02, Yonik Seeley yo...@heliosearch.com wrote:
 On Wed, Nov 26, 2014 at 7:57 PM, Brendan Humphreys bren...@canva.com 
 wrote:
 I'd wager this is a loss of precision caused by Javascript rounding in the
 admin client. More details here:
 
 http://stackoverflow.com/questions/1379934/large-numbers-erroneously-rounded-in-javascript
 
 Ah, indeed - I was testing directly through the address bar, and not
 via the admin interface.
 I just tried the admin interface at
 http://localhost:8983/solr/#/collection1/query
 and I do see the rounding now.
 
 
 -Yonik
 http://heliosearch.org - native code faceting, facet functions,
 sub-facets, off-heap data



Indexing problems with BBoxField

2014-11-23 Thread Thomas Seidl
Hi all,

I just downloaded Solr 4.10.2 and wanted to try out the new BBoxField
type, but couldn't get it to work. The error (with status 400) I get is:

ERROR: [doc=foo] Error adding field
'bboxs_field_location_area'='ENVELOPE(25.89, 41.13, 47.07, 35.31)'
msg=java.lang.IllegalStateException: instead call createFields() because
isPolyField() is true

Which, of course, is rather unhelpful for a user.
The relevant portions of my schema.xml look like this (largely copied
from [1]:

fieldType name=bbox class=solr.BBoxField geo=true units=degrees
numberType=_bbox_coord /
fieldType name=_bbox_coord class=solr.TrieDoubleField
precisionStep=8 stored=false /
dynamicField name=bboxs_* type=bbox indexed=true stored=false
multiValued=false/

[1] https://cwiki.apache.org/confluence/display/solr/Spatial+Search

And the request I send is this:

add
  doc
field name=idfoo/field
field name=bboxs_field_location_areaENVELOPE(25.89, 41.13,
47.07, 35.31)/field
  /doc
/add

Does anyone have any idea what could be going wrong here?

Thanks a lot in advance,
Thomas


Re: Problems after upgrade 4.10.1 - 4.10.2

2014-11-13 Thread Thomas Lamy

Hi,

a big thank you to Jeon Woosung - we just upgraded our cloud to 4.10.2.
One correction: we had to use 
/collections/{collection}/leader_initiated_recovery/shard1/node5, where 
node5 had to be replaced with the place the down node showed up in the 
solr cloud dashboard. Also no tomcat restart was neccessary - even 
contra productive, since state changes may overwrite the just-fixed enty.



Best regards
Thomas


Am 13.11.2014 um 05:47 schrieb Jeon Woosung:

you can migrate zookeeper data manually.

1. connect zookeeper.
 - zkCli.sh -server host:port
2. check old data
 - get /collections/your collection
name/leader_initiated_recovery/your shard name


[zk: localhost:3181(CONNECTED) 25] get
/collections/collection1/leader_initiated_recovery/shard1
*down*
cZxid = 0xe4
ctime = Thu Nov 13 13:38:53 KST 2014
mZxid = 0xe4
mtime = Thu Nov 13 13:38:53 KST 2014
pZxid = 0xe4
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 4
numChildren = 0


i guess that there is only single word which is down

3. delete the data.
 - remove /collections/your collection
name/leader_initiated_recovery/your shard name

4. create new data.
 - create /collections/your collection
name/leader_initiated_recovery/your shard name {state:down}

5. restart the server.



On Thu, Nov 13, 2014 at 7:42 AM, Anshum Gupta ans...@anshumgupta.net
wrote:


Considering the impact, I think we should put this out as an announcement
on the 'news' section of the website warning people about this.

On Wed, Nov 12, 2014 at 12:33 PM, Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:


I opened https://issues.apache.org/jira/browse/SOLR-6732

On Wed, Nov 12, 2014 at 12:29 PM, Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:


Hi Thomas,

You're right, there's a back-compat break here. I'll open an issue.

On Wed, Nov 12, 2014 at 9:37 AM, Thomas Lamy t.l...@cytainment.de

wrote:

Am 12.11.2014 um 15:29 schrieb Thomas Lamy:


Hi there!

As we got bitten by https://issues.apache.org/jira/browse/SOLR-6530

on

a regular basis, we started upgrading our 7 mode cloud from 4.10.1 to
4.10.2.
The first node upgrade worked like a charm.
After upgrading the second node, two cores no longer come up and we

get

the following error:

ERROR - 2014-11-12 15:17:34.226;

org.apache.solr.cloud.RecoveryStrategy;

Recovery failed - trying again... (16) core=cams_shard1_replica4
ERROR - 2014-11-12 15:17:34.230;

org.apache.solr.common.SolrException;

Error while trying to recover. core=onlinelist_shard1_
replica7rg.noggit.JSONParser$ParseException: JSON Parse Error:
char=d,position=0 BEFORE='d' AFTER='own'
 at org.noggit.JSONParser.err(JSONParser.java:223)
 at org.noggit.JSONParser.next(JSONParser.java:622)
 at org.noggit.JSONParser.nextEvent(JSONParser.java:663)
 at org.noggit.ObjectBuilder.init(ObjectBuilder.java:44)
 at org.noggit.ObjectBuilder.getVal(ObjectBuilder.java:37)
 at org.apache.solr.common.cloud.ZkStateReader.fromJSON(
ZkStateReader.java:129)
 at

org.apache.solr.cloud.ZkController.getLeaderInitiatedRecoveryStat

eObject(ZkController.java:1925)
 at

org.apache.solr.cloud.ZkController.getLeaderInitiatedRecoveryStat

e(ZkController.java:1890)
 at org.apache.solr.cloud.ZkController.publish(
ZkController.java:1071)
 at org.apache.solr.cloud.ZkController.publish(
ZkController.java:1041)
 at org.apache.solr.cloud.ZkController.publish(
ZkController.java:1037)
 at org.apache.solr.cloud.RecoveryStrategy.doRecovery(
RecoveryStrategy.java:355)
 at org.apache.solr.cloud.RecoveryStrategy.run(
RecoveryStrategy.java:235)

Any hint on how to solve this? Google didn't reveal anything

useful...


Kind regards
Thomas

  Just switched to INFO loglevel:

INFO  - 2014-11-12 15:30:31.563;

org.apache.solr.cloud.RecoveryStrategy;

Publishing state of core onlinelist_shard1_replica7 as recovering,

leader

is http://solr-bc1-blade2:8080/solr/onlinelist_shard1_replica2/ and I

am

http://solr-bc1-blade3:8080/solr/onlinelist_shard1_replica7/
INFO  - 2014-11-12 15:30:31.563;

org.apache.solr.cloud.RecoveryStrategy;

Publishing state of core cams_shard1_replica4 as recovering, leader is
http://solr-bc1-blade2:8080/solr/cams_shard1_replica2/ and I am
http://solr-bc1-blade3:8080/solr/cams_shard1_replica4/
INFO  - 2014-11-12 15:30:31.563; org.apache.solr.cloud.ZkController;
publishing core=onlinelist_shard1_replica7 state=recovering
collection=onlinelist
INFO  - 2014-11-12 15:30:31.563; org.apache.solr.cloud.ZkController;
publishing core=cams_shard1_replica4 state=recovering collection=cams
ERROR - 2014-11-12 15:30:31.564; org.apache.solr.common.SolrException;
Error while trying to recover. core=cams_shard1_replica4rg.
noggit.JSONParser$ParseException: JSON Parse Error: char=d,position=0
BEFORE='d' AFTER='own'
ERROR

Problems after upgrade 4.10.1 - 4.10.2

2014-11-12 Thread Thomas Lamy

Hi there!

As we got bitten by https://issues.apache.org/jira/browse/SOLR-6530 on a 
regular basis, we started upgrading our 7 mode cloud from 4.10.1 to 4.10.2.

The first node upgrade worked like a charm.
After upgrading the second node, two cores no longer come up and we get 
the following error:


ERROR - 2014-11-12 15:17:34.226; org.apache.solr.cloud.RecoveryStrategy; 
Recovery failed - trying again... (16) core=cams_shard1_replica4
ERROR - 2014-11-12 15:17:34.230; org.apache.solr.common.SolrException; 
Error while trying to recover. 
core=onlinelist_shard1_replica7rg.noggit.JSONParser$ParseException: JSON 
Parse Error: char=d,position=0 BEFORE='d' AFTER='own'

at org.noggit.JSONParser.err(JSONParser.java:223)
at org.noggit.JSONParser.next(JSONParser.java:622)
at org.noggit.JSONParser.nextEvent(JSONParser.java:663)
at org.noggit.ObjectBuilder.init(ObjectBuilder.java:44)
at org.noggit.ObjectBuilder.getVal(ObjectBuilder.java:37)
at 
org.apache.solr.common.cloud.ZkStateReader.fromJSON(ZkStateReader.java:129)
at 
org.apache.solr.cloud.ZkController.getLeaderInitiatedRecoveryStateObject(ZkController.java:1925)
at 
org.apache.solr.cloud.ZkController.getLeaderInitiatedRecoveryState(ZkController.java:1890)

at org.apache.solr.cloud.ZkController.publish(ZkController.java:1071)
at org.apache.solr.cloud.ZkController.publish(ZkController.java:1041)
at org.apache.solr.cloud.ZkController.publish(ZkController.java:1037)
at 
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:355)
at 
org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:235)


Any hint on how to solve this? Google didn't reveal anything useful...


Kind regards
Thomas

--
Thomas Lamy
Cytainment AG  Co KG
Nordkanalstrasse 52
20097 Hamburg

Tel.: +49 (40) 23 706-747
Fax: +49 (40) 23 706-139

Sitz und Registergericht Hamburg
HRA 98121
HRB 86068
Ust-ID: DE213009476



Re: Problems after upgrade 4.10.1 - 4.10.2

2014-11-12 Thread Thomas Lamy

Am 12.11.2014 um 15:29 schrieb Thomas Lamy:

Hi there!

As we got bitten by https://issues.apache.org/jira/browse/SOLR-6530 on 
a regular basis, we started upgrading our 7 mode cloud from 4.10.1 to 
4.10.2.

The first node upgrade worked like a charm.
After upgrading the second node, two cores no longer come up and we 
get the following error:


ERROR - 2014-11-12 15:17:34.226; 
org.apache.solr.cloud.RecoveryStrategy; Recovery failed - trying 
again... (16) core=cams_shard1_replica4
ERROR - 2014-11-12 15:17:34.230; org.apache.solr.common.SolrException; 
Error while trying to recover. 
core=onlinelist_shard1_replica7rg.noggit.JSONParser$ParseException: 
JSON Parse Error: char=d,position=0 BEFORE='d' AFTER='own'

at org.noggit.JSONParser.err(JSONParser.java:223)
at org.noggit.JSONParser.next(JSONParser.java:622)
at org.noggit.JSONParser.nextEvent(JSONParser.java:663)
at org.noggit.ObjectBuilder.init(ObjectBuilder.java:44)
at org.noggit.ObjectBuilder.getVal(ObjectBuilder.java:37)
at 
org.apache.solr.common.cloud.ZkStateReader.fromJSON(ZkStateReader.java:129)
at 
org.apache.solr.cloud.ZkController.getLeaderInitiatedRecoveryStateObject(ZkController.java:1925)
at 
org.apache.solr.cloud.ZkController.getLeaderInitiatedRecoveryState(ZkController.java:1890)

at org.apache.solr.cloud.ZkController.publish(ZkController.java:1071)
at org.apache.solr.cloud.ZkController.publish(ZkController.java:1041)
at org.apache.solr.cloud.ZkController.publish(ZkController.java:1037)
at 
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:355)
at 
org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:235)


Any hint on how to solve this? Google didn't reveal anything useful...


Kind regards
Thomas


Just switched to INFO loglevel:

INFO  - 2014-11-12 15:30:31.563; org.apache.solr.cloud.RecoveryStrategy; 
Publishing state of core onlinelist_shard1_replica7 as recovering, 
leader is http://solr-bc1-blade2:8080/solr/onlinelist_shard1_replica2/ 
and I am http://solr-bc1-blade3:8080/solr/onlinelist_shard1_replica7/
INFO  - 2014-11-12 15:30:31.563; org.apache.solr.cloud.RecoveryStrategy; 
Publishing state of core cams_shard1_replica4 as recovering, leader is 
http://solr-bc1-blade2:8080/solr/cams_shard1_replica2/ and I am 
http://solr-bc1-blade3:8080/solr/cams_shard1_replica4/
INFO  - 2014-11-12 15:30:31.563; org.apache.solr.cloud.ZkController; 
publishing core=onlinelist_shard1_replica7 state=recovering 
collection=onlinelist
INFO  - 2014-11-12 15:30:31.563; org.apache.solr.cloud.ZkController; 
publishing core=cams_shard1_replica4 state=recovering collection=cams
ERROR - 2014-11-12 15:30:31.564; org.apache.solr.common.SolrException; 
Error while trying to recover. 
core=cams_shard1_replica4rg.noggit.JSONParser$ParseException: JSON Parse 
Error: char=d,position=0 BEFORE='d' AFTER='own'
ERROR - 2014-11-12 15:30:31.564; org.apache.solr.common.SolrException; 
Error while trying to recover. 
core=onlinelist_shard1_replica7rg.noggit.JSONParser$ParseException: JSON 
Parse Error: char=d,position=0 BEFORE='d' AFTER='own'
ERROR - 2014-11-12 15:30:31.564; org.apache.solr.cloud.RecoveryStrategy; 
Recovery failed - trying again... (5) core=cams_shard1_replica4
ERROR - 2014-11-12 15:30:31.564; org.apache.solr.cloud.RecoveryStrategy; 
Recovery failed - trying again... (5) core=onlinelist_shard1_replica7
INFO  - 2014-11-12 15:30:31.564; org.apache.solr.cloud.RecoveryStrategy; 
Wait 60.0 seconds before trying to recover again (6)
INFO  - 2014-11-12 15:30:31.564; org.apache.solr.cloud.RecoveryStrategy; 
Wait 60.0 seconds before trying to recover again (6)


The leader for both collections (solr-bc1-blade2) is still on 4.10.1.
As no special instructions were given in the release notes and it's a 
minor upgrade, we thought there should be no BC issues and planned to 
upgrade one node after the other.


Did that provide more insight?

--
Thomas Lamy
Cytainment AG  Co KG
Nordkanalstrasse 52
20097 Hamburg

Tel.: +49 (40) 23 706-747
Fax: +49 (40) 23 706-139

Sitz und Registergericht Hamburg
HRA 98121
HRB 86068
Ust-ID: DE213009476



Re: Suggester not suggesting anything using DictionaryCompoundWordTokenFilterFactory

2014-11-11 Thread Thomas Michael Engelke
 I think I found the problem. The definition of the suggester component
has a field option which references the field that the suggester uses
to generate suggestions. Changing this to the field using the
DictionaryCompundWordTokenFilterFactory also suggests word parts.

Am 11.11.2014 08:52 schrieb Thomas Michael Engelke: 

 I'm toying around with the suggester component, like described here: 
 http://www.andornot.com/blog/post/Advanced-autocomplete-with-Solr-Ngrams-and-Twitters-typeaheadjs.aspx
  [1]
 
 So I made 4 fields:
 
 field name=text_suggest type=text_suggest indexed=true stored=true 
 multiValued=true /
 copyField source=name dest=text_suggest /
 field name=text_suggest_edge type=text_suggest_edge indexed=true 
 stored=true multiValued=true /
 copyField source=name dest=text_suggest_edge /
 field name=text_suggest_ngram type=text_suggest_ngram indexed=true 
 stored=true multiValued=true /
 copyField source=name dest=text_suggest_ngram /
 field name=text_suggest_dictionary_ngram 
 type=text_suggest_dictionary_ngram indexed=true stored=true 
 multiValued=true /
 copyField source=name dest=text_suggest_dictionary_ngram /
 
 with the corresponding definitions:
 
 fieldType name=text_suggest class=solr.TextField
 analyzer
 tokenizer class=solr.KeywordTokenizerFactory /
 filter class=solr.LowerCaseFilterFactory /
 /analyzer
 /fieldType
 fieldType name=text_suggest_edge class=solr.TextField
 analyzer
 tokenizer class=solr.KeywordTokenizerFactory /
 filter class=solr.LowerCaseFilterFactory /
 filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=50 
 side=front /
 /analyzer
 /fieldType
 fieldType name=text_suggest_ngram class=solr.TextField
 analyzer
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory /
 filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=50 
 side=front /
 filter class=solr.RemoveDuplicatesTokenFilterFactory/
 /analyzer
 /fieldType
 fieldType name=text_suggest_dictionary_ngram class=solr.TextField
 analyzer type=index
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory /
 filter class=solr.DictionaryCompoundWordTokenFilterFactory 
 dictionary=dictionary.txt minWordSize=5 minSubwordSize=3 
 maxSubwordSize=30 onlyLongestMatch=false/
 filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=50 
 side=front /
 filter class=solr.RemoveDuplicatesTokenFilterFactory/
 /analyzer
 analyzer type=query
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory /
 /analyzer
 /fieldType
 
 I'm calling the suggester component this way:
 
 http://address:8983/solr/core/suggest?qf=text_suggest^6.0%20test_suggest_edge^3.0%20text_suggest_ngram^1.0%20text_suggest_dictionary_ngram^0.2q=wa
 
 This seems to work fine:
 
 response
 lst name=responseHeader
 int name=status0/int
 int name=QTime0/int
 /lst
 lst name=spellcheck
 lst name=suggestions
 lst name=wa
 int name=numFound5/int
 int name=startOffset0/int
 int name=endOffset2/int
 arr name=suggestion
 strwandelement aus gitter/str
 strwandelement aus stahlblech/str
 strwandelement/str
 strwandhalter für prospekte/str
 strwandascher, h 300 × b 230 × t 60 mm/str
 /arr
 /lst
 str name=collation(wandelement aus gitter)/str
 /lst
 /lst
 /response
 
 However, I added the fourth field so I could get low-boosted suggestions 
 using the afformentioned DictionaryCompoundWordTokenFilterFactory. A sample 
 analysis for the field(type) text_suggest_dictionary_ngram for the word 
 Geländewagen:
 
 g
 ge
 gel
 gelä
 gelän
 geländ
 gelände
 geländew
 geländewa
 geländewag
 geländewage
 geländewagen
 g
 ge
 gel
 gelä
 gelän
 geländ
 gelände
 w
 wa
 wag
 wage
 wagen
 
 As we can see, the DictionaryCompoundWordTokenFilterFactory extracts the word 
 wagen and EdgeNGrams it. However, I cannot get results from these NGrams. 
 Trying wag as the search term for the suggester, there are no results.
 
 However, doing an analysis of Geländewagen (as field value index) and wag 
 (as field value query), analysis shows a match.
 
 I had the thought that it might be because the underlying component of the 
 suggester is a spellchecker, and a spellchecker wouldn't correct wag to 
 wagen because there was an NGram that spelled wag, and so the word was 
 spelled correctly already. So I tried without the EdgeNGrams, but the result 
 stays the same.
 

Links:
--
[1]
http://www.andornot.com/blog/post/Advanced-autocomplete-with-Solr-Ngrams-and-Twitters-typeaheadjs.aspx

How to suggest from multiple fields?

2014-11-11 Thread Thomas Michael Engelke
Like in this article 
(http://www.andornot.com/blog/post/Advanced-autocomplete-with-Solr-Ngrams-and-Twitters-typeaheadjs.aspx), 
I am using multiple fields to generate different options for an 
autosuggest functionality:


- First, the whole field (top priority)
- Then, the whole field as EdgeNGrams from the left side (normal 
priority)

- Lastly, single words or word parts (compound words) as EdgeNGrams

However, I was not very successful in supplying a single requestHandler 
(/suggest) with data from multiple suggesters. I have also not been 
able to find any sample of how this might be done correctly.


Is there a sample that I can read, or a documentation of how this might 
be done? The referenced article was doing it, yet only marginally 
described the technical implementation.


Re: Best practice: Autosuggest/autocomplete vs. real search

2014-11-10 Thread Thomas Michael Engelke
 The dedicated autosuggest field is not used by a suggester component,
instead we just directly query it (/select). I'm trying to read my way
into how the suggesters work, and toying around with some configurations
(For instance from here:
http://www.andornot.com/blog/post/Advanced-autocomplete-with-Solr-Ngrams-and-Twitters-typeaheadjs.aspx).

Compared to how you can analyze search result through the Solr backend,
the analysis of suggester results seems to be sorely lacking.

Am 10.11.2014 14:37 schrieb Michael Sokolov: 

 The goal is to ensure that suggestions from autocomplete are actually terms 
 in the main index, so that the suggestions will actually result in matches. 
 You've considered expanding the main index by adding the suggestion n-grams 
 to it, but it would probably be better to alter your suggester so that it 
 produces only tokens that are in the main index. I think this is basically 
 how all the Suggester implementations are designed to work already; are you 
 using one of those, or are you using the TermsComponent, or something else?
 
 -Mike
 
 On 11/10/14 2:54 AM, Thomas Michael Engelke wrote:
 
 We're using Solr as a backend for an ECommerce site/system. The Solr index 
 stores products with selected attributes, as well as a dedicated field for 
 autocomplete suggestions (Done via AJAX request when typing in the search 
 box without pressing return). The autosuggest field is supplied by copyField 
 directives from certain select product attribute fields (description and/or 
 name mostly). It uses EdgeNGramFilterFactory to complete words not yet typed 
 completely, and it works quite well. However, we come across an issue with a 
 disconnect between the autosuggest results and results of a normal search, 
 that is, a query over the full fields of the product. Let's say there are 
 products that are called motor. - When autosuggesting, typing mot 
 autosuggests all products with motor, because the EdgeNGram created m, 
 mo, mot, moto and motor, respectively, and it matches. - When 
 searching for mot, however (i.e. pressing enter when seeing the 
 autosuggestions), it doesn't
find any products. The autosuggest field is not part of the real search, and 
no product attribute contains mot as a word. One obvious solution would be to 
incorporate the autosuggest field into the real search, however, this adds 
many tokens to the index that aren't really part of the products indexed and 
makes for strange search results, for example when an NGram is also a word, but 
the record itself does contain the search term only as part of a word. Are 
there clever solutions to this problem?
 

Suggester not suggesting anything using DictionaryCompoundWordTokenFilterFactory

2014-11-10 Thread Thomas Michael Engelke
I'm toying around with the suggester component, like described here: 
http://www.andornot.com/blog/post/Advanced-autocomplete-with-Solr-Ngrams-and-Twitters-typeaheadjs.aspx


So I made 4 fields:

 field name=text_suggest type=text_suggest indexed=true 
stored=true multiValued=true /

 copyField source=name dest=text_suggest /
 field name=text_suggest_edge type=text_suggest_edge indexed=true 
stored=true multiValued=true /

 copyField source=name dest=text_suggest_edge /
 field name=text_suggest_ngram type=text_suggest_ngram 
indexed=true stored=true multiValued=true /

 copyField source=name dest=text_suggest_ngram /
 field name=text_suggest_dictionary_ngram 
type=text_suggest_dictionary_ngram indexed=true stored=true 
multiValued=true /

 copyField source=name dest=text_suggest_dictionary_ngram /

with the corresponding definitions:

 fieldType name=text_suggest class=solr.TextField
 analyzer
 tokenizer class=solr.KeywordTokenizerFactory /
 filter class=solr.LowerCaseFilterFactory /
 /analyzer
 /fieldType
 fieldType name=text_suggest_edge class=solr.TextField
 analyzer
 tokenizer class=solr.KeywordTokenizerFactory /
 filter class=solr.LowerCaseFilterFactory /
 filter class=solr.EdgeNGramFilterFactory minGramSize=1 
maxGramSize=50 side=front /

 /analyzer
 /fieldType
 fieldType name=text_suggest_ngram class=solr.TextField
 analyzer
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory /
 filter class=solr.EdgeNGramFilterFactory minGramSize=1 
maxGramSize=50 side=front /

 filter class=solr.RemoveDuplicatesTokenFilterFactory/
 /analyzer
 /fieldType
 fieldType name=text_suggest_dictionary_ngram class=solr.TextField
 analyzer type=index
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory /
 filter class=solr.DictionaryCompoundWordTokenFilterFactory 
dictionary=dictionary.txt minWordSize=5 minSubwordSize=3 
maxSubwordSize=30 onlyLongestMatch=false/
 filter class=solr.EdgeNGramFilterFactory minGramSize=1 
maxGramSize=50 side=front /

 filter class=solr.RemoveDuplicatesTokenFilterFactory/
 /analyzer
 analyzer type=query
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory /
 /analyzer
 /fieldType

I'm calling the suggester component this way:

http://address:8983/solr/core/suggest?qf=text_suggest^6.0%20test_suggest_edge^3.0%20text_suggest_ngram^1.0%20text_suggest_dictionary_ngram^0.2q=wa

This seems to work fine:

response
  lst name=responseHeader
int name=status0/int
int name=QTime0/int
  /lst
  lst name=spellcheck
lst name=suggestions
  lst name=wa
int name=numFound5/int
int name=startOffset0/int
int name=endOffset2/int
arr name=suggestion
  strwandelement aus gitter/str
  strwandelement aus stahlblech/str
  strwandelement/str
  strwandhalter für prospekte/str
  strwandascher, h 300 × b 230 × t 60 mm/str
/arr
  /lst
  str name=collation(wandelement aus gitter)/str
/lst
  /lst
/response

However, I added the fourth field so I could get low-boosted suggestions 
using the afformentioned DictionaryCompoundWordTokenFilterFactory. A 
sample analysis for the field(type) text_suggest_dictionary_ngram for 
the word Geländewagen:


g
ge
gel
gelä
gelän
geländ
gelände
geländew
geländewa
geländewag
geländewage
geländewagen
g
ge
gel
gelä
gelän
geländ
gelände
w
wa
wag
wage
wagen

As we can see, the DictionaryCompoundWordTokenFilterFactory extracts the 
word wagen and EdgeNGrams it. However, I cannot get results from these 
NGrams. Trying wag as the search term for the suggester, there are no 
results.


However, doing an analysis of Geländewagen (as field value index) and 
wag (as field value query), analysis shows a match.


I had the thought that it might be because the underlying component of 
the suggester is a spellchecker, and a spellchecker wouldn't correct 
wag to wagen because there was an NGram that spelled wag, and so 
the word was spelled correctly already. So I tried without the 
EdgeNGrams, but the result stays the same.


  1   2   3   >