Solr Expand throws NPE along with elevate component

2018-02-20 Thread Aman Deep singh
Hi,
I’m Facing a issue with expand component when working alongside with elevate 
component,
In some of the request (not for all request) expand component is throwing NPE 
below is the stacktrace,any idea why the ArrayTimSorter object is null and any 
way to avoid that

Solr Log stack trace-
2018-02-21 05:09:06.404 ERROR (qtp444920847-16) [c:test s:shard1 r:core_node1 
x:test_shard1_replica1] o.a.s.s.HttpSolrCall null:java.io.IOException: 
java.lang.NullPointerException
at 
org.apache.solr.handler.component.ExpandComponent.process(ExpandComponent.java:339)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:304)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477)
at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723)
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:361)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:305)
at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691)
at 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
at 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
at 
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
at 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
at 
org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
at org.eclipse.jetty.server.Server.handle(Server.java:534)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)
at 
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
at 
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
at 
org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
at 
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
at 
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
at 
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
at 
java.util.Comparators$NaturalOrderComparator.compare(Comparators.java:52)
at 
java.util.Comparators$NaturalOrderComparator.compare(Comparators.java:47)
at org.apache.lucene.util.ArrayTimSorter.compare(ArrayTimSorter.java:48)
at org.apache.lucene.util.Sorter.comparePivot(Sorter.java:50)
at org.apache.lucene.util.Sorter.binarySort(Sorter.java:197)
at org.apache.lucene.util.TimSorter.nextRun(TimSorter.java:120)
at org.apache.lucene.util.TimSorter.sort(TimSorter.java:201)
at org.apache.lucene.util.ArrayUtil.timSort(ArrayUtil.java:426)
at org.apache.lucene.util.ArrayUtil.timSort(ArrayUtil.java:445)
at org.apache.lucene.util.ArrayUtil.timSort(ArrayUtil.java:453)
at 
org.apache.lucene.search.TermInSetQuery.(TermInSetQuery.java:87)
at 
org.apache.lucene.search.TermInSetQuery.(TermInSetQuery.java:109)
at 
org.apache.solr.handler.component.ExpandComponent.getGroupQuery(ExpandComponent.java:718)
at 
org.apache.solr.handler.component.ExpandComponent.process(ExpandComponent.java:337)
... 34 more

Response stack trace-
java.io.IOException: java.lang.NullPointerException\n\tat 

Re: Streaming Expressions using Solrj.io

2018-02-20 Thread Shawn Heisey

On 2/20/2018 7:54 PM, Ryan Yacyshyn wrote:

I'd like to get a stream of search results using the solrj.io package but
running into a small issue.





Exception in thread "main" java.lang.NoSuchMethodError:
org.apache.http.impl.client.HttpClientBuilder.evictIdleConnections(JLjava/util/concurrent/TimeUnit;)Lorg/apache/http/impl/client/HttpClientBuilder;


There is a problem accessing the HttpClient library. Either the 
httpclient jar is missing from your project, or it's the wrong version.  
You can use pretty much any 4.5.x version for recent SolrJ versions.  
3.x versions won't work at all, and older 4.x versions not work.  The 
5.0 beta releases also won't work. You can find information about 4.0 
and later versions of HttpClient here:


http://hc.apache.org/

If you use a dependency manager like gradle, maven, or ivy for your 
project, just be sure it's set to pull in all transitive dependencies 
for solrj, and you should be fine.  If you manage dependencies manually, 
you will find all of the extra jars required by the solrj client in the 
download, in the dist/solrj-lib directory.  Note that you can very 
likely upgrade individual dependencies to newer versions than Solr 
includes with no issues.


Thanks,
Shawn



Streaming Expressions using Solrj.io

2018-02-20 Thread Ryan Yacyshyn
Hello all,

I'd like to get a stream of search results using the solrj.io package but
running into a small issue. It seems to have something to do with the
HttpClientUtil. I'm testing on SolrCloud 7.1.0, using the
sample_techproducts_configs configs, and indexed the manufacturers.xml file.

I'm following the test code in the method `testCloudSolrStreamWithZkHost`
found in StreamExpressionTest.java:

```
package ca.ryac.testing;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import org.apache.solr.client.solrj.io.SolrClientCache;
import org.apache.solr.client.solrj.io.Tuple;
import org.apache.solr.client.solrj.io.stream.CloudSolrStream;
import org.apache.solr.client.solrj.io.stream.StreamContext;
import org.apache.solr.client.solrj.io.stream.TupleStream;
import org.apache.solr.client.solrj.io.stream.expr.StreamExpression;
import org.apache.solr.client.solrj.io.stream.expr.StreamExpressionParser;
import org.apache.solr.client.solrj.io.stream.expr.StreamFactory;

public class SolrStreamingClient {

  String zkHost = "localhost:9983";
  String COLLECTIONORALIAS = "gettingstarted";

  public SolrStreamingClient() throws Exception {
init();
  }

  public static void main(String[] args) throws Exception {
new SolrStreamingClient();
  }

  private void init() throws Exception {

System.out.println(zkHost);

StreamFactory factory = new StreamFactory();

StreamExpression expression;
CloudSolrStream stream;
List tuples;
StreamContext streamContext = new StreamContext();
SolrClientCache solrClientCache = new SolrClientCache();
streamContext.setSolrClientCache(solrClientCache);

// basic test..
String expr = "search(" + COLLECTIONORALIAS + ", zkHost=\"" + zkHost
+ "\", q=*:*, fl=\"id,compName_s\", sort=\"compName_s asc\")";

System.out.println(expr);
expression = StreamExpressionParser.parse(expr);

stream = new CloudSolrStream(expression, factory);
stream.setStreamContext(streamContext);
tuples = getTuples(stream);

System.out.println(tuples.size());
  }

  protected List getTuples(TupleStream tupleStream) throws
IOException {
List tuples = new ArrayList();

try {
  System.out.println("open stream..");
  tupleStream.open();
  for (Tuple t = tupleStream.read(); !t.EOF; t = tupleStream.read()) {
tuples.add(t);
  }
} finally {
  tupleStream.close();
}
return tuples;
  }
}
```

And this is the output I get:

---
localhost:9983
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further
details.
search(gettingstarted, zkHost="localhost:9983", q=*:*, fl="id,compName_s",
sort="compName_s asc")
open stream..
Exception in thread "main" java.lang.NoSuchMethodError:
org.apache.http.impl.client.HttpClientBuilder.evictIdleConnections(JLjava/util/concurrent/TimeUnit;)Lorg/apache/http/impl/client/HttpClientBuilder;
  at
org.apache.solr.client.solrj.impl.HttpClientUtil.createClient(HttpClientUtil.java:279)
  at
org.apache.solr.client.solrj.impl.HttpClientUtil.createClient(HttpClientUtil.java:298)
  at
org.apache.solr.client.solrj.impl.HttpClientUtil.createClient(HttpClientUtil.java:236)
  at
org.apache.solr.client.solrj.impl.HttpClientUtil.createClient(HttpClientUtil.java:223)
  at
org.apache.solr.client.solrj.impl.CloudSolrClient.(CloudSolrClient.java:276)
  at
org.apache.solr.client.solrj.impl.CloudSolrClient$Builder.build(CloudSolrClient.java:1525)
  at
org.apache.solr.client.solrj.io.SolrClientCache.getCloudSolrClient(SolrClientCache.java:62)
  at
org.apache.solr.client.solrj.io.stream.TupleStream.getShards(TupleStream.java:138)
  at
org.apache.solr.client.solrj.io.stream.CloudSolrStream.constructStreams(CloudSolrStream.java:368)
  at
org.apache.solr.client.solrj.io.stream.CloudSolrStream.open(CloudSolrStream.java:274)
  at
ca.ryac.testing.SolrStreamingClient.getTuples(SolrStreamingClient.java:61)
  at ca.ryac.testing.SolrStreamingClient.init(SolrStreamingClient.java:51)
  at ca.ryac.testing.SolrStreamingClient.(SolrStreamingClient.java:22)
  at ca.ryac.testing.SolrStreamingClient.main(SolrStreamingClient.java:26)
---

It's not finding or connecting to my SolrCloud instance, I can put
*anything* in zkHost and get the same results. Not really sure why it can't
find or connect to it. Any thoughts or ideas?

Thank you,
Ryan


Re: Filesystems supported by Solr

2018-02-20 Thread Shawn Heisey
On 2/20/2018 3:22 PM, Ritesh Chaman wrote:
> May I know what all filesystems are supported by Solr. For eg ADLS,WASB, S3
> etc. Thanks.

Solr supports whatever your operating system supports.  It will expect
file locking to work be fully functional, so things like NFS don't
always work.  Local filesystems are very much preferred, and will
generally have the best performance.

As far as I am aware, the only filesystem that Solr has explicit support
for (outside of what the OS itself provides) is HDFS.

https://lucene.apache.org/solr/guide/7_2/running-solr-on-hdfs.html

There may be plugins available to store indexes in other stores like S3,
but if those exist, I am not immediately aware of them.  They would be
third-party plugins, not supported by the Solr project.

Thanks,
Shawn



Re: Filesystems supported by Solr

2018-02-20 Thread Rick Leir
Ritesh
The filesystems you mention are used by Spark so it can stream huge quantities 
of data (corrections please).

By comparison, Solr uses a more 'reasonable' sized filesystem, but needs enough 
memory that all the index data can be resident. The regular Linux ext3 or ext4 
is fine.

If you are integrating Solr with Spark, then the filesystems you mention would 
be for Spark not Solr. 
Cheers -- Rick


On February 20, 2018 5:22:33 PM EST, Ritesh Chaman  
wrote:
>Hi team
>
>May I know what all filesystems are supported by Solr. For eg
>ADLS,WASB, S3
>etc. Thanks.
>
>Ritesh

-- 
Sorry for being brief. Alternate email is rickleir at yahoo dot com 

Filesystems supported by Solr

2018-02-20 Thread Ritesh Chaman
Hi team

May I know what all filesystems are supported by Solr. For eg ADLS,WASB, S3
etc. Thanks.

Ritesh


Re: storing large text fields in a database? (instead of inside index)

2018-02-20 Thread Roman Chyla
Say there is a high load and  I'd like to bring a new machine and let it
replicate the index, if 100gb and more can be shaved, it will have a
significant impact on how quickly the new searcher is ready and added to
the cluster. Impact on the search speed is likely minimal.

we are investigating the idea of two clusters but i have to say it seems to
me more complex than storing/loading a field from an external source.
having said that, I wonder why this was not done before (maybe it was) and
what the cons are (besides the obvious ones: maintenance and the database
being potential point of failure; well in that case i'd miss highlights -
can live with that...)

On Tue, Feb 20, 2018 at 10:36 AM, David Hastings <
hastings.recurs...@gmail.com> wrote:

> Really depends on what you consider too large, and why the size is a big
> issue, since most replication will go at about 100mg/second give or take,
> and replicating a 300GB index is only an hour or two.  What i do for this
> purpose is store my text in a separate index altogether, and call on that
> core for highlighting.  So for my use case, the primary index with no
> stored text is around 300GB and replicates as needed, and the full text
> indexes with stored text totals around 500GB and are replicating non stop.
> All searching goes against the primary index, and for highlighting i call
> on the full text indexes that have a stupid simple schema.  This has worked
> for me pretty well at least.
>
> On Tue, Feb 20, 2018 at 10:27 AM, Roman Chyla 
> wrote:
>
> > Hello,
> >
> > We have a use case of a very large index (slave-master; for unrelated
> > reasons the search cannot work in the cloud mode) - one of the fields is
> a
> > very large text, stored mostly for highlighting. To cut down the index
> size
> > (for purposes of replication/scaling) I thought I could try to save it
> in a
> > database - and not in the index.
> >
> > Lucene has codecs - one of the methods is for 'stored field', so that
> seems
> > likes a natural path for me.
> >
> > However, I'd expect somebody else before had a similar problem. I googled
> > and couldn't find any solutions. Using the codecs seems really good thing
> > for this particular problem, am I missing something? Is there a better
> way
> > to cut down on index size? (besides solr cloud/sharding, compression)
> >
> > Thank you,
> >
> >Roman
> >
>


Re: What is “high cardinality” in facet streams?

2018-02-20 Thread Joel Bernstein
The rollup streaming expression rolls up aggregations on a stream that has
been sorted by the group by fields. This is basically a MapReduce reduce
operation and can work with extremely high cardinality (basically
unlimited). The rollup function is designed to rollup data produced by the
/export handler which can also sort data sets with very high cardinality.
The docs should describe the correct usage of the rollup expression with
the /export handler.

Joel Bernstein
http://joelsolr.blogspot.com/

On Tue, Feb 20, 2018 at 11:10 AM, Shawn Heisey  wrote:

> On 2/20/2018 4:44 AM, Alfonso Muñoz-Pomer Fuentes wrote:
>
>> We have a query that we can resolve using either facet or search with
>> rollup. In the Stream Source Reference section of Solr’s Reference Guide (
>> https://lucene.apache.org/solr/guide/7_1/stream-source-refe
>> rence.html#facet) it says “To support high cardinality aggregations see
>> the rollup function”. I was wondering what it’s considered “high
>> cardinality”. If it serves, our query returns up to 60k results. I haven’t
>> got to do any benchmarking to see if there’s any difference, though,
>> because facet so far performs very well, but I don’t know if I’m near the
>> “tipping point”. Any feedback would be appreciated.
>>
>
> There's no hard and fast rule for this.  The tipping point is going to be
> different for every use case.  With a little bit of information about your
> setup, experienced users can make an educated guess about whether or not
> performance will be good, but cannot say with absolute certainty what
> you're going to run into.
>
> Let's start with some definitions, which you may or may not already know:
>
> https://en.wikipedia.org/wiki/Cardinality_(data_modeling)
> https://en.wikipedia.org/wiki/Cardinality
>
> You haven't said how many unique values are in your field.  The only
> information I have from you is 60K results from your queries, which may or
> may not have any bearing on the total number of documents in your index, or
> the total number of unique values in the field you're using for faceting.
> So the next paragraph may or may not apply to your index.
>
> In general, 60,000 unique values in a field would be considered very low
> cardinality, because computers can typically operate on 60,000 values
> *very* quickly, unless the size of each value is enormous.  But if the
> index has 60,000 total documents, then *in relation to other data*, the
> cardinality is very high, even though most people would say the opposite.
> Sixty thousand documents or unique values is almost always a very small
> index, not prone to performance issues.
>
> The warnings about cardinality in the Solr documentation mostly refer to
> *absolute* cardinality -- how many unique values there are in a field,
> regardless of the actual number of documents.  If there are millions or
> billions of unique values, then operations like facets, grouping, sorting,
> etc are probably going to be slow.  If there are a lot less, such as
> thousands or only a handful, then those operations are likely to be very
> fast, because the computer will have less information it must process.
>
> Thanks,
> Shawn
>
>


solr.DictionaryCompoundWordTokenFilterFactory filter and double quotes

2018-02-20 Thread Natarajan, Rajeswari
 Hi,

We have below field type defined in our schema.xml  to support  the German 
Compound  word search .  This works find. But even when double quotes are there 
in the search term , it gets split . Is there a way not to split the term when 
double quotes are present in the query with this field type


 
  
  
  

  
 
 
  


Thanks in Advance,
Rajeswari  



Re: What is “high cardinality” in facet streams?

2018-02-20 Thread Shawn Heisey

On 2/20/2018 4:44 AM, Alfonso Muñoz-Pomer Fuentes wrote:

We have a query that we can resolve using either facet or search with rollup. 
In the Stream Source Reference section of Solr’s Reference Guide 
(https://lucene.apache.org/solr/guide/7_1/stream-source-reference.html#facet) 
it says “To support high cardinality aggregations see the rollup function”. I 
was wondering what it’s considered “high cardinality”. If it serves, our query 
returns up to 60k results. I haven’t got to do any benchmarking to see if 
there’s any difference, though, because facet so far performs very well, but I 
don’t know if I’m near the “tipping point”. Any feedback would be appreciated.


There's no hard and fast rule for this.  The tipping point is going to 
be different for every use case.  With a little bit of information about 
your setup, experienced users can make an educated guess about whether 
or not performance will be good, but cannot say with absolute certainty 
what you're going to run into.


Let's start with some definitions, which you may or may not already know:

https://en.wikipedia.org/wiki/Cardinality_(data_modeling)
https://en.wikipedia.org/wiki/Cardinality

You haven't said how many unique values are in your field.  The only 
information I have from you is 60K results from your queries, which may 
or may not have any bearing on the total number of documents in your 
index, or the total number of unique values in the field you're using 
for faceting.  So the next paragraph may or may not apply to your index.


In general, 60,000 unique values in a field would be considered very low 
cardinality, because computers can typically operate on 60,000 values 
*very* quickly, unless the size of each value is enormous.  But if the 
index has 60,000 total documents, then *in relation to other data*, the 
cardinality is very high, even though most people would say the 
opposite.  Sixty thousand documents or unique values is almost always a 
very small index, not prone to performance issues.


The warnings about cardinality in the Solr documentation mostly refer to 
*absolute* cardinality -- how many unique values there are in a field, 
regardless of the actual number of documents.  If there are millions or 
billions of unique values, then operations like facets, grouping, 
sorting, etc are probably going to be slow.  If there are a lot less, 
such as thousands or only a handful, then those operations are likely to 
be very fast, because the computer will have less information it must 
process.


Thanks,
Shawn



Re: Auto-Suggestions are not propagating to Solr Cluster Nodes

2018-02-20 Thread Kalahasthi Satyanarayana
FYI

Thanks
Kalahasthi Satyanarayana
Mobile : 08884581161

From: Kalahasthi Satyanarayana
Sent: Tuesday, February 20, 2018 11:57 AM
To: 'solr-user@lucene.apache.org
Cc: Deepak Udapudi; Venkata MR; v...@delta.org; Nareshkumar P; Nareshkumar P; 
Soma Das; Soma Das
Subject: Auto-Suggestions are not propagating to Solr Cluster Nodes


Hi All,

Problem: Not able to build suggest data on all solr cluster nodes

Configured three solr using external zookeeper
Configured the requestHandler for auto-suggestion as below



true
5
Name


suggest





Name
name
name
AnalyzingInfixLookupFactory
name_suggester_infix_dir
DocumentDictionaryFactory
key
lowercase
name_suggestor_dictionary
string



When we manually issue request with suggest.build=true on one of the node for 
building suggest data, suggest data is built for that particular node only, 
other nodes of cluster are not getting build the suggest data.
Any configuration mismatch?

Thanks
Kalahasthi Satyanarayana
Mobile : 08884581161

::DISCLAIMER::
--
The contents of this e-mail and any attachment(s) are confidential and intended 
for the named recipient(s) only. E-mail transmission is not guaranteed to be 
secure or error-free as information could be intercepted, corrupted, lost, 
destroyed, arrive late or incomplete, or may contain viruses in transmission. 
The e mail and its contents (with or without referred errors) shall therefore 
not attach any liability on the originator or HCL or its affiliates. Views or 
opinions, if any, presented in this email are solely those of the author and 
may not necessarily reflect the views or opinions of HCL or its affiliates. Any 
form of reproduction, dissemination, copying, disclosure, modification, 
distribution and / or publication of this message without the prior written 
consent of authorized representative of HCL is strictly prohibited. If you have 
received this email in error please delete it and notify the sender 
immediately. Before opening any email and/or attachments, please check them for 
viruses and other defects.
--


Re: storing large text fields in a database? (instead of inside index)

2018-02-20 Thread David Hastings
Really depends on what you consider too large, and why the size is a big
issue, since most replication will go at about 100mg/second give or take,
and replicating a 300GB index is only an hour or two.  What i do for this
purpose is store my text in a separate index altogether, and call on that
core for highlighting.  So for my use case, the primary index with no
stored text is around 300GB and replicates as needed, and the full text
indexes with stored text totals around 500GB and are replicating non stop.
All searching goes against the primary index, and for highlighting i call
on the full text indexes that have a stupid simple schema.  This has worked
for me pretty well at least.

On Tue, Feb 20, 2018 at 10:27 AM, Roman Chyla  wrote:

> Hello,
>
> We have a use case of a very large index (slave-master; for unrelated
> reasons the search cannot work in the cloud mode) - one of the fields is a
> very large text, stored mostly for highlighting. To cut down the index size
> (for purposes of replication/scaling) I thought I could try to save it in a
> database - and not in the index.
>
> Lucene has codecs - one of the methods is for 'stored field', so that seems
> likes a natural path for me.
>
> However, I'd expect somebody else before had a similar problem. I googled
> and couldn't find any solutions. Using the codecs seems really good thing
> for this particular problem, am I missing something? Is there a better way
> to cut down on index size? (besides solr cloud/sharding, compression)
>
> Thank you,
>
>Roman
>


storing large text fields in a database? (instead of inside index)

2018-02-20 Thread Roman Chyla
Hello,

We have a use case of a very large index (slave-master; for unrelated
reasons the search cannot work in the cloud mode) - one of the fields is a
very large text, stored mostly for highlighting. To cut down the index size
(for purposes of replication/scaling) I thought I could try to save it in a
database - and not in the index.

Lucene has codecs - one of the methods is for 'stored field', so that seems
likes a natural path for me.

However, I'd expect somebody else before had a similar problem. I googled
and couldn't find any solutions. Using the codecs seems really good thing
for this particular problem, am I missing something? Is there a better way
to cut down on index size? (besides solr cloud/sharding, compression)

Thank you,

   Roman


Save the date: ApacheCon North America, September 24-27 in Montréal

2018-02-20 Thread Rich Bowen

Dear Apache Enthusiast,

(You’re receiving this message because you’re subscribed to a user@ or 
dev@ list of one or more Apache Software Foundation projects.)


We’re pleased to announce the upcoming ApacheCon [1] in Montréal, 
September 24-27. This event is all about you — the Apache project community.


We’ll have four tracks of technical content this time, as well as lots 
of opportunities to connect with your project community, hack on the 
code, and learn about other related (and unrelated!) projects across the 
foundation.


The Call For Papers (CFP) [2] and registration are now open. Register 
early to take advantage of the early bird prices and secure your place 
at the event hotel.


Important dates
March 30: CFP closes
April 20: CFP notifications sent
	August 24: Hotel room block closes (please do not wait until the last 
minute)


Follow @ApacheCon on Twitter to be the first to hear announcements about 
keynotes, the schedule, evening events, and everything you can expect to 
see at the event.


See you in Montréal!

Sincerely, Rich Bowen, V.P. Events,
on behalf of the entire ApacheCon team

[1] http://www.apachecon.com/acna18
[2] https://cfp.apachecon.com/conference.html?apachecon-north-america-2018


What is “high cardinality” in facet streams?

2018-02-20 Thread Alfonso Muñoz-Pomer Fuentes
Hi,

We have a query that we can resolve using either facet or search with rollup. 
In the Stream Source Reference section of Solr’s Reference Guide 
(https://lucene.apache.org/solr/guide/7_1/stream-source-reference.html#facet) 
it says “To support high cardinality aggregations see the rollup function”. I 
was wondering what it’s considered “high cardinality”. If it serves, our query 
returns up to 60k results. I haven’t got to do any benchmarking to see if 
there’s any difference, though, because facet so far performs very well, but I 
don’t know if I’m near the “tipping point”. Any feedback would be appreciated.

Many thanks in advance.

--
Alfonso Muñoz-Pomer Fuentes
Senior Lead Software Engineer @ Expression Atlas Team
European Bioinformatics Institute (EMBL-EBI)
European Molecular Biology Laboratory
Tel:+ 44 (0) 1223 49 2633
Skype: amunozpomer



Sitecore Analytics Index

2018-02-20 Thread rojerick luna
Hi,

For those have Sitecore website app having multisite (but only 1 Sitecore code 
base), have you separated the index for each multisite? how where you able to 
manage it? also do you have archiving since analytics data keep growing?

Thanks

Best Regards,
Jeck


Re: Need help with match contains query in SOLR

2018-02-20 Thread Alessandro Benedetti
It was not clear at the beginning, but If I understood correctly you could :

*Index Time analysis*
Use whatever charFilter you need, the keyword tokenizer[1] and then token
filters you like ( such as lowercase filter, synonyms ect)

*Query Time Analysis*
You can use a tokenizer you like ( that tokenizes so not keywordTokenizer),
the Shingle Token filter[2] and 
whatever additional filter you need.
This should do the trick.

Cheers

[1]
https://lucene.apache.org/solr/guide/6_6/tokenizers.html#Tokenizers-KeywordTokenizer
[2]
https://lucene.apache.org/solr/guide/6_6/filter-descriptions.html#FilterDescriptions-ShingleFilter



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: One of three cores is missing userData and lastModified fields from /admin/cores

2018-02-20 Thread www.vivek.sharma
Were you able to get a solution to this issue ?


Aaron Daubman wrote
> On a Solr server running 4.10.2 with three cores, two return the expected
> info from /solr/admin/cores?wt=json but the third is missing userData and
> lastModified.
> 
> The first (artists) and third (tracks) cores from the linked screenshot
> are
> the ones I care about. 
*
> Unfortunately, the third (tracks) is the one missing
> lastModified.
*
> 
> As far as I can see, that comes from:
> https://github.com/apache/lucene-solr/blob/lucene_solr_4_10_2/solr/core/src/java/org/apache/solr/handler/admin/LukeRequestHandler.java#L568
> 
> I can't trace back to see what would possible cause getUserData() to
> return
> an empty Object, but that appears to be what is happening?





--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Facet performance problem

2018-02-20 Thread Shawn Heisey

On 2/20/2018 1:18 AM, LOPEZ-CORTES Mariano-ext wrote:

We return a facet list of values in "motifPresence" field (person status).
Status:
[ ] status1
[x] status2
[x] status3

The user then selects 1 or multiple status (It's this step that we called "facet 
filtering").

Query is then re-executed with fq=motifPresence:(status2 OR status3)

We use fq in order to not alter the score in main query.

We've read that docValues=true for facet fields.

We need also indexed=true?


Facets, grouping, and sorting are more efficient with docValues, but 
searches aren't helped by docValues.  Without indexed="true", searches 
on the field will be VERY slow.  A filter query is still a search.  The 
"filter" in filter query just refers to the fact that it's separate from 
the main query, and that it does not affect relevancy scoring.


Thanks,
Shawn



Re: ZK session times out intermittently

2018-02-20 Thread Shawn Heisey

On 2/19/2018 3:33 PM, Roy Lim wrote:

6 x Solr (3 primary shard, 3 secondary)
3 x ZK

The client is indexing over 16 million documents using 8 threads.  Auto-soft
commit is 3 minutes, auto-commit is 10 minutes.


I would probably reduce the autoCommit time to 1 minute, as long as 
openSearcher is set to false, which is the recommended setting.  This is 
not necessary, but it would probably reduce the size of your transaction 
logs, which will make Solr restarts faster.



The following timeout is observed in our client log, intermittently:


There is no information here.  I checked Nabble as well, because 
sometimes when they replicate to the mailing list, there is information 
on their forum that does not show up on the mailing list.  In this case, 
Nabble didn't have any information either. If you can't get the data to 
stay in the message, you may need to use a paste website and provide a URL.



Thinking that this is a case where ZK could no longer establish connection
to Solr node it is communicating with, I went to the primary nodes and
correlated the timestamps.  They all are very similar to below:


Again, there is nothing here for us to examine.

BTW, ZK does not connect to Solr.  Solr connects to ZK.It's possible 
that you're already aware of this, but because of the way you phrased 
your comment, I cannot tell for sure.



Note the time gap of over 1 minute, which I can only surmise that ZK is
waiting this whole time for Solr to return, only to timeout.  Is that
reasonable?  Thing is I have no idea what is happening in during that time
and why Solr is taking so long.  Note the second statement signaling the
start of the soft commit, so I don't think this is a case of a long commit.

Finally, checking the GC logs, there are no long pauses either!

Hoping an expert can shed some light here.


Because we can't actually see the information you've referenced, which I 
assume are excerpts from logfiles, it's difficult to make any kind of 
recommendation, or even make a guess.


We'll need to see your solr logfile, and maybe your ZK logfile. 
Hopefully there are ERROR logs that we can attempt to decipher, but 
you'll want the logging to be at the default level of INFO, so we can 
see the errors in context.  If Solr and ZK are on separate servers, 
you'll want to make sure that there is good time synchronization, so 
that timestamps in different logs are in sync with each other.


How have you determined that the GC log does not have long pauses?  Can 
you share a GC log that includes the timeframe where the problem happened?


Thanks,
Shawn



RE: Facet performance problem

2018-02-20 Thread LOPEZ-CORTES Mariano-ext
Our query looks like this:

...factet=true=motifPresence

We return a facet list of values in "motifPresence" field (person status).
Status:
[ ] status1
[x] status2
[x] status3

The user then selects 1 or multiple status (It's this step that we called 
"facet filtering").

Query is then re-executed with fq=motifPresence:(status2 OR status3)

We use fq in order to not alter the score in main query.

We've read that docValues=true for facet fields.  

We need also indexed=true?
Is there any other problem in our solution?

-Message d'origine-
De : Erick Erickson [mailto:erickerick...@gmail.com] 
Envoyé : lundi 19 février 2018 18:18
À : solr-user
Objet : Re: Facet performance problem

I'm confused here. What do you mean by "facet filtering"? Your examples have no 
facets at all, just a _filter query_.

I'll assume you want to use filter query (fq), and faceting has nothing to do 
with it. This is one of the tricky bits of docValues.
While it's _possible_ to search on a field that's defined as above, it's very 
inefficient since there's no "inverted index" for the field, you specified 
'indexed="false" '. So the docValues are searched, and it's essentially a table 
scan.

If you mean to search against this field, set indexed="true". You'll have to 
completely reindex your corpus of course.

If you intend to facet, group or sort on this field, you should _also_ have 
docValues="true".

Best,
Erick

On Mon, Feb 19, 2018 at 7:47 AM, MOUSSA MZE Oussama-ext 
 wrote:
> Hi
>
> We have following environement :
>
> 3 nodes cluster
> 1 shard
> Replication factor = 2
> 8GB per node
>
> 29 millions of documents
>
> We've faceting over field "motifPresence" defined as follow:
>
>  indexed="false" stored="true" required="false"/>
>
> Once the user selects motifPresence filter we executes search again with:
>
> fq: (value1 OR value2 OR value3 OR ...)
>
> The problem is: During facet filtering query is too slow and her response 
> time is greater than main search (without facet filtering).
>
> Thanks in advance!