date:20160426

Re: How can I set the defaultOperator to be AND?

2016-04-26 Thread Bastien Latard - MDPI AG


Thank you Erick.
You're fully right that it can be an expected behavior to get more docs 
with more words...why not...


However, when I set the default OP to "AND" in solrconfig.xml, then a 
simple query "q=a OR b" doesn't work as expected... as described in the 
previous email:
-> a search 'title:"test" OR author:"me"' will returns documents 
matching 'title:"test" AND author:"me"'


Kind regards,
Bastien

On 27/04/2016 05:30, Erick Erickson wrote:
Defaulting to "OR" has been the behavior since forever, so changing 
the behavior now is just not going to happen. Making it fit a new 
version of "correct" will change the behavior for every application 
out there that has not specified the default behavior.


There's no a-priori reason to expect "more words to equal fewer docs", 
I can just as easily argue that "more words should return more docs". 
Which you expect depends on your mental model.


And providing the default op in your solrconfig.xml request handlers 
allows you to implement whatever model your application chooses...


Best,
Erick

On Mon, Apr 25, 2016 at 11:32 PM, Bastien Latard - MDPI AG 
> wrote:


Thank you Shawn, Jan and Georg for your answers.

Yes, it seems that if I simply remove the defaultOperator it works
well for "composed queries" like '(a:x AND b:y) OR c:z'.
But I think that the default Operator should/could be the AND.

Because when I add an extra search word, I expect that the results
get more accurate...
(It seems to be what google is also doing now)
   ||

Otherwise, if you make a search and apply another filter (e.g.:
sort by publication date, facets, ...) , user can get the less
relevant item (only 1 word in 4 matches) in first position only
because of its date...

What do you think?


Kind regards,
Bastien


On 25/04/2016 14:53, Shawn Heisey wrote:

On 4/25/2016 6:39 AM, Bastien Latard - MDPI AG wrote:

Remember:
If I add the following line to the schema.xml, even if I do a search
'title:"test" OR author:"me"', it will returns documents matching
'title:"test" AND author:"me"':


The settings in the schema for default field and default operator were
deprecated a long time ago.  I actually have no idea whether they are
even supported in newer Solr versions.

The q.op parameter controls the default operator, and the df parameter
controls the default field.  These can be set in the request handler
definition in solrconfig.xml -- usually in "defaults" but there might be
reason to put them in "invariants" instead.

If you're using edismax, you'd be better off using the mm parameter
rather than the q.op parameter.  The behavior you have described above
sounds like a change in behavior (some call it a bug) introduced in the
5.5 version:

https://issues.apache.org/jira/browse/SOLR-8812

If you are using edismax, I suspect that if you set mm=100% instead of
q.op=AND (or the schema default operator) that the problem might go away
... but I am not sure.  Someone who is more familiar with SOLR-8812
probably should comment.

Thanks,
Shawn




Kind regards,
Bastien Latard
Web engineer
-- 
MDPI AG

Postfach, CH-4005 Basel, Switzerland
Office: Klybeckstrasse 64, CH-4057
Tel.+41 61 683 77 35 Fax: +41 61 302
89 18  E-mail: lat...@mdpi.com
 http://www.mdpi.com/




Kind regards,
Bastien Latard
Web engineer
--
MDPI AG
Postfach, CH-4005 Basel, Switzerland
Office: Klybeckstrasse 64, CH-4057
Tel. +41 61 683 77 35
Fax: +41 61 302 89 18
E-mail:
lat...@mdpi.com
http://www.mdpi.com/

Build Java Package for required schema and solrconfig files field and configuration.

2016-04-26 Thread Nitin Solanki

Hello Everyone,
 I have created a autosuggest using Solr suggester.
I have added a field and field type in schema.xml and did some changes in
/suggest request handler into solrconfig.xml.
Now, I need to build a java package using those configuration which I need
to plug into my current java project. I don't want to use CURL, I need my
configuration as jar or java package. How can I do ? Not having experience
of jar package too much. Any help please...

Thanks,
Nitin

Re: concat 2 fields

2016-04-26 Thread vrajesh

Hi Jack,
as per your explanation i made following changes:

 
id 
title 
 
 
title 
title 
 
 
title 
_ 

 

 

i.e trying to copy value of Id to Title field and then appending actual
Title field to make Id_Title combination.

but still it is not working. please help me if it can be done this way. 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/concat-2-fields-tp4271760p4273072.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Dergraded performance between Solr 4 and Solr 5

2016-04-26 Thread Erick Erickson

Well, the first question is always "how are you measuring this"?
Measuring a few queries is almost completely uninformative,
especially if the two systems have differing warmups. The only
meaningful measurements are when throwing away the first bunch
of queries then measuring a meaningful sample.

The setup you describe will be very sensitive to disk access
with the autowarm of 1 second, so if there's much at all in
the way of differences in I/O that would be a red flag.

>From here on down doesn't really respond to the question, but
I thought I'd mention it.

And you don't have to worry about disabling your fitlerCache since
any filter query of the form fq=field:[mention NOW in here without rounding]
never be re-used. So you might as well use {!cache=false}. Here's the
background:

https://lucidworks.com/blog/2012/02/23/date-math-now-and-filter-queries/

And your soft commit is probably throwing out all the filter caches anyway.

I doubt you're doing any autowarming at all given the autocommit interval
of 1 second and continuously updating documents and your reported
query times. So you can pretty much forget what I said about throwing
out your first N queries since you're (probably) not getting any benefit
out of caches anyway.

On Tue, Apr 26, 2016 at 10:34 AM, Jaroslaw Rozanski
 wrote:
> Hi all,
>
> I am migrating a large Solr Cloud cluster from Solr 4.10 to Solr 5.5.0
> and I observed big difference in query execution time.
>
> First a setup summary:
> - multiple collections - 6
> - each has multiple shards - 6
> - same/similar hardware
> - indexing tens of messages per second
> - autoSoftCommit with 1s; hard commit few tens of seconds
> - Java 8
>
> The query has following form: field1:[* TO NOW-14DAYS] OR (-field1:[* TO
> *] AND field2:[* TO NOW-14DAYS])
>
> The fields field1 & field2 are of date type:
>  positionIncrementGap="0"/>
>
> As query (q={!cache=false}...)
> Solr 4.10 -> 5s
> Solr 5.5.0 -> 12s
>
> As filter query (q={!cache=false}*:*=..,)
> Solr 4.10 -> 9s
> Solr 5.5.0 -> 11s
>
> The query itself is bad and its optimization aside, I am wondering if
> there is anything in Lucene/Solr that would have such an impact on query
> execution time between versions.
>
> Originally I though it might be related to
> https://issues.apache.org/jira/browse/SOLR-8251 and testing on small
> scale proved that there is a difference in performance. However upgraded
> version is already 5.5.0.
>
>
>
> Thanks,
> Jarek
>

Re: How can I set the defaultOperator to be AND?

2016-04-26 Thread Erick Erickson

Defaulting to "OR" has been the behavior since forever, so changing the
behavior now is just not going to happen. Making it fit a new version of
"correct" will change the behavior for every application out there that has
not specified the default behavior.

There's no a-priori reason to expect "more words to equal fewer docs", I
can just as easily argue that "more words should return more docs". Which
you expect depends on your mental model.

And providing the default op in your solrconfig.xml request handlers allows
you to implement whatever model your application chooses...

Best,
Erick

On Mon, Apr 25, 2016 at 11:32 PM, Bastien Latard - MDPI AG <
lat...@mdpi.com.invalid> wrote:

> Thank you Shawn, Jan and Georg for your answers.
>
> Yes, it seems that if I simply remove the defaultOperator it works well
> for "composed queries" like '(a:x AND b:y) OR c:z'.
> But I think that the default Operator should/could be the AND.
>
> Because when I add an extra search word, I expect that the results get
> more accurate...
> (It seems to be what google is also doing now)
>|   |
>
> Otherwise, if you make a search and apply another filter (e.g.: sort by
> publication date, facets, ...) , user can get the less relevant item (only
> 1 word in 4 matches) in first position only because of its date...
>
> What do you think?
>
>
> Kind regards,
> Bastien
>
>
> On 25/04/2016 14:53, Shawn Heisey wrote:
>
> On 4/25/2016 6:39 AM, Bastien Latard - MDPI AG wrote:
>
> Remember:
> If I add the following line to the schema.xml, even if I do a search
> 'title:"test" OR author:"me"', it will returns documents matching
> 'title:"test" AND author:"me"':
> 
>
> The settings in the schema for default field and default operator were
> deprecated a long time ago.  I actually have no idea whether they are
> even supported in newer Solr versions.
>
> The q.op parameter controls the default operator, and the df parameter
> controls the default field.  These can be set in the request handler
> definition in solrconfig.xml -- usually in "defaults" but there might be
> reason to put them in "invariants" instead.
>
> If you're using edismax, you'd be better off using the mm parameter
> rather than the q.op parameter.  The behavior you have described above
> sounds like a change in behavior (some call it a bug) introduced in the
> 5.5 version:
> https://issues.apache.org/jira/browse/SOLR-8812
>
> If you are using edismax, I suspect that if you set mm=100% instead of
> q.op=AND (or the schema default operator) that the problem might go away
> ... but I am not sure.  Someone who is more familiar with SOLR-8812
> probably should comment.
>
> Thanks,
> Shawn
>
>
>
>
> Kind regards,
> Bastien Latard
> Web engineer
> --
> MDPI AG
> Postfach, CH-4005 Basel, Switzerland
> Office: Klybeckstrasse 64, CH-4057
> Tel. +41 61 683 77 35
> Fax: +41 61 302 89 18
> E-mail: latard@mdpi.comhttp://www.mdpi.com/
>
>

Re: Replicas for same shard not in sync

2016-04-26 Thread Erick Erickson

You left out step 5... leader responds with fail for the update to the
client. At this point, the client is in charge of retrying the docs.
Retrying will update all the docs that were successfully indexed in
the failed packet, but that's not unusual.

There's no real rollback semantics that I know of. This is analogous
to not hitting minRF, see:
https://support.lucidworks.com/hc/en-us/articles/212834227-How-does-indexing-work-in-SolrCloud.
In particular the bit about "it is the client's responsibility to
re-send it"...

There's some retry logic in the code that distributes the updates from
the leader as well.

Best,
Erick

On Tue, Apr 26, 2016 at 12:51 PM, Jeff Wartes  wrote:
>
> At the risk of thread hijacking, this is an area where I don’t know I fully 
> understand, so I want to make sure.
>
> I understand the case where a node is marked “down” in the clusterstate, but 
> what if it’s down for less than the ZK heartbeat? That’s not unreasonable, 
> I’ve seen some recommendations for really high ZK timeouts. Let’s assume 
> there’s some big GC pause, or some other ephemeral service interruption that 
> recovers very quickly.
>
> So,
> 1. leader gets an update request
> 2. leader makes update requests to all live nodes
> 3. leader gets success responses from all but one replica
> 4. leader gets failure response from one replica
>
> At this point we have different replicas with different data sets. Does 
> anything signal that the failure-response node has now diverged? Does the 
> leader attempt to roll back the other replicas? I’ve seen references to 
> leader-initiated-recovery, is this that?
>
> And regardless, is the update request considered a success (and reported as 
> such to the client) by the leader?
>
>
>
> On 4/25/16, 12:14 PM, "Erick Erickson"  wrote:
>
>>Ted:
>>Yes, deleting and re-adding the replica will be fine.
>>
>>Having commits happen from the client when you _also_ have
>>autocommits that frequently (10 seconds and 1 second are pretty
>>aggressive BTW) is usually not recommended or necessary.
>>
>>David:
>>
>>bq: if one or more replicas are down, updates presented to the leader
>>still succeed, right?  If so, tedsolr is correct that the Solr client
>>app needs to re-issue update
>>
>>Absolutely not the case. When the replicas are down, they're marked as
>>down by Zookeeper. When then come back up they find the leader through
>>Zookeeper magic and ask, essentially "Did I miss any updates"? If the
>>replica did miss any updates it gets them from the leader either
>>through the leader replaying the updates from its transaction log to
>>the replica or by replicating the entire index from the leader. Which
>>path is followed is a function of how far behind the replica is.
>>
>>In this latter case, any updates that come in to the leader while the
>>replication is happening are buffered and replayed on top of the index
>>when the full replication finishes.
>>
>>The net-net here is that you should not have to track whether updates
>>got to all the replicas or not. One of the major advantages of
>>SolrCloud is to remove that worry from the indexing client...
>>
>>Best,
>>Erick
>>
>>On Mon, Apr 25, 2016 at 11:39 AM, David Smith
>> wrote:
>>> Erick,
>>>
>>> So that my understanding is correct, let me ask, if one or more replicas 
>>> are down, updates presented to the leader still succeed, right?  If so, 
>>> tedsolr is correct that the Solr client app needs to re-issue updates, if 
>>> it wants stronger guarantees on replica consistency than what Solr provides.
>>>
>>> The “Write Fault Tolerance” section of the Solr Wiki makes what I believe 
>>> is the same point:
>>>
>>> "On the client side, if the achieved replication factor is less than the 
>>> acceptable level, then the client application can take additional measures 
>>> to handle the degraded state. For instance, a client application may want 
>>> to keep a log of which update requests were sent while the state of the 
>>> collection was degraded and then resend the updates once the problem has 
>>> been resolved."
>>>
>>>
>>> https://cwiki.apache.org/confluence/display/solr/Read+and+Write+Side+Fault+Tolerance
>>>
>>>
>>> Kind Regards,
>>>
>>> David
>>>
>>>
>>>
>>>
>>> On 4/25/16, 11:57 AM, "Erick Erickson"  wrote:
>>>
bq: I also read that it's up to the
client to keep track of updates in case commits don't happen on all the
replicas.

This is not true. Or if it is it's a bug.

The update cycle is this:
1> updates get to the leader
2> updates are sent to all followers and indexed on the leader as well
3> each replica writes the updates to the local transaction log
4> all the replicas ack back to the leader
5> the leader responds to the client.

At this point, all the replicas for the shard have the docs locally
and can take over as leader.

You may be confusing

Re: 'batching when indexing is good' -> some questions

2016-04-26 Thread Erick Erickson

These are orthogonal. Confusing I know...

In my blog, "batch size" refers to the number of documents
sent to _Solr_ when you're indexing, in this case from a
SolrJ program but the results generally hold for HTTP requests.

The "batch size" you're seeing in DIH is the batch size for
getting records from the SQL database.

Completely different things...

And assuming you're using a Java program, you really have to
look at your individual JDBC APIs. The batchsize of -1 for DIH
is particular to MySQL, other DBs perform fine with a fixed
size batch. I'd recommend the forums for your particular JDBC to
figure out what you want to do with this setting.

Best,
Erick

On Tue, Apr 26, 2016 at 5:36 AM, Bastien Latard - MDPI AG
 wrote:
> Hi Eric (Erickson) & others,
>
> I read your post 'batching when indexing is good
> '.
> But I also read this one
> , which recommend
> to use batchSize="-1".
>
> So I have now some questions:
> - when you speak about 'Packet Size', are you speaking about batchSize?
> - where can I define the Integer.MIN_VALUE used by the setFetchSize() from
> JDBC con. ? (I use mysql jdbc)
>
> Kind regards,
> Bastien

Re: Questions on SolrCloud core state, when will Solr recover a "DOWN" core to "ACTIVE" core.

2016-04-26 Thread Erick Erickson

One of the reasons this happens is if you have very
long GC cycles, longer than the Zookeeper "keep alive"
timeout. During a full GC pause, Solr is unresponsive and
if the ZK ping times out, ZK assumes the machine is
gone and you get into this recovery state.

So I'd collect GC logs and see if you have any
stop-the-world GC pauses that take longer than the ZK
timeout.

see Mark Millers primer on GC here:
https://lucidworks.com/blog/2011/03/27/garbage-collection-bootcamp-1-0/

Best,
Erick

On Tue, Apr 26, 2016 at 2:13 PM, Li Ding  wrote:
> Thank you all for your help!
>
> The zookeeper log rolled over, thisis from Solr.log:
>
> Looks like the solr and zk connection is gone for some reason
>
> INFO  - 2016-04-21 12:37:57.536;
> org.apache.solr.common.cloud.ConnectionManager; Watcher
> org.apache.solr.common.cloud.ConnectionManager@19789a96
> name:ZooKeeperConnection Watcher:{ZK HOSTS here} got event WatchedEvent
> state:Disconnected type:None path:null path:null type:None
>
> INFO  - 2016-04-21 12:37:57.536;
> org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected
>
> INFO  - 2016-04-21 12:38:24.248;
> org.apache.solr.common.cloud.DefaultConnectionStrategy; Connection expired
> - starting a new one...
>
> INFO  - 2016-04-21 12:38:24.262;
> org.apache.solr.common.cloud.ConnectionManager; Waiting for client to
> connect to ZooKeeper
>
> INFO  - 2016-04-21 12:38:24.269;
> org.apache.solr.common.cloud.ConnectionManager; Connected:true
>
>
> Then it publishes all cores on the hosts are down.  I just list three cores
> here:
>
> INFO  - 2016-04-21 12:38:24.269; org.apache.solr.cloud.ZkController;
> publishing core=product1_shard1_replica1 state=down
>
> INFO  - 2016-04-21 12:38:24.271; org.apache.solr.cloud.ZkController;
> publishing core=collection1 state=down
>
> INFO  - 2016-04-21 12:38:24.272; org.apache.solr.cloud.ZkController;
> numShards not found on descriptor - reading it from system property
>
> INFO  - 2016-04-21 12:38:24.289; org.apache.solr.cloud.ZkController;
> publishing core=product2_shard5_replica1 state=down
>
> INFO  - 2016-04-21 12:38:24.292; org.apache.solr.cloud.ZkController;
> publishing core=product2_shard13_replica1 state=down
>
>
> product1 has only one shard one replica and it's able to be active
> successfully:
>
> INFO  - 2016-04-21 12:38:26.383; org.apache.solr.cloud.ZkController;
> Register replica - core:product1_shard1_replica1 address:http://
> {internalIp}:8983/solr collection:product1 shard:shard1
>
> WARN  - 2016-04-21 12:38:26.385; org.apache.solr.cloud.ElectionContext;
> cancelElection did not find election node to remove
>
> INFO  - 2016-04-21 12:38:26.393;
> org.apache.solr.cloud.ShardLeaderElectionContext; Running the leader
> process for shard shard1
>
> INFO  - 2016-04-21 12:38:26.399;
> org.apache.solr.cloud.ShardLeaderElectionContext; Enough replicas found to
> continue.
>
> INFO  - 2016-04-21 12:38:26.399;
> org.apache.solr.cloud.ShardLeaderElectionContext; I may be the new leader -
> try and sync
>
> INFO  - 2016-04-21 12:38:26.399; org.apache.solr.cloud.SyncStrategy; Sync
> replicas to http://{internalIp}:8983/solr/product1_shard1_replica1/
>
> INFO  - 2016-04-21 12:38:26.399; org.apache.solr.cloud.SyncStrategy; Sync
> Success - now sync replicas to me
>
> INFO  - 2016-04-21 12:38:26.399; org.apache.solr.cloud.SyncStrategy;
> http://{internalIp}:8983/solr/product1_shard1_replica1/
> has no replicas
>
> INFO  - 2016-04-21 12:38:26.399;
> org.apache.solr.cloud.ShardLeaderElectionContext; I am the new leader:
> http://{internalIp}:8983/solr/product1_shard1_replica1/ shard1
>
> INFO  - 2016-04-21 12:38:26.399; org.apache.solr.common.cloud.SolrZkClient;
> makePath: /collections/product1/leaders/shard1
>
> INFO  - 2016-04-21 12:38:26.412; org.apache.solr.cloud.ZkController; We are
> http://{internalIp}:8983/solr/product1_shard1_replica1/ and leader is
> http://{internalIp}:8983/solr/product1_shard1_replica1/
>
> INFO  - 2016-04-21 12:38:26.412; org.apache.solr.cloud.ZkController; No
> LogReplay needed for core=product1_replica1 baseURL=http://
> {internalIp}:8983/solr
>
> INFO  - 2016-04-21 12:38:26.412; org.apache.solr.cloud.ZkController; I am
> the leader, no recovery necessary
>
> INFO  - 2016-04-21 12:38:26.413; org.apache.solr.cloud.ZkController;
> publishing core=product1_shard1_replica1 state=active
>
>
> product2 has 15 shards one replica but only two shards lived on this
> machine, this is one of the failed shard that I never seen the message of
> the core product2_shard5_replica1 active:
>
> INFO  - 2016-04-21 12:38:26.616; org.apache.solr.cloud.ZkController;
> Register replica - product2_shard5_replica1 address:http://
> {internalIp}:8983/solr collection:product2 shard:shard5
>
> WARN  - 2016-04-21 12:38:26.618; org.apache.solr.cloud.ElectionContext;
> cancelElection did not find election node to remove
>
> INFO  - 2016-04-21 12:38:26.625;
> org.apache.solr.cloud.ShardLeaderElectionContext; Running the leader
>

Re: Tuning solr for large index with rapid writes

2016-04-26 Thread Erick Erickson

If I'm reading this right, you have 420M docs on a single shard? If that's true
you are pushing the envelope of what I've seen work and be performant. Your
OOM errors are the proverbial 'smoking gun' that you're putting too many docs
on too few nodes.

You say that the document count is "growing quite rapidly". My expectation is
that your problems will only get worse as you cram more docs into your shard.

You're correct that adding more memory (and consequently more JVM
memory?) only gets you so far before you start running into GC trouble,
when you hit full GC pauses they'll get longer and longer which is its own
problem. And you don't want to have huge JVM memory at the expense
of op system memory due the fact that Lucene uses MMapDirectory, see
Uwe's excellent blog:
http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

I'd _strongly_ recommend you do "the sizing exercise". There are lots of
details here:
https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

You've already done some of this inadvertently, unfortunately it sounds like
it's in production. If I were going to guess, I'd say the maximum number of
docs on any shard should be less than half what you currently have. So you
need to figure out how many docs you expect to host in this collection
eventually
and have N/200M shards. At least.

There are various strategies when the answer is "I don't know", you
might add new
collections when you max out and then use "collection aliasing" to
query them etc.

Best,
Erick

On Tue, Apr 26, 2016 at 3:49 PM, Stephen Lewis  wrote:
> Hello,
>
> I'm looking for some guidance on the best steps for tuning a solr cloud
> cluster which is heavy on writes. We are currently running a solr cloud
> fleet composed of one core, one shard, and three nodes. The cloud is hosted
> in AWS, and each solr node is on its own linux r3.2xl instance with 8 cpu
> and 61 GiB mem, and a 2TB EBS volume attached. Our index is currently 550
> GiB over 420M documents, and growing quite rapidly. We are currently doing
> a bit more than 1000 document writes/deletes per second.
>
> Recently, we've hit some trouble with our production cloud. We have had the
> process on individual instances die a few times, and we see the following
> error messages being logged (expanded logs at the bottom of the email):
>
> ERROR - 2016-04-26 00:56:43.873; org.apache.solr.common.SolrException;
> null:org.eclipse.jetty.io.EofException
>
> WARN  - 2016-04-26 00:55:29.571; org.eclipse.jetty.servlet.ServletHandler;
> /solr/panopto/select
> java.lang.IllegalStateException: Committed
>
> WARN  - 2016-04-26 00:55:29.571; org.eclipse.jetty.server.Response;
> Committed before 500 {trace=org.eclipse.jetty.io.EofException
>
>
> Another time we saw this happen, we had java OOM errors (expanded logs at
> the bottom):
>
> WARN  - 2016-04-25 22:58:43.943; org.eclipse.jetty.servlet.ServletHandler;
> Error for /solr/panopto/select
> java.lang.OutOfMemoryError: Java heap space
> ERROR - 2016-04-25 22:58:43.945; org.apache.solr.common.SolrException;
> null:java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space
> ...
> Caused by: java.lang.OutOfMemoryError: Java heap space
>
>
> When the cloud goes into recovery during live indexing, it takes about 4-6
> hours for a node to recover, but when we turn off indexing, recovery only
> takes about 90 minutes.
>
> Moreover, we see that deletes are extremely slow. We do batch deletes of
> about 300 documents based on two value filters, and this takes about one
> minute:
>
> Research online suggests that a larger disk cache
>  could be helpful,
> but I also see from an older page
>  on tuning for
> Lucene that turning down the swappiness on our Linux instances may be
> preferred to simply increasing space for the disk cache.
>
> Moreover, to scale in the past, we've simply rolled our cluster while
> increasing the memory on the new machines, but I wonder if we're hitting
> the limit for how much we should scale vertically. My impression is that
> sharding will allow us to warm searchers faster and maintain a more
> effective cache as we scale. Will we really be helped by sharding, or is it
> only a matter of total CPU/Memory in the cluster?
>
> Thanks!
>
> Stephen
>
> (206)753-9320
> stephen-lewis.net
>
> Logs:
>
> ERROR - 2016-04-26 00:56:43.873; org.apache.solr.common.SolrException;
> null:org.eclipse.jetty.io.EofException
> at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:142)
> at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:107)
> at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
> at sun.nio.cs.StreamEncoder.implWrite(StreamEncoder.java:282)
> at sun.nio.cs.StreamEncoder.write(StreamEncoder.java:125)
> at java.io.OutputStreamWriter.write(OutputStreamWriter.java:207)
> at

Re: Child doc facet not getting terms, only counts

2016-04-26 Thread Yangrui Guo

I've finally solved this problem. It appears that I do not need to add the
line domain: blockChildren: content_type:c in the subfacet. Now I've got my
desired results

On Tue, Apr 26, 2016 at 3:14 PM, Yangrui Guo  wrote:

> The documents are organized in a key-value like structure
>
> {
>  id: 1
>  product_name: some apparel
>  category: apparel
>  {
>   attribute: brand
>   value: Chanel
>   }
>   {
>attribute: madein
>value: Europe
>}
> }
>
> Because there are indefinite numbers of attributes associated with the
> products, I used this structure to store the document. My intention is to
> show facets of the value when an attribute facet is chosen. For example, if
> you choose "brand" then it'll show "Chanel", "Dior", etc. Is this currently
> possible?
>
> Yangrui Guo
>
>
> On Tuesday, April 26, 2016, Yonik Seeley  wrote:
>
>> How are the documents indexed?  Can you show an example document (with
>> nested documents)?
>> -Yonik
>>
>>
>> On Tue, Apr 26, 2016 at 5:08 PM, Yangrui Guo 
>> wrote:
>> >  When I use subfaceting with Json API, the facet results only gave me
>> > counts, no terms. My query is like this:
>> >
>> > {
>> > apparels : {
>> > type: terms,
>> > field: brand,
>> > facet:{
>> >   values:{
>> >   type: query,
>> >   q:\"brand:Chanel\",
>> >   facet: {
>> > type: terms,
>> > field: madein
>> >   }
>> >   domain: { blockChildren : \"content_type:p\" }
>> >   }
>> > },
>> > domain: { blockChildren : \"content_type:p\" }
>> > }
>> > }
>> > }
>> >
>> > And this is the results that I got:
>> >
>> > facets={
>> > count=57477,
>> > apparels={
>> > buckets=
>> > {
>> > val=Chanel,
>> > count=6,
>> > madein={
>> > count=6
>> >
>> > buckets={}
>> > }
>> > }
>> > }
>> > }
>> >
>> > The second buckets got zero results but the count was correct. What was
>> I
>> > missing? Thanks so much!
>>
>

Tuning solr for large index with rapid writes

2016-04-26 Thread Stephen Lewis

Hello,

I'm looking for some guidance on the best steps for tuning a solr cloud
cluster which is heavy on writes. We are currently running a solr cloud
fleet composed of one core, one shard, and three nodes. The cloud is hosted
in AWS, and each solr node is on its own linux r3.2xl instance with 8 cpu
and 61 GiB mem, and a 2TB EBS volume attached. Our index is currently 550
GiB over 420M documents, and growing quite rapidly. We are currently doing
a bit more than 1000 document writes/deletes per second.

Recently, we've hit some trouble with our production cloud. We have had the
process on individual instances die a few times, and we see the following
error messages being logged (expanded logs at the bottom of the email):

ERROR - 2016-04-26 00:56:43.873; org.apache.solr.common.SolrException;
null:org.eclipse.jetty.io.EofException

WARN  - 2016-04-26 00:55:29.571; org.eclipse.jetty.servlet.ServletHandler;
/solr/panopto/select
java.lang.IllegalStateException: Committed

WARN  - 2016-04-26 00:55:29.571; org.eclipse.jetty.server.Response;
Committed before 500 {trace=org.eclipse.jetty.io.EofException


Another time we saw this happen, we had java OOM errors (expanded logs at
the bottom):

WARN  - 2016-04-25 22:58:43.943; org.eclipse.jetty.servlet.ServletHandler;
Error for /solr/panopto/select
java.lang.OutOfMemoryError: Java heap space
ERROR - 2016-04-25 22:58:43.945; org.apache.solr.common.SolrException;
null:java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space
...
Caused by: java.lang.OutOfMemoryError: Java heap space


When the cloud goes into recovery during live indexing, it takes about 4-6
hours for a node to recover, but when we turn off indexing, recovery only
takes about 90 minutes.

Moreover, we see that deletes are extremely slow. We do batch deletes of
about 300 documents based on two value filters, and this takes about one
minute:

Research online suggests that a larger disk cache
 could be helpful,
but I also see from an older page
 on tuning for
Lucene that turning down the swappiness on our Linux instances may be
preferred to simply increasing space for the disk cache.

Moreover, to scale in the past, we've simply rolled our cluster while
increasing the memory on the new machines, but I wonder if we're hitting
the limit for how much we should scale vertically. My impression is that
sharding will allow us to warm searchers faster and maintain a more
effective cache as we scale. Will we really be helped by sharding, or is it
only a matter of total CPU/Memory in the cluster?

Thanks!

Stephen

(206)753-9320
stephen-lewis.net

Logs:

ERROR - 2016-04-26 00:56:43.873; org.apache.solr.common.SolrException;
null:org.eclipse.jetty.io.EofException
at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:142)
at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:107)
at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
at sun.nio.cs.StreamEncoder.implWrite(StreamEncoder.java:282)
at sun.nio.cs.StreamEncoder.write(StreamEncoder.java:125)
at java.io.OutputStreamWriter.write(OutputStreamWriter.java:207)
at org.apache.solr.util.FastWriter.flush(FastWriter.java:141)
at org.apache.solr.util.FastWriter.flushBuffer(FastWriter.java:155)
at
org.apache.solr.response.TextResponseWriter.close(TextResponseWriter.java:83)
at
org.apache.solr.response.XMLResponseWriter.write(XMLResponseWriter.java:42)
at
org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:765)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:426)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
at org.eclipse.jetty.server.Server.handle(Server.java:368)
at

Re: Child doc facet not getting terms, only counts

2016-04-26 Thread Yangrui Guo

The documents are organized in a key-value like structure

{
 id: 1
 product_name: some apparel
 category: apparel
 {
  attribute: brand
  value: Chanel
  }
  {
   attribute: madein
   value: Europe
   }
}

Because there are indefinite numbers of attributes associated with the
products, I used this structure to store the document. My intention is to
show facets of the value when an attribute facet is chosen. For example, if
you choose "brand" then it'll show "Chanel", "Dior", etc. Is this currently
possible?

Yangrui Guo


On Tuesday, April 26, 2016, Yonik Seeley  wrote:

> How are the documents indexed?  Can you show an example document (with
> nested documents)?
> -Yonik
>
>
> On Tue, Apr 26, 2016 at 5:08 PM, Yangrui Guo  > wrote:
> >  When I use subfaceting with Json API, the facet results only gave me
> > counts, no terms. My query is like this:
> >
> > {
> > apparels : {
> > type: terms,
> > field: brand,
> > facet:{
> >   values:{
> >   type: query,
> >   q:\"brand:Chanel\",
> >   facet: {
> > type: terms,
> > field: madein
> >   }
> >   domain: { blockChildren : \"content_type:p\" }
> >   }
> > },
> > domain: { blockChildren : \"content_type:p\" }
> > }
> > }
> > }
> >
> > And this is the results that I got:
> >
> > facets={
> > count=57477,
> > apparels={
> > buckets=
> > {
> > val=Chanel,
> > count=6,
> > madein={
> > count=6
> >
> > buckets={}
> > }
> > }
> > }
> > }
> >
> > The second buckets got zero results but the count was correct. What was I
> > missing? Thanks so much!
>

Re: Child doc facet not getting terms, only counts

2016-04-26 Thread Yonik Seeley

How are the documents indexed?  Can you show an example document (with
nested documents)?
-Yonik


On Tue, Apr 26, 2016 at 5:08 PM, Yangrui Guo  wrote:
>  When I use subfaceting with Json API, the facet results only gave me
> counts, no terms. My query is like this:
>
> {
> apparels : {
> type: terms,
> field: brand,
> facet:{
>   values:{
>   type: query,
>   q:\"brand:Chanel\",
>   facet: {
> type: terms,
> field: madein
>   }
>   domain: { blockChildren : \"content_type:p\" }
>   }
> },
> domain: { blockChildren : \"content_type:p\" }
> }
> }
> }
>
> And this is the results that I got:
>
> facets={
> count=57477,
> apparels={
> buckets=
> {
> val=Chanel,
> count=6,
> madein={
> count=6
>
> buckets={}
> }
> }
> }
> }
>
> The second buckets got zero results but the count was correct. What was I
> missing? Thanks so much!

Re: Questions on SolrCloud core state, when will Solr recover a "DOWN" core to "ACTIVE" core.

2016-04-26 Thread Li Ding

Thank you all for your help!

The zookeeper log rolled over, thisis from Solr.log:

Looks like the solr and zk connection is gone for some reason

INFO  - 2016-04-21 12:37:57.536;
org.apache.solr.common.cloud.ConnectionManager; Watcher
org.apache.solr.common.cloud.ConnectionManager@19789a96
name:ZooKeeperConnection Watcher:{ZK HOSTS here} got event WatchedEvent
state:Disconnected type:None path:null path:null type:None

INFO  - 2016-04-21 12:37:57.536;
org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected

INFO  - 2016-04-21 12:38:24.248;
org.apache.solr.common.cloud.DefaultConnectionStrategy; Connection expired
- starting a new one...

INFO  - 2016-04-21 12:38:24.262;
org.apache.solr.common.cloud.ConnectionManager; Waiting for client to
connect to ZooKeeper

INFO  - 2016-04-21 12:38:24.269;
org.apache.solr.common.cloud.ConnectionManager; Connected:true


Then it publishes all cores on the hosts are down.  I just list three cores
here:

INFO  - 2016-04-21 12:38:24.269; org.apache.solr.cloud.ZkController;
publishing core=product1_shard1_replica1 state=down

INFO  - 2016-04-21 12:38:24.271; org.apache.solr.cloud.ZkController;
publishing core=collection1 state=down

INFO  - 2016-04-21 12:38:24.272; org.apache.solr.cloud.ZkController;
numShards not found on descriptor - reading it from system property

INFO  - 2016-04-21 12:38:24.289; org.apache.solr.cloud.ZkController;
publishing core=product2_shard5_replica1 state=down

INFO  - 2016-04-21 12:38:24.292; org.apache.solr.cloud.ZkController;
publishing core=product2_shard13_replica1 state=down


product1 has only one shard one replica and it's able to be active
successfully:

INFO  - 2016-04-21 12:38:26.383; org.apache.solr.cloud.ZkController;
Register replica - core:product1_shard1_replica1 address:http://
{internalIp}:8983/solr collection:product1 shard:shard1

WARN  - 2016-04-21 12:38:26.385; org.apache.solr.cloud.ElectionContext;
cancelElection did not find election node to remove

INFO  - 2016-04-21 12:38:26.393;
org.apache.solr.cloud.ShardLeaderElectionContext; Running the leader
process for shard shard1

INFO  - 2016-04-21 12:38:26.399;
org.apache.solr.cloud.ShardLeaderElectionContext; Enough replicas found to
continue.

INFO  - 2016-04-21 12:38:26.399;
org.apache.solr.cloud.ShardLeaderElectionContext; I may be the new leader -
try and sync

INFO  - 2016-04-21 12:38:26.399; org.apache.solr.cloud.SyncStrategy; Sync
replicas to http://{internalIp}:8983/solr/product1_shard1_replica1/

INFO  - 2016-04-21 12:38:26.399; org.apache.solr.cloud.SyncStrategy; Sync
Success - now sync replicas to me

INFO  - 2016-04-21 12:38:26.399; org.apache.solr.cloud.SyncStrategy;
http://{internalIp}:8983/solr/product1_shard1_replica1/
has no replicas

INFO  - 2016-04-21 12:38:26.399;
org.apache.solr.cloud.ShardLeaderElectionContext; I am the new leader:
http://{internalIp}:8983/solr/product1_shard1_replica1/ shard1

INFO  - 2016-04-21 12:38:26.399; org.apache.solr.common.cloud.SolrZkClient;
makePath: /collections/product1/leaders/shard1

INFO  - 2016-04-21 12:38:26.412; org.apache.solr.cloud.ZkController; We are
http://{internalIp}:8983/solr/product1_shard1_replica1/ and leader is
http://{internalIp}:8983/solr/product1_shard1_replica1/

INFO  - 2016-04-21 12:38:26.412; org.apache.solr.cloud.ZkController; No
LogReplay needed for core=product1_replica1 baseURL=http://
{internalIp}:8983/solr

INFO  - 2016-04-21 12:38:26.412; org.apache.solr.cloud.ZkController; I am
the leader, no recovery necessary

INFO  - 2016-04-21 12:38:26.413; org.apache.solr.cloud.ZkController;
publishing core=product1_shard1_replica1 state=active


product2 has 15 shards one replica but only two shards lived on this
machine, this is one of the failed shard that I never seen the message of
the core product2_shard5_replica1 active:

INFO  - 2016-04-21 12:38:26.616; org.apache.solr.cloud.ZkController;
Register replica - product2_shard5_replica1 address:http://
{internalIp}:8983/solr collection:product2 shard:shard5

WARN  - 2016-04-21 12:38:26.618; org.apache.solr.cloud.ElectionContext;
cancelElection did not find election node to remove

INFO  - 2016-04-21 12:38:26.625;
org.apache.solr.cloud.ShardLeaderElectionContext; Running the leader
process for shard shard5

INFO  - 2016-04-21 12:38:26.631;
org.apache.solr.cloud.ShardLeaderElectionContext; Enough replicas found to
continue.

INFO  - 2016-04-21 12:38:26.631;
org.apache.solr.cloud.ShardLeaderElectionContext; I may be the new leader -
try and sync

INFO  - 2016-04-21 12:38:26.631; org.apache.solr.cloud.SyncStrategy; Sync
replicas to http://
{internalIp}:8983/solr/product2_shard5_replica1_shard5_replica1/

INFO  - 2016-04-21 12:38:26.631; org.apache.solr.cloud.SyncStrategy; Sync
Success - now sync replicas to me

INFO  - 2016-04-21 12:38:26.632; org.apache.solr.cloud.SyncStrategy;
http://{internalIp}:8983/solr/product2_shard5_replica1_shard5_replica1/
has no replicas

INFO  - 2016-04-21 12:38:26.632;

Child doc facet not getting terms, only counts

2016-04-26 Thread Yangrui Guo

 When I use subfaceting with Json API, the facet results only gave me
counts, no terms. My query is like this:

{
apparels : {
type: terms,
field: brand,
facet:{
  values:{
  type: query,
  q:\"brand:Chanel\",
  facet: {
type: terms,
field: madein
  }
  domain: { blockChildren : \"content_type:p\" }
  }
},
domain: { blockChildren : \"content_type:p\" }
}
}
}

And this is the results that I got:

facets={
count=57477,
apparels={
buckets=
{
val=Chanel,
count=6,
madein={
count=6

buckets={}
}
}
}
}

The second buckets got zero results but the count was correct. What was I
missing? Thanks so much!

Re: Replicas for same shard not in sync

2016-04-26 Thread Jeff Wartes


At the risk of thread hijacking, this is an area where I don’t know I fully 
understand, so I want to make sure.

I understand the case where a node is marked “down” in the clusterstate, but 
what if it’s down for less than the ZK heartbeat? That’s not unreasonable, I’ve 
seen some recommendations for really high ZK timeouts. Let’s assume there’s 
some big GC pause, or some other ephemeral service interruption that recovers 
very quickly.

So,
1. leader gets an update request
2. leader makes update requests to all live nodes
3. leader gets success responses from all but one replica
4. leader gets failure response from one replica

At this point we have different replicas with different data sets. Does 
anything signal that the failure-response node has now diverged? Does the 
leader attempt to roll back the other replicas? I’ve seen references to 
leader-initiated-recovery, is this that?

And regardless, is the update request considered a success (and reported as 
such to the client) by the leader?



On 4/25/16, 12:14 PM, "Erick Erickson"  wrote:

>Ted:
>Yes, deleting and re-adding the replica will be fine.
>
>Having commits happen from the client when you _also_ have
>autocommits that frequently (10 seconds and 1 second are pretty
>aggressive BTW) is usually not recommended or necessary.
>
>David:
>
>bq: if one or more replicas are down, updates presented to the leader
>still succeed, right?  If so, tedsolr is correct that the Solr client
>app needs to re-issue update
>
>Absolutely not the case. When the replicas are down, they're marked as
>down by Zookeeper. When then come back up they find the leader through
>Zookeeper magic and ask, essentially "Did I miss any updates"? If the
>replica did miss any updates it gets them from the leader either
>through the leader replaying the updates from its transaction log to
>the replica or by replicating the entire index from the leader. Which
>path is followed is a function of how far behind the replica is.
>
>In this latter case, any updates that come in to the leader while the
>replication is happening are buffered and replayed on top of the index
>when the full replication finishes.
>
>The net-net here is that you should not have to track whether updates
>got to all the replicas or not. One of the major advantages of
>SolrCloud is to remove that worry from the indexing client...
>
>Best,
>Erick
>
>On Mon, Apr 25, 2016 at 11:39 AM, David Smith
> wrote:
>> Erick,
>>
>> So that my understanding is correct, let me ask, if one or more replicas are 
>> down, updates presented to the leader still succeed, right?  If so, tedsolr 
>> is correct that the Solr client app needs to re-issue updates, if it wants 
>> stronger guarantees on replica consistency than what Solr provides.
>>
>> The “Write Fault Tolerance” section of the Solr Wiki makes what I believe is 
>> the same point:
>>
>> "On the client side, if the achieved replication factor is less than the 
>> acceptable level, then the client application can take additional measures 
>> to handle the degraded state. For instance, a client application may want to 
>> keep a log of which update requests were sent while the state of the 
>> collection was degraded and then resend the updates once the problem has 
>> been resolved."
>>
>>
>> https://cwiki.apache.org/confluence/display/solr/Read+and+Write+Side+Fault+Tolerance
>>
>>
>> Kind Regards,
>>
>> David
>>
>>
>>
>>
>> On 4/25/16, 11:57 AM, "Erick Erickson"  wrote:
>>
>>>bq: I also read that it's up to the
>>>client to keep track of updates in case commits don't happen on all the
>>>replicas.
>>>
>>>This is not true. Or if it is it's a bug.
>>>
>>>The update cycle is this:
>>>1> updates get to the leader
>>>2> updates are sent to all followers and indexed on the leader as well
>>>3> each replica writes the updates to the local transaction log
>>>4> all the replicas ack back to the leader
>>>5> the leader responds to the client.
>>>
>>>At this point, all the replicas for the shard have the docs locally
>>>and can take over as leader.
>>>
>>>You may be confusing indexing in batches and having errors with
>>>updates getting to replicas. When you send a batch of docs to Solr,
>>>if one of them fails indexing some of the rest of the docs may not
>>>be indexed. See SOLR-445 for some work on this front.
>>>
>>>That said, bouncing servers willy-nilly during heavy indexing, especially
>>>if the indexer doesn't know enough to retry if an indexing attempt fails may
>>>be the root cause here. Have you verified that your indexing program
>>>retries in the event of failure?
>>>
>>>Best,
>>>Erick
>>>
>>>On Mon, Apr 25, 2016 at 6:13 AM, tedsolr  wrote:
 I've done a bit of reading - found some other posts with similar questions.
 So I gather "Optimizing" a collection is rarely a good idea. It does not
 need to be condensed to a single segment. I also read that it's

Re: The Streaming API (Solrj.io) : id must have DocValues?

2016-04-26 Thread Joel Bernstein

My blog is pretty out of date at this point unfortunately. I need to get
some better examples published.

Also there is huge amount of work that went into Solr 6 Streaming API and
Streaming Expressions that make them much easier to work with. In Solr 6.1
you'll be able to test Streaming Expressions from the Solr admin console
which should be very helpful.

Since you're planning on using the joins, performance will be very much
driven by the number of shards and replicas pushing the streams and the
number of workers performing the join.

If you're still having problems with Solr 6.0, feel free to post the
Expression you're using and I or other people can help debug the issue.


Joel Bernstein
http://joelsolr.blogspot.com/

On Tue, Apr 26, 2016 at 2:29 PM, sudsport s  wrote:

> I see that some work was done to remove stream handler form config. so
> enabling stream handler is still security issue?
>
> https://issues.apache.org/jira/browse/SOLR-8262
>
> On Tue, Apr 26, 2016 at 11:14 AM, sudsport s  wrote:
>
>> I am using solr 5.3.1 server & solr5.5 on client ( solrj) . I will try
>> with solrj 6.0
>>
>> On Tue, Apr 26, 2016 at 11:12 AM, Susmit Shukla 
>> wrote:
>>
>>> Which solrj version are you using? could you try with solrj 6.0
>>>
>>> On Tue, Apr 26, 2016 at 10:36 AM, sudsport s 
>>> wrote:
>>>
>>> > @Joel
>>> > >Can you describe how you're planning on using Streaming?
>>> >
>>> > I am mostly using it for distirbuted join case. We were planning to use
>>> > similar logic (hash id and join) in Spark for our usecase. but since
>>> data
>>> > is stored in solr , I will be using solr stream to perform same
>>> operation.
>>> >
>>> > I have similar user cases to build probabilistic data-structures while
>>> > streaming results. I might have to spend some time in exploring query
>>> > optimization (while doing join decide sort order etc)
>>> >
>>> > Please let me know if you have any feedback.
>>> >
>>> > On Tue, Apr 26, 2016 at 10:30 AM, sudsport s 
>>> wrote:
>>> >
>>> > > Thanks @Reth yes that was my one of the concern. I will look at JIRA
>>> you
>>> > > mentioned.
>>> > >
>>> > > Thanks Joel
>>> > > I used some of examples for streaming client from your blog. I got
>>> basic
>>> > > tuple stream working but I get following exception while running
>>> parallel
>>> > > string.
>>> > >
>>> > >
>>> > > java.io.IOException: java.util.concurrent.ExecutionException:
>>> > > org.noggit.JSONParser$ParseException: JSON Parse Error:
>>> char=<,position=0
>>> > > BEFORE='<' AFTER='html>  >> > >
>>> > >
>>> > > looks like Parallel stream is trying to access /stream on shard. can
>>> > > someone tell me how to enable stream handler? I have export handler
>>> > > enabled. I will look at latest solrconfig to see if I can turn that
>>> on.
>>> > >
>>> > >
>>> > >
>>> > > @Joel I am running sizing exercises already , I will run new one with
>>> > > solr5.5+ and docValues on id enabled.
>>> > >
>>> > > BTW Solr streaming has amazing response times thanks for making it so
>>> > > FAST!!!
>>> > >
>>> > >
>>> > >
>>> > >
>>> > >
>>> > >
>>> > >
>>> > > On Mon, Apr 25, 2016 at 10:54 AM, Joel Bernstein >> >
>>> > > wrote:
>>> > >
>>> > >> Can you describe how you're planning on using Streaming? I can
>>> provide
>>> > >> some
>>> > >> feedback on how it will perform for your use use.
>>> > >>
>>> > >> When scaling out Streaming you'll get large performance boosts when
>>> you
>>> > >> increase the number of shards, replicas and workers. This is
>>> > particularly
>>> > >> true if you're doing parallel relational algebra or map/reduce
>>> > operations.
>>> > >>
>>> > >> As far a DocValues being expensive with unique fields, you'll want
>>> to
>>> > do a
>>> > >> sizing exercise to see how many documents per-shard work best for
>>> your
>>> > use
>>> > >> case. There are different docValues implementations that will allow
>>> you
>>> > to
>>> > >> trade off memory for performance.
>>> > >>
>>> > >> Joel Bernstein
>>> > >> http://joelsolr.blogspot.com/
>>> > >>
>>> > >> On Mon, Apr 25, 2016 at 3:30 AM, Reth RM 
>>> wrote:
>>> > >>
>>> > >> > Hi,
>>> > >> >
>>> > >> > So, is the concern related to same field value being stored twice:
>>> > with
>>> > >> > stored=true and docValues=true? If that is the case, there is a
>>> jira
>>> > >> > relevant to this, fixed[1]. If you upgrade to 5.5/6.0 version, it
>>> is
>>> > >> > possible to read non-stored fields from docValues index., check
>>> out.
>>> > >> >
>>> > >> >
>>> > >> > [1] https://issues.apache.org/jira/browse/SOLR-8220
>>> > >> >
>>> > >> > On Mon, Apr 25, 2016 at 9:44 AM, sudsport s >> >
>>> > >> wrote:
>>> > >> >
>>> > >> > > Thanks Erik for reply,
>>> > >> > >
>>> > >> > > Since I was storing Id (its stored field) and after enabling
>>> > >> docValues my
>>> > >> > > guess is it will be stored in 2

RE: Overall large size in Solr across collections

2016-04-26 Thread Allison, Timothy B.

> I can tell you that Tika is  quite the resource hog.  It is likely chewing up 
> CPU and memory 
> resources at an incredible rate, slowing down your Solr server.  You 
> would probably see better performance than ERH if you incorporate Tika 
> and SolrJ into a client indexing program that runs on a different machine 
> than Solr.

+1

It'd be interesting to see what happens if you use standalone tika-batch to see 
what the performance is.  

java -jar tika-app.jar -i  -o 

and if you're feeling adventurous:

java -jar tika-app.jar -i  -o  -J -t

You can specify the number of threads with -numConsumers 5 (don't use many more 
than # of cpus!)

Content extraction with Tika is usually slower (sometimes far slower) than the 
indexing step.  If you have any crazily slow docs, open an issue on Tika's JIRA.

Cheers,
 
  Tim



-Original Message-
From: Zheng Lin Edwin Yeo [mailto:edwinye...@gmail.com] 
Sent: Thursday, April 21, 2016 12:13 AM
To: solr-user@lucene.apache.org
Subject: Re: Overall large size in Solr across collections

Hi Shawn,

Yes, I'm using the Extracting Request Handler.

The 0.7GB/hr is the indexing rate at which the size of the original documents 
which get ingested into Solr. This means that for every hour, only 0.7GB of my 
documents gets ingested into Solr. It will require 10 hours just to index 
documents which are of 7GB in size.

Regards,
Edwin

Re: The Streaming API (Solrj.io) : id must have DocValues?

2016-04-26 Thread sudsport s

I see that some work was done to remove stream handler form config. so
enabling stream handler is still security issue?

https://issues.apache.org/jira/browse/SOLR-8262

On Tue, Apr 26, 2016 at 11:14 AM, sudsport s  wrote:

> I am using solr 5.3.1 server & solr5.5 on client ( solrj) . I will try
> with solrj 6.0
>
> On Tue, Apr 26, 2016 at 11:12 AM, Susmit Shukla 
> wrote:
>
>> Which solrj version are you using? could you try with solrj 6.0
>>
>> On Tue, Apr 26, 2016 at 10:36 AM, sudsport s 
>> wrote:
>>
>> > @Joel
>> > >Can you describe how you're planning on using Streaming?
>> >
>> > I am mostly using it for distirbuted join case. We were planning to use
>> > similar logic (hash id and join) in Spark for our usecase. but since
>> data
>> > is stored in solr , I will be using solr stream to perform same
>> operation.
>> >
>> > I have similar user cases to build probabilistic data-structures while
>> > streaming results. I might have to spend some time in exploring query
>> > optimization (while doing join decide sort order etc)
>> >
>> > Please let me know if you have any feedback.
>> >
>> > On Tue, Apr 26, 2016 at 10:30 AM, sudsport s 
>> wrote:
>> >
>> > > Thanks @Reth yes that was my one of the concern. I will look at JIRA
>> you
>> > > mentioned.
>> > >
>> > > Thanks Joel
>> > > I used some of examples for streaming client from your blog. I got
>> basic
>> > > tuple stream working but I get following exception while running
>> parallel
>> > > string.
>> > >
>> > >
>> > > java.io.IOException: java.util.concurrent.ExecutionException:
>> > > org.noggit.JSONParser$ParseException: JSON Parse Error:
>> char=<,position=0
>> > > BEFORE='<' AFTER='html>  > > >
>> > >
>> > > looks like Parallel stream is trying to access /stream on shard. can
>> > > someone tell me how to enable stream handler? I have export handler
>> > > enabled. I will look at latest solrconfig to see if I can turn that
>> on.
>> > >
>> > >
>> > >
>> > > @Joel I am running sizing exercises already , I will run new one with
>> > > solr5.5+ and docValues on id enabled.
>> > >
>> > > BTW Solr streaming has amazing response times thanks for making it so
>> > > FAST!!!
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > On Mon, Apr 25, 2016 at 10:54 AM, Joel Bernstein 
>> > > wrote:
>> > >
>> > >> Can you describe how you're planning on using Streaming? I can
>> provide
>> > >> some
>> > >> feedback on how it will perform for your use use.
>> > >>
>> > >> When scaling out Streaming you'll get large performance boosts when
>> you
>> > >> increase the number of shards, replicas and workers. This is
>> > particularly
>> > >> true if you're doing parallel relational algebra or map/reduce
>> > operations.
>> > >>
>> > >> As far a DocValues being expensive with unique fields, you'll want to
>> > do a
>> > >> sizing exercise to see how many documents per-shard work best for
>> your
>> > use
>> > >> case. There are different docValues implementations that will allow
>> you
>> > to
>> > >> trade off memory for performance.
>> > >>
>> > >> Joel Bernstein
>> > >> http://joelsolr.blogspot.com/
>> > >>
>> > >> On Mon, Apr 25, 2016 at 3:30 AM, Reth RM 
>> wrote:
>> > >>
>> > >> > Hi,
>> > >> >
>> > >> > So, is the concern related to same field value being stored twice:
>> > with
>> > >> > stored=true and docValues=true? If that is the case, there is a
>> jira
>> > >> > relevant to this, fixed[1]. If you upgrade to 5.5/6.0 version, it
>> is
>> > >> > possible to read non-stored fields from docValues index., check
>> out.
>> > >> >
>> > >> >
>> > >> > [1] https://issues.apache.org/jira/browse/SOLR-8220
>> > >> >
>> > >> > On Mon, Apr 25, 2016 at 9:44 AM, sudsport s 
>> > >> wrote:
>> > >> >
>> > >> > > Thanks Erik for reply,
>> > >> > >
>> > >> > > Since I was storing Id (its stored field) and after enabling
>> > >> docValues my
>> > >> > > guess is it will be stored in 2 places. also as per my
>> understanding
>> > >> > > docValues are great when you have values which repeat. I am not
>> sure
>> > >> how
>> > >> > > beneficial it would be for uniqueId field.
>> > >> > > I am looking at collection of few hundred billion documents ,
>> that
>> > is
>> > >> > > reason I really want to care about expense from design phase.
>> > >> > >
>> > >> > >
>> > >> > >
>> > >> > >
>> > >> > > On Sun, Apr 24, 2016 at 7:24 PM, Erick Erickson <
>> > >> erickerick...@gmail.com
>> > >> > >
>> > >> > > wrote:
>> > >> > >
>> > >> > > > In a word, "yes".
>> > >> > > >
>> > >> > > > DocValues aren't particularly expensive, or expensive at all.
>> The
>> > >> idea
>> > >> > > > is that when you sort by a field or facet, the field has to be
>> > >> > > > "uninverted" which builds the entire structure in Java's JVM
>> (this
>> > >> is
>> > >> > > > when the field is _not_ DocValues).
>> > >> > > >
>> > >> > > > DocValues

Re: The Streaming API (Solrj.io) : id must have DocValues?

2016-04-26 Thread sudsport s

I am using solr 5.3.1 server & solr5.5 on client ( solrj) . I will try with
solrj 6.0

On Tue, Apr 26, 2016 at 11:12 AM, Susmit Shukla 
wrote:

> Which solrj version are you using? could you try with solrj 6.0
>
> On Tue, Apr 26, 2016 at 10:36 AM, sudsport s  wrote:
>
> > @Joel
> > >Can you describe how you're planning on using Streaming?
> >
> > I am mostly using it for distirbuted join case. We were planning to use
> > similar logic (hash id and join) in Spark for our usecase. but since data
> > is stored in solr , I will be using solr stream to perform same
> operation.
> >
> > I have similar user cases to build probabilistic data-structures while
> > streaming results. I might have to spend some time in exploring query
> > optimization (while doing join decide sort order etc)
> >
> > Please let me know if you have any feedback.
> >
> > On Tue, Apr 26, 2016 at 10:30 AM, sudsport s 
> wrote:
> >
> > > Thanks @Reth yes that was my one of the concern. I will look at JIRA
> you
> > > mentioned.
> > >
> > > Thanks Joel
> > > I used some of examples for streaming client from your blog. I got
> basic
> > > tuple stream working but I get following exception while running
> parallel
> > > string.
> > >
> > >
> > > java.io.IOException: java.util.concurrent.ExecutionException:
> > > org.noggit.JSONParser$ParseException: JSON Parse Error:
> char=<,position=0
> > > BEFORE='<' AFTER='html>   > >
> > >
> > > looks like Parallel stream is trying to access /stream on shard. can
> > > someone tell me how to enable stream handler? I have export handler
> > > enabled. I will look at latest solrconfig to see if I can turn that on.
> > >
> > >
> > >
> > > @Joel I am running sizing exercises already , I will run new one with
> > > solr5.5+ and docValues on id enabled.
> > >
> > > BTW Solr streaming has amazing response times thanks for making it so
> > > FAST!!!
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Mon, Apr 25, 2016 at 10:54 AM, Joel Bernstein 
> > > wrote:
> > >
> > >> Can you describe how you're planning on using Streaming? I can provide
> > >> some
> > >> feedback on how it will perform for your use use.
> > >>
> > >> When scaling out Streaming you'll get large performance boosts when
> you
> > >> increase the number of shards, replicas and workers. This is
> > particularly
> > >> true if you're doing parallel relational algebra or map/reduce
> > operations.
> > >>
> > >> As far a DocValues being expensive with unique fields, you'll want to
> > do a
> > >> sizing exercise to see how many documents per-shard work best for your
> > use
> > >> case. There are different docValues implementations that will allow
> you
> > to
> > >> trade off memory for performance.
> > >>
> > >> Joel Bernstein
> > >> http://joelsolr.blogspot.com/
> > >>
> > >> On Mon, Apr 25, 2016 at 3:30 AM, Reth RM 
> wrote:
> > >>
> > >> > Hi,
> > >> >
> > >> > So, is the concern related to same field value being stored twice:
> > with
> > >> > stored=true and docValues=true? If that is the case, there is a jira
> > >> > relevant to this, fixed[1]. If you upgrade to 5.5/6.0 version, it is
> > >> > possible to read non-stored fields from docValues index., check out.
> > >> >
> > >> >
> > >> > [1] https://issues.apache.org/jira/browse/SOLR-8220
> > >> >
> > >> > On Mon, Apr 25, 2016 at 9:44 AM, sudsport s 
> > >> wrote:
> > >> >
> > >> > > Thanks Erik for reply,
> > >> > >
> > >> > > Since I was storing Id (its stored field) and after enabling
> > >> docValues my
> > >> > > guess is it will be stored in 2 places. also as per my
> understanding
> > >> > > docValues are great when you have values which repeat. I am not
> sure
> > >> how
> > >> > > beneficial it would be for uniqueId field.
> > >> > > I am looking at collection of few hundred billion documents , that
> > is
> > >> > > reason I really want to care about expense from design phase.
> > >> > >
> > >> > >
> > >> > >
> > >> > >
> > >> > > On Sun, Apr 24, 2016 at 7:24 PM, Erick Erickson <
> > >> erickerick...@gmail.com
> > >> > >
> > >> > > wrote:
> > >> > >
> > >> > > > In a word, "yes".
> > >> > > >
> > >> > > > DocValues aren't particularly expensive, or expensive at all.
> The
> > >> idea
> > >> > > > is that when you sort by a field or facet, the field has to be
> > >> > > > "uninverted" which builds the entire structure in Java's JVM
> (this
> > >> is
> > >> > > > when the field is _not_ DocValues).
> > >> > > >
> > >> > > > DocValues essentially serialize this structure to disk. So your
> > >> > > > on-disk index size is larger, but that size is MMaped rather
> than
> > >> > > > stored on Java's heap.
> > >> > > >
> > >> > > > Really, the question I'd have to ask though is "why do you care
> > >> about
> > >> > > > the expense?". If you have a functional requirement that has to
> be
> > >> > > > served by returning the id via the /export handler, you

Re: The Streaming API (Solrj.io) : id must have DocValues?

2016-04-26 Thread Susmit Shukla

Which solrj version are you using? could you try with solrj 6.0

On Tue, Apr 26, 2016 at 10:36 AM, sudsport s  wrote:

> @Joel
> >Can you describe how you're planning on using Streaming?
>
> I am mostly using it for distirbuted join case. We were planning to use
> similar logic (hash id and join) in Spark for our usecase. but since data
> is stored in solr , I will be using solr stream to perform same operation.
>
> I have similar user cases to build probabilistic data-structures while
> streaming results. I might have to spend some time in exploring query
> optimization (while doing join decide sort order etc)
>
> Please let me know if you have any feedback.
>
> On Tue, Apr 26, 2016 at 10:30 AM, sudsport s  wrote:
>
> > Thanks @Reth yes that was my one of the concern. I will look at JIRA you
> > mentioned.
> >
> > Thanks Joel
> > I used some of examples for streaming client from your blog. I got basic
> > tuple stream working but I get following exception while running parallel
> > string.
> >
> >
> > java.io.IOException: java.util.concurrent.ExecutionException:
> > org.noggit.JSONParser$ParseException: JSON Parse Error: char=<,position=0
> > BEFORE='<' AFTER='html>   >
> >
> > looks like Parallel stream is trying to access /stream on shard. can
> > someone tell me how to enable stream handler? I have export handler
> > enabled. I will look at latest solrconfig to see if I can turn that on.
> >
> >
> >
> > @Joel I am running sizing exercises already , I will run new one with
> > solr5.5+ and docValues on id enabled.
> >
> > BTW Solr streaming has amazing response times thanks for making it so
> > FAST!!!
> >
> >
> >
> >
> >
> >
> >
> > On Mon, Apr 25, 2016 at 10:54 AM, Joel Bernstein 
> > wrote:
> >
> >> Can you describe how you're planning on using Streaming? I can provide
> >> some
> >> feedback on how it will perform for your use use.
> >>
> >> When scaling out Streaming you'll get large performance boosts when you
> >> increase the number of shards, replicas and workers. This is
> particularly
> >> true if you're doing parallel relational algebra or map/reduce
> operations.
> >>
> >> As far a DocValues being expensive with unique fields, you'll want to
> do a
> >> sizing exercise to see how many documents per-shard work best for your
> use
> >> case. There are different docValues implementations that will allow you
> to
> >> trade off memory for performance.
> >>
> >> Joel Bernstein
> >> http://joelsolr.blogspot.com/
> >>
> >> On Mon, Apr 25, 2016 at 3:30 AM, Reth RM  wrote:
> >>
> >> > Hi,
> >> >
> >> > So, is the concern related to same field value being stored twice:
> with
> >> > stored=true and docValues=true? If that is the case, there is a jira
> >> > relevant to this, fixed[1]. If you upgrade to 5.5/6.0 version, it is
> >> > possible to read non-stored fields from docValues index., check out.
> >> >
> >> >
> >> > [1] https://issues.apache.org/jira/browse/SOLR-8220
> >> >
> >> > On Mon, Apr 25, 2016 at 9:44 AM, sudsport s 
> >> wrote:
> >> >
> >> > > Thanks Erik for reply,
> >> > >
> >> > > Since I was storing Id (its stored field) and after enabling
> >> docValues my
> >> > > guess is it will be stored in 2 places. also as per my understanding
> >> > > docValues are great when you have values which repeat. I am not sure
> >> how
> >> > > beneficial it would be for uniqueId field.
> >> > > I am looking at collection of few hundred billion documents , that
> is
> >> > > reason I really want to care about expense from design phase.
> >> > >
> >> > >
> >> > >
> >> > >
> >> > > On Sun, Apr 24, 2016 at 7:24 PM, Erick Erickson <
> >> erickerick...@gmail.com
> >> > >
> >> > > wrote:
> >> > >
> >> > > > In a word, "yes".
> >> > > >
> >> > > > DocValues aren't particularly expensive, or expensive at all. The
> >> idea
> >> > > > is that when you sort by a field or facet, the field has to be
> >> > > > "uninverted" which builds the entire structure in Java's JVM (this
> >> is
> >> > > > when the field is _not_ DocValues).
> >> > > >
> >> > > > DocValues essentially serialize this structure to disk. So your
> >> > > > on-disk index size is larger, but that size is MMaped rather than
> >> > > > stored on Java's heap.
> >> > > >
> >> > > > Really, the question I'd have to ask though is "why do you care
> >> about
> >> > > > the expense?". If you have a functional requirement that has to be
> >> > > > served by returning the id via the /export handler, you really
> have
> >> no
> >> > > > choice.
> >> > > >
> >> > > > Best,
> >> > > > Erick
> >> > > >
> >> > > >
> >> > > > On Sun, Apr 24, 2016 at 9:55 AM, sudsport s  >
> >> > > wrote:
> >> > > > > I was trying to use Streaming for reading basic tuple stream. I
> am
> >> > > using
> >> > > > > sort by id asc ,
> >> > > > > I am getting following exception
> >> > > > >
> >> > > > > I am using export search handler as per
> >> > >

Re: The Streaming API (Solrj.io) : id must have DocValues?

2016-04-26 Thread sudsport s

@Joel
>Can you describe how you're planning on using Streaming?

I am mostly using it for distirbuted join case. We were planning to use
similar logic (hash id and join) in Spark for our usecase. but since data
is stored in solr , I will be using solr stream to perform same operation.

I have similar user cases to build probabilistic data-structures while
streaming results. I might have to spend some time in exploring query
optimization (while doing join decide sort order etc)

Please let me know if you have any feedback.

On Tue, Apr 26, 2016 at 10:30 AM, sudsport s  wrote:

> Thanks @Reth yes that was my one of the concern. I will look at JIRA you
> mentioned.
>
> Thanks Joel
> I used some of examples for streaming client from your blog. I got basic
> tuple stream working but I get following exception while running parallel
> string.
>
>
> java.io.IOException: java.util.concurrent.ExecutionException:
> org.noggit.JSONParser$ParseException: JSON Parse Error: char=<,position=0
> BEFORE='<' AFTER='html>  
>
> looks like Parallel stream is trying to access /stream on shard. can
> someone tell me how to enable stream handler? I have export handler
> enabled. I will look at latest solrconfig to see if I can turn that on.
>
>
>
> @Joel I am running sizing exercises already , I will run new one with
> solr5.5+ and docValues on id enabled.
>
> BTW Solr streaming has amazing response times thanks for making it so
> FAST!!!
>
>
>
>
>
>
>
> On Mon, Apr 25, 2016 at 10:54 AM, Joel Bernstein 
> wrote:
>
>> Can you describe how you're planning on using Streaming? I can provide
>> some
>> feedback on how it will perform for your use use.
>>
>> When scaling out Streaming you'll get large performance boosts when you
>> increase the number of shards, replicas and workers. This is particularly
>> true if you're doing parallel relational algebra or map/reduce operations.
>>
>> As far a DocValues being expensive with unique fields, you'll want to do a
>> sizing exercise to see how many documents per-shard work best for your use
>> case. There are different docValues implementations that will allow you to
>> trade off memory for performance.
>>
>> Joel Bernstein
>> http://joelsolr.blogspot.com/
>>
>> On Mon, Apr 25, 2016 at 3:30 AM, Reth RM  wrote:
>>
>> > Hi,
>> >
>> > So, is the concern related to same field value being stored twice: with
>> > stored=true and docValues=true? If that is the case, there is a jira
>> > relevant to this, fixed[1]. If you upgrade to 5.5/6.0 version, it is
>> > possible to read non-stored fields from docValues index., check out.
>> >
>> >
>> > [1] https://issues.apache.org/jira/browse/SOLR-8220
>> >
>> > On Mon, Apr 25, 2016 at 9:44 AM, sudsport s 
>> wrote:
>> >
>> > > Thanks Erik for reply,
>> > >
>> > > Since I was storing Id (its stored field) and after enabling
>> docValues my
>> > > guess is it will be stored in 2 places. also as per my understanding
>> > > docValues are great when you have values which repeat. I am not sure
>> how
>> > > beneficial it would be for uniqueId field.
>> > > I am looking at collection of few hundred billion documents , that is
>> > > reason I really want to care about expense from design phase.
>> > >
>> > >
>> > >
>> > >
>> > > On Sun, Apr 24, 2016 at 7:24 PM, Erick Erickson <
>> erickerick...@gmail.com
>> > >
>> > > wrote:
>> > >
>> > > > In a word, "yes".
>> > > >
>> > > > DocValues aren't particularly expensive, or expensive at all. The
>> idea
>> > > > is that when you sort by a field or facet, the field has to be
>> > > > "uninverted" which builds the entire structure in Java's JVM (this
>> is
>> > > > when the field is _not_ DocValues).
>> > > >
>> > > > DocValues essentially serialize this structure to disk. So your
>> > > > on-disk index size is larger, but that size is MMaped rather than
>> > > > stored on Java's heap.
>> > > >
>> > > > Really, the question I'd have to ask though is "why do you care
>> about
>> > > > the expense?". If you have a functional requirement that has to be
>> > > > served by returning the id via the /export handler, you really have
>> no
>> > > > choice.
>> > > >
>> > > > Best,
>> > > > Erick
>> > > >
>> > > >
>> > > > On Sun, Apr 24, 2016 at 9:55 AM, sudsport s 
>> > > wrote:
>> > > > > I was trying to use Streaming for reading basic tuple stream. I am
>> > > using
>> > > > > sort by id asc ,
>> > > > > I am getting following exception
>> > > > >
>> > > > > I am using export search handler as per
>> > > > >
>> > https://cwiki.apache.org/confluence/display/solr/Exporting+Result+Sets
>> > > > >
>> > > > > null:java.io.IOException: id must have DocValues to use this
>> feature.
>> > > > > at
>> > > >
>> > >
>> >
>> org.apache.solr.response.SortingResponseWriter.getFieldWriters(SortingResponseWriter.java:241)
>> > > > > at
>> > > >
>> > >
>> >
>>

Dergraded performance between Solr 4 and Solr 5

2016-04-26 Thread Jaroslaw Rozanski

Hi all,
 
I am migrating a large Solr Cloud cluster from Solr 4.10 to Solr 5.5.0
and I observed big difference in query execution time.
 
First a setup summary:
- multiple collections - 6
- each has multiple shards - 6
- same/similar hardware
- indexing tens of messages per second
- autoSoftCommit with 1s; hard commit few tens of seconds
- Java 8
 
The query has following form: field1:[* TO NOW-14DAYS] OR (-field1:[* TO
*] AND field2:[* TO NOW-14DAYS])
 
The fields field1 & field2 are of date type:

 
As query (q={!cache=false}...)
Solr 4.10 -> 5s
Solr 5.5.0 -> 12s
 
As filter query (q={!cache=false}*:*=..,)
Solr 4.10 -> 9s
Solr 5.5.0 -> 11s
 
The query itself is bad and its optimization aside, I am wondering if
there is anything in Lucene/Solr that would have such an impact on query
execution time between versions.
 
Originally I though it might be related to
https://issues.apache.org/jira/browse/SOLR-8251 and testing on small
scale proved that there is a difference in performance. However upgraded
version is already 5.5.0.
 
 
 
Thanks,
Jarek

Re: The Streaming API (Solrj.io) : id must have DocValues?

2016-04-26 Thread sudsport s

Thanks @Reth yes that was my one of the concern. I will look at JIRA you
mentioned.

Thanks Joel
I used some of examples for streaming client from your blog. I got basic
tuple stream working but I get following exception while running parallel
string.


java.io.IOException: java.util.concurrent.ExecutionException:
org.noggit.JSONParser$ParseException: JSON Parse Error: char=<,position=0
BEFORE='<' AFTER='html>   wrote:

> Can you describe how you're planning on using Streaming? I can provide some
> feedback on how it will perform for your use use.
>
> When scaling out Streaming you'll get large performance boosts when you
> increase the number of shards, replicas and workers. This is particularly
> true if you're doing parallel relational algebra or map/reduce operations.
>
> As far a DocValues being expensive with unique fields, you'll want to do a
> sizing exercise to see how many documents per-shard work best for your use
> case. There are different docValues implementations that will allow you to
> trade off memory for performance.
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Mon, Apr 25, 2016 at 3:30 AM, Reth RM  wrote:
>
> > Hi,
> >
> > So, is the concern related to same field value being stored twice: with
> > stored=true and docValues=true? If that is the case, there is a jira
> > relevant to this, fixed[1]. If you upgrade to 5.5/6.0 version, it is
> > possible to read non-stored fields from docValues index., check out.
> >
> >
> > [1] https://issues.apache.org/jira/browse/SOLR-8220
> >
> > On Mon, Apr 25, 2016 at 9:44 AM, sudsport s 
> wrote:
> >
> > > Thanks Erik for reply,
> > >
> > > Since I was storing Id (its stored field) and after enabling docValues
> my
> > > guess is it will be stored in 2 places. also as per my understanding
> > > docValues are great when you have values which repeat. I am not sure
> how
> > > beneficial it would be for uniqueId field.
> > > I am looking at collection of few hundred billion documents , that is
> > > reason I really want to care about expense from design phase.
> > >
> > >
> > >
> > >
> > > On Sun, Apr 24, 2016 at 7:24 PM, Erick Erickson <
> erickerick...@gmail.com
> > >
> > > wrote:
> > >
> > > > In a word, "yes".
> > > >
> > > > DocValues aren't particularly expensive, or expensive at all. The
> idea
> > > > is that when you sort by a field or facet, the field has to be
> > > > "uninverted" which builds the entire structure in Java's JVM (this is
> > > > when the field is _not_ DocValues).
> > > >
> > > > DocValues essentially serialize this structure to disk. So your
> > > > on-disk index size is larger, but that size is MMaped rather than
> > > > stored on Java's heap.
> > > >
> > > > Really, the question I'd have to ask though is "why do you care about
> > > > the expense?". If you have a functional requirement that has to be
> > > > served by returning the id via the /export handler, you really have
> no
> > > > choice.
> > > >
> > > > Best,
> > > > Erick
> > > >
> > > >
> > > > On Sun, Apr 24, 2016 at 9:55 AM, sudsport s 
> > > wrote:
> > > > > I was trying to use Streaming for reading basic tuple stream. I am
> > > using
> > > > > sort by id asc ,
> > > > > I am getting following exception
> > > > >
> > > > > I am using export search handler as per
> > > > >
> > https://cwiki.apache.org/confluence/display/solr/Exporting+Result+Sets
> > > > >
> > > > > null:java.io.IOException: id must have DocValues to use this
> feature.
> > > > > at
> > > >
> > >
> >
> org.apache.solr.response.SortingResponseWriter.getFieldWriters(SortingResponseWriter.java:241)
> > > > > at
> > > >
> > >
> >
> org.apache.solr.response.SortingResponseWriter.write(SortingResponseWriter.java:120)
> > > > > at
> > > >
> > >
> >
> org.apache.solr.response.QueryResponseWriterUtil.writeQueryResponse(QueryResponseWriterUtil.java:53)
> > > > > at
> > > >
> > org.apache.solr.servlet.HttpSolrCall.writeResponse(HttpSolrCall.java:742)
> > > > > at
> > > > org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:471)
> > > > > at
> > > >
> > >
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:214)
> > > > > at
> > > >
> > >
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:179)
> > > > > at
> > > >
> > >
> >
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
> > > > > at
> > > >
> > >
> >
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
> > > > > at
> > > >
> > >
> >
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
> > > > > at
> > > >
> > >
> >
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
> > > > > at
> > > >
> > >
> >
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
> > > > > at
>

MoreLikeThis Component - how to get fields of documents

2016-04-26 Thread Dr. Jan Frederik Maas


Hello,

I want to use the moreLikeThis Component to get similar documents from a 
sharded SOLR. This works quite well except for the fact that the 
documents in the moreLikeThis-list only contain the id/unique key of the 
documents.


Is it possible to get the other fields? I can of course do another query 
for the given IDs, but this would be complicated and very slow.


For example:

http://mysolrsystem/?q=id:524507260=true=title=0=true=true=title,id,topic

creates

(...)


646199803
613210832
562239472
819200034
539877271

(...)

I tried to modify the fl-parameter, but I can only switch the ID-field 
in the moreLikeThis-Documents on and off (the latter resulting in empty 
document tags). In the result list however, the fl-fields are shown as 
specified.


I would be very grateful for help.

Best wishes,
Jan Maas

Re: concat 2 fields

2016-04-26 Thread Jack Krupansky

As I myself had commented on that grokbase thread so many months ago, there
are examples of how to do this is my old Solr 4.x Deep Dive book.

If you read the grokbase thread carefully, you will see that you left out
the prefix "Custom" in front of "Concat" - this is not a standard Solr
feature.

Concat simply combines multiple values for a single field into a single
value. It does that for each specified field independently. It will not
concatenate two separate fields.

What you can do is Clone your second field to the name of the first field,
which will result in two values for the first field. Then you can use
Concat to combine the two values.

-- Jack Krupansky

On Thu, Apr 21, 2016 at 5:29 AM, vrajesh  wrote:

> to concatenating two fields to use it as one field from
>
> http://grokbase.com/t/lucene/solr-user/138vr75hvj/concat-2-fields-in-another-field
> ,
> but the solution whichever is given i tried but its not working. please
> help
> me on it.
>  i am trying to concat latitude and longitude fields to make it as single
> unit using following:
>  
>
> 
>
>  
>  i added it to solrconfig.xml.
>
>  some of my doubts are :
>  - should we define destination field (geo_location) in schema.xml?
>
>  - i want to make this combined field  (geo_location) as field facet so i
> have to add   in
>
>  - any specific tag in which i should add above process script to make it
> working.
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/concat-2-fields-tp4271760.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: concat 2 fields

2016-04-26 Thread vrajesh

i have tried two methods to define as follow:
1)
  
 
id 
id_title 
 
 
title 
id_title 
 
 

 

2)
  

  id
  title
   id_title
  ,

   


but none of them is working.

i am indexing using solr indexing portal :  
http://localhost:8983/solr/#//documents

using above configuration i should get response with "id_title" as new field
in solr  :http://localhost:8983/solr/ssp/select?q=*=json=true

but no such field found.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/concat-2-fields-tp4271760p4272895.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Amazon CloudSearch

2016-04-26 Thread Sameer Maggon

Hi Sergio,

CloudSearch is a Search-as-a-Service that uses SOLR underneath, though they
have a proprietary API to interact with it. Both on the document side and
query side. It won't give us ability to 'manage' Solr instances or cluster.
If you have a use cases where you want to keep on pumping data and forget
about it, good for that as it offers auto-scaling. Does not offer ability
or visibility into what's going on underneath it plus no control over
solrconfig, custom plugins, etc. No spellcheck, etc.

If you are looking for a service that provides you direct access to Solr's
APIs without having to rewrite your application, then CloudSearch is
probably not what you are looking for.

Take a look at Measured Search (www.measuredsearch.com) - It offers
Solr-as-a-Service on top of AWS, Azure and Google Cloud that allows you
direct access to Solr and ability to manage your instances. The platform is
comprised of currently three products:

1. SearchStax Cloud Manager - Allows you to deploy, manage and scale Solr.
- Provides High Availability as instances are front-ended with ELB (load
balancers).
- One time and scheduled backups.
- Cloning of deployments;
- Ability to add / remove nodes, real time log access and log archival.
- All deployments run on https, supports auth
- Enterprise version allows you to deploy & manage Solr within your AWS
account as well.
- Zookeeper deployment & setup.
- access to deploy custom JARs, etc.
- Supports Solr 4.8 and above (self serve version supports Solr 5.2.1 and
5.3.1)

2. SearchStax Pulse - Monitoring and Alerting for your Solr Clusters.
- System Level monitoring
- GC monitoring
- Search & Indexing monitoring
- Cache statistics
- Alerting on any of the above metrics at host and collection level.
- PagerDuty integration

3. SearchStax Analytics - User behavior Analytics that allows you to track
application level interactions and metrics to help you optimize your
search.
- Total searches,
- No result searches
- Click through rates
- conversion metrics for e-commerce scenarios
- query level details
- advanced version includes MRR reports, average click positions, etc.

Lastly, provides 24x7x365 Support and auto-scaling for customers that elect
for it.

Thanks,
Sameer.

On Tuesday, April 26, 2016, marotosg  wrote:

> Hi,
>
> I am evaluating the possibility of using Amazon CloudSearch to manage Solr
> insances. Reason is the price and time to manage and deploy. I am not fully
> sure yet how flexible is that service. in case you need to install a
> specific solr version or plug in.
> Do you have any experience with it?
>
> Would you please share any thoughts?
>
> Thanks
> Sergio
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Amazon-CloudSearch-tp4272875.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

-- 
*Sameer Maggon*
Measured Search
c: 310.344.7266
www.measuredsearch.com

'batching when indexing is good' -> some questions

2016-04-26 Thread Bastien Latard - MDPI AG


Hi Eric (Erickson) & others,

I read your post 'batching when indexing is good 
'.
But I also read this one 
, which 
recommend to use batchSize="-1".


So I have now some questions:
- when you speak about 'Packet Size', are you speaking about batchSize?
- where can I define the Integer.MIN_VALUE used by the setFetchSize() 
from JDBC con. ? (I use mysql jdbc)


Kind regards,
Bastien

ANN: Solr puzzle: Magic Date

2016-04-26 Thread Alexandre Rafalovitch

I am doing an experiment in teaching about Solr. I've created a Solr
puzzle and want to know whether people would find it useful to do more
of these. My mailing list have seen this already, but I would love the
feedback from a wider Solr audience as well. Privately or on the list.

The - first - puzzle is deceptively simple:

--
Given the following sequence of commands (for Solr 5.5 or 6.0):

1. bin/solr create_core -c puzzle_date
2. bin/post -c puzzle_date -type text/csv -d $'today\n2016-04-08'
3. curl http://localhost:8983/solr/puzzle_date/select?q=Fri

--
Would the result be:

1.Error in the command 1 for not providing a configuration directory
2.Error in the command 2 for missing a uniqueKey field
3.Error in the command 2 due to an incorrect date format
4.No records in the command 3 output
5.One record in the command 3 output
--

You can find the answer and full in-depth explanation at:
http://blog.outerthoughts.com/2016/04/solr-5-puzzle-magic-date-answer/

Again, what I am trying to understand is whether that's somehow useful
to people and worth making time to create and write-up.

Any feedback would be appreciated.

Regards,
Alex.


Newsletter and resources for Solr beginners and intermediates:
http://www.solr-start.com/

Amazon CloudSearch

2016-04-26 Thread marotosg

Hi,

I am evaluating the possibility of using Amazon CloudSearch to manage Solr
insances. Reason is the price and time to manage and deploy. I am not fully
sure yet how flexible is that service. in case you need to install a
specific solr version or plug in.
Do you have any experience with it?

Would you please share any thoughts?

Thanks
Sergio



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Amazon-CloudSearch-tp4272875.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr Cloud Indexing Performance degrades suddenly

2016-04-26 Thread Reth RM

What are the recent changes made to database or DIH? Version upgrade?
Addition of new fields? co-location of db?


On Tue, Apr 26, 2016 at 2:47 PM, preeti kumari 
wrote:

> I am using solr 5.2.1 .
>
>
> -- Forwarded message --
> From: preeti kumari 
> Date: Mon, Apr 25, 2016 at 2:29 PM
> Subject: Solr Cloud Indexing Performance degrades suddenly
> To: solr-user@lucene.apache.org
>
>
> Hi,
>
> I have 2 solr cloud setups : Primary and secondary.
> Both are importing data from same DB . I am using DIH to index data.
>
> I was previously getting speed of 700docs/sec .
> Now suddenly primary cluster is giving me a speed of 20docs/sec.
> Same configs in Secondary is still giving 700 docs/sec speed.
> Both cluster servers are having same server specifications.
>
>
> I am looking for pointers where i can look for the reason for this degrade
> in indexing speed.
>
> Please help me out.
>
> Thanks
> Preeti
>

Re: concat 2 fields

2016-04-26 Thread Reth RM

Check if you have added the 'concatFields'  definition as well in
solrconfig.xml...
How are you indexing btw?


On Tue, Apr 26, 2016 at 12:24 PM, vrajesh  wrote:

> Hi,
> i have added it to /update request handler as per following in
> solrconfig.xml:
>  
> 
>  application/json
>   concatFields
>
>   
>   
> 
>  application/csv
>   concatFields
>
>   
>
> but when i query it after indexing new files, i dont see any concatenated
> field.
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/concat-2-fields-tp4271760p4272829.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

HttpSolrClient issue

2016-04-26 Thread srinivasarao vundavalli

I am using HttpSolrClient class to query and fetch the documents from solr
index. I am passing my custom httpclient object to HttpSolrClient.
HttpSolrClient solrClient = new HttpSolrClient(url, httpClient);

This is restricting me setting maximum number of connections using
*setMaxTotalConnections
*
method as the http client was created outside and the operation is not
allowed. Can someone let me know how I can set this value if I create the
http client outside the HttpSolrClient class.

-- 
http://cheyuta-helpinghands.blogspot.com

Fwd: Solr Cloud Indexing Performance degrades suddenly

2016-04-26 Thread preeti kumari

I am using solr 5.2.1 .

-- Forwarded message --
From: preeti kumari 
Date: Mon, Apr 25, 2016 at 2:29 PM
Subject: Solr Cloud Indexing Performance degrades suddenly
To: solr-user@lucene.apache.org

Hi,

I have 2 solr cloud setups : Primary and secondary.
Both are importing data from same DB . I am using DIH to index data.

I was previously getting speed of 700docs/sec .
Now suddenly primary cluster is giving me a speed of 20docs/sec.
Same configs in Secondary is still giving 700 docs/sec speed.
Both cluster servers are having same server specifications.

I am looking for pointers where i can look for the reason for this degrade
in indexing speed.

Please help me out.

Thanks
Preeti

Indexing performance on HDFS

2016-04-26 Thread KORTMANN Stefan (MORPHO)

Hi,

can indexing on HDFS somehow be tuned up using pluggable codecs / some 
customized PostingsFormat? What settings would you recommend for using Lucene 
5.5 on HDFS?

Regards,
Stefan

#
" This e-mail and any attached documents may contain confidential or 
proprietary information. If you are not the intended recipient, you are 
notified that any dissemination, copying of this e-mail and any attachments 
thereto or use of their contents by any means whatsoever is strictly 
prohibited. If you have received this e-mail in error, please advise the sender 
immediately and delete this e-mail and all attached documents from your 
computer system."
#

RunTimeLib Transformers

2016-04-26 Thread Basel Ariqat

Hi,
I want to make my transformers in solr loaded at run time (use .system
collection to upload jars), but this feature seems to only work with
requesthandlers, responsewriters and other plugins in solrconfig.xml, it
doesn't work with anything in data-config.xml probably because it's
dependent on the plugin dataimporthandler.

I'm planning to write a custom dataimporthandler so i can load it with
runtimlib, but i don't think this is the right method to do it.

If you guys have any idea on how to make transformers and other classes in
data-config.xml loaded at runtime instead of restarting solr every time we
modify the transformer.

Thanks in advance :D

Regars,
Basel Ariqat.

Re: concat 2 fields

2016-04-26 Thread vrajesh

Hi,
i have added it to /update request handler as per following in
solrconfig.xml:
 

 application/json
  concatFields
   
  
  

 application/csv
  concatFields
   
  

but when i query it after indexing new files, i dont see any concatenated
field.





--
View this message in context: 
http://lucene.472066.n3.nabble.com/concat-2-fields-tp4271760p4272829.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: How can I set the defaultOperator to be AND?

2016-04-26 Thread Bastien Latard - MDPI AG


Thank you Shawn, Jan and Georg for your answers.

Yes, it seems that if I simply remove the defaultOperator it works well 
for "composed queries" like '(a:x AND b:y) OR c:z'.

But I think that the default Operator should/could be the AND.

Because when I add an extra search word, I expect that the results get 
more accurate...

(It seems to be what google is also doing now)
   ||

Otherwise, if you make a search and apply another filter (e.g.: sort by 
publication date, facets, ...) , user can get the less relevant item 
(only 1 word in 4 matches) in first position only because of its date...


What do you think?


Kind regards,
Bastien


On 25/04/2016 14:53, Shawn Heisey wrote:

On 4/25/2016 6:39 AM, Bastien Latard - MDPI AG wrote:

Remember:
If I add the following line to the schema.xml, even if I do a search
'title:"test" OR author:"me"', it will returns documents matching
'title:"test" AND author:"me"':


The settings in the schema for default field and default operator were
deprecated a long time ago.  I actually have no idea whether they are
even supported in newer Solr versions.

The q.op parameter controls the default operator, and the df parameter
controls the default field.  These can be set in the request handler
definition in solrconfig.xml -- usually in "defaults" but there might be
reason to put them in "invariants" instead.

If you're using edismax, you'd be better off using the mm parameter
rather than the q.op parameter.  The behavior you have described above
sounds like a change in behavior (some call it a bug) introduced in the
5.5 version:

https://issues.apache.org/jira/browse/SOLR-8812

If you are using edismax, I suspect that if you set mm=100% instead of
q.op=AND (or the schema default operator) that the problem might go away
... but I am not sure.  Someone who is more familiar with SOLR-8812
probably should comment.

Thanks,
Shawn




Kind regards,
Bastien Latard
Web engineer
--
MDPI AG
Postfach, CH-4005 Basel, Switzerland
Office: Klybeckstrasse 64, CH-4057
Tel. +41 61 683 77 35
Fax: +41 61 302 89 18
E-mail:
lat...@mdpi.com
http://www.mdpi.com/

39 matches

Mail list logo