Re: Joining across collections with Nested documents

2017-03-02 Thread Mikhail Khludnev
Related docs can be retrieved with
https://cwiki.apache.org/confluence/display/solr/Transforming+Result+Documents#TransformingResultDocuments-[subquery]
but searching related docs is less ready.
Here is a patch for query time join across collections
https://issues.apache.org/jira/browse/SOLR-8297.

On Fri, Mar 3, 2017 at 8:55 AM, Preeti Bhat 
wrote:

> Hi All,
>
> I have two collections in solrcloud namely contact and company, they are
> in same solr instance. Company is relatively simpler document with id,
> Name, address etc... Coming over to Contact, this has the nested document
> like below. I would like to get the Company details using the "CompanyId"
> field in child document  by joining this to the "Company" collections's id
> document. Is this possible? Could someone please guide me on this?
>
> {
> id: "1"
> , FirstName: "ABC"
> , LastName: "BCD"
> .
> .
> .
> _childDocuments_:{
> {
> id:"123-1",
> CompanyId: "123",
> Email: "abc@smd.edu"
> }
> {
> id:"124-1",
> CompanyId: "124",
> Email: "abc@smd.edu"
>
> }
> }
>
>
>
> Thanks and Regards,
> Preeti Bhat
>
>
>
> NOTICE TO RECIPIENTS: This communication may contain confidential and/or
> privileged information. If you are not the intended recipient (or have
> received this communication in error) please notify the sender and
> it-supp...@shoregrp.com immediately, and destroy this communication. Any
> unauthorized copying, disclosure or distribution of the material in this
> communication is strictly forbidden. Any views or opinions presented in
> this email are solely those of the author and do not necessarily represent
> those of the company. Finally, the recipient should check this email and
> any attachments for the presence of viruses. The company accepts no
> liability for any damage caused by any virus transmitted by this email.
>
>
>


-- 
Sincerely yours
Mikhail Khludnev


Re: Joining across collections with Nested documents

2017-03-02 Thread Walter Underwood
Make one collection with denormalized data. This looks like a relational, 
multi-table schema in Solr. That will be slow and painful.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Mar 2, 2017, at 9:55 PM, Preeti Bhat  wrote:
> 
> Hi All,
> 
> I have two collections in solrcloud namely contact and company, they are in 
> same solr instance. Company is relatively simpler document with id, Name, 
> address etc... Coming over to Contact, this has the nested document like 
> below. I would like to get the Company details using the "CompanyId" field in 
> child document  by joining this to the "Company" collections's id document. 
> Is this possible? Could someone please guide me on this?
> 
> {
> id: "1"
> , FirstName: "ABC"
> , LastName: "BCD"
> .
> .
> .
> _childDocuments_:{
> {
> id:"123-1",
> CompanyId: "123",
> Email: "abc@smd.edu"
> }
> {
> id:"124-1",
> CompanyId: "124",
> Email: "abc@smd.edu"
> 
> }
> }
> 
> 
> 
> Thanks and Regards,
> Preeti Bhat
> 
> 
> 
> NOTICE TO RECIPIENTS: This communication may contain confidential and/or 
> privileged information. If you are not the intended recipient (or have 
> received this communication in error) please notify the sender and 
> it-supp...@shoregrp.com immediately, and destroy this communication. Any 
> unauthorized copying, disclosure or distribution of the material in this 
> communication is strictly forbidden. Any views or opinions presented in this 
> email are solely those of the author and do not necessarily represent those 
> of the company. Finally, the recipient should check this email and any 
> attachments for the presence of viruses. The company accepts no liability for 
> any damage caused by any virus transmitted by this email.
> 
> 



Joining across collections with Nested documents

2017-03-02 Thread Preeti Bhat
Hi All,

I have two collections in solrcloud namely contact and company, they are in 
same solr instance. Company is relatively simpler document with id, Name, 
address etc... Coming over to Contact, this has the nested document like below. 
I would like to get the Company details using the "CompanyId" field in child 
document  by joining this to the "Company" collections's id document. Is this 
possible? Could someone please guide me on this?

{
id: "1"
, FirstName: "ABC"
, LastName: "BCD"
.
.
.
_childDocuments_:{
{
id:"123-1",
CompanyId: "123",
Email: "abc@smd.edu"
}
{
id:"124-1",
CompanyId: "124",
Email: "abc@smd.edu"

}
}



Thanks and Regards,
Preeti Bhat



NOTICE TO RECIPIENTS: This communication may contain confidential and/or 
privileged information. If you are not the intended recipient (or have received 
this communication in error) please notify the sender and 
it-supp...@shoregrp.com immediately, and destroy this communication. Any 
unauthorized copying, disclosure or distribution of the material in this 
communication is strictly forbidden. Any views or opinions presented in this 
email are solely those of the author and do not necessarily represent those of 
the company. Finally, the recipient should check this email and any attachments 
for the presence of viruses. The company accepts no liability for any damage 
caused by any virus transmitted by this email.




Re: How to update index after document expired.

2017-03-02 Thread XuQing Tan
SOLR gets the updated content from external source (by calling a REST api
which returns xml content).
so my question is how can I plug this logic
in DocExpirationUpdateProcessorFactory, saying poll from external source
and update indexing?

for now i'm thinking to use a custom 'autoDeleteChainName', still i'm
experimenting with this, is it feasible?


  scheduled-delete-and-update
  ...


  Thanks & Best Regards!

  ///
 (. .)
  ooO--(_)--Ooo
  |   Nick Tan   |
  

On Thu, Mar 2, 2017 at 7:36 PM, Alexandre Rafalovitch 
wrote:

> Where would Solr get the updated content? Do you mean would it poll
> from external source to refresh? Then, no. And if it is pushed from
> external sources to Solr, then you just replace it as normal.
>
> Not sure if I understand your use-case exactly.
>
> Regards,
>Alex.
> 
> http://www.solr-start.com/ - Resources for Solr users, new and experienced
>
>
> On 2 March 2017 at 22:29, XuQing Tan  wrote:
> > Hi folks
> >
> > in our case, we have contents need to be refreshed periodically according
> > to the TTL of each document.
> >
> > looks like DocExpirationUpdateProcessorFactory is a quite good fit
> except
> > that it does delete the document only, but no way to update the indexing
> > with the new document.
> >
> > I don't see there's a way to hook into DocExpirationUpdateProcessorFa
> ctory
> > for custom logic like get the document and update index. and even
> > DocExpirationUpdateProcessorFactory is a final class.
> >
> > so just want to confirm with you, is there any existing solution for
> this?
> >
> > otherwise, I might have our own copy of DocExpirationUpdateProcessorFa
> ctory
> > with custom code.
> >
> >
> >   Thanks & Best Regards!
> >
> >   ///
> >  (. .)
> >   ooO--(_)--Ooo
> >   |   Nick Tan   |
> >   
>


Re: How to update index after document expired.

2017-03-02 Thread Alexandre Rafalovitch
Where would Solr get the updated content? Do you mean would it poll
from external source to refresh? Then, no. And if it is pushed from
external sources to Solr, then you just replace it as normal.

Not sure if I understand your use-case exactly.

Regards,
   Alex.

http://www.solr-start.com/ - Resources for Solr users, new and experienced


On 2 March 2017 at 22:29, XuQing Tan  wrote:
> Hi folks
>
> in our case, we have contents need to be refreshed periodically according
> to the TTL of each document.
>
> looks like DocExpirationUpdateProcessorFactory is a quite good fit except
> that it does delete the document only, but no way to update the indexing
> with the new document.
>
> I don't see there's a way to hook into DocExpirationUpdateProcessorFactory
> for custom logic like get the document and update index. and even
> DocExpirationUpdateProcessorFactory is a final class.
>
> so just want to confirm with you, is there any existing solution for this?
>
> otherwise, I might have our own copy of DocExpirationUpdateProcessorFactory
> with custom code.
>
>
>   Thanks & Best Regards!
>
>   ///
>  (. .)
>   ooO--(_)--Ooo
>   |   Nick Tan   |
>   


How to update index after document expired.

2017-03-02 Thread XuQing Tan
Hi folks

in our case, we have contents need to be refreshed periodically according
to the TTL of each document.

looks like DocExpirationUpdateProcessorFactory is a quite good fit except
that it does delete the document only, but no way to update the indexing
with the new document.

I don't see there's a way to hook into DocExpirationUpdateProcessorFactory
for custom logic like get the document and update index. and even
DocExpirationUpdateProcessorFactory is a final class.

so just want to confirm with you, is there any existing solution for this?

otherwise, I might have our own copy of DocExpirationUpdateProcessorFactory
with custom code.


  Thanks & Best Regards!

  ///
 (. .)
  ooO--(_)--Ooo
  |   Nick Tan   |
  


Re: Delta Import JDBC connection frame size larger than max length

2017-03-02 Thread Shawn Heisey
On 3/1/2017 8:48 AM, Liu, Daphne wrote:
> Hello Solr experts, Is there a place in Solr (Delta Import
> Datasource?) where I can adjust the JDBC connection frame size to 256
> mb ? I have adjusted the settings in Cassandra but I'm still getting
> this error. NonTransientConnectionException:
> org.apache.thrift.transport.TTransportException: Frame size (17676563)
> larger than max length (16384000 Thank you.

Whatever your JDBC driver can do with JDBC URL parameters, Solr can ask
it to do.  There's probably a URL parameter to change that, but you'll
need to check with the driver docs.

This is the url definition in one of my DIH configs, where I use a URL
parameter to tell the MySQL JDBC driver that I want zero dates to be
eliminated, because Solr can't handle them:

url="jdbc:mysql://${dih.request.dbHost}:3306/${dih.request.dbSchema}?zeroDateTimeBehavior=convertToNull"

JDBC is intended to be a generic database access framework, and DIH
layers an even more generic configuration on top of JDBC.  This falls
into the realm of special configuration.  URL parameters are the best
way to handle special configuration for a specific JDBC driver, and the
ONLY way to do it with DIH, unless you are interested in writing some
custom java code to use with Solr.

Solving this problem may require talking to whoever created the JDBC
driver that you are using.  This is the question I would ask them:

"Are there JDBC URL parameters to configure the frame size and other
similar settings, or another way to change the configuration that does
not involve writing custom Java code?"

I was unable to determine what driver you're using, or to find any kind
of documentation about how you might configure it.

As an alternative, you could try lowering the frame size on the DB server.

Thanks,
Shawn



Re: What is the bottleneck for an optimise operation?

2017-03-02 Thread Shawn Heisey
On 3/2/2017 8:04 AM, Caruana, Matthew wrote:
> I’m currently performing an optimise operation on a ~190GB index with about 4 
> million documents. The process has been running for hours.
>
> This is surprising, because the machine is an EC2 r4.xlarge with four cores 
> and 30GB of RAM, 24GB of which is allocated to the JVM.
>
> The load average has been steady at about 1.3. Memory usage is 25% or less 
> the whole time. iostat reports ~6% util.
>
> What gives?

On one of my systems, which is running 6.3.0, the optimize of a 50GB
index takes 1.73 hours.  This system has very fast storage -- six SATA
drives in RAID10.  It machine has eight 2.8Ghz CPU cores (two Intel
E5440), 64GB of memory, a 13GB heap, and over 700GB of total index
data.  I would like more memory, but the machine is maxed, and this is
the dev server, which doesn't need to perform as well as production.

As others have said, an optimize rewrites the whole index.  The optimize
does NOT proceed at full disk I/O rate, though.  The speed of the disks
has very little influence on optimize speed unless they are REALLY
slow.  Any modern disk should be able to keep up easily.

It's not just a straight copy of data.  Lucene must do a very heavy
amount of data processing.  Except for the fact that the source data is
a little bit different and text analysis does not need to be repeated,
what Lucene ends up doing is a lot like the initial indexing process. 
All existing data must be examined, minus deleted documents.  A new term
list for the optimized segment (which covers the ENTIRE index dataset)
must be built.  This will be a significant portion of the total size,
and is likely to be millions or billions of terms for an index that size
-- this takes time to process.  The rest of the files that make up an
index segment also require some significant processing to rewrite to a
new segment.

An optimize is a purely Lucene operation, so I do not know whether the
bug in Solr 6.4.1 that causes high CPU usage affects it directly.  It
CAN definitely affect it indirectly, by the making the CPU less available.

https://issues.apache.org/jira/browse/SOLR-10130

The statement you might have heard that an optimize can take 3x the
space is sort of true, but not in the way that most think.  It is true
that an optimize might result in total space consumption that's
temporarily three times the *final* index size, but when looking at the
*starting* index size, the most it should ever take is double.  It is a
good idea to always plan on having 3x the disk space for your index,
though.  There are certain situations that you can experience, even
during normal operation when not running an optimize, where the index
can grow to triple size before it shrinks.

Another issue which might be relevant:  Assuming that this 190GB index
is the only index on the machine, you've only left yourself with 6GB of
RAM to cache 190GB of index data.  This will be even less if the server
has other software running on it.  That's not enough RAM for good
general performance.  If the machine has more indexes than just this
190GB, then the situation is even worse.

General performance will get worse during an optimize on most systems. 
I cannot say for sure that having too little system memory for good
caching will cause an optimize to be very slow.  I think the processing
involved might make it somewhat immune to that particular problem, if
the optimize is the only thing the server is doing.  If the server is
busy with queries and/or indexing during an optimize, I would expect a
very low memory situation like that to slow EVERYTHING down.

https://wiki.apache.org/solr/SolrPerformanceProblems

On my 6.3.0 dev system, optimizing a 190GB index would take more than
six hours.  Running with memory so low and on 6.4.1 with its CPU bug, it
might take even longer.

The 6.4.2 release that fixes the performance bug should be available
sometime in the next week or so.

Thanks,
Shawn



Re: Setting up to index multiple datastores

2017-03-02 Thread Shawn Heisey
On 3/2/2017 6:44 PM, Alexandre Rafalovitch wrote:
> And if you are not using SolrCloud, you can have
> collection=shard=core, so the terminology gets confused. But you can
> definitely have many cores on one mail server. You can also make them
> lazy, so not all cores have to be loaded. That would definitely allow
> you to have a core per user and only searched cores would be loaded.
> And relevance might be a bit better too, as each user will get their
> own term counts.

Yes, definitely.  If you abandon SolrCloud, then you gain some features
that we call "LotsOfCores", and the number of indexes you have is not so
relevant for performance.  There may be a short delay when a user
connects and tries to access their search, while Solr spins the core
back up.  I have no way to estimate how long that delay would be.

https://wiki.apache.org/solr/LotsOfCores

As Alexandre mentions, there is a little bit of terminology confusion,
because cores will sometimes be referred to as collections, especially
by those who are not as familiar with internal terminology used by the
project.  Until recently, the first core encountered by a new user,
originating before SolrCloud was available, was named "collection1". 
This did not help with the confusion.

Thanks,
Shawn



Re: OR condition between !frange and normal query

2017-03-02 Thread Zheng Lin Edwin Yeo
Hi Emir,

Thanks for your reply.

For the query:

q=_query_:"({!frange l=1}ms(startDate_dt,endDate_dt)" OR
_query_:"startDate:[2000-01-01T00:00:00Z TO *] AND
endDate:[2016-12-31T23:59:59Z]"

Must the _query_  be one of the field in the index? I do not have any
fields in the index that relates to the output of the query, and if I put
something that is not one of the fields in the index, it is not returning
any results.

Regards,
Edwin



On 2 March 2017 at 17:04, Emir Arnautovic 
wrote:

> Hi Edwin,
>
> You can use subqueries:
>
> q=_query_:"({!frange l=1}ms(startDate_dt,endDate_dt)" OR
> _query_:"startDate:[2000-01-01T00:00:00Z TO *] AND
> endDate:[2016-12-31T23:59:59Z]"
>
> HTH,
> Emir
>
>
>
> On 02.03.2017 04:51, Zheng Lin Edwin Yeo wrote:
>
>> Hi,
>>
>> Would like to check, how can we do an OR condition between !frange and
>> normal query?
>>
>> For example, I want to have the following condition in my query:
>>
>> ({!frange l=1}ms(startDate_dt,endDate_dt) OR
>> (startDate:[2000-01-01T00:00:00Z TO *] AND endDate:[2016-12-31T23:59:59Z]
>> ))
>>
>> How can we put it in the Solr query URL for Solr to recognize this
>> condition?
>>
>> I'm using Solr 6.4.1
>>
>> Thank you.
>>
>> Regards,
>> Edwin
>>
>>
> --
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
>
>


Re: Setting up to index multiple datastores

2017-03-02 Thread Alexandre Rafalovitch
And if you are not using SolrCloud, you can have
collection=shard=core, so the terminology gets confused. But you can
definitely have many cores on one mail server. You can also make them
lazy, so not all cores have to be loaded. That would definitely allow
you to have a core per user and only searched cores would be loaded.
And relevance might be a bit better too, as each user will get their
own term counts.

Regards,
   Alex.

http://www.solr-start.com/ - Resources for Solr users, new and experienced


On 2 March 2017 at 20:14, Shawn Heisey  wrote:
> On 3/2/2017 2:58 PM, Daniel Miller wrote:
>> One of the many features of the Dovecot IMAP server is Solr support.
>> This obviously provides full-text-searching of stored mails - and it
>> works great.  But...the focus of the Dovecot team and mailing list is
>> Dovecot configuration.  I'm asking for some guidance on how I might
>> optimize Solr.
>
> I use Solr for work.  I use Dovecot for personal domains.  I have not
> used them together.  I probably should -- my personal mailbox is many
> gigabytes and would benefit from a boost in search performance.
>
>> At the moment I have a (I think!) reasonably well-defined schema that
>> seems to perform well.  In my particular use case, I have a single
>> physical server running Linux with available VirtualBox virtual
>> servers.  I am presently running Solr within one of the virtual
>> servers, and I'm running SolrCloud even though I only have one core
>> (it just seemed to work better).
>>
>> Now because I have a single collection/core/shard - all the mail users
>> and all their mail folders are stored/indexed/searched by this single
>> Solr instance.  I'm thinking that I'd like to split the indexing on at
>> least a per-user fashion - possibly also on a per-mailbox fashion.
>> Dovecot does allow for variable substitution in the Solr URL - so I
>> should be able to generate the necessary URL requests on the Dovecot
>> side.  What I don't know is:
>>
>> 1.  Is it possible to split the "indexes" (I'm still learning Solr
>> vocabulary) without creating separate "cores" (which to me means
>> separate Java instances)?
>> 2.  Can these separate "indexes" be created on-demand - or do they
>> need to be explictly created prior to use?
>
> Here's a paragraph that hopefully clears up most confusion about Solr
> terminology.  This is applicable to SolrCloud:
>
> Collections are made up of one or more shards.  Shards are made up of
> one or more replicas.  Each replica is a core.  One replica from each
> shard is elected as the leader of that shard, and if there are multiple
> replicas, the leader role can move between them in response to a change
> in cluster state.
>
> Further info: One Solr instance (JVM) can handle many cores.  SolrCloud
> allows multiple Solr instances to coordinate with each other (via
> ZooKeeper) and form a whole cluster.  Without SolrCloud, you have cores,
> but no collections and no replicas.  Sharding is possible without
> SolrCloud, but is handled mostly manually.  Replication is possible
> without SolrCloud, but works very differently, and has a single point of
> failure due to the fact that switching master servers isn't something
> that's done easily.  SolrCloud is a true cluster, no masters or slaves.
>
> https://cwiki.apache.org/confluence/display/solr/Distributed+Search+with+Index+Sharding
> https://cwiki.apache.org/confluence/display/solr/Index+Replication
>
> SolrCloud also makes it VERY easy to create new collections (logical
> indexes) if the desired index config is already in the zookeeper
> database.  It can be done entirely with an HTTP request:
>
> https://cwiki.apache.org/confluence/display/solr/Distributed+Search+with+Index+Sharding
>
> One thing to note:  SolrCloud begins to have performance issues when the
> number of collections in the cloud reaches the low hundreds.  It's not
> going to scale very well with a collection per user or per mailbox
> unless there aren't very many users.  There are people looking into how
> to scale better, but this hasn't really gone anywhere yet.  Here's one
> issue about it, with a lot of very dense comments:
>
> https://issues.apache.org/jira/browse/SOLR-7191
>
> Thanks,
> Shawn
>


Re: What is the bottleneck for an optimise operation? / solve the disk space and time issues by specifying multiple segments to optimize

2017-03-02 Thread Alexandre Rafalovitch
What do you have for merge configuration in solrconfig.xml? You should
be able to tune it to - approximately - whatever you want without
doing the grand optimize:
https://cwiki.apache.org/confluence/display/solr/IndexConfig+in+SolrConfig#IndexConfiginSolrConfig-MergingIndexSegments

Regards,
   Alex.

http://www.solr-start.com/ - Resources for Solr users, new and experienced


On 2 March 2017 at 16:37, Caruana, Matthew  wrote:
> Yes, we already do it outside Solr. See https://github.com/ICIJ/extract which 
> we developed for this purpose. My guess is that the documents are very large, 
> as you say.
>
> Optimising was always an attempt to bring down the number of segments from 
> 60+. Not sure how else to do that.
>
>> On 2 Mar 2017, at 7:42 pm, Michael Joyner  wrote:
>>
>> You can solve the disk space and time issues by specifying multiple segments 
>> to optimize down to instead of a single segment.
>>
>> When we reindex we have to optimize or we end up with hundreds of segments 
>> and very horrible performance.
>>
>> We optimize down to like 16 segments or so and it doesn't do the 3x disk 
>> space thing and usually runs in a decent amount of time. (we have >50 
>> million articles in one of our solr indexes).
>>
>>
>>> On 03/02/2017 10:20 AM, David Hastings wrote:
>>> Agreed, and since it takes three times the space is part of the reason it
>>> takes so long, so that 190gb index ends up writing another 380 gb until it
>>> compresses down and deletes the two left over files.  its a pretty hefty
>>> operation
>>>
>>> On Thu, Mar 2, 2017 at 10:13 AM, Alexandre Rafalovitch 
>>> wrote:
>>>
 Optimize operation is no longer recommended for Solr, as the
 background merges got a lot smarter.

 It is an extremely expensive operation that can require up to 3-times
 amount of disk during the processing.

 This is not to say yours is a valid question, which I am leaving to
 others to respond.

 Regards,
Alex.
 
 http://www.solr-start.com/ - Resources for Solr users, new and experienced


> On 2 March 2017 at 10:04, Caruana, Matthew  wrote:
> I’m currently performing an optimise operation on a ~190GB index with
 about 4 million documents. The process has been running for hours.
> This is surprising, because the machine is an EC2 r4.xlarge with four
 cores and 30GB of RAM, 24GB of which is allocated to the JVM.
> The load average has been steady at about 1.3. Memory usage is 25% or
 less the whole time. iostat reports ~6% util.
> What gives?
>
> Running Solr 6.4.1.
>>


Re: Problem with facet and multivalued field

2017-03-02 Thread Sales
Yes, so the terms component will of course show me the same thing as the facet 
query, I am sure the facet query is not wrong. It shows ` in the values, no 
matter for which unique product key since there should be 0 of them since there 
is a splitby, was there something else you wanted me to look for? So, I am 
wondering how in the world a tilda can get into the index. After all, there is 
a splitby (and always has been). And there is a transformer of course:

http://ourserver:8080/solr/prod/terms?terms.fl=categories=.*`.*

This should return 0 values since it’s split on the ~, and if it wasn’t split 
on it, I would get 3 facet results I’d only get 1 in the original example. 

So, something else is going on, but I can’t seem to find it. Have any other 
ideas or were you thinking of something else?

It clearly does the split, but then adds additionally the non split value. 


> On Mar 2, 2017, at 5:27 PM, Erick Erickson  wrote:
> 
> "should" is the operative term here. My guess is that the data you're putting
> in the index isn't what you think it is.
> 
> I'd suggest you use the TermsComponent to examine the data actually in
> your index.
> 
> Best,
> Erick
> 
> On Thu, Mar 2, 2017 at 3:18 PM, Sales
>  wrote:
>> We are using Solr 4.10.4. I have a few Solr fields in schema.xml defined as 
>> follows:
>> 
>>   > multiValued="true" stored="false" required="true" />
>>   > stored="false" />
>>   > multiValued="true" stored="false" />
>> 
>> Both of them are loaded in via data-config.xml import handler, and they are 
>> defined there as:
>> 
>>  
>>  
>>  > splitBy="`" />
>> 
>> This has been working for years, but, lately, we have noticed strange 
>> happenings, not sure what triggered it. Note a few things: category and 
>> categories both have the same exact source field. categorycountfacet 
>> contains the same data as categories, with an additional piece of data in 
>> each entry.
>> 
>> So, sample data:
>> 
>> category and categories loaded from a mysql database with value:
>> 
>> "Software Maintenance Agreement`Technical Support Services”
>> 
>> So, this should equate to two field values, "Software Maintenance Agreement” 
>> and "Technical Support Services”
>> 
>> categoryfacet for the same product has the following mysql value before 
>> loading:
>> 
>> "Software Maintenance Agreement~60005`Technical Support Services~60184"
>> 
>> So, basically the same just with an extra piece of data used by our system
>> 
>> So, these are bulk loaded via the data import handler, and, I then do a 
>> simple query:
>> 
>> http:/ourserver:8080/solr/prod/select?q=10001548=categories=true=1
>> 
>> And this results in:
>> 
>> 
>> 
>> 
>> 
>> 1
>> 1
>> 1
>> 
>> 
>> 
>> Note the problem here, there are THREE values, and one of them is the 
>> original non split field.
>> 
>> Let’s do the same query on category since it comes from the same source 
>> field:
>> 
>> 
>> 
>> 
>> 
>> 1
>> 1
>> 1
>> 1
>> 1
>> 1
>> 
>> 
>> 
>> And let’s do the same query for categoryfacet since it’s almost identical 
>> and not tokenized:
>> 
>> 
>> 
>> 
>> 
>> 1
>> 1
>> 
>> 
>> 
>> Note it does not have a third value! I can’t seem to figure out what might 
>> be causing three values for the categories facet result. Any ideas?
>> 
>> Steve
>> 



Re: Setting up to index multiple datastores

2017-03-02 Thread Shawn Heisey
On 3/2/2017 2:58 PM, Daniel Miller wrote:
> One of the many features of the Dovecot IMAP server is Solr support. 
> This obviously provides full-text-searching of stored mails - and it
> works great.  But...the focus of the Dovecot team and mailing list is
> Dovecot configuration.  I'm asking for some guidance on how I might
> optimize Solr.

I use Solr for work.  I use Dovecot for personal domains.  I have not
used them together.  I probably should -- my personal mailbox is many
gigabytes and would benefit from a boost in search performance.

> At the moment I have a (I think!) reasonably well-defined schema that
> seems to perform well.  In my particular use case, I have a single
> physical server running Linux with available VirtualBox virtual
> servers.  I am presently running Solr within one of the virtual
> servers, and I'm running SolrCloud even though I only have one core
> (it just seemed to work better).
>
> Now because I have a single collection/core/shard - all the mail users
> and all their mail folders are stored/indexed/searched by this single
> Solr instance.  I'm thinking that I'd like to split the indexing on at
> least a per-user fashion - possibly also on a per-mailbox fashion. 
> Dovecot does allow for variable substitution in the Solr URL - so I
> should be able to generate the necessary URL requests on the Dovecot
> side.  What I don't know is:
>
> 1.  Is it possible to split the "indexes" (I'm still learning Solr
> vocabulary) without creating separate "cores" (which to me means
> separate Java instances)?
> 2.  Can these separate "indexes" be created on-demand - or do they
> need to be explictly created prior to use?

Here's a paragraph that hopefully clears up most confusion about Solr
terminology.  This is applicable to SolrCloud:

Collections are made up of one or more shards.  Shards are made up of
one or more replicas.  Each replica is a core.  One replica from each
shard is elected as the leader of that shard, and if there are multiple
replicas, the leader role can move between them in response to a change
in cluster state.

Further info: One Solr instance (JVM) can handle many cores.  SolrCloud
allows multiple Solr instances to coordinate with each other (via
ZooKeeper) and form a whole cluster.  Without SolrCloud, you have cores,
but no collections and no replicas.  Sharding is possible without
SolrCloud, but is handled mostly manually.  Replication is possible
without SolrCloud, but works very differently, and has a single point of
failure due to the fact that switching master servers isn't something
that's done easily.  SolrCloud is a true cluster, no masters or slaves.

https://cwiki.apache.org/confluence/display/solr/Distributed+Search+with+Index+Sharding
https://cwiki.apache.org/confluence/display/solr/Index+Replication

SolrCloud also makes it VERY easy to create new collections (logical
indexes) if the desired index config is already in the zookeeper
database.  It can be done entirely with an HTTP request:

https://cwiki.apache.org/confluence/display/solr/Distributed+Search+with+Index+Sharding

One thing to note:  SolrCloud begins to have performance issues when the
number of collections in the cloud reaches the low hundreds.  It's not
going to scale very well with a collection per user or per mailbox
unless there aren't very many users.  There are people looking into how
to scale better, but this hasn't really gone anywhere yet.  Here's one
issue about it, with a lot of very dense comments:

https://issues.apache.org/jira/browse/SOLR-7191

Thanks,
Shawn



Re: Problem with facet and multivalued field

2017-03-02 Thread Erick Erickson
"should" is the operative term here. My guess is that the data you're putting
in the index isn't what you think it is.

I'd suggest you use the TermsComponent to examine the data actually in
your index.

Best,
Erick

On Thu, Mar 2, 2017 at 3:18 PM, Sales
 wrote:
> We are using Solr 4.10.4. I have a few Solr fields in schema.xml defined as 
> follows:
>
> multiValued="true" stored="false" required="true" />
> stored="false" />
> multiValued="true" stored="false" />
>
> Both of them are loaded in via data-config.xml import handler, and they are 
> defined there as:
>
>   
>   
>splitBy="`" />
>
> This has been working for years, but, lately, we have noticed strange 
> happenings, not sure what triggered it. Note a few things: category and 
> categories both have the same exact source field. categorycountfacet contains 
> the same data as categories, with an additional piece of data in each entry.
>
> So, sample data:
>
> category and categories loaded from a mysql database with value:
>
> "Software Maintenance Agreement`Technical Support Services”
>
> So, this should equate to two field values, "Software Maintenance Agreement” 
> and "Technical Support Services”
>
> categoryfacet for the same product has the following mysql value before 
> loading:
>
> "Software Maintenance Agreement~60005`Technical Support Services~60184"
>
> So, basically the same just with an extra piece of data used by our system
>
> So, these are bulk loaded via the data import handler, and, I then do a 
> simple query:
>
> http:/ourserver:8080/solr/prod/select?q=10001548=categories=true=1
>
> And this results in:
>
> 
> 
> 
> 
> 1
> 1
> 1
> 
> 
>
> Note the problem here, there are THREE values, and one of them is the 
> original non split field.
>
> Let’s do the same query on category since it comes from the same source field:
>
> 
> 
> 
> 
> 1
> 1
> 1
> 1
> 1
> 1
> 
>
>
> And let’s do the same query for categoryfacet since it’s almost identical and 
> not tokenized:
>
> 
> 
> 
> 
> 1
> 1
> 
>
>
> Note it does not have a third value! I can’t seem to figure out what might be 
> causing three values for the categories facet result. Any ideas?
>
> Steve
>


Problem with facet and multivalued field

2017-03-02 Thread Sales
We are using Solr 4.10.4. I have a few Solr fields in schema.xml defined as 
follows:

   
   
   

Both of them are loaded in via data-config.xml import handler, and they are 
defined there as:

  
  
  

This has been working for years, but, lately, we have noticed strange 
happenings, not sure what triggered it. Note a few things: category and 
categories both have the same exact source field. categorycountfacet contains 
the same data as categories, with an additional piece of data in each entry.

So, sample data:

category and categories loaded from a mysql database with value:

"Software Maintenance Agreement`Technical Support Services”

So, this should equate to two field values, "Software Maintenance Agreement” 
and "Technical Support Services”

categoryfacet for the same product has the following mysql value before loading:

"Software Maintenance Agreement~60005`Technical Support Services~60184"

So, basically the same just with an extra piece of data used by our system

So, these are bulk loaded via the data import handler, and, I then do a simple 
query:

http:/ourserver:8080/solr/prod/select?q=10001548=categories=true=1

And this results in:





1
1
1



Note the problem here, there are THREE values, and one of them is the original 
non split field.

Let’s do the same query on category since it comes from the same source field:





1
1
1
1
1
1



And let’s do the same query for categoryfacet since it’s almost identical and 
not tokenized:





1
1



Note it does not have a third value! I can’t seem to figure out what might be 
causing three values for the categories facet result. Any ideas?

Steve



Setting up to index multiple datastores

2017-03-02 Thread Daniel Miller
One of the many features of the Dovecot IMAP server is Solr support.  
This obviously provides full-text-searching of stored mails - and it 
works great.  But...the focus of the Dovecot team and mailing list is 
Dovecot configuration.  I'm asking for some guidance on how I might 
optimize Solr.


At the moment I have a (I think!) reasonably well-defined schema that 
seems to perform well.  In my particular use case, I have a single 
physical server running Linux with available VirtualBox virtual 
servers.  I am presently running Solr within one of the virtual servers, 
and I'm running SolrCloud even though I only have one core (it just 
seemed to work better).


Now because I have a single collection/core/shard - all the mail users 
and all their mail folders are stored/indexed/searched by this single 
Solr instance.  I'm thinking that I'd like to split the indexing on at 
least a per-user fashion - possibly also on a per-mailbox fashion.  
Dovecot does allow for variable substitution in the Solr URL - so I 
should be able to generate the necessary URL requests on the Dovecot 
side.  What I don't know is:


1.  Is it possible to split the "indexes" (I'm still learning Solr 
vocabulary) without creating separate "cores" (which to me means 
separate Java instances)?
2.  Can these separate "indexes" be created on-demand - or do they need 
to be explictly created prior to use?


--
Daniel



Re: What is the bottleneck for an optimise operation? / solve the disk space and time issues by specifying multiple segments to optimize

2017-03-02 Thread Caruana, Matthew
Yes, we already do it outside Solr. See https://github.com/ICIJ/extract which 
we developed for this purpose. My guess is that the documents are very large, 
as you say.

Optimising was always an attempt to bring down the number of segments from 60+. 
Not sure how else to do that.

> On 2 Mar 2017, at 7:42 pm, Michael Joyner  wrote:
> 
> You can solve the disk space and time issues by specifying multiple segments 
> to optimize down to instead of a single segment.
> 
> When we reindex we have to optimize or we end up with hundreds of segments 
> and very horrible performance.
> 
> We optimize down to like 16 segments or so and it doesn't do the 3x disk 
> space thing and usually runs in a decent amount of time. (we have >50 million 
> articles in one of our solr indexes).
> 
> 
>> On 03/02/2017 10:20 AM, David Hastings wrote:
>> Agreed, and since it takes three times the space is part of the reason it
>> takes so long, so that 190gb index ends up writing another 380 gb until it
>> compresses down and deletes the two left over files.  its a pretty hefty
>> operation
>> 
>> On Thu, Mar 2, 2017 at 10:13 AM, Alexandre Rafalovitch 
>> wrote:
>> 
>>> Optimize operation is no longer recommended for Solr, as the
>>> background merges got a lot smarter.
>>> 
>>> It is an extremely expensive operation that can require up to 3-times
>>> amount of disk during the processing.
>>> 
>>> This is not to say yours is a valid question, which I am leaving to
>>> others to respond.
>>> 
>>> Regards,
>>>Alex.
>>> 
>>> http://www.solr-start.com/ - Resources for Solr users, new and experienced
>>> 
>>> 
 On 2 March 2017 at 10:04, Caruana, Matthew  wrote:
 I’m currently performing an optimise operation on a ~190GB index with
>>> about 4 million documents. The process has been running for hours.
 This is surprising, because the machine is an EC2 r4.xlarge with four
>>> cores and 30GB of RAM, 24GB of which is allocated to the JVM.
 The load average has been steady at about 1.3. Memory usage is 25% or
>>> less the whole time. iostat reports ~6% util.
 What gives?
 
 Running Solr 6.4.1.
> 


Re: What is the bottleneck for an optimise operation? / solve the disk space and time issues by specifying multiple segments to optimize

2017-03-02 Thread Caruana, Matthew
I typically end up with about 60-70 segments after indexing. What configuration 
do you use to bring it down to 16?

> On 2 Mar 2017, at 7:42 pm, Michael Joyner  wrote:
> 
> You can solve the disk space and time issues by specifying multiple segments 
> to optimize down to instead of a single segment.
> 
> When we reindex we have to optimize or we end up with hundreds of segments 
> and very horrible performance.
> 
> We optimize down to like 16 segments or so and it doesn't do the 3x disk 
> space thing and usually runs in a decent amount of time. (we have >50 million 
> articles in one of our solr indexes).
> 
> 
>> On 03/02/2017 10:20 AM, David Hastings wrote:
>> Agreed, and since it takes three times the space is part of the reason it
>> takes so long, so that 190gb index ends up writing another 380 gb until it
>> compresses down and deletes the two left over files.  its a pretty hefty
>> operation
>> 
>> On Thu, Mar 2, 2017 at 10:13 AM, Alexandre Rafalovitch 
>> wrote:
>> 
>>> Optimize operation is no longer recommended for Solr, as the
>>> background merges got a lot smarter.
>>> 
>>> It is an extremely expensive operation that can require up to 3-times
>>> amount of disk during the processing.
>>> 
>>> This is not to say yours is a valid question, which I am leaving to
>>> others to respond.
>>> 
>>> Regards,
>>>Alex.
>>> 
>>> http://www.solr-start.com/ - Resources for Solr users, new and experienced
>>> 
>>> 
 On 2 March 2017 at 10:04, Caruana, Matthew  wrote:
 I’m currently performing an optimise operation on a ~190GB index with
>>> about 4 million documents. The process has been running for hours.
 This is surprising, because the machine is an EC2 r4.xlarge with four
>>> cores and 30GB of RAM, 24GB of which is allocated to the JVM.
 The load average has been steady at about 1.3. Memory usage is 25% or
>>> less the whole time. iostat reports ~6% util.
 What gives?
 
 Running Solr 6.4.1.
> 


Re: Does {!child} query support nested Queries ("v=")

2017-03-02 Thread Mikhail Khludnev
Hello, Frank!

The closest equivalent would be  q=+type:userAccount +givenName:test*
And make sure please that it's parsed correctly with debugQuery=true.
Can you also narrow the query to troubleshoot the difference?
ahhh I probably understood.. shards results are merged by uniqueKey, can
you share your schema and sample docs?

On Thu, Mar 2, 2017 at 5:53 PM, Kelly, Frank  wrote:

> This is Solr Cloud 5.3.1
>
> I have a query like the following
> q={!child of="type:userAccount" v="givenName:test*”}
>
> Intent: Show me all children of the type:userAccount where
> userAccount.givenName:test*
>
> If I run the query multiple times I get a very different numFound
> difference 186,560 to 187,412 (+/0 500).
>
> If I run the “normal” query on just the  parents
> q=type:userAccount givenName:test*
>
> I get a very stable numFound
>
> Reading the docs it’s not documented as supported but neither do I get an
> error
> https://cwiki.apache.org/confluence/display/solr/Other+Parsers
>
> Am I using nestedQueries correctly?
>
> -Frank
>
>
>
>
> [image: Description: Macintosh
> HD:Users:jerchow:Downloads:Asset_Package_01_160721:HERE_Logo_2016:sRGB:PDF:HERE_Logo_2016_POS_sRGB.pdf]
>
>
>
> *Frank Kelly*
>
> *Principal Software Engineer*
>
> Identity Profile Team (SCBE, Traces, CDA)
>
>
> HERE
>
> 5 Wayside Rd, Burlington, MA 01803, USA
>
> *42° 29' 7" N 71° 11' 32" W*
>
>
> [image: Description:
> /Users/nussbaum/_WORK/PROJECTS/20160726_HERE_EMail_Signature/_Layout/_Images/20160726_HERE_EMail_Signature_360.gif]
> [image: Description:
> /Users/nussbaum/_WORK/PROJECTS/20160726_HERE_EMail_Signature/_Layout/_Images/20160726_HERE_EMail_Signature_Twitter.gif]
>    [image: Description:
> /Users/nussbaum/_WORK/PROJECTS/20160726_HERE_EMail_Signature/_Layout/_Images/20160726_HERE_EMail_Signature_FB.gif]
> [image: Description:
> /Users/nussbaum/_WORK/PROJECTS/20160726_HERE_EMail_Signature/_Layout/_Images/20160726_HERE_EMail_Signature_IN.gif]
> [image: Description:
> /Users/nussbaum/_WORK/PROJECTS/20160726_HERE_EMail_Signature/_Layout/_Images/20160726_HERE_EMail_Signature_Insta.gif]
> 
>



-- 
Sincerely yours
Mikhail Khludnev


Re: Excessive Wire logging while indexing.

2017-03-02 Thread Erick Erickson
Glad to hear it's working. The trick (as you've probably discovered)
is to properly
map the meta-data to Solr fields. The extracting request handler does
this, but the
real underlying issue is that there's no real standard. Word docs
might have "last_editor",
PDFs might have just "author". And on and on and on.

Anyway, sounds like you're on your way. The code snippet Shawn
referenced dumps all
the meta-data Tika finds so you can figure out what you need.

Best,
Erick

On Thu, Mar 2, 2017 at 11:56 AM, Phil Scadden  wrote:
> Got it all working with Tika and SolrJ. (Got the correct artifacts). Much 
> faster now too which is good. Thanks very much for your help.
> Notice: This email and any attachments are confidential and may not be used, 
> published or redistributed without the prior written consent of the Institute 
> of Geological and Nuclear Sciences Limited (GNS Science). If received in 
> error please destroy and immediately notify GNS Science. Do not copy or 
> disclose the contents.


RE: Excessive Wire logging while indexing.

2017-03-02 Thread Phil Scadden
Got it all working with Tika and SolrJ. (Got the correct artifacts). Much 
faster now too which is good. Thanks very much for your help.
Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.


Re: Using SOLR to search for Names from RDBMS

2017-03-02 Thread Alexandre Rafalovitch
You would absolutely want to read "Relevant Search" book first. It is
based on Elasticsearch examples, but the concepts map to Solr (and
there is an appendix).

(The following is mostly for names, phone numbers, don't know about addresses)

The core issue is that you will want to setup a bunch of copyFields to
create different level of analysis precision for the names with the
goal of more precise versions of the name matching with higher boost.
Otherwise, you are going to have an issue where a common surname
matches against the name (Smith Jones) and get really boosted.

Then, you will want to have a look at phonetic mapping. There is
several different algorithms, depending on the kinds of names you get.
Some are better for Western, some are better for Eastern European. You
can mix them again with balancing the boosts.

You have to decide whether you are doing one big search box (could get
messy) or one where people can enter different elements in different
boxes. The later is easier to tune, but you need to pass the data to
Solr from your middleware. This example may help to see how to
construct search line only if some of the search fields are provided:
https://gist.github.com/arafalov/5e04884e5aefaf46678c

Also, if you search phone numbers, you may want to do suffix search
(last n digits of the number). I recommend squishing all non-digits,
reversing the string and doing EdgeNGrams. Makes it a lot easier. I
did a presentation on this a couple of years back, I could dig it out
probably if you need more details.

Regards,
   Alex.


http://www.solr-start.com/ - Resources for Solr users, new and experienced


On 2 March 2017 at 13:11, Bijesh EB  wrote:
> Hi All,
>
> First off, what a fabulous job you all are doing creating and supporting an
> open source solution! Great Work and many thanks for that.
>
>  I am reasonably new to SOLR and our team is trying to integrate SOLR to a
> structured database to help with Searching Person Records (first name, last
> name, address etc).
>  Our developers (who also are currently learning SOLR) have managed to get
> the initial set up done, but when I am trying to test, I do see that there
> are a lot of seemingly unrelated results coming up potentially because of
> the combinations of algorithms used.
> Because of that, I was wondering, is there any subset of algorithms that
> are recommended to be used when working with relation DB for items such as
> names and addresses as opposed to searching a website for a string etc.
> Also is there any link to algorithms and their behaviours with some
> examples that non technical persons can relate to.
>
> I am not a technical person myself, but I am trying to learn from the
> experts here who might have been there and done that many times over.
>
> Thanks,
>
> Bijesh  EB


Re: OOM

2017-03-02 Thread Erick Erickson
When you restart, there are a bunch of threads that start up than can
chew up stack space.
If the message says something about "unable to start native thread"
then it's not raw memory
but the stack space.

Doesn't really sound like this is your error, but thought I'd mention it.

On Wed, Mar 1, 2017 at 6:37 PM, Rick Leir  wrote:
> Thanks Shawn, of course it must be the -Xmx. It is interesting that we do not 
> see the OOM until restarting.
>
> On March 1, 2017 8:18:11 PM EST, Shawn Heisey  wrote:
>>On 2/27/2017 4:57 PM, Rick Leir wrote:
>>> We get an OOM after stopping then starting Solr (with a tiny index).
>>Is there something I could check quickly before I break out the Eclipse
>>debugger? Maybe Marple could tell me about problems in the index?
>>
>>There are exactly two ways of dealing with OOME, assuming that there's
>>not a bug in the software.  1) Increase the heap size.  2) Figure out
>>why the program needs so much memory and change the configuration to
>>reduce the amount required.  Action number 2 may prove to be difficult.
>>
>>There are no other possible solutions.
>>
>>We are not aware of any memory leaks in Solr, Lucene, or any of the
>>other dependencies.  There could be a memory leak, but if there is, it
>>has not yet been discovered.
>>
>>It is likely that even if you have a stacktrace showing the OOME error
>>that the place where the OOME occurred will have absolutely nothing to
>>do with the part of the system that is requiring a lot of memory.  You
>>can share this error if you want, but I will warn you that it is
>>probably not useful information.
>>
>>Here's some generic info about what causes high heap requirements with
>>Solr.  It is not an exhaustive list:
>>
>>https://wiki.apache.org/solr/SolrPerformanceProblems#Java_Heap
>>
>>Further down on the page is a section about *reducing* the amount of
>>heap required.
>>
>>Thanks,
>>Shawn
>
> --
> Sent from my Android device with K-9 Mail. Please excuse my brevity.


Using SOLR to search for Names from RDBMS

2017-03-02 Thread Bijesh EB
Hi All,

First off, what a fabulous job you all are doing creating and supporting an
open source solution! Great Work and many thanks for that.

 I am reasonably new to SOLR and our team is trying to integrate SOLR to a
structured database to help with Searching Person Records (first name, last
name, address etc).
 Our developers (who also are currently learning SOLR) have managed to get
the initial set up done, but when I am trying to test, I do see that there
are a lot of seemingly unrelated results coming up potentially because of
the combinations of algorithms used.
Because of that, I was wondering, is there any subset of algorithms that
are recommended to be used when working with relation DB for items such as
names and addresses as opposed to searching a website for a string etc.
Also is there any link to algorithms and their behaviours with some
examples that non technical persons can relate to.

I am not a technical person myself, but I am trying to learn from the
experts here who might have been there and done that many times over.

Thanks,

Bijesh  EB


Re: What is the bottleneck for an optimise operation? / solve the disk space and time issues by specifying multiple segments to optimize

2017-03-02 Thread Michael Joyner
You can solve the disk space and time issues by specifying multiple 
segments to optimize down to instead of a single segment.


When we reindex we have to optimize or we end up with hundreds of 
segments and very horrible performance.


We optimize down to like 16 segments or so and it doesn't do the 3x disk 
space thing and usually runs in a decent amount of time. (we have >50 
million articles in one of our solr indexes).



On 03/02/2017 10:20 AM, David Hastings wrote:

Agreed, and since it takes three times the space is part of the reason it
takes so long, so that 190gb index ends up writing another 380 gb until it
compresses down and deletes the two left over files.  its a pretty hefty
operation

On Thu, Mar 2, 2017 at 10:13 AM, Alexandre Rafalovitch 
wrote:


Optimize operation is no longer recommended for Solr, as the
background merges got a lot smarter.

It is an extremely expensive operation that can require up to 3-times
amount of disk during the processing.

This is not to say yours is a valid question, which I am leaving to
others to respond.

Regards,
Alex.

http://www.solr-start.com/ - Resources for Solr users, new and experienced


On 2 March 2017 at 10:04, Caruana, Matthew  wrote:

I’m currently performing an optimise operation on a ~190GB index with

about 4 million documents. The process has been running for hours.

This is surprising, because the machine is an EC2 r4.xlarge with four

cores and 30GB of RAM, 24GB of which is allocated to the JVM.

The load average has been steady at about 1.3. Memory usage is 25% or

less the whole time. iostat reports ~6% util.

What gives?

Running Solr 6.4.1.




Re: What is the bottleneck for an optimise operation?

2017-03-02 Thread Erick Erickson
It's _very_ unlikely that optimize will help with OOMs, so that's
very probably a red herring. Likely the document that's causing
the issue is very large or, perhaps, you're using the extracting
processor and it might be a Tika issue, consider doing the Tika
processing outside Solr if so, see:

https://lucidworks.com/2012/02/14/indexing-with-solrj/

But forget optimize for curing this problem ;)

Best,
Erick

On Thu, Mar 2, 2017 at 9:37 AM, Caruana, Matthew  wrote:
> Thank you, these are useful tips.
>
> We were previously working with a 4GB heap and getting OOMs in Solr while 
> updating (probably from the analysers) that would cause the index writer to 
> close with what’s called a “tragic” error in the writer code. Only a hard 
> restart of the service could bring it back. There are certain documents that 
> function like poison and trigger this error every time. Haven’t had time to 
> isolate and create a test case, so throwing RAM at it is a stopgap.
>
> When I do, I’ll file an issue.
>
>> On 2 Mar 2017, at 18:28, Walter Underwood  wrote:
>>
>> 6.4.0 added a lot of metrics to low-level calls. That makes many operations 
>> slow. Go back to 6.3.0 or wait for 6.4.2.
>>
>> Meanwhile, stop running optimize. You almost certainly don’t need it.
>>
>> 24 GB is a huge heap. Do you really need that? We run a 15 million doc index 
>> with an 8 GB heap (Java 8u121, G1 collector). I recommend a smaller heap so 
>> the OS can use that RAM to cache file buffers.
>>
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>>
>>
>>> On Mar 2, 2017, at 7:04 AM, Caruana, Matthew  wrote:
>>>
>>> I’m currently performing an optimise operation on a ~190GB index with about 
>>> 4 million documents. The process has been running for hours.
>>>
>>> This is surprising, because the machine is an EC2 r4.xlarge with four cores 
>>> and 30GB of RAM, 24GB of which is allocated to the JVM.
>>>
>>> The load average has been steady at about 1.3. Memory usage is 25% or less 
>>> the whole time. iostat reports ~6% util.
>>>
>>> What gives?
>>>
>>> Running Solr 6.4.1.
>>
>


Re: Excessive Wire logging while indexing.

2017-03-02 Thread Shawn Heisey
On 3/1/2017 6:59 PM, Phil Scadden wrote:
> Exceptions never triggered but metadata was essentially empty except
> for contentType, and content was always an empty string. I don’t know
> what parser was doing, but I gave up and with the extractHandler route
> instead which did at least build a full index.

With the extracting request handler, Tika is running inside Solr.  The
handler code is customized for Solr's needs, so it usually works.  When
Tika has one of its well-known issues though, the entire JVM (which
includes Solr) suffers as well.

I do not know how to write Tika code, but this blog post covers an
example program that uses Tika with SolrJ, so the processing is outside
Solr:

https://lucidworks.com/2012/02/14/indexing-with-solrj/

It also uses a database, but it should be relatively easy to remove that.

Thanks,
Shawn



Re: What is the bottleneck for an optimise operation?

2017-03-02 Thread Caruana, Matthew
Thank you, these are useful tips.

We were previously working with a 4GB heap and getting OOMs in Solr while 
updating (probably from the analysers) that would cause the index writer to 
close with what’s called a “tragic” error in the writer code. Only a hard 
restart of the service could bring it back. There are certain documents that 
function like poison and trigger this error every time. Haven’t had time to 
isolate and create a test case, so throwing RAM at it is a stopgap.

When I do, I’ll file an issue.

> On 2 Mar 2017, at 18:28, Walter Underwood  wrote:
> 
> 6.4.0 added a lot of metrics to low-level calls. That makes many operations 
> slow. Go back to 6.3.0 or wait for 6.4.2.
> 
> Meanwhile, stop running optimize. You almost certainly don’t need it.
> 
> 24 GB is a huge heap. Do you really need that? We run a 15 million doc index 
> with an 8 GB heap (Java 8u121, G1 collector). I recommend a smaller heap so 
> the OS can use that RAM to cache file buffers.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
> 
>> On Mar 2, 2017, at 7:04 AM, Caruana, Matthew  wrote:
>> 
>> I’m currently performing an optimise operation on a ~190GB index with about 
>> 4 million documents. The process has been running for hours.
>> 
>> This is surprising, because the machine is an EC2 r4.xlarge with four cores 
>> and 30GB of RAM, 24GB of which is allocated to the JVM.
>> 
>> The load average has been steady at about 1.3. Memory usage is 25% or less 
>> the whole time. iostat reports ~6% util.
>> 
>> What gives?
>> 
>> Running Solr 6.4.1.
> 



Re: What is the bottleneck for an optimise operation?

2017-03-02 Thread Walter Underwood
6.4.0 added a lot of metrics to low-level calls. That makes many operations 
slow. Go back to 6.3.0 or wait for 6.4.2.

Meanwhile, stop running optimize. You almost certainly don’t need it.

24 GB is a huge heap. Do you really need that? We run a 15 million doc index 
with an 8 GB heap (Java 8u121, G1 collector). I recommend a smaller heap so the 
OS can use that RAM to cache file buffers.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Mar 2, 2017, at 7:04 AM, Caruana, Matthew  wrote:
> 
> I’m currently performing an optimise operation on a ~190GB index with about 4 
> million documents. The process has been running for hours.
> 
> This is surprising, because the machine is an EC2 r4.xlarge with four cores 
> and 30GB of RAM, 24GB of which is allocated to the JVM.
> 
> The load average has been steady at about 1.3. Memory usage is 25% or less 
> the whole time. iostat reports ~6% util.
> 
> What gives?
> 
> Running Solr 6.4.1.



Re: Distinguish exact match from wildcard match

2017-03-02 Thread Ahmet Arslan
Hi,

how about q=code_text:bolt*=code_text:bolt

Ahmet

On Thursday, March 2, 2017 4:41 PM, Сергей Твердохлеб  
wrote:



Hi,

is there way to separate exact match from wildcard match in solr response?
e.g. there are two documents: {code_text:bolt} and {code_text:bolter}. When
I search for "bolt" I want to get both results, but somehow grouped, so I
can determine either it was found with exact or non-exact match.

Thanks.

-- 
Regards,
Sergey Tverdokhleb


Solr 5.3.1: child query must only match non-parent docs

2017-03-02 Thread Kelly, Frank
Our customers are running this query where they have a filter on the parent 
objects (givenName, familyName etc) and then request the child objects 
({!parent which etc)

q=+(givenName:(+UserSearchControllerUTFN +1180460672*) 
familyName:(+UserSearchControllerUTFN +1180460672*)) +{!parent 
which="type:userAccount”}hereRealm:Test

We get the following error from Solr/Lucene

java.lang.IllegalStateException: child query must only match non-parent docs, 
but parent docID=2147483647 matched childScorer=class 
org.apache.lucene.search.TermScorer
at 
org.apache.lucene.search.join.ToParentBlockJoinQuery$BlockJoinScorer.nextDoc(ToParentBlockJoinQuery.java:311)
at 
org.apache.lucene.search.join.ToParentBlockJoinQuery$BlockJoinScorer.advance(ToParentBlockJoinQuery.java:384)
at 
org.apache.lucene.search.ConjunctionDISI.doNext(ConjunctionDISI.java:118)
at 
org.apache.lucene.search.ConjunctionDISI.nextDoc(ConjunctionDISI.java:151)
at 
org.apache.lucene.search.ConjunctionScorer.nextDoc(ConjunctionScorer.java:62)
at 
org.apache.lucene.search.Weight$DefaultBulkScorer.scoreAll(Weight.java:216)
at 
org.apache.lucene.search.Weight$DefaultBulkScorer.score(Weight.java:169)
at org.apache.lucene.search.BulkScorer.score(BulkScorer.java:39)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:772)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:486)
at 
org.apache.solr.search.SolrIndexSearcher.buildAndRunCollectorChain(SolrIndexSearcher.java:200)
at 
org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:1678)
at 
org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1497)
at 
org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:555)


Any thoughts?

A previous email 
http://lucene.472066.n3.nabble.com/ToParentBlockJoinQuery-java-td4247115.html 
suggested that we might split the query into a query and a filter query

q=+{!parent which="type:userAccount”}hereRealm:Test
fq=+(givenName:(+UserSearchControllerUTFN +1180460672*) 
familyName:(+UserSearchControllerUTFN +1180460672*))

Is this the same problem?

-Frank


[Description: Macintosh 
HD:Users:jerchow:Downloads:Asset_Package_01_160721:HERE_Logo_2016:sRGB:PDF:HERE_Logo_2016_POS_sRGB.pdf]



Frank Kelly

Principal Software Engineer

Identity Profile Team (SCBE, Traces, CDA)


HERE

5 Wayside Rd, Burlington, MA 01803, USA

42° 29' 7" N 71° 11' 32" W

[Description: 
/Users/nussbaum/_WORK/PROJECTS/20160726_HERE_EMail_Signature/_Layout/_Images/20160726_HERE_EMail_Signature_360.gif]
[Description: 
/Users/nussbaum/_WORK/PROJECTS/20160726_HERE_EMail_Signature/_Layout/_Images/20160726_HERE_EMail_Signature_Twitter.gif]
 [Description: 
/Users/nussbaum/_WORK/PROJECTS/20160726_HERE_EMail_Signature/_Layout/_Images/20160726_HERE_EMail_Signature_FB.gif]
  [Description: 
/Users/nussbaum/_WORK/PROJECTS/20160726_HERE_EMail_Signature/_Layout/_Images/20160726_HERE_EMail_Signature_IN.gif]
  [Description: 
/Users/nussbaum/_WORK/PROJECTS/20160726_HERE_EMail_Signature/_Layout/_Images/20160726_HERE_EMail_Signature_Insta.gif]
 


Re: Boolean expression for spatial query

2017-03-02 Thread David Smiley
I recommend the MULTIPOINT approach.

BTW if you go the route of multiple OR'ed sub-clauses, I recommend avoiding
the _query_ syntax which predates Solr 4.x's (4.2?) ability to embed fully
the sub-clauses more naturally; though you need to beware of the gotcha of
needing to add a leading space.  If Solr had this feature from the start
then that _query_ hack never would have been added.  For example:
fq=   {!field f=regionGeometry v="Intersects(POINT(x1, y1))"} OR
{!field f=regionGeometry v="Intersects(POINT(x2, y2))"}

Any way, MULTIPOINT is probably going to be much faster, plus it's more
intuitive to understand.  And avoid the "Contains" predicate when point
data is involved as it's slower yet semantically equivalent to "Intersects"
(for a single non-multi point any way)

On Mon, Feb 27, 2017 at 4:12 AM Emir Arnautovic <
emir.arnauto...@sematext.com> wrote:

> Hi Michael,
>
> I haven't been playing with spatial for a while, but if it fully
> supports WKT, you could use Intersects instead of Contains and
> MULTIPOINT instead of POINT. Something like:
>
> fq={!field f=regionGeometry}Intersects(MULTIPOINT((x1 y1), (x2, y2)))
>
> In any case you can use OR-ed _query_:
>
> fq=_query_:"{!field f=regionGeometry}Contains(POINT(x1, y1))" OR
> _query_:"{!field f=regionGeometry}Contains(POINT(x2, y2))"
>
>
> HTH
> Emir
>
>
> On 26.02.2017 07:08, Michael Dürr wrote:
> > Hi all,
> >
> > I index documents containing a spatial field (rpt) that holds a wkt
> > multipolygon. In order to retrieve all documents for which a certain
> point
> > is contained within a polygon I issue the following query:
> >
> > q=*:*={!field f=regionGeometry}Contains(POINT( ))
> >
> > This works pretty good.
> >
> > My question is: Is there any syntax to issue this query for multiple
> points
> > (i.e. return all documents for which at least one of the points is within
> > the document's polygon)?
> >
> > E.g. something like this:
> >
> > q=*:*={!field f=regionGeometry}ContainsOR(POINT( ),POINT(
> > ),...)
> >
> > If not - what other efficient options do you xuggest to do such a query?
> >
> > Best regards,
> > Michael
> >
>
> --
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
>
> --
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com


Re: What is the bottleneck for an optimise operation?

2017-03-02 Thread Otis Gospodnetić
Hi,

It's simply expensive.  You are rewriting your whole index.

Why are you running optimize?  Are you seeing performance problems you are
trying to fix with optimize?

Otis
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/


On Thu, Mar 2, 2017 at 10:39 AM, Caruana, Matthew  wrote:

> Thank you. The question remains however, if this is such a hefty operation
> then why is it walking to the destination instead of running, so to speak?
>
> Is the process throttled in some way?
>
> > On 2 Mar 2017, at 16:20, David Hastings 
> wrote:
> >
> > Agreed, and since it takes three times the space is part of the reason it
> > takes so long, so that 190gb index ends up writing another 380 gb until
> it
> > compresses down and deletes the two left over files.  its a pretty hefty
> > operation
> >
> > On Thu, Mar 2, 2017 at 10:13 AM, Alexandre Rafalovitch <
> arafa...@gmail.com>
> > wrote:
> >
> >> Optimize operation is no longer recommended for Solr, as the
> >> background merges got a lot smarter.
> >>
> >> It is an extremely expensive operation that can require up to 3-times
> >> amount of disk during the processing.
> >>
> >> This is not to say yours is a valid question, which I am leaving to
> >> others to respond.
> >>
> >> Regards,
> >>   Alex.
> >> 
> >> http://www.solr-start.com/ - Resources for Solr users, new and
> experienced
> >>
> >>
> >> On 2 March 2017 at 10:04, Caruana, Matthew  wrote:
> >>> I’m currently performing an optimise operation on a ~190GB index with
> >> about 4 million documents. The process has been running for hours.
> >>>
> >>> This is surprising, because the machine is an EC2 r4.xlarge with four
> >> cores and 30GB of RAM, 24GB of which is allocated to the JVM.
> >>>
> >>> The load average has been steady at about 1.3. Memory usage is 25% or
> >> less the whole time. iostat reports ~6% util.
> >>>
> >>> What gives?
> >>>
> >>> Running Solr 6.4.1.
> >>
>
>


Re: What is the bottleneck for an optimise operation?

2017-03-02 Thread Caruana, Matthew
Thank you. The question remains however, if this is such a hefty operation then 
why is it walking to the destination instead of running, so to speak?

Is the process throttled in some way?

> On 2 Mar 2017, at 16:20, David Hastings  wrote:
> 
> Agreed, and since it takes three times the space is part of the reason it
> takes so long, so that 190gb index ends up writing another 380 gb until it
> compresses down and deletes the two left over files.  its a pretty hefty
> operation
> 
> On Thu, Mar 2, 2017 at 10:13 AM, Alexandre Rafalovitch 
> wrote:
> 
>> Optimize operation is no longer recommended for Solr, as the
>> background merges got a lot smarter.
>> 
>> It is an extremely expensive operation that can require up to 3-times
>> amount of disk during the processing.
>> 
>> This is not to say yours is a valid question, which I am leaving to
>> others to respond.
>> 
>> Regards,
>>   Alex.
>> 
>> http://www.solr-start.com/ - Resources for Solr users, new and experienced
>> 
>> 
>> On 2 March 2017 at 10:04, Caruana, Matthew  wrote:
>>> I’m currently performing an optimise operation on a ~190GB index with
>> about 4 million documents. The process has been running for hours.
>>> 
>>> This is surprising, because the machine is an EC2 r4.xlarge with four
>> cores and 30GB of RAM, 24GB of which is allocated to the JVM.
>>> 
>>> The load average has been steady at about 1.3. Memory usage is 25% or
>> less the whole time. iostat reports ~6% util.
>>> 
>>> What gives?
>>> 
>>> Running Solr 6.4.1.
>> 



Re: What is the bottleneck for an optimise operation?

2017-03-02 Thread Otis Gospodnetić
Hi Matthew,

I'm guessing it's the EBS.  With EBS we've seen:
* cpu.system going up in some kernels
* low read/write speeds and maxed out IO at times

Otis
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/


On Thu, Mar 2, 2017 at 10:04 AM, Caruana, Matthew  wrote:

> I’m currently performing an optimise operation on a ~190GB index with
> about 4 million documents. The process has been running for hours.
>
> This is surprising, because the machine is an EC2 r4.xlarge with four
> cores and 30GB of RAM, 24GB of which is allocated to the JVM.
>
> The load average has been steady at about 1.3. Memory usage is 25% or less
> the whole time. iostat reports ~6% util.
>
> What gives?
>
> Running Solr 6.4.1.


Re: What is the bottleneck for an optimise operation?

2017-03-02 Thread David Hastings
Agreed, and since it takes three times the space is part of the reason it
takes so long, so that 190gb index ends up writing another 380 gb until it
compresses down and deletes the two left over files.  its a pretty hefty
operation

On Thu, Mar 2, 2017 at 10:13 AM, Alexandre Rafalovitch 
wrote:

> Optimize operation is no longer recommended for Solr, as the
> background merges got a lot smarter.
>
> It is an extremely expensive operation that can require up to 3-times
> amount of disk during the processing.
>
> This is not to say yours is a valid question, which I am leaving to
> others to respond.
>
> Regards,
>Alex.
> 
> http://www.solr-start.com/ - Resources for Solr users, new and experienced
>
>
> On 2 March 2017 at 10:04, Caruana, Matthew  wrote:
> > I’m currently performing an optimise operation on a ~190GB index with
> about 4 million documents. The process has been running for hours.
> >
> > This is surprising, because the machine is an EC2 r4.xlarge with four
> cores and 30GB of RAM, 24GB of which is allocated to the JVM.
> >
> > The load average has been steady at about 1.3. Memory usage is 25% or
> less the whole time. iostat reports ~6% util.
> >
> > What gives?
> >
> > Running Solr 6.4.1.
>


Re: What is the bottleneck for an optimise operation?

2017-03-02 Thread Alexandre Rafalovitch
Optimize operation is no longer recommended for Solr, as the
background merges got a lot smarter.

It is an extremely expensive operation that can require up to 3-times
amount of disk during the processing.

This is not to say yours is a valid question, which I am leaving to
others to respond.

Regards,
   Alex.

http://www.solr-start.com/ - Resources for Solr users, new and experienced


On 2 March 2017 at 10:04, Caruana, Matthew  wrote:
> I’m currently performing an optimise operation on a ~190GB index with about 4 
> million documents. The process has been running for hours.
>
> This is surprising, because the machine is an EC2 r4.xlarge with four cores 
> and 30GB of RAM, 24GB of which is allocated to the JVM.
>
> The load average has been steady at about 1.3. Memory usage is 25% or less 
> the whole time. iostat reports ~6% util.
>
> What gives?
>
> Running Solr 6.4.1.


What is the bottleneck for an optimise operation?

2017-03-02 Thread Caruana, Matthew
I’m currently performing an optimise operation on a ~190GB index with about 4 
million documents. The process has been running for hours.

This is surprising, because the machine is an EC2 r4.xlarge with four cores and 
30GB of RAM, 24GB of which is allocated to the JVM.

The load average has been steady at about 1.3. Memory usage is 25% or less the 
whole time. iostat reports ~6% util.

What gives?

Running Solr 6.4.1.

Re: Distinguish exact match from wildcard match

2017-03-02 Thread Emir Arnautovic
Again, depending on your case, you can use functions in fl to return 
additional indicator if doc is exact match or not:


q=code_text:bolt OR whatever=*,isExact:tf('code_text_exact', 'bolt')

It will return isExact field with values >0 for any doc that has term 
'bolt' in code_text_exact field. Note that I used different field to 
make sure that  term 'bolt' is not in field even document has 'bolter' 
because of analysis chain.


Regards,
Emir

On 02.03.2017 15:51, Alexandre Rafalovitch wrote:

You could still use scoring with distinct bands of values and include
score field to see the assigned score. Then, on the client, you do
rough grouping.

You could try looking at highlighting, but that's probably
computationally irrational for this purpose.

You could try enabling debugging and see if information present in
there is sufficient.

All of these imply client-side post-processing.

Any other option would be possible only if that keyword was the only
content of the field. Then you could group or facet on it or
something. But not if it is one word of many.

Otherwise, you just do two queries and sort/merge yourself.

Regards,
Alex.

http://www.solr-start.com/ - Resources for Solr users, new and experienced


On 2 March 2017 at 09:09, Сергей Твердохлеб  wrote:

Hi Emir,

Thanks for your answer.
However in my case I really need to separate results, because I need to
treat those resultsets differently.

Thanks.

2017-03-02 15:57 GMT+02:00 Emir Arnautovic :


Hi Sergei,

Usually you don't want to know which is which, but you do want to have
exact matches first. In case of simple queries and depending on your
usecase, you can use score to make distinction. If "bolter" matches "bolt"
because of some filters, you will need to index it in two fields and boost
fields differently to get different score for different matches:

code_text_exact:bolt^1 OR code_text:bolt

If you want to use wildcards, you can use similar approach:

code_text:bolt^1 OR code_text:bolt*

HTH,
Emir


On 02.03.2017 14:41, Сергей Твердохлеб wrote:


Hi,

is there way to separate exact match from wildcard match in solr response?
e.g. there are two documents: {code_text:bolt} and {code_text:bolter}.
When
I search for "bolt" I want to get both results, but somehow grouped, so I
can determine either it was found with exact or non-exact match.

Thanks.



--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/




--
Regards,
Sergey Tverdokhleb


--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: Arabic words search in solr

2017-03-02 Thread Steve Rowe
Hi Mohan,

> On Feb 26, 2017, at 1:37 AM, mohanmca01  wrote:
> 
> i searched with (bizNameAr: شرطة ازكي), and am getting:
> […]
> 
> the expected result is:   "id": "82",
>  "bizNameAr": "شرطة عمان السلطانية - قيادة
> شرطة محافظة الداخلية - - مركز *شرطة إزكي*",
> 
> as the above has both the words mentioned in the query (marked as Bold),
> where the rest have the following:
> 
>"id": "63",
>"bizNameAr": "شركة ظفار للتأمين ش.م.ع.ع - فرع ازكي"
> 
> it has only one word of the query (ازكي)
> 
>"id": "56",
>"bizNameAr": "شرطة عمان السلطانية - قيادة شرطة محافظة شمال الشرقية 
> -  - مركز شرطة إبراء"
> 
> it has only one word of the query (شرطة)
> 
> "id": "79",
> "bizNameAr": "شرطة عمان السلطانية - قيادة شرطة محافظة شمال الشرقية - - مركز
> شرطة إبراء"
> 
> It has only one word of the query (شرطة)
> 
> where the above 3 records should not come in the result since already 2
> words mentioned in the query, and only one record has these two words.

Solr's standard query language includes two mechanisms for requiring terms: ‘+’ 
before a required term, and ‘AND’ between two required terms.  ‘+’ is better - 
see  for more 
information.

You can also set the default operator to ‘AND’, e.g. via request parameter 
“=AND” (if this is always what you want, you can include this in the 
/select request handler’s definition in solrconfig.xml).  See 
 
for more information.  

> I would really suggest if we can give you a real-time demo on our system
> with my Arab colleague so it can be more clear for you. let us know if we
> can do that.

I prefer to keep discussion on this public mailing list so that others can 
benefit.  If you find that you need faster or more interactive help, you can 
check out the list of people who have indicated that they provide Solr support: 
.

--
Steve
www.lucidworks.com



Does {!child} query support nested Queries ("v=")

2017-03-02 Thread Kelly, Frank
This is Solr Cloud 5.3.1

I have a query like the following
q={!child of="type:userAccount" v="givenName:test*”}

Intent: Show me all children of the type:userAccount where 
userAccount.givenName:test*

If I run the query multiple times I get a very different numFound difference 
186,560 to 187,412 (+/0 500).

If I run the “normal” query on just the  parents
q=type:userAccount givenName:test*

I get a very stable numFound

Reading the docs it’s not documented as supported but neither do I get an error
https://cwiki.apache.org/confluence/display/solr/Other+Parsers

Am I using nestedQueries correctly?

-Frank




[Description: Macintosh 
HD:Users:jerchow:Downloads:Asset_Package_01_160721:HERE_Logo_2016:sRGB:PDF:HERE_Logo_2016_POS_sRGB.pdf]



Frank Kelly

Principal Software Engineer

Identity Profile Team (SCBE, Traces, CDA)


HERE

5 Wayside Rd, Burlington, MA 01803, USA

42° 29' 7" N 71° 11' 32" W

[Description: 
/Users/nussbaum/_WORK/PROJECTS/20160726_HERE_EMail_Signature/_Layout/_Images/20160726_HERE_EMail_Signature_360.gif]
[Description: 
/Users/nussbaum/_WORK/PROJECTS/20160726_HERE_EMail_Signature/_Layout/_Images/20160726_HERE_EMail_Signature_Twitter.gif]
 [Description: 
/Users/nussbaum/_WORK/PROJECTS/20160726_HERE_EMail_Signature/_Layout/_Images/20160726_HERE_EMail_Signature_FB.gif]
  [Description: 
/Users/nussbaum/_WORK/PROJECTS/20160726_HERE_EMail_Signature/_Layout/_Images/20160726_HERE_EMail_Signature_IN.gif]
  [Description: 
/Users/nussbaum/_WORK/PROJECTS/20160726_HERE_EMail_Signature/_Layout/_Images/20160726_HERE_EMail_Signature_Insta.gif]
 


Re: Distinguish exact match from wildcard match

2017-03-02 Thread Alexandre Rafalovitch
You could still use scoring with distinct bands of values and include
score field to see the assigned score. Then, on the client, you do
rough grouping.

You could try looking at highlighting, but that's probably
computationally irrational for this purpose.

You could try enabling debugging and see if information present in
there is sufficient.

All of these imply client-side post-processing.

Any other option would be possible only if that keyword was the only
content of the field. Then you could group or facet on it or
something. But not if it is one word of many.

Otherwise, you just do two queries and sort/merge yourself.

Regards,
   Alex.

http://www.solr-start.com/ - Resources for Solr users, new and experienced


On 2 March 2017 at 09:09, Сергей Твердохлеб  wrote:
> Hi Emir,
>
> Thanks for your answer.
> However in my case I really need to separate results, because I need to
> treat those resultsets differently.
>
> Thanks.
>
> 2017-03-02 15:57 GMT+02:00 Emir Arnautovic :
>
>> Hi Sergei,
>>
>> Usually you don't want to know which is which, but you do want to have
>> exact matches first. In case of simple queries and depending on your
>> usecase, you can use score to make distinction. If "bolter" matches "bolt"
>> because of some filters, you will need to index it in two fields and boost
>> fields differently to get different score for different matches:
>>
>> code_text_exact:bolt^1 OR code_text:bolt
>>
>> If you want to use wildcards, you can use similar approach:
>>
>> code_text:bolt^1 OR code_text:bolt*
>>
>> HTH,
>> Emir
>>
>>
>> On 02.03.2017 14:41, Сергей Твердохлеб wrote:
>>
>>> Hi,
>>>
>>> is there way to separate exact match from wildcard match in solr response?
>>> e.g. there are two documents: {code_text:bolt} and {code_text:bolter}.
>>> When
>>> I search for "bolt" I want to get both results, but somehow grouped, so I
>>> can determine either it was found with exact or non-exact match.
>>>
>>> Thanks.
>>>
>>>
>> --
>> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
>> Solr & Elasticsearch Support * http://sematext.com/
>>
>>
>
>
> --
> Regards,
> Sergey Tverdokhleb


Re: minimal solrconfig example

2017-03-02 Thread Alexandre Rafalovitch
If you liked my minimal config, you may also appreciate the last
presentation I did at the Lucene/Solr Revolution on deconstructing the
examples.

The slides are 
https://www.slideshare.net/arafalov/rebuilding-solr-6-examples-layer-by-layer-lucenesolrrevolution-2016
(the video is embedded at the end, slide 64).

Regards,
   Alex.

http://www.solr-start.com/ - Resources for Solr users, new and experienced


On 2 March 2017 at 04:09, David Michael Gang  wrote:
> Thanks Charly. This is what i looked for.
>
> On Thu, Mar 2, 2017 at 11:07 AM David Michael Gang 
> wrote:
>
> I use the latest version. Solr 6.4.1
>
>
> On Thu, Mar 2, 2017 at 9:15 AM Aravind Durvasula 
> wrote:
>
> Hi David,
>
> What is the solr version you are using?
> To get started, it's better to use the config file that comes out of the
> box.
>
> Thanks,
> Aravind
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/minimal-solrconfig-example-tp4322977p4322978.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Distinguish exact match from wildcard match

2017-03-02 Thread Сергей Твердохлеб
Hi Emir,

Thanks for your answer.
However in my case I really need to separate results, because I need to
treat those resultsets differently.

Thanks.

2017-03-02 15:57 GMT+02:00 Emir Arnautovic :

> Hi Sergei,
>
> Usually you don't want to know which is which, but you do want to have
> exact matches first. In case of simple queries and depending on your
> usecase, you can use score to make distinction. If "bolter" matches "bolt"
> because of some filters, you will need to index it in two fields and boost
> fields differently to get different score for different matches:
>
> code_text_exact:bolt^1 OR code_text:bolt
>
> If you want to use wildcards, you can use similar approach:
>
> code_text:bolt^1 OR code_text:bolt*
>
> HTH,
> Emir
>
>
> On 02.03.2017 14:41, Сергей Твердохлеб wrote:
>
>> Hi,
>>
>> is there way to separate exact match from wildcard match in solr response?
>> e.g. there are two documents: {code_text:bolt} and {code_text:bolter}.
>> When
>> I search for "bolt" I want to get both results, but somehow grouped, so I
>> can determine either it was found with exact or non-exact match.
>>
>> Thanks.
>>
>>
> --
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
>
>


-- 
Regards,
Sergey Tverdokhleb


Re: Distinguish exact match from wildcard match

2017-03-02 Thread Emir Arnautovic

Hi Sergei,

Usually you don't want to know which is which, but you do want to have 
exact matches first. In case of simple queries and depending on your 
usecase, you can use score to make distinction. If "bolter" matches 
"bolt" because of some filters, you will need to index it in two fields 
and boost fields differently to get different score for different matches:


code_text_exact:bolt^1 OR code_text:bolt

If you want to use wildcards, you can use similar approach:

code_text:bolt^1 OR code_text:bolt*

HTH,
Emir


On 02.03.2017 14:41, Сергей Твердохлеб wrote:

Hi,

is there way to separate exact match from wildcard match in solr response?
e.g. there are two documents: {code_text:bolt} and {code_text:bolter}. When
I search for "bolt" I want to get both results, but somehow grouped, so I
can determine either it was found with exact or non-exact match.

Thanks.



--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Distinguish exact match from wildcard match

2017-03-02 Thread Сергей Твердохлеб
Hi,

is there way to separate exact match from wildcard match in solr response?
e.g. there are two documents: {code_text:bolt} and {code_text:bolter}. When
I search for "bolt" I want to get both results, but somehow grouped, so I
can determine either it was found with exact or non-exact match.

Thanks.

-- 
Regards,
Sergey Tverdokhleb


messages in gc log not connected to gcs in indexing time

2017-03-02 Thread David Michael Gang
Hi all,

When indexing data i get in the gc log messages like:

2017-03-02T10:43:17.872+: 1088.957: Total time for which application
threads were stopped: 0.0002071 seconds, Stopping threads took: 0.888
seconds
2017-03-02T10:43:17.885+: 1088.970: Total time for which application
threads were stopped: 0.0005521 seconds, Stopping threads took: 0.0003334
seconds
2017-03-02T10:43:18.886+: 1089.970: Total time for which application
threads were stopped: 0.0003282 seconds, Stopping threads took: 0.0001287
seconds
2017-03-02T10:43:19.887+: 1090.972: Total time for which application
threads were stopped: 0.0012858 seconds, Stopping threads took: 0.0010998
seconds
2017-03-02T10:43:20.204+: 1091.289: Total time for which application
threads were stopped: 0.0005383 seconds, Stopping threads took: 0.0002524
seconds
2017-03-02T10:43:20.209+: 1091.294: Total time for which application
threads were stopped: 0.0045521 seconds, Stopping threads took: 0.0044508
seconds
2017-03-02T10:43:20.210+: 1091.295: Total time for which application
threads were stopped: 0.0005231 seconds, Stopping threads took: 0.0002992
seconds
2017-03-02T10:43:20.211+: 1091.295: Total time for which application
threads were stopped: 0.0003368 seconds, Stopping threads took: 0.0002406
seconds
2017-03-02T10:43:20.211+: 1091.296: Total time for which application
threads were stopped: 0.0003751 seconds, Stopping threads took: 0.0002424
seconds
2017-03-02T10:43:20.212+: 1091.297: Total time for which application
threads were stopped: 0.0003600 seconds, Stopping threads took: 0.0002684
seconds
2017-03-02T10:43:20.212+: 1091.297: Total time for which application
threads were stopped: 0.0004536 seconds, Stopping threads took: 0.0003420
seconds
2017-03-02T10:43:20.213+: 1091.298: Total time for which application
threads were stopped: 0.0002759 seconds, Stopping threads took: 0.0001949
seconds
2017-03-02T10:43:20.213+: 1091.298: Total time for which application
threads were stopped: 0.0001114 seconds, Stopping threads took: 0.558
seconds
2017-03-02T10:43:20.214+: 1091.299: Total time for which application
threads were stopped: 0.0004253 seconds, Stopping threads took: 0.0002801
seconds
2017-03-02T10:43:20.214+: 1091.299: Total time for which application
threads were stopped: 0.0002987 seconds, Stopping threads took: 0.0002093
seconds
2017-03-02T10:43:20.675+: 1091.760: Total time for which application
threads were stopped: 0.0004817 seconds, Stopping threads took: 0.0002180
seconds
2017-03-02T10:43:20.676+: 1091.760: Total time for which application
threads were stopped: 0.0003925 seconds, Stopping threads took: 0.0002443
seconds
2017-03-02T10:43:20.676+: 1091.761: Total time for which application
threads were stopped: 0.0002973 seconds, Stopping threads took: 0.0001974
seconds
2017-03-02T10:43:20.704+: 1091.788: Total time for which application
threads were stopped: 0.0002341 seconds, Stopping threads took: 0.771
seconds


They are not connected  to gcs, but to safepoints
http://blog.ragozin.info/2012/10/safepoints-in-hotspot-jvm.html

What are the reasons in solr regarding these safepoints? Do i have take
care about this? If yes what can i do regarding this?

Thanks,
David


How to expose new Lucene field type to Solr

2017-03-02 Thread Mike Thomsen
Found this project and I'd like to know what would be involved with
exposing its RestrictedField type through Solr for indexing and querying as
a Solr field type.

https://github.com/roshanp/lucure-core

Thanks,

Mike


Re: Arabic words search in solr

2017-03-02 Thread mohanmca01
Hi Stave, 

Any update on this.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Arabic-words-search-in-solr-tp4317733p4323005.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: bin/solr -a doesn't work?

2017-03-02 Thread Markus Jelsma
Hi - don't bother anymore, it seems to work fine now. I don't know why, but it 
kept hanging without error message.
Thanks,
Markus 
 
-Original message-
> From:Zheng Lin Edwin Yeo 
> Sent: Thursday 2nd March 2017 4:55
> To: solr-user@lucene.apache.org
> Subject: Re: bin/solr -a doesn't work?
> 
> Hi Markus,
> 
> Maybe you can post the script or error message here, so we can have a
> better understanding of the situation.
> 
> Regards,
> Edwin
> 
> 
> On 1 March 2017 at 19:53, Markus Jelsma  wrote:
> 
> > Hello,
> >
> > Because we upload large files to Zookeeper, i tried:
> >
> > bin/solr restart -c -m 1500m  -a "-Djute.maxbuffer=0xF2"
> >
> > But the script keeps hanging, and no Solr is started. The -a parameter
> > doesn't seem to work. I am missing something very obvious?
> >
> > Thanks,
> > Markus
> >
> 


Re: minimal solrconfig example

2017-03-02 Thread David Michael Gang
Thanks Charly. This is what i looked for.

On Thu, Mar 2, 2017 at 11:07 AM David Michael Gang 
wrote:

I use the latest version. Solr 6.4.1


On Thu, Mar 2, 2017 at 9:15 AM Aravind Durvasula 
wrote:

Hi David,

What is the solr version you are using?
To get started, it's better to use the config file that comes out of the
box.

Thanks,
Aravind



--
View this message in context:
http://lucene.472066.n3.nabble.com/minimal-solrconfig-example-tp4322977p4322978.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: minimal solrconfig example

2017-03-02 Thread David Michael Gang
I use the latest version. Solr 6.4.1

On Thu, Mar 2, 2017 at 9:15 AM Aravind Durvasula 
wrote:

> Hi David,
>
> What is the solr version you are using?
> To get started, it's better to use the config file that comes out of the
> box.
>
> Thanks,
> Aravind
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/minimal-solrconfig-example-tp4322977p4322978.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: OR condition between !frange and normal query

2017-03-02 Thread Emir Arnautovic

Hi Edwin,

You can use subqueries:

q=_query_:"({!frange l=1}ms(startDate_dt,endDate_dt)" OR 
_query_:"startDate:[2000-01-01T00:00:00Z TO *] AND endDate:[2016-12-31T23:59:59Z]"

HTH,
Emir


On 02.03.2017 04:51, Zheng Lin Edwin Yeo wrote:

Hi,

Would like to check, how can we do an OR condition between !frange and
normal query?

For example, I want to have the following condition in my query:

({!frange l=1}ms(startDate_dt,endDate_dt) OR
(startDate:[2000-01-01T00:00:00Z TO *] AND endDate:[2016-12-31T23:59:59Z]))

How can we put it in the Solr query URL for Solr to recognize this
condition?

I'm using Solr 6.4.1

Thank you.

Regards,
Edwin



--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: minimal solrconfig example

2017-03-02 Thread Charlie Hull

On 02/03/2017 06:58, David Michael Gang wrote:

Hi all,

I want to create my first solr collection
I found an example of solrconfig here.
https://github.com/apache/lucene-solr/blob/master/solr/example/files/conf/solrconfig.xml
This is a file of more than thousand lines.
As i understand this file shows all possible configurations possible.
What i miss is the most minimal file.
Where i can i find a minimal solrconfig.xml file with just the required
options?

Thanks,
David

We worked on the idea of a minimal Solr config at our London Lucene 
Hackday last year (see item 2): 
https://github.com/flaxsearch/london-hackday-2016


As I recall we managed to get Solr going with a config file that was 
shorter than the Apache license statement!


I'm generally of the opinion that it would be better to learn how Solr 
works by adding things to a minimal config file, rather than the usual 
method of hacking large chunks of it out and seeing what breaks what.


Cheers

Charlie

--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk