Re: SolrJ and autoscaling

2018-06-07 Thread Shalin Shekhar Mangar
Yes, we don't have Solrj support for changing autoscaling configuration
today. It'd be nice to have for sure. Can you please file a Jira? Patches
are welcome too!

On Wed, Jun 6, 2018 at 8:33 PM, Hendrik Haddorp 
wrote:

> Hi,
>
> I'm trying to read and modify the autoscaling config. The API on
> https://lucene.apache.org/solr/guide/7_3/solrcloud-autoscaling-api.html
> does only mention the REST API. The read part does however also work via
> SolrJ:
>
> cloudSolrClient.getZkStateReader().getAutoScalingConfig()
>
> Just for the write part I could not find anything in the API. Is this
> still a gap?
>
> regards,
> Hendrik
>



-- 
Regards,
Shalin Shekhar Mangar.


Re: Setting preferred replica for query/read

2018-06-07 Thread Shawn Heisey

On 6/7/2018 9:17 PM, Zheng Lin Edwin Yeo wrote:

Thanks for your reply.

As currently we are looking at having a replica to do indexing, and another
replica to be use for searching, these 2 requests looks like it can archive
this purpose.

Will this be implemented in the Solr 7.4 release?


SOLR-11982 will be in version 7.4.0.

Having redundancy for indexing requires either two NRT replicas or two 
TLOG replicas per shard.  You could do one of each, but I think it's 
better to have them both the same type.


Then you can set up one PULL replica (or more if you wish), and use the 
SOLR-11982 feature to indicate that you want to prefer PULL replicas. 
You won't lose redundancy if there's only one PULL replica and it goes 
down. In that situation, the NRT or TLOG replicas will be used.


Thanks,
Shawn



Re: Setting preferred replica for query/read

2018-06-07 Thread Zheng Lin Edwin Yeo
Hi Ere,

Thanks for your reply.

As currently we are looking at having a replica to do indexing, and another
replica to be use for searching, these 2 requests looks like it can archive
this purpose.

Will this be implemented in the Solr 7.4 release?

Regards,
Edwin


On 7 June 2018 at 16:00, Ere Maijala  wrote:

> Hi,
>
> What I did in SOLR-11982 was meant to be used with replica types. The idea
> is that you could have a set of NRT replicas used for indexing and a set of
> PULL replicas used for queries. That's the easiest way to split the work
> since PULL replicas never do indexing work, and then you can say in the
> queries that "shards.preference=replica.type:PULL" or have that as a
> default parameter in solrconfig. SOLR-8146 is not needed for this. I
> suppose now that SOLR-11982 is done, SOLR-8146 would only be needed to make
> it easier to set the preferred replica type etc.
>
> SOLR-11982 also allows you to use replica location in node preference. The
> nodes to use could be deduced from the cluster state and then you could use
> shards.preference with replica.location. But that means the client has to
> know which replicas to prefer.
>
> Regards,
> Ere
>
>
> Zheng Lin Edwin Yeo kirjoitti 4.6.2018 klo 19.09:
>
>> Hi,
>>
>> SOLR-8146 has not been updated since January last year, but I have just
>> commented it.
>>
>> So we need both to be updated in order to achieve the full functionality
>> of
>> setting preferred replica for query/read? Currently, is there a way to
>> achieve this by other means?
>>
>> Regards,
>> Edwin
>>
>> On 4 June 2018 at 19:43, Ere Maijala  wrote:
>>
>> Hi,
>>>
>>> Well, SOLR-11982 adds server-side support for part of what SOLR-8146 aims
>>> to do (shards.preference=replica.location:[something]). It doesn't do
>>> regular expressions or snitches at the moment, though it would be easy to
>>> add. So, it looks to me like SOLR-8146 would need to be updated in this
>>> regard.
>>>
>>> --Ere
>>>
>>>
>>> Zheng Lin Edwin Yeo kirjoitti 4.6.2018 klo 12.45:
>>>
>>> Hi,

 Is there any similarities between these two requests in the JIRA
 regarding
 setting of prefer replica function?

 (SOLR-11982) Add support for preferReplicaTypes parameter

 (SOLR-8146) Allowing SolrJ CloudSolrClient to have preferred replica for
 query/read

 I am looking at setting one of the replica to be the preferred replica
 for
 query/read, and another replica to be use for indexing.

 I am using Solr 7.3.1 currently.

 Regards,
 Edwin


 --
>>> Ere Maijala
>>> Kansalliskirjasto / The National Library of Finland
>>>
>>>
>>
> --
> Ere Maijala
> Kansalliskirjasto / The National Library of Finland
>


Collections unable to load after setting up SSL

2018-06-07 Thread Zheng Lin Edwin Yeo
Hi,

I am running SolrCloud on Solr 7.3.1 on External ZooKeeper 3.4.11, and I am
setting up the security aspect of Solr.

After setting up the SSL based on the steps from
https://lucene.apache.org/solr/guide/7_3/enabling-ssl.html, the collections
that are with 2 replica are no longer able to be loaded.

What could be causing the issue?

I remember that wasn't this problem when I tried the same thing in Solr 6
and even Solr 7.1.

Regards,
Edwin


Re: Streaming Expression intersect() behaviour

2018-06-07 Thread Joel Bernstein
This expression works as expected:

intersect(
cartesianProduct(tuple(fieldA=array(a,b,c,c)), fieldA, productSort="fieldA
asc"),
cartesianProduct(tuple(fieldA=array(a,c)), fieldA, productSort="fieldA
asc"),
on="fieldA"
)

And when you transpose the "on" fields like this:

intersect(
 cartesianProduct(tuple(fieldA=array(a,b,c,c)), fieldA, productSort="fieldA
asc"),
 cartesianProduct(tuple(fieldB=array(a,c)), fieldB, productSort="fieldB
asc"),
 on="fieldB=fieldA"
 )

It also works.

So, yes there is a bug where the fields are being transposed with intersect
function's "on" fields. The same issue was happening with joins and may
have been resolved. I'll do little more research into this.







Joel Bernstein
http://joelsolr.blogspot.com/

On Thu, Jun 7, 2018 at 9:29 AM, Christian Spitzlay <
christian.spitz...@biologis.com> wrote:

>
>
> > Am 07.06.2018 um 11:34 schrieb Christian Spitzlay <
> christian.spitz...@biologis.com>:
> >
> > intersect(
> > cartesianProduct(tuple(fieldA=array(a,b,c,c)), fieldA,
> productSort="fieldA asc"),
> > cartesianProduct(tuple(fieldB=array(a,c)), fieldB, productSort="fieldB
> asc"),
> > on="fieldA=fieldB"
> > )
> >
> > I simplified it a bit, too. I still get one document with fieldA == a.
> > I would have expected three documents in the output, one with fieldA ==
> a and two with fieldB == c.
>
> That should have read „… and two with fieldA == c“ of course.
>
>
>
>


Re: Different solr score between stand alone vs cloud mode solr

2018-06-07 Thread Erick Erickson
Wei:

That is odd. These should be the same so I'm puzzled too.

I'm assuming that you're using the exact same schema on both with each
field having the exact same definitions. And since you say it's the
same release of Solr it's not like some default changed

Here's an idea (and I'm shooting in the dark here).

Copy the index from one place to another and see if what you're seeing
is still true. Assuming the schema is the seam, you should be able to
1> shut down all your, say, SolrCloud instances.
2> copy the stand-alone index to each of those instances. Verify that
there is exactly one segment since you said it's optimized.
3> start the SolrCloud instances back up.

Are the scores still different?

Let's claim they're the same. In that case, use the schema from your
stand-alone solr for SolrCloud, then delete the index adn re-index
from scratch.

Best,
Erick

On Thu, Jun 7, 2018 at 2:28 PM, Wei  wrote:
> Thanks Erick. However our indexes on stand alone and cloud are both static
> -- we indexed them from the same source xmls, optimize and have no updates
> after it is done. Also in cloud there is only one single shard( with
> multiple replicas ). I assume distributed stats doesn't have effect in this
> case?
>
> Thanks,
> Wei
>
> On Thu, Jun 7, 2018 at 12:18 PM, Erick Erickson 
> wrote:
>
>> Short form:
>>
>> As docs are updated, they're marked as deleted until the segment is
>> merged. This affects things like term frequency and doc frequency
>> which in turn influences the score.
>>
>> Due to how commits happen, i.e. autocommit will hit at slightly skewed
>> wall-clock time, different segments are merged on different replicas
>> of the same shard. Thus the scores can be slightly different
>>
>> You can turn on distributed stats which will help with this:
>> https://issues.apache.org/jira/browse/SOLR-1632
>>
>> Best,
>> Erick
>>
>> On Thu, Jun 7, 2018 at 12:07 PM, Wei  wrote:
>> > Hi,
>> >
>> > Recently we have an observation that really puzzled us.  We have two
>> > instances of Solr,  one in stand alone mode and one is a single-shard
>> solr
>> > cloud with a couple of replicas.  Both are indexed with the same
>> documents
>> > and have same solr version 6.6.2.  When issue the same query, the solr
>> > score from stand alone and cloud are different.  How could this happen?
>> > With the same data, software version and query,  should solr score be
>> > exactly same regardless of cloud mode or not?
>> >
>> > Thanks,
>> > Wei
>>


Re: Different solr score between stand alone vs cloud mode solr

2018-06-07 Thread Wei
Thanks Erick. However our indexes on stand alone and cloud are both static
-- we indexed them from the same source xmls, optimize and have no updates
after it is done. Also in cloud there is only one single shard( with
multiple replicas ). I assume distributed stats doesn't have effect in this
case?

Thanks,
Wei

On Thu, Jun 7, 2018 at 12:18 PM, Erick Erickson 
wrote:

> Short form:
>
> As docs are updated, they're marked as deleted until the segment is
> merged. This affects things like term frequency and doc frequency
> which in turn influences the score.
>
> Due to how commits happen, i.e. autocommit will hit at slightly skewed
> wall-clock time, different segments are merged on different replicas
> of the same shard. Thus the scores can be slightly different
>
> You can turn on distributed stats which will help with this:
> https://issues.apache.org/jira/browse/SOLR-1632
>
> Best,
> Erick
>
> On Thu, Jun 7, 2018 at 12:07 PM, Wei  wrote:
> > Hi,
> >
> > Recently we have an observation that really puzzled us.  We have two
> > instances of Solr,  one in stand alone mode and one is a single-shard
> solr
> > cloud with a couple of replicas.  Both are indexed with the same
> documents
> > and have same solr version 6.6.2.  When issue the same query, the solr
> > score from stand alone and cloud are different.  How could this happen?
> > With the same data, software version and query,  should solr score be
> > exactly same regardless of cloud mode or not?
> >
> > Thanks,
> > Wei
>


RE: Different solr score between stand alone vs cloud mode solr

2018-06-07 Thread Markus Jelsma
To add on that, keep in mind to disable queryResultCache or distributed stats 
won't work.

And to add on that, i do not think distributed stats will work for a single 
shard index anyway.

Regards,
Markus

 
 
-Original message-
> From:Erick Erickson 
> Sent: Thursday 7th June 2018 21:19
> To: solr-user 
> Subject: Re: Different solr score between stand alone vs cloud mode solr
> 
> Short form:
> 
> As docs are updated, they're marked as deleted until the segment is
> merged. This affects things like term frequency and doc frequency
> which in turn influences the score.
> 
> Due to how commits happen, i.e. autocommit will hit at slightly skewed
> wall-clock time, different segments are merged on different replicas
> of the same shard. Thus the scores can be slightly different
> 
> You can turn on distributed stats which will help with this:
> https://issues.apache.org/jira/browse/SOLR-1632
> 
> Best,
> Erick
> 
> On Thu, Jun 7, 2018 at 12:07 PM, Wei  wrote:
> > Hi,
> >
> > Recently we have an observation that really puzzled us.  We have two
> > instances of Solr,  one in stand alone mode and one is a single-shard solr
> > cloud with a couple of replicas.  Both are indexed with the same documents
> > and have same solr version 6.6.2.  When issue the same query, the solr
> > score from stand alone and cloud are different.  How could this happen?
> > With the same data, software version and query,  should solr score be
> > exactly same regardless of cloud mode or not?
> >
> > Thanks,
> > Wei
> 


Re: Running Solr on HDFS - Disk space

2018-06-07 Thread Hendrik Haddorp
The only option should be to configure Solr to just have a replication 
factor of 1 or HDFS to have no replication. I would go for the middle 
and configure both to use a factor of 2. This way a single failure in 
HDFS and Solr is not a problem. While in 1/3 or 3/1 option a single 
server error would bring the collection down.


Setting the HDFS replication factor is a bit tricky as Solr takes in 
some places the default replication factor set on HDFS and some times 
takes a default from the client side. HDFS allows you to set a 
replication factor for every file individually.


regards,
Hendrik

On 07.06.2018 15:30, Shawn Heisey wrote:

On 6/7/2018 6:41 AM, Greenhorn Techie wrote:

As HDFS has got its own replication mechanism, with a HDFS replication
factor of 3, and then SolrCloud replication factor of 3, does that mean
each document will probably have around 9 copies replicated 
underneath of
HDFS? If so, is there a way to configure HDFS or Solr such that only 
three

copies are maintained overall?


Yes, that is exactly what happens.

SolrCloud replication assumes that each of its replicas is a 
completely independent index.  I am not aware of anything in Solr's 
HDFS support that can use one HDFS index directory for multiple 
replicas.  At the most basic level, a Solr index is a Lucene index.  
Lucene goes to great lengths to make sure that an index *CANNOT* be 
used in more than one place.


Perhaps somebody who is more familiar with HDFSDirectoryFactory can 
offer you a solution.  But as far as I know, there isn't one.


Thanks,
Shawn





Re: Solr for Content Management

2018-06-07 Thread David Hastings
When you are sending updates you are adjusting the segments which take them
out of memory and the index becomes "cold" until it gets enough searches to
cache the various aspects of the index.

On Thu, Jun 7, 2018 at 2:10 PM, Moenieb Davids 
wrote:

> Hi All,
>
> Background:
> I am currently testing a deployment of a content management framework where
> I am trying to punt Solr as the tool of choice for ingestion and searching.
>
> Current status:
> I have deployed SolrCloud across multiple servers with multiple shards and
> a replication factor of 2.
> In terms of collections, I have a person collection that contains details
> individuals including address and high level portfolio info. Structurally,
> this collection contains great grandchildren.
> Then I have a few collections that deals with content. For now, content is
> just emails and document with a max size of 2MB, with certain user
> exceptions that can go higher than 2MB.
> Content is indexed twice in terms of the actual content, firstly as
> binary/stream and then as readable text. Metadata is negligible
>
>
> Challenges:
> When performing full text searches without concurrently executing updates,
> solr seems to be doing well. Running updates also does okish given the
> nature of the transaction. However, when I run search and updates
> simultaneously, performance drops quite significantly. I have played with
> field properties, analyzers, tokenizers, shafting sizes etc.
> Any advice?
> Would like to know if anyone has done something similar. Please excuse the
> long winded message
>
>
> --
> Sent from Gmail Mobile
>
>
>
> --
> Sent from Gmail Mobile
>


Re: Different solr score between stand alone vs cloud mode solr

2018-06-07 Thread David Hastings
Also the score is a fluid number, you shouldnt use the score for any real
reason aside from seeing that the documents are in the right order in
relation to the scores from the other documents in the result set.  or the
occasional condition where two results switch in place from one to the
other because they have the same score

On Thu, Jun 7, 2018 at 3:18 PM, Erick Erickson 
wrote:

> Short form:
>
> As docs are updated, they're marked as deleted until the segment is
> merged. This affects things like term frequency and doc frequency
> which in turn influences the score.
>
> Due to how commits happen, i.e. autocommit will hit at slightly skewed
> wall-clock time, different segments are merged on different replicas
> of the same shard. Thus the scores can be slightly different
>
> You can turn on distributed stats which will help with this:
> https://issues.apache.org/jira/browse/SOLR-1632
>
> Best,
> Erick
>
> On Thu, Jun 7, 2018 at 12:07 PM, Wei  wrote:
> > Hi,
> >
> > Recently we have an observation that really puzzled us.  We have two
> > instances of Solr,  one in stand alone mode and one is a single-shard
> solr
> > cloud with a couple of replicas.  Both are indexed with the same
> documents
> > and have same solr version 6.6.2.  When issue the same query, the solr
> > score from stand alone and cloud are different.  How could this happen?
> > With the same data, software version and query,  should solr score be
> > exactly same regardless of cloud mode or not?
> >
> > Thanks,
> > Wei
>


Re: Different solr score between stand alone vs cloud mode solr

2018-06-07 Thread Erick Erickson
Short form:

As docs are updated, they're marked as deleted until the segment is
merged. This affects things like term frequency and doc frequency
which in turn influences the score.

Due to how commits happen, i.e. autocommit will hit at slightly skewed
wall-clock time, different segments are merged on different replicas
of the same shard. Thus the scores can be slightly different

You can turn on distributed stats which will help with this:
https://issues.apache.org/jira/browse/SOLR-1632

Best,
Erick

On Thu, Jun 7, 2018 at 12:07 PM, Wei  wrote:
> Hi,
>
> Recently we have an observation that really puzzled us.  We have two
> instances of Solr,  one in stand alone mode and one is a single-shard solr
> cloud with a couple of replicas.  Both are indexed with the same documents
> and have same solr version 6.6.2.  When issue the same query, the solr
> score from stand alone and cloud are different.  How could this happen?
> With the same data, software version and query,  should solr score be
> exactly same regardless of cloud mode or not?
>
> Thanks,
> Wei


Different solr score between stand alone vs cloud mode solr

2018-06-07 Thread Wei
Hi,

Recently we have an observation that really puzzled us.  We have two
instances of Solr,  one in stand alone mode and one is a single-shard solr
cloud with a couple of replicas.  Both are indexed with the same documents
and have same solr version 6.6.2.  When issue the same query, the solr
score from stand alone and cloud are different.  How could this happen?
With the same data, software version and query,  should solr score be
exactly same regardless of cloud mode or not?

Thanks,
Wei


Re: Apache and Apache Solr together

2018-06-07 Thread Shawn Heisey
On 6/6/2018 12:57 AM, azharuddin wrote:
> I've got a question: I came across  Apache Solr
>    as requirement for a module
> I'm installing and even after reading the documentation on Apache Solr's
> official homepage I'm still not sure whether Apache runs alongside regular
> Apache or does it require it own server? If it does work alongside Apache,
> is there any known issues/problems that I should be aware of? How would this
> architecture (Apache and Apache Solr) in terms of file system and serving
> pages? I'm sorry if the question might sound silly but I'm very new to the
> whole server-side programming/setup world.

Solr is completely separate software from the Apache webserver (httpd). 
It can run on the same server, or different servers.

Solr should not be accessible by untrusted parties, especially the
Internet.  Another way to think about it: When you put an application on
the Internet that uses a database, you might run the database software
on the same machine as Apache ... but I really doubt that you would
allow the Internet to get directly to the database.  Solr offers a
service that can be used by a website, just like a database server
does.  Access to it should be controlled in a similar way.

Thanks,
Shawn



Solr for Content Management

2018-06-07 Thread Moenieb Davids
Hi All,

Background:
I am currently testing a deployment of a content management framework where
I am trying to punt Solr as the tool of choice for ingestion and searching.

Current status:
I have deployed SolrCloud across multiple servers with multiple shards and
a replication factor of 2.
In terms of collections, I have a person collection that contains details
individuals including address and high level portfolio info. Structurally,
this collection contains great grandchildren.
Then I have a few collections that deals with content. For now, content is
just emails and document with a max size of 2MB, with certain user
exceptions that can go higher than 2MB.
Content is indexed twice in terms of the actual content, firstly as
binary/stream and then as readable text. Metadata is negligible


Challenges:
When performing full text searches without concurrently executing updates,
solr seems to be doing well. Running updates also does okish given the
nature of the transaction. However, when I run search and updates
simultaneously, performance drops quite significantly. I have played with
field properties, analyzers, tokenizers, shafting sizes etc.
Any advice?
Would like to know if anyone has done something similar. Please excuse the
long winded message


-- 
Sent from Gmail Mobile



-- 
Sent from Gmail Mobile


Re: Query a particular index from a multivalued field.

2018-06-07 Thread Erick Erickson
there's no such syntax OOB.

You could append an index to it. So your input doc would look something like:

 doc 1= {
"id": "1",
"status": [
  "b1",
  "a2"
]
 }

and search appropriately.

Perhaps this would be a duplicated field used only when you wanted to
search by position.

Best,
Erick

On Thu, Jun 7, 2018 at 8:36 AM, root23  wrote:
> Hi all,
> is there a way i can query a particular index of a multivalued field.
> e.g lets say i have a document like this
>  doc 1= {
> "id": "1",
> "status": [
>   "b",
>   "a"
> ]
>  }
>
> doc2= {
> "id": "1",
> "status": [
>   "c",
>   "b"
> ]
>  }
>
> can i query like give me the document which has status = b at index 0. which
> should only return doc 1.
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Query a particular index from a multivalued field.

2018-06-07 Thread root23
Hi all,
is there a way i can query a particular index of a multivalued field.
e.g lets say i have a document like this
 doc 1= {
"id": "1",
"status": [
  "b",
  "a"
]
 }

doc2= {
"id": "1",
"status": [
  "c",
  "b"
]
 }

can i query like give me the document which has status = b at index 0. which
should only return doc 1.



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Delete then re-add a core

2018-06-07 Thread Erick Erickson
Amanda:

Your Solr log will record each update that comes through. It's a
little opaque, by default it'll show you the first 10 IDs of each
batch it receives.

Guesses:
- you're somehow having the same ID () assigned to multiple documents
- your schemas are a bit different and the docs can't be indexed
(undefined field for instance).


Best,
Erick


On Thu, Jun 7, 2018 at 7:49 AM, Amanda Shuman  wrote:
> Thanks, Shawn, that is a remarkably clear description.
>
> I am able to create the core and all appears fine, but when I go to index I
> am unfortunately running into a new problem. I am indexing from the same
> site content as before (it's just an Omeka install with a solr plug-in that
> reindexes the sitE), but now it only indexes 3 (!) records out of 3000+ and
> then stops. I have no idea why. The old core - with a different name -
> still works, even I choose to reindex it. Now I have to figure out which
> error logs to check -- Solr or Omeka.
>
> Amanda
>
> --
> Dr. Amanda Shuman
> Post-doc researcher, University of Freiburg, The Maoist Legacy Project
> 
> PhD, University of California, Santa Cruz
> http://www.amandashuman.net/
> http://www.prchistoryresources.org/
> Office: +49 (0) 761 203 4925
>
>
> On Thu, Jun 7, 2018 at 3:08 PM, Shawn Heisey  wrote:
>
>> On 6/7/2018 4:12 AM, Amanda Shuman wrote:
>>
>>> Definitely not a permissions problem - everything is run by the solr user,
>>> which owns everything in the directories. I just can't figure out why the
>>> default working directory is in opt rather than var (which is where it
>>> should be according to a previous chain I was in).
>>>
>>> But at this point I'm at a total loss, so maybe a fresh install wouldn't
>>> hurt.
>>>
>>
>> The "bin/solr" script, which is ultimately how Solr is started even when
>> it is installed as a service, initially sets the current working directory
>> to a directory that it knows as SOLR_TIP.  This is the directory containing
>> bin, server, and others.  It defaults to /opt/solr when Solr is installed
>> as a service.
>>
>> Then just before Solr is started, the script will change the current
>> working directory to the server directory, which is a subdirectory of
>> SOLR_TIP.
>>
>> So when Solr starts, the current working directory is $SOLR_TIP/server.
>>
>> The service installer sets the owner of everything in SOLR_TIP to root.
>> The solr user has absolutely no reason to write to that directory at all.
>> Everything that Solr writes will be to an absolute path under the "var dir"
>> given during service install, which defaults to /var/solr.  THAT directory
>> and all its contents will be owned by the user specified during install,
>> which defaults to solr.
>>
>> The current working directory is where the developers want it, and will
>> not be in the "var dir".  Its location is critical for correct Jetty
>> operation.  When Solr is configured in the expected way for a service
>> install, it does not use the current working directory.
>>
>> Thanks,
>> Shawn
>>
>>


Re: Solr start script

2018-06-07 Thread Cassandra Targett
The reason why you pass the DirectoryFactory at startup is so every
collection/core that's created is automatically stored in HDFS before
solrconfig.xml is read to know that's where they should be stored.

If you prefer to only store certain collections/cores in HDFS, you would
only set those properties in the solrconfig.xml files for the collection.

The properties do still need to be defined in solrconfig.xml, which the
documentation you pointed to says - make the change in solrconfig.xml, then
pass the properties at startup.

On Thu, Jun 7, 2018 at 9:25 AM Greenhorn Techie 
wrote:

> Shawn, Thanks for your response. Please find my follow-up questions:
>
> 1. My understanding is that Directory Factory settings are typically at a
> collection / core level. If thats the case, what is the advantage of
> passing it along with the start script?
> 2. In your below response, did you mean that even though I pass the
> settings as part of start script, they dont have any value unless they are
> mentioned as part of the solrconfig.xml file?
> 3. As per my previous email, what does Solr do if my solfconfig.xml contain
> NRTDirectoryFactory setting while the solr script is started with HDFS
> settings?
>
> Thanks
>
>
> On 7 June 2018 at 15:08:02, Shawn Heisey (apa...@elyograg.org) wrote:
>
> On 6/7/2018 7:37 AM, Greenhorn Techie wrote:
> > When the above settings are passed as part of start script, does that
> mean
> > whenever a new collection is created, Solr is going to store the indexes
> in
> > HDFS? But what if I upload my solrconfig.xml to ZK which contradicts with
> > this and contains NRTDirectoryFactory setting? Given the above start
> > script, should / could I skip the directory factory setting section in my
> > solrconfig.xml with the assumption that the collections are going to be
> > stored on HDFS *by default*?
>
> Those commandline options are Java system properties.  It looks like the
> example configs DO have settings in them that would use the
> solr.directoryFactory and solr.lock.type properties.  But if your
> solrconfig.xml file doesn't reference those properties, then they
> wouldn't make any difference.  The last one is probably a setting that
> HdfsDirectoryFactory uses that doesn't need to be explicitly referenced
> in a config file.
>
> Thanks,
> Shawn
>


Re: Delete then re-add a core

2018-06-07 Thread Amanda Shuman
Thanks, Shawn, that is a remarkably clear description.

I am able to create the core and all appears fine, but when I go to index I
am unfortunately running into a new problem. I am indexing from the same
site content as before (it's just an Omeka install with a solr plug-in that
reindexes the sitE), but now it only indexes 3 (!) records out of 3000+ and
then stops. I have no idea why. The old core - with a different name -
still works, even I choose to reindex it. Now I have to figure out which
error logs to check -- Solr or Omeka.

Amanda

--
Dr. Amanda Shuman
Post-doc researcher, University of Freiburg, The Maoist Legacy Project

PhD, University of California, Santa Cruz
http://www.amandashuman.net/
http://www.prchistoryresources.org/
Office: +49 (0) 761 203 4925


On Thu, Jun 7, 2018 at 3:08 PM, Shawn Heisey  wrote:

> On 6/7/2018 4:12 AM, Amanda Shuman wrote:
>
>> Definitely not a permissions problem - everything is run by the solr user,
>> which owns everything in the directories. I just can't figure out why the
>> default working directory is in opt rather than var (which is where it
>> should be according to a previous chain I was in).
>>
>> But at this point I'm at a total loss, so maybe a fresh install wouldn't
>> hurt.
>>
>
> The "bin/solr" script, which is ultimately how Solr is started even when
> it is installed as a service, initially sets the current working directory
> to a directory that it knows as SOLR_TIP.  This is the directory containing
> bin, server, and others.  It defaults to /opt/solr when Solr is installed
> as a service.
>
> Then just before Solr is started, the script will change the current
> working directory to the server directory, which is a subdirectory of
> SOLR_TIP.
>
> So when Solr starts, the current working directory is $SOLR_TIP/server.
>
> The service installer sets the owner of everything in SOLR_TIP to root.
> The solr user has absolutely no reason to write to that directory at all.
> Everything that Solr writes will be to an absolute path under the "var dir"
> given during service install, which defaults to /var/solr.  THAT directory
> and all its contents will be owned by the user specified during install,
> which defaults to solr.
>
> The current working directory is where the developers want it, and will
> not be in the "var dir".  Its location is critical for correct Jetty
> operation.  When Solr is configured in the expected way for a service
> install, it does not use the current working directory.
>
> Thanks,
> Shawn
>
>


Re: HDP Search - Configuration & Data Directories

2018-06-07 Thread Cassandra Targett
The documentation for HDP Search is online (and included in the package
actually). This page has the descriptions for the Ambari parameters:
https://doc.lucidworks.com/lucidworks-hdpsearch/3.0.0/Guide-Install-Ambari.html
.

HDP Search is a package developed by Lucidworks but distributed by
Hortonworks, so Shawn is right, you should go through them for further
questions.

On Thu, Jun 7, 2018 at 8:39 AM Greenhorn Techie 
wrote:

> Thanks Shawn. Will check with Hortonworks!
>
>
> On 7 June 2018 at 14:19:43, Shawn Heisey (apa...@elyograg.org) wrote:
>
> On 6/7/2018 6:35 AM, Greenhorn Techie wrote:
> > A quick question on configuring Solr with Hortonworks HDP. I have
> installed
> > HDP and then installed HDP Search using the steps described under the
> link
>
> 
>
> > - Within the various Solr config settings on Ambari, I am a bit confused
> > on the role of "solr_config_conf_dir" parameter. At the moment, it only
> > contains log4j.properties file. As HDPSearch is mainly meant to be used
> > with SolrCloud, wondering what is the significance of this directory as
> the
> > configurations are always maintained on ZooKeeper.
>
> The text strings "solr_config_conf_dir" and "solr_config_data_dir" do
> not appear anywhere in the Lucene/Solr source code, even if I use a
> case-insensitive grep. Which must mean that it is specific to the
> third-party software you are using.  You'll need to ask your question to
> the people who make that third-party software.
>
> The log4j config is not in zookeeper.  That will be found on each
> server.  That file configures the logging framework at the JVM level, it
> is not specifically for Solr.
>
> Thanks,
> Shawn
>


Re: Solr start script

2018-06-07 Thread Greenhorn Techie
Shawn, Thanks for your response. Please find my follow-up questions:

1. My understanding is that Directory Factory settings are typically at a
collection / core level. If thats the case, what is the advantage of
passing it along with the start script?
2. In your below response, did you mean that even though I pass the
settings as part of start script, they dont have any value unless they are
mentioned as part of the solrconfig.xml file?
3. As per my previous email, what does Solr do if my solfconfig.xml contain
NRTDirectoryFactory setting while the solr script is started with HDFS
settings?

Thanks


On 7 June 2018 at 15:08:02, Shawn Heisey (apa...@elyograg.org) wrote:

On 6/7/2018 7:37 AM, Greenhorn Techie wrote:
> When the above settings are passed as part of start script, does that
mean
> whenever a new collection is created, Solr is going to store the indexes
in
> HDFS? But what if I upload my solrconfig.xml to ZK which contradicts with
> this and contains NRTDirectoryFactory setting? Given the above start
> script, should / could I skip the directory factory setting section in my
> solrconfig.xml with the assumption that the collections are going to be
> stored on HDFS *by default*?

Those commandline options are Java system properties.  It looks like the
example configs DO have settings in them that would use the
solr.directoryFactory and solr.lock.type properties.  But if your
solrconfig.xml file doesn't reference those properties, then they
wouldn't make any difference.  The last one is probably a setting that
HdfsDirectoryFactory uses that doesn't need to be explicitly referenced
in a config file.

Thanks,
Shawn


"ADDREPLICA failed to create replica"

2018-06-07 Thread solrnoobie
So we have a solr 6.6.3 deployed in AWS EC2 instance (dockerized) and during
our load testing, our script for some reason removed 1 replica. So I decided
to add 1 replica in the shard with only 1 replica and it returned the error
message in the title.


Whan can cause this? We have another collection in the cluster and it can
easily add a replica but this particular collection proved to be
problematic.



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr start script

2018-06-07 Thread Shawn Heisey

On 6/7/2018 7:37 AM, Greenhorn Techie wrote:

When the above settings are passed as part of start script, does that mean
whenever a new collection is created, Solr is going to store the indexes in
HDFS? But what if I upload my solrconfig.xml to ZK which contradicts with
this and contains NRTDirectoryFactory setting? Given the above start
script, should / could I skip the directory factory setting section in my
solrconfig.xml with the assumption that the collections are going to be
stored on HDFS *by default*?


Those commandline options are Java system properties.  It looks like the 
example configs DO have settings in them that would use the 
solr.directoryFactory and solr.lock.type properties.  But if your 
solrconfig.xml file doesn't reference those properties, then they 
wouldn't make any difference.  The last one is probably a setting that 
HdfsDirectoryFactory uses that doesn't need to be explicitly referenced 
in a config file.


Thanks,
Shawn



Re[2]: Sort hits in the order of subqueries

2018-06-07 Thread Robert K .
Hello,

I had a look at the Constant Score approach suggested by Emir: (q0^=100) OR 
(q1)^=90 ...

As observed by Alexandre it seems to introduce stratification at the cost of 
the intra-query ranking
which is not satisfactory.

So if I imagine Constant Score as a function f(x) = C operating on a document 
score and constrained
to a subquery then what I would like to have is sigmoid function F(x, C) = C + 
1 / (1+ exp(-x)) applied to
the document scores of intra-queries.

Instead of:

ConstantScore(q0, 100) OR ConstantScore(q1, 90) ...

then:

SigmoidScore(q0, 100) OR SigmoidScore(q1, 90) ...

I'm pretty sure, it is possible to take ConstantScore class and end up with 
Sigmoid as a custom extension.
Still hoping for a hint what is the simplest approach to achieve the 
stratification.


Next question which I have in this context: we happen to sort some intra 
queries by different fields in some cases.
It looks like:

(q0 sorted by date) OR (q1 sorted by relevancy)


Wondering if you have any idea how is that possible to formulate in Solr.

Regards,

Robert


>Четверг,  7 июня 2018, 15:20 +02:00 от Alexandre Rafalovitch 
>:
>
>I think this solution will destroy intra-query ranking. So all results in
>q0 come before q1 but would be random within q0 results.
>
>Would instead just a bunch of boost queries with different weights
>(additive probably) be a beter way to introduce stratification?
>
>Regards,
>   Alex
>
>On Thu, Jun 7, 2018, 13:19 Emir Arnautović, < emir.arnauto...@sematext.com >
>wrote:
>
>> Hi Robert,
>> If I get your requirement right, you can solve it with following:
>> (q0)^=100 OR (q1)^=90….
>>
>> Assuming there are no overlaps - otherwise, one matching multiple
>> conditions can change the ordering.
>>
>> HTH,
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training -  http://sematext.com/
>>
>>
>>
>> > On 7 Jun 2018, at 11:53, Robert K. < wk.rk.sk...@mail.ru.INVALID > wrote:
>> >
>> > Hello,
>> >
>> > I am investigating the following use case.
>> >
>> > Suppose I have a list of queries q_0, q_1, ..., q_n which I combine to a
>> boolean query using 'SHOULD'-clauses.
>> > The requirement for the hits sorting is that the results of q_0 precede
>> the results of q_1, the results of q_1 precede the
>> > results of q_2 an so on. If a hit occurs in the results of more then one
>> query, then we should see it only once in the results
>> > of the query with the smallest index.
>> >
>> > I have searched for some solutions but didn't find anything useful so
>> far.
>> >
>> > I have considered following approaches:
>> >
>> > 1. Reformulate: q0 & (q_1 & !q_0) & (q2 & !q_0 & !q1) & ...
>> >
>> > While possible, seems to have a potential negative impact on performance
>> due to multiple evaluations on the same queries.
>> > I didn't do any measurements, though. It is technically possible to
>> optimize the execution of this query to evaluate the subqueries
>> > q_i only once, but I don't know, whether this kind of optimizations is
>> implemented in the current Lucene/Solr. (?)
>> >
>> > 2. Implement CustomScoreQuery. General idea: Take a list of queries and
>> execute them in the context of a BooleanQuery mapping
>> > the scores of the corresponding subqueries to disjunct score ranges,
>> like q_n -> [0,1), q_(n-1) -> [1,2) and so on.
>> >
>> > Problem: CustomScoreQuery is deprecated, FunctionQuery is the recommeded
>> approach. Still I didn't see any obvious solution
>> > how I can use FunctionQuery to implement the idea. Is it possible,
>> should I dive in and try to do it with FunctionQuery.
>> >
>> > 3. Assuming there is some possibility to solve the task with the
>> FunctionQuery (or anything within the out-of-the-box Solr). My questions
>> > are: Is there any solution without having to write our own extension to
>> Solr? Using only what is delivered in the standard distribution of Solr?
>> >
>> >
>> > Note: In the past we solved the problem within our legacy application
>> with a modified BooleanQuery/BooleanScorer. We could migrate
>> > (=rewrite) this extension to the current Solr/Lucene, but it may be not
>> the best option, so I am exploring all the other possibilities.
>> >
>> > Thank you all & Best regards,
>> >
>> > Robert
>>
>>





Difference in fieldLengh and avgFieldLength in Solr 6.6 vs Solr 7.1

2018-06-07 Thread rupali pol
Hi all,

We are doing upgrade from Solr 6.6 to Solr 7.1, we are seeing lot of
differneces in raking and scores of Solr 6.6 and Solr7.1 results.

The major differences we observed are in fieldLengh and avgFieldLength
parameters which are calculated per field, per document, per search term.

*Calculation of tfNorm in Solr 7.1.0 -*
tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b *
fieldLength / avgFieldLength)) from:
fieldLength *53272.0* 4087877% 4087877%
avgFieldLength *7284.33100* 558970% 558970%
termFreq=10.0 10.0 767% 767%
parameter k1 1.2 92% 92%
parameter b 0.75000 58% 58%

*Calculation of tfNorm for same in Solr 6.6.0 -*
tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b *
fieldLength / avgFieldLength)) from:
fieldLength *65536.0* 5480182% 5480182%
avgFieldLength *7284.83060* 609164% 609164%
termFreq=10.0 10.0 836% 836%
parameter k1 1.2 100% 100%
parameter b 0.75000 63% 63%


Can someone please elaborate on what differences are brought in Solr7.1 for
the fieldLength calculation?

Thanks in advance.

Best,
Rups


Re: HDP Search - Configuration & Data Directories

2018-06-07 Thread Greenhorn Techie
Thanks Shawn. Will check with Hortonworks!


On 7 June 2018 at 14:19:43, Shawn Heisey (apa...@elyograg.org) wrote:

On 6/7/2018 6:35 AM, Greenhorn Techie wrote:
> A quick question on configuring Solr with Hortonworks HDP. I have
installed
> HDP and then installed HDP Search using the steps described under the
link



> - Within the various Solr config settings on Ambari, I am a bit confused
> on the role of "solr_config_conf_dir" parameter. At the moment, it only
> contains log4j.properties file. As HDPSearch is mainly meant to be used
> with SolrCloud, wondering what is the significance of this directory as
the
> configurations are always maintained on ZooKeeper.

The text strings "solr_config_conf_dir" and "solr_config_data_dir" do
not appear anywhere in the Lucene/Solr source code, even if I use a
case-insensitive grep. Which must mean that it is specific to the
third-party software you are using.  You'll need to ask your question to
the people who make that third-party software.

The log4j config is not in zookeeper.  That will be found on each
server.  That file configures the logging framework at the JVM level, it
is not specifically for Solr.

Thanks,
Shawn


Solr start script

2018-06-07 Thread Greenhorn Techie
Hi,

For our project purposes, we need to store Solr collections on HDFS.  While
exploring the documentation for the same, I have found lucidworks
documentation (
https://doc.lucidworks.com/lucidworks-hdpsearch/3.0.0/Guide-Install-Manual.html#hdfs-specific-changes)
, where it has been mentioned that solr start script can be passed many
arguments while starting. The example provided is as below:

bin/solr start -c
   -z 10.0.0.1:2181,10.0.0.2:2181,10.0.0.3:2181/solr
   -Dsolr.directoryFactory=HdfsDirectoryFactory
   -Dsolr.lock.type=hdfs
   -Dsolr.hdfs.home=hdfs://sandbox.hortonworks.com:8020/user/solr


What does this actually mean when passing directoryFactory settings for
Solr start script? I was thinking Directory Factory setting is something
that apply only at each collection level i.e. we need to specify within the
solrconfig.xml file *only*.

When the above settings are passed as part of start script, does that mean
whenever a new collection is created, Solr is going to store the indexes in
HDFS? But what if I upload my solrconfig.xml to ZK which contradicts with
this and contains NRTDirectoryFactory setting? Given the above start
script, should / could I skip the directory factory setting section in my
solrconfig.xml with the assumption that the collections are going to be
stored on HDFS *by default*?

This is confusing to me and hence need the expert advice of the community.

Thanks


Re: Running Solr on HDFS - Disk space

2018-06-07 Thread Shawn Heisey

On 6/7/2018 6:41 AM, Greenhorn Techie wrote:

As HDFS has got its own replication mechanism, with a HDFS replication
factor of 3, and then SolrCloud replication factor of 3, does that mean
each document will probably have around 9 copies replicated underneath of
HDFS? If so, is there a way to configure HDFS or Solr such that only three
copies are maintained overall?


Yes, that is exactly what happens.

SolrCloud replication assumes that each of its replicas is a completely 
independent index.  I am not aware of anything in Solr's HDFS support 
that can use one HDFS index directory for multiple replicas.  At the 
most basic level, a Solr index is a Lucene index.  Lucene goes to great 
lengths to make sure that an index *CANNOT* be used in more than one place.


Perhaps somebody who is more familiar with HDFSDirectoryFactory can 
offer you a solution.  But as far as I know, there isn't one.


Thanks,
Shawn



Re: Streaming Expression intersect() behaviour

2018-06-07 Thread Christian Spitzlay



> Am 07.06.2018 um 11:34 schrieb Christian Spitzlay 
> :
> 
> intersect(
> cartesianProduct(tuple(fieldA=array(a,b,c,c)), fieldA, productSort="fieldA 
> asc"),
> cartesianProduct(tuple(fieldB=array(a,c)), fieldB, productSort="fieldB asc"),
> on="fieldA=fieldB"
> )
> 
> I simplified it a bit, too. I still get one document with fieldA == a.
> I would have expected three documents in the output, one with fieldA == a and 
> two with fieldB == c.

That should have read „… and two with fieldA == c“ of course.





Re: Graph traversal: Bypass cycle detection?

2018-06-07 Thread Joel Bernstein
Ah. I'll do some testing to see exactly how nodes function behaves when a
node links to itself.

Joel Bernstein
http://joelsolr.blogspot.com/

On Thu, Jun 7, 2018 at 5:06 AM, Christian Spitzlay <
christian.spitz...@biologis.com> wrote:

> Hi,
>
>
> > Am 07.06.2018 um 03:20 schrieb Joel Bernstein :
> >
> > Hi,
> >
> > At this time cycle detection is built into the nodes expression and
> cannot
> > be turned off. The nodes expression is really designed to do a
> traditional
> > breadth first search through a graph where cycle detection is needed so
> you
> > don't continually walk the same nodes.
> >
> > Are you looking to do random walk analysis?
> > I've been meaning to add a
> > function that supports random walks on a graph that would not do cycle
> > detection.
>
>
> No, this is not about random walks. We have an application that knows
> different types of entities and links betweens them. Both entities and
> links
> are indexed and we create additional documents to represent relations
> between the entities to prepare a network we can search on.
> A regular walk with nodes() is part of that.
>
> There is an issue in a situation where one of the entities in the original
> system is linked to itself.  I’m haven’t finished analysing the problem yet
> but I wondered whether there was an easy way to rule out that
> cycle detection is causing it.
>
>
> Best regards,
> Christian Spitzlay
>
>
>
>
>
>
> > christian.spitz...@biologis.com> wrote:
> >
> >> Hi,
> >>
> >> is it possible to bypass the cycle detection so a traversal
> >> can revisit nodes?
> >>
> >> The documentation at
> >> https://lucene.apache.org/solr/guide/7_3/graph-
> >> traversal.html#cycle-detection
> >> does not mention any and lists reasons why the cycle detection is in
> place.
> >> But if I were willing to live with the consequences would it be
> possible?
> >>
> >>
> >> Best regards
> >> Christian Spitzlay
> >>
> >>
>
>


Re: Sort hits in the order of subqueries

2018-06-07 Thread Alexandre Rafalovitch
I think this solution will destroy intra-query ranking. So all results in
q0 come before q1 but would be random within q0 results.

Would instead just a bunch of boost queries with different weights
(additive probably) be a beter way to introduce stratification?

Regards,
   Alex

On Thu, Jun 7, 2018, 13:19 Emir Arnautović, 
wrote:

> Hi Robert,
> If I get your requirement right, you can solve it with following:
> (q0)^=100 OR (q1)^=90….
>
> Assuming there are no overlaps - otherwise, one matching multiple
> conditions can change the ordering.
>
> HTH,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 7 Jun 2018, at 11:53, Robert K.  wrote:
> >
> > Hello,
> >
> > I am investigating the following use case.
> >
> > Suppose I have a list of queries q_0, q_1, ..., q_n which I combine to a
> boolean query using 'SHOULD'-clauses.
> > The requirement for the hits sorting is that the results of q_0 precede
> the results of q_1, the results of q_1 precede the
> > results of q_2 an so on. If a hit occurs in the results of more then one
> query, then we should see it only once in the results
> > of the query with the smallest index.
> >
> > I have searched for some solutions but didn't find anything useful so
> far.
> >
> > I have considered following approaches:
> >
> > 1. Reformulate: q0 & (q_1 & !q_0) & (q2 & !q_0 & !q1) & ...
> >
> > While possible, seems to have a potential negative impact on performance
> due to multiple evaluations on the same queries.
> > I didn't do any measurements, though. It is technically possible to
> optimize the execution of this query to evaluate the subqueries
> > q_i only once, but I don't know, whether this kind of optimizations is
> implemented in the current Lucene/Solr. (?)
> >
> > 2. Implement CustomScoreQuery. General idea: Take a list of queries and
> execute them in the context of a BooleanQuery mapping
> > the scores of the corresponding subqueries to disjunct score ranges,
> like q_n -> [0,1), q_(n-1) -> [1,2) and so on.
> >
> > Problem: CustomScoreQuery is deprecated, FunctionQuery is the recommeded
> approach. Still I didn't see any obvious solution
> > how I can use FunctionQuery to implement the idea. Is it possible,
> should I dive in and try to do it with FunctionQuery.
> >
> > 3. Assuming there is some possibility to solve the task with the
> FunctionQuery (or anything within the out-of-the-box Solr). My questions
> > are: Is there any solution without having to write our own extension to
> Solr? Using only what is delivered in the standard distribution of Solr?
> >
> >
> > Note: In the past we solved the problem within our legacy application
> with a modified BooleanQuery/BooleanScorer. We could migrate
> > (=rewrite) this extension to the current Solr/Lucene, but it may be not
> the best option, so I am exploring all the other possibilities.
> >
> > Thank you all & Best regards,
> >
> > Robert
>
>


Re: HDP Search - Configuration & Data Directories

2018-06-07 Thread Shawn Heisey

On 6/7/2018 6:35 AM, Greenhorn Techie wrote:

A quick question on configuring Solr with Hortonworks HDP. I have installed
HDP and then installed HDP Search using the steps described under the link





- Within the various Solr config settings on Ambari, I am a bit confused
on the role of "solr_config_conf_dir" parameter. At the moment, it only
contains log4j.properties file. As HDPSearch is mainly meant to be used
with SolrCloud, wondering what is the significance of this directory as the
configurations are always maintained on ZooKeeper.


The text strings "solr_config_conf_dir" and "solr_config_data_dir" do 
not appear anywhere in the Lucene/Solr source code, even if I use a 
case-insensitive grep. Which must mean that it is specific to the 
third-party software you are using.  You'll need to ask your question to 
the people who make that third-party software.


The log4j config is not in zookeeper.  That will be found on each 
server.  That file configures the logging framework at the JVM level, it 
is not specifically for Solr.


Thanks,
Shawn



Re: Streaming Expression intersect() behaviour

2018-06-07 Thread Joel Bernstein
Nice example!

I'll take a look at this today. I believe there was/is a bug with the some
of the joins where the "on" parameter is transposing the fields. Its
possible that is the case here as well.



Joel Bernstein
http://joelsolr.blogspot.com/

On Thu, Jun 7, 2018 at 5:34 AM, Christian Spitzlay <
christian.spitz...@biologis.com> wrote:

> Hi,
>
> I noticed that my mail program broke the test case by replacing a double
> quote with a different UTF-8 character.
>
> Here is the test case again and I hope it will work this time:
>
> intersect(
> cartesianProduct(tuple(fieldA=array(a,b,c,c)), fieldA,
> productSort="fieldA asc"),
> cartesianProduct(tuple(fieldB=array(a,c)), fieldB, productSort="fieldB
> asc"),
> on="fieldA=fieldB"
> )
>
> I simplified it a bit, too. I still get one document with fieldA == a.
> I would have expected three documents in the output, one with fieldA == a
> and two with fieldB == c.
> Did I misunderstand the docs of the intersect decorator or have I come
> across a bug?
>
>
> Best regards,
> Christian Spitzlay
>
>
>
> > Am 06.06.2018 um 10:18 schrieb Christian Spitzlay <
> christian.spitz...@biologis.com>:
> >
> > Hi,
> >
> > I don’t seem to get the behaviour of the intersect() stream decorator.
> > I only ever get one doc from the left stream when I would have expected
> > more than one.
> >
> > I constructed a test case that does not depend on my concrete index:
> >
> > intersect(
> > cartesianProduct(tuple(fieldA=array(c,c,a,b,d,d)), fieldA,
> productSort="fieldA asc"),
> > cartesianProduct(tuple(fieldB=array(c,c,a,d,d)), fieldB,
> productSort="fieldB asc"),
> > on="fieldA=fieldB“
> > )
> >
> >
> > The result:
> >
> > {
> >  "result-set": {
> >"docs": [
> >  {
> >"fieldA": "a"
> >  },
> >  {
> >"EOF": true,
> >"RESPONSE_TIME": 0
> >  }
> >]
> >  }
> > }
> >
> >
> > I would have expected all the docs from the left stream with fieldA
> values a, c, d
> > and only the docs with fieldA == b missing.  Do I have a fundamental
> misunderstanding?
> >
> >
> > Best regards
> > Christian Spitzlay
> >
> >
>
>


Re: Delete then re-add a core

2018-06-07 Thread Shawn Heisey

On 6/7/2018 4:12 AM, Amanda Shuman wrote:

Definitely not a permissions problem - everything is run by the solr user,
which owns everything in the directories. I just can't figure out why the
default working directory is in opt rather than var (which is where it
should be according to a previous chain I was in).

But at this point I'm at a total loss, so maybe a fresh install wouldn't
hurt.


The "bin/solr" script, which is ultimately how Solr is started even when 
it is installed as a service, initially sets the current working 
directory to a directory that it knows as SOLR_TIP.  This is the 
directory containing bin, server, and others.  It defaults to /opt/solr 
when Solr is installed as a service.


Then just before Solr is started, the script will change the current 
working directory to the server directory, which is a subdirectory of 
SOLR_TIP.


So when Solr starts, the current working directory is $SOLR_TIP/server.

The service installer sets the owner of everything in SOLR_TIP to root.  
The solr user has absolutely no reason to write to that directory at 
all.  Everything that Solr writes will be to an absolute path under the 
"var dir" given during service install, which defaults to /var/solr.  
THAT directory and all its contents will be owned by the user specified 
during install, which defaults to solr.


The current working directory is where the developers want it, and will 
not be in the "var dir".  Its location is critical for correct Jetty 
operation.  When Solr is configured in the expected way for a service 
install, it does not use the current working directory.


Thanks,
Shawn



Re: Dataimport performance

2018-06-07 Thread Shawn Heisey

On 6/7/2018 12:19 AM, kotekaman wrote:

sorry. may i know how to code it?


Code *what*?

Here's the same wiki page that I gave you for your last message:

https://wiki.apache.org/solr/UsingMailingLists

Even if I go to the Nabble website and discover that you've replied to a 
topic that's SEVEN AND A HALF YEARS OLD, that information doesn't help 
me understand exactly what it is you want to know.  The previous 
information in the topic is a question and answer about what kind of 
performance can be expected from the dataimport handler.  There's 
nothing about coding in it.


Thanks,
Shawn



Re: Delta Import Configuration

2018-06-07 Thread Shawn Heisey

On 6/7/2018 12:22 AM, kotekaman wrote:

Is the deltaimport should use the timestamp in sql table?


The text above, and the subject, are the ONLY things I can see in this 
message.  Which makes this an extremely vague question.  This wiki page 
may be relevant:


https://wiki.apache.org/solr/UsingMailingLists

Given that you posted your question on a forum, you might be wondering 
why I'm talking about mailing lists.  There's a simple reason for that 
-- your view might be a forum, but this is in fact a mailing list that 
Nabble just happens to MIRROR as a forum.  This is the info about Solr's 
mailing lists:


http://lucene.apache.org/solr/community.html#mailing-lists-irc

A delta import in Solr's dataimport handler can use anything you give 
it.  The only information that Solr actually records from a previous 
import is the time of the import, so if you haven't provided any extra 
info, then that might be the only information available.  But if you 
want to give it more information when you ask it to do a delta import, 
Solr will be able to use that information.


Thanks,
Shawn



Running Solr on HDFS - Disk space

2018-06-07 Thread Greenhorn Techie
Hi,

As HDFS has got its own replication mechanism, with a HDFS replication
factor of 3, and then SolrCloud replication factor of 3, does that mean
each document will probably have around 9 copies replicated underneath of
HDFS? If so, is there a way to configure HDFS or Solr such that only three
copies are maintained overall?

Thanks


HDP Search - Configuration & Data Directories

2018-06-07 Thread Greenhorn Techie
Hi,

A quick question on configuring Solr with Hortonworks HDP. I have installed
HDP and then installed HDP Search using the steps described under the link
-
https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.4/bk_solr-search-installation/content/hdp-search30-install-mpack.html


I have used the link from lucidworks to configure various parameters
exposed on Ambari -
https://doc.lucidworks.com/lucidworks-hdpsearch/3.0.0/Guide-Install-Ambari.html#_startup-option-reference


   - Within the various Solr config settings on Ambari, I am a bit confused
   on the role of "solr_config_conf_dir" parameter. At the moment, it only
   contains log4j.properties file. As HDPSearch is mainly meant to be used
   with SolrCloud, wondering what is the significance of this directory as the
   configurations are always maintained on ZooKeeper.
   - Another question is when the indexes for SolrCloud collections are
   stored on HDFS, what is the significance of "solr_config_data_dir"? Is
   the solr_config_data_dir directory used ONLY for collections for which
   directory factory settings are set to local? If so, is it safe to assume
   that this is not needed when HDFS is being used?

Thanks​


Re: Sort hits in the order of subqueries

2018-06-07 Thread Emir Arnautović
Hi Robert,
If I get your requirement right, you can solve it with following:
(q0)^=100 OR (q1)^=90….

Assuming there are no overlaps - otherwise, one matching multiple conditions 
can change the ordering.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 7 Jun 2018, at 11:53, Robert K.  wrote:
> 
> Hello,
> 
> I am investigating the following use case.
> 
> Suppose I have a list of queries q_0, q_1, ..., q_n which I combine to a 
> boolean query using 'SHOULD'-clauses.
> The requirement for the hits sorting is that the results of q_0 precede the 
> results of q_1, the results of q_1 precede the
> results of q_2 an so on. If a hit occurs in the results of more then one 
> query, then we should see it only once in the results
> of the query with the smallest index.
> 
> I have searched for some solutions but didn't find anything useful so far.
> 
> I have considered following approaches:
> 
> 1. Reformulate: q0 & (q_1 & !q_0) & (q2 & !q_0 & !q1) & ...
> 
> While possible, seems to have a potential negative impact on performance due 
> to multiple evaluations on the same queries.
> I didn't do any measurements, though. It is technically possible to optimize 
> the execution of this query to evaluate the subqueries
> q_i only once, but I don't know, whether this kind of optimizations is 
> implemented in the current Lucene/Solr. (?)
> 
> 2. Implement CustomScoreQuery. General idea: Take a list of queries and 
> execute them in the context of a BooleanQuery mapping
> the scores of the corresponding subqueries to disjunct score ranges, like q_n 
> -> [0,1), q_(n-1) -> [1,2) and so on.
> 
> Problem: CustomScoreQuery is deprecated, FunctionQuery is the recommeded 
> approach. Still I didn't see any obvious solution
> how I can use FunctionQuery to implement the idea. Is it possible, should I 
> dive in and try to do it with FunctionQuery.
> 
> 3. Assuming there is some possibility to solve the task with the 
> FunctionQuery (or anything within the out-of-the-box Solr). My questions
> are: Is there any solution without having to write our own extension to Solr? 
> Using only what is delivered in the standard distribution of Solr?
> 
> 
> Note: In the past we solved the problem within our legacy application with a 
> modified BooleanQuery/BooleanScorer. We could migrate
> (=rewrite) this extension to the current Solr/Lucene, but it may be not the 
> best option, so I am exploring all the other possibilities.
> 
> Thank you all & Best regards,
> 
> Robert



Re: Delete then re-add a core

2018-06-07 Thread Amanda Shuman
Definitely not a permissions problem - everything is run by the solr user,
which owns everything in the directories. I just can't figure out why the
default working directory is in opt rather than var (which is where it
should be according to a previous chain I was in).

But at this point I'm at a total loss, so maybe a fresh install wouldn't
hurt.


--
Dr. Amanda Shuman
Post-doc researcher, University of Freiburg, The Maoist Legacy Project

PhD, University of California, Santa Cruz
http://www.amandashuman.net/
http://www.prchistoryresources.org/
Office: +49 (0) 761 203 4925


On Wed, Jun 6, 2018 at 11:09 PM, BlackIce  wrote:

> One of the issues with the install script is that when its run by any user
> other than "solr" and installed into default directories,
> is that one might get ownership/permission problems.
>
> The easiest way to avoid these is by creating the "solr" user BEFORE
> installing Solr as a regular "Login-User",
> and then install Solr while being logged into this account (Or sudo, etc..)
> and then install Solr with NON default values for directories,
> have everything installed within the "Solr" users home directory space,
> that way everything belongs to the solr user, it is then easily modified,
> by just logging into the solr account and one doesn't have to worry
> about ownership/permissions.. ad if one makes a mistake it only affects
> the "solr" user...
>
> Ayway, just my 2 cents
>
> On Wed, Jun 6, 2018 at 9:41 PM, Amanda Shuman 
> wrote:
>
> > Thanks, I was able to do most of the but didn't reinstall... Still
> running
> > into an issue I think is related to current working directory. I guess
> > reinstalling might fix that?
> > Amanda
> >
> > On Wed, Jun 6, 2018, 17:27 Erick Erickson 
> wrote:
> >
> > > Assuming this is stand-alone:
> > > > find the data dir for the core (parent of the index dir)
> > > > find the config dir for the core
> > > > shut down Solr
> > > > "rm -rf data"
> > > > make any changes to the configs you want
> > > > start Solr
> > >
> > > As BlackIce said, reinstalling works too.
> > >
> > > If it's SolrCloud delete and recreate the collection, your configs
> > > will be in ZooKeeper. Of course update your configs with your changes
> > > before creating the new collection.
> > >
> > > Best,
> > > Erick
> > >
> > >
> > > On Wed, Jun 6, 2018 at 7:09 AM, BlackIce 
> wrote:
> > > > I'm not a Solr guru
> > > >
> > > > I take i that you installed Solr with the install script
> > > > then it installs into a dir where normal users have no right to
> access
> > > the
> > > > necessary files...
> > > >
> > > > One way to circumvent this is to un-install Solr and then re-install
> > > > without using the default and have it install into a directory where
> > the
> > > > solr and login user have access to.
> > > >
> > > > Deleting a Core is a simple as deleting its directory...
> > > >
> > > > Hope this helps - good luck
> > > >
> > > > On Wed, Jun 6, 2018 at 3:59 PM, Amanda Shuman <
> amanda.shu...@gmail.com
> > >
> > > > wrote:
> > > >
> > > >> Oh, and I also have a related question - how can I change my CWD
> > > (current
> > > >> working directory)? It is set for the /opt/ folder and not /var/
> and I
> > > >> think that's screwing things up...
> > > >> Thanks!
> > > >> Amanda
> > > >>
> > > >> --
> > > >> Dr. Amanda Shuman
> > > >> Post-doc researcher, University of Freiburg, The Maoist Legacy
> Project
> > > >> 
> > > >> PhD, University of California, Santa Cruz
> > > >> http://www.amandashuman.net/
> > > >> http://www.prchistoryresources.org/
> > > >> Office: +49 (0) 761 203 4925
> > > >>
> > > >>
> > > >> On Wed, Jun 6, 2018 at 3:35 PM, Amanda Shuman <
> > amanda.shu...@gmail.com>
> > > >> wrote:
> > > >>
> > > >> > Hi all, I'm a bit of a newbie still but have clearly screwed
> > something
> > > >> > up... so I think what I need to do now is to delete a core (saving
> > > >> current
> > > >> > conf files as-is) then re-add/re-create the core and re-index.
> (It's
> > > not
> > > >> a
> > > >> > big site and it's not public yet, so I'm not concerned about
> taking
> > > >> > anything down during this process.)
> > > >> >
> > > >> > So what's the quickest way to do this:
> > > >> >
> > > >> > 1. Create a new core at command line with different name, move all
> > > conf
> > > >> > files into that (?)
> > > >> > 2. Delete the current core at command line, but what's the script
> > for
> > > >> > doing that to make sure it's totally gone? I see different
> responses
> > > >> > online... not sure what's the "best practice" for this...
> > > >> >
> > > >> > Thanks!
> > > >> > Amanda
> > > >> >
> > > >> >
> > > >> >
> > > >> > --
> > > >> > Dr. Amanda Shuman
> > > >> > Post-doc researcher, University of Freiburg, The Maoist Legacy
> > Project
> > > >> > 
> > > >> > PhD, University of California, Santa Cruz
> > > >> > 

Sort hits in the order of subqueries

2018-06-07 Thread Robert K .
Hello,

I am investigating the following use case.

Suppose I have a list of queries q_0, q_1, ..., q_n which I combine to a 
boolean query using 'SHOULD'-clauses.
The requirement for the hits sorting is that the results of q_0 precede the 
results of q_1, the results of q_1 precede the
results of q_2 an so on. If a hit occurs in the results of more then one query, 
then we should see it only once in the results
of the query with the smallest index.

I have searched for some solutions but didn't find anything useful so far.

I have considered following approaches:

1. Reformulate: q0 & (q_1 & !q_0) & (q2 & !q_0 & !q1) & ...

While possible, seems to have a potential negative impact on performance due to 
multiple evaluations on the same queries.
I didn't do any measurements, though. It is technically possible to optimize 
the execution of this query to evaluate the subqueries
q_i only once, but I don't know, whether this kind of optimizations is 
implemented in the current Lucene/Solr. (?)

2. Implement CustomScoreQuery. General idea: Take a list of queries and execute 
them in the context of a BooleanQuery mapping
the scores of the corresponding subqueries to disjunct score ranges, like q_n 
-> [0,1), q_(n-1) -> [1,2) and so on.

Problem: CustomScoreQuery is deprecated, FunctionQuery is the recommeded 
approach. Still I didn't see any obvious solution
how I can use FunctionQuery to implement the idea. Is it possible, should I 
dive in and try to do it with FunctionQuery.

3. Assuming there is some possibility to solve the task with the FunctionQuery 
(or anything within the out-of-the-box Solr). My questions
are: Is there any solution without having to write our own extension to Solr? 
Using only what is delivered in the standard distribution of Solr?


Note: In the past we solved the problem within our legacy application with a 
modified BooleanQuery/BooleanScorer. We could migrate
(=rewrite) this extension to the current Solr/Lucene, but it may be not the 
best option, so I am exploring all the other possibilities.

Thank you all & Best regards,

Robert


Re: Dataimport performance

2018-06-07 Thread kotekaman
sorry. may i know how to code it?



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Delta Import Configuration

2018-06-07 Thread kotekaman
Hi all,

Is the deltaimport should use the timestamp in sql table?



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Streaming Expression intersect() behaviour

2018-06-07 Thread Christian Spitzlay
Hi,

I noticed that my mail program broke the test case by replacing a double
quote with a different UTF-8 character.

Here is the test case again and I hope it will work this time:

intersect(
cartesianProduct(tuple(fieldA=array(a,b,c,c)), fieldA, productSort="fieldA 
asc"),
cartesianProduct(tuple(fieldB=array(a,c)), fieldB, productSort="fieldB asc"),
on="fieldA=fieldB"
)

I simplified it a bit, too. I still get one document with fieldA == a.
I would have expected three documents in the output, one with fieldA == a and 
two with fieldB == c.
Did I misunderstand the docs of the intersect decorator or have I come across a 
bug?


Best regards,
Christian Spitzlay



> Am 06.06.2018 um 10:18 schrieb Christian Spitzlay 
> :
> 
> Hi,
> 
> I don’t seem to get the behaviour of the intersect() stream decorator.
> I only ever get one doc from the left stream when I would have expected 
> more than one.
> 
> I constructed a test case that does not depend on my concrete index:
> 
> intersect(
> cartesianProduct(tuple(fieldA=array(c,c,a,b,d,d)), fieldA, 
> productSort="fieldA asc"),
> cartesianProduct(tuple(fieldB=array(c,c,a,d,d)), fieldB, productSort="fieldB 
> asc"),
> on="fieldA=fieldB“
> )
> 
> 
> The result:
> 
> {
>  "result-set": {
>"docs": [
>  {
>"fieldA": "a"
>  },
>  {
>"EOF": true,
>"RESPONSE_TIME": 0
>  }
>]
>  }
> }
> 
> 
> I would have expected all the docs from the left stream with fieldA values a, 
> c, d
> and only the docs with fieldA == b missing.  Do I have a fundamental 
> misunderstanding?
> 
> 
> Best regards
> Christian Spitzlay
> 
> 



Re: Graph traversal: Bypass cycle detection?

2018-06-07 Thread Christian Spitzlay
Hi,


> Am 07.06.2018 um 03:20 schrieb Joel Bernstein :
> 
> Hi,
> 
> At this time cycle detection is built into the nodes expression and cannot
> be turned off. The nodes expression is really designed to do a traditional
> breadth first search through a graph where cycle detection is needed so you
> don't continually walk the same nodes.
> 
> Are you looking to do random walk analysis?
> I've been meaning to add a
> function that supports random walks on a graph that would not do cycle
> detection.


No, this is not about random walks. We have an application that knows
different types of entities and links betweens them. Both entities and links 
are indexed and we create additional documents to represent relations 
between the entities to prepare a network we can search on.
A regular walk with nodes() is part of that.

There is an issue in a situation where one of the entities in the original 
system is linked to itself.  I’m haven’t finished analysing the problem yet
but I wondered whether there was an easy way to rule out that 
cycle detection is causing it.


Best regards,
Christian Spitzlay






> christian.spitz...@biologis.com> wrote:
> 
>> Hi,
>> 
>> is it possible to bypass the cycle detection so a traversal
>> can revisit nodes?
>> 
>> The documentation at
>> https://lucene.apache.org/solr/guide/7_3/graph-
>> traversal.html#cycle-detection
>> does not mention any and lists reasons why the cycle detection is in place.
>> But if I were willing to live with the consequences would it be possible?
>> 
>> 
>> Best regards
>> Christian Spitzlay
>> 
>> 



Re: Setting preferred replica for query/read

2018-06-07 Thread Ere Maijala

Hi,

What I did in SOLR-11982 was meant to be used with replica types. The 
idea is that you could have a set of NRT replicas used for indexing and 
a set of PULL replicas used for queries. That's the easiest way to split 
the work since PULL replicas never do indexing work, and then you can 
say in the queries that "shards.preference=replica.type:PULL" or have 
that as a default parameter in solrconfig. SOLR-8146 is not needed for 
this. I suppose now that SOLR-11982 is done, SOLR-8146 would only be 
needed to make it easier to set the preferred replica type etc.


SOLR-11982 also allows you to use replica location in node preference. 
The nodes to use could be deduced from the cluster state and then you 
could use shards.preference with replica.location. But that means the 
client has to know which replicas to prefer.


Regards,
Ere

Zheng Lin Edwin Yeo kirjoitti 4.6.2018 klo 19.09:

Hi,

SOLR-8146 has not been updated since January last year, but I have just
commented it.

So we need both to be updated in order to achieve the full functionality of
setting preferred replica for query/read? Currently, is there a way to
achieve this by other means?

Regards,
Edwin

On 4 June 2018 at 19:43, Ere Maijala  wrote:


Hi,

Well, SOLR-11982 adds server-side support for part of what SOLR-8146 aims
to do (shards.preference=replica.location:[something]). It doesn't do
regular expressions or snitches at the moment, though it would be easy to
add. So, it looks to me like SOLR-8146 would need to be updated in this
regard.

--Ere


Zheng Lin Edwin Yeo kirjoitti 4.6.2018 klo 12.45:


Hi,

Is there any similarities between these two requests in the JIRA regarding
setting of prefer replica function?

(SOLR-11982) Add support for preferReplicaTypes parameter

(SOLR-8146) Allowing SolrJ CloudSolrClient to have preferred replica for
query/read

I am looking at setting one of the replica to be the preferred replica for
query/read, and another replica to be use for indexing.

I am using Solr 7.3.1 currently.

Regards,
Edwin



--
Ere Maijala
Kansalliskirjasto / The National Library of Finland





--
Ere Maijala
Kansalliskirjasto / The National Library of Finland


Re: Issues in Solr-7.3

2018-06-07 Thread tapan1707
Hello Shawn,
Thanks for the detailed explanation.
> That would depend on the specific issues that concern you.
Totally agree, most of the issues I saw on the mailing lists were quite
subjective and might not be affecting us. But I thought it would be better
to directly ask from 7.3 users and solr-team.

> https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;a=blob;f=solr/CHANGES.txt;hb=refs/heads/branch_7x
Thanks for the link, glad to see the issue solved by me in new features.
Although, most of the credits go to David.





--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Issues in Solr-7.3

2018-06-07 Thread Shawn Heisey

On 6/6/2018 7:38 PM, tapan1707 wrote:

We are planning to upgrade our Solr-6.4 to Solr-7.x. While considering the
appropriate minor version, I saw that there are many ongoing issues for
Solr-7.3 users on the mailing list.
Just wanted to take an expert opinion if it's *safe* to just upgrade to 7.3
without worrying about creating (or adding) patches for too many issues or
should we go for Solr-7.2 for the time being.


That would depend on the specific issues that concern you. A lot of 
issues only affect users with specific configurations. Issues affecting 
a majority of users with no possible workaround HAVE happened, but that 
sort of thing is not common.


The changelog for 7.3.0 lists 44 new features, 35 bug fixes, 5 
optimizations, and 45 other changes.  None of which will be in 7.2.x.


My opinion -- run 7.3.1.

Any Solr release you choose will have bugs.  Sometimes many bugs. The 
idea is to find one where the bugs don't affect YOU.  If you wait for a 
more stable release, you may be waiting forever.


The 7.3.0 version was announced on April 4th.  The 7.3.1 version didn't 
hit until May 15th.  That release fixed 9 additional bugs beyond 7.3.0.  
Usually less time passes between X.Y.0 and X.Y.1, so to me, that's a 
very good sign for the overall stability of the 7.3.x line.


http://mail-archives.apache.org/mod_mbox/lucene-dev/201805.mbox/%3ccagov8j9ejvb-w8r7l9xhrwj12tlzlewfnsk5_0t8hgjtzfv...@mail.gmail.com%3E

I don't recall hearing about any showstopper problems in 7.3.x.  Usually 
a really bad problem will be mentioned loudly on at least one of the 
project mailing lists.


It is possible to see a list of issues (including bugs) where the fix 
will be in the next release of Solr.  Just look at the CHANGES.txt file 
for branch_7x.  It's the first section, which right now is 7.4.0.  You 
can see it here:


https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;a=blob;f=solr/CHANGES.txt;hb=refs/heads/branch_7x

This also lists the changelogs for previous versions.  All in one 
place.  This file is also in every Solr download.


It has been nearly a month since the 7.3.1 release and so far, nobody 
has created a 7.3.2 section in the CHANGES.txt file found in 
branch_7_3.  This doesn't mean that 7.3.1 is perfect, just that nobody 
has found a big enough reason to make changes in that branch beyond 7.3.1.


Thanks,
Shawn