Spark-Solr connector

2019-07-11 Thread Dwane Hall
Hey guys,



I’ve just started looking at the excellent spark-solr project (thanks Tim 
Potter, Kiran Chitturi, Kevin Risden and Jason Gerlowski for their efforts with 
this project it looks really neat!!).



I’m only at the initial stages of my exploration but I’m running into a class 
not found exception when connecting to a secure solr cloud instance (basic 
auth, ssl).  Everything is working as expected on a non-secure solr cloud 
instance.



The process looks pretty straightforward according to the doco so I’m wondering 
if I’m missing anything obvious or if I need to bring any extra classes to the 
classpath when using this project?



Any advice would be greatly appreciated.



Thanks,



Dwane



Environments tried

7.6 and 8.1.1 solr cloud

SSL, Basic Auth Plugin, Rules Based Authorisation Plugin enabled

Spark v 2.4.3

Spark-Solr build spark-solr-3.7.0-20190619.153847-16-shaded.jar





./spark-2.4.3-bin-hado./spark-2.4.3-bin-hadoop2.7/bin/spark-shell --master 
local[*] --jars spark-solr-3.7.0-20190619.153847-16-shaded.jar --conf 
'spark.driver.extraJavaOptions=-Dbasicauth=solr:SolrRocks'





val options = Map(

"collection" -> "My_Collection",

"zkhost" -> "zkn1:2181,zkn2:2181,zkn3:2181/solr/SPARKTEST"

  )



val df = spark.read.format("solr").options(options).load



com.google.common.util.concurrent.ExecutionError: 
java.lang.NoClassDefFoundError: org/eclipse/jetty/client/api/Authentication

  at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2261)

  at com.google.common.cache.LocalCache.get(LocalCache.java:4000)

  at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4004)

  at 
com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)

  at 
com.lucidworks.spark.util.SolrSupport$.getCachedCloudClient(SolrSupport.scala:244)

  at 
com.lucidworks.spark.util.SolrSupport$.getSolrBaseUrl(SolrSupport.scala:248)

  at 
com.lucidworks.spark.SolrRelation.dynamicSuffixes$lzycompute(SolrRelation.scala:100)

  at com.lucidworks.spark.SolrRelation.dynamicSuffixes(SolrRelation.scala:98)

  at 
com.lucidworks.spark.SolrRelation.getBaseSchemaFromConfig(SolrRelation.scala:299)

  at 
com.lucidworks.spark.SolrRelation.querySchema$lzycompute(SolrRelation.scala:239)

  at com.lucidworks.spark.SolrRelation.querySchema(SolrRelation.scala:108)

  at com.lucidworks.spark.SolrRelation.schema(SolrRelation.scala:428)

  at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:403)

  at 
org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)

  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)

  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)

  ... 49 elided

Caused by: java.lang.NoClassDefFoundError: 
org/eclipse/jetty/client/api/Authentication

  at 
com.lucidworks.spark.util.SolrSupport$.getSolrCloudClient(SolrSupport.scala:214)

  at 
com.lucidworks.spark.util.SolrSupport$.getNewSolrCloudClient(SolrSupport.scala:240)

  at 
com.lucidworks.spark.util.CacheCloudSolrClient$$anon$1.load(SolrSupport.scala:38)

  at 
com.lucidworks.spark.util.CacheCloudSolrClient$$anon$1.load(SolrSupport.scala:36)

  at 
com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)

  at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)

  at 
com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)

  at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2257)

  ... 64 more

Caused by: java.lang.ClassNotFoundException: 
org.eclipse.jetty.client.api.Authentication

  at java.net.URLClassLoader.findClass(URLClassLoader.java:382)

  at java.lang.ClassLoader.loadClass(ClassLoader.java:424)

  at java.lang.ClassLoader.loadClass(ClassLoader.java:357)

  ... 72 more


Re: QTime

2019-07-11 Thread Erick Erickson
true, although there’s still network that can’t be included.

> On Jul 11, 2019, at 5:55 PM, Edward Ribeiro  wrote:
> 
> Wouldn't be the case of using =0 parameter on those requests? Wdyt?
> 
> Edward
> 
> Em qui, 11 de jul de 2019 14:24, Erick Erickson 
> escreveu:
> 
>> Not only does Qtime not include network latency, it also doesn't include
>> the time it takes to assemble the docs for return, which can be lengthy
>> when rows is large..
>> 
>> On Wed, Jul 10, 2019, 14:39 Shawn Heisey  wrote:
>> 
>>> On 7/10/2019 3:17 PM, Lucky Sharma wrote:
 I am seeing one very weird behaviour of QTime of SOLR.
 
 Scenario is :
 When I am hitting the Solr Cloud Instance, situated at a DC with my
>> local
 machine while load test I was seeing 400ms Qtime response and 1sec Http
 Response time.
>>> 
>>> How much data was in the response?  If it's large, I can see it taking
>>> that long to transfer.  This is even more likely if there is a lot of
>>> network latency in the network between the client and the server.
>>> 
 While I am trying to do the same process within the same DC location, I
>>> am
 getting 100 ms Solr QTime and 130ms Response Time.
 
 Does QTime counts network latency too??
>>> 
>>> There's no way Solr can include the time to send the response over the
>>> network in QTime.  The value is calculated and put into the response
>>> before Solr starts sending.
>>> 
>>> Thanks,
>>> Shawn
>>> 
>> 



Re: QTime

2019-07-11 Thread Edward Ribeiro
Wouldn't be the case of using =0 parameter on those requests? Wdyt?

Edward

Em qui, 11 de jul de 2019 14:24, Erick Erickson 
escreveu:

> Not only does Qtime not include network latency, it also doesn't include
> the time it takes to assemble the docs for return, which can be lengthy
> when rows is large..
>
> On Wed, Jul 10, 2019, 14:39 Shawn Heisey  wrote:
>
> > On 7/10/2019 3:17 PM, Lucky Sharma wrote:
> > > I am seeing one very weird behaviour of QTime of SOLR.
> > >
> > > Scenario is :
> > > When I am hitting the Solr Cloud Instance, situated at a DC with my
> local
> > > machine while load test I was seeing 400ms Qtime response and 1sec Http
> > > Response time.
> >
> > How much data was in the response?  If it's large, I can see it taking
> > that long to transfer.  This is even more likely if there is a lot of
> > network latency in the network between the client and the server.
> >
> > > While I am trying to do the same process within the same DC location, I
> > am
> > > getting 100 ms Solr QTime and 130ms Response Time.
> > >
> > > Does QTime counts network latency too??
> >
> > There's no way Solr can include the time to send the response over the
> > network in QTime.  The value is calculated and put into the response
> > before Solr starts sending.
> >
> > Thanks,
> > Shawn
> >
>


Get cluster information using JMX

2019-07-11 Thread sharayu shenoy
Hi,

I am running solr cloud on version 6.6.6. With jmx enabled. And I am
interested in knowing the zookeeper on which a node is running using jmx.
For higher versions of Solr, I am able to get this information using the
JVM system properties but u cannot find it for version 6.6.6.

Is there a common ID across all nodes of a solr cluster using which I can
know which cluster a node belongs to?

Thanks,
S


Function Query with multi-value field

2019-07-11 Thread Wei
Hi,

I have a question regarding function query that operates on multi-value
fields.  For the following field:



 Each value is a hex string representation of RGB value.  for example there
are 3 values indexed

#FF00FF- C1
#EE82EE   - C2
#DA70D6   - C3

How would I write a function query that operates on all values of the
field?  Given color S in query, how to calculate the similarities between S
and C1/C2/C3 and find which one is the closest?
I checked https://lucene.apache.org/solr/guide/6_6/function-queries.html but
didn't see an example.

Thanks,
Wei


SSL certificate automated rotation/renewal?

2019-07-11 Thread Jamie Gruener
Folks,

I've done plenty of searching, but haven't found anything addressing this 
issue. I have an existing SolrCloud 3 server cluster in production. We need to 
enable SSL/TLS encryption, both for clients and between the 3 servers. I've 
read through the documentation, and while I've not done it yet, it all makes 
sense.

Related, we're also using Consul and working up the infrastructure to use 
Consul Connect with sidecar proxies for client-to-service end-to-end TLS 
encryption. That's great because it automatically handles SSL/TLS certificate 
rotation without any manual interaction. But that doesn't help me with the 
intra-cluster SolrCloud communication.

So here's my question. How do folks handle SSL/TLS certificate rotation on 
SolrCloud instances in production? Update the certificate and restart solr on 
each box, one at a time? Just use extra long-lasting certificates? Or is there 
another way, like using an external truststore/keystore in Vault? I'm assuming 
that wouldn't work because you have to restart Solr to get the new cert, but 
maybe there's something I don't know?

Any thoughts welcome,

--Jamie



Re: QTime

2019-07-11 Thread Erick Erickson
Not only does Qtime not include network latency, it also doesn't include
the time it takes to assemble the docs for return, which can be lengthy
when rows is large..

On Wed, Jul 10, 2019, 14:39 Shawn Heisey  wrote:

> On 7/10/2019 3:17 PM, Lucky Sharma wrote:
> > I am seeing one very weird behaviour of QTime of SOLR.
> >
> > Scenario is :
> > When I am hitting the Solr Cloud Instance, situated at a DC with my local
> > machine while load test I was seeing 400ms Qtime response and 1sec Http
> > Response time.
>
> How much data was in the response?  If it's large, I can see it taking
> that long to transfer.  This is even more likely if there is a lot of
> network latency in the network between the client and the server.
>
> > While I am trying to do the same process within the same DC location, I
> am
> > getting 100 ms Solr QTime and 130ms Response Time.
> >
> > Does QTime counts network latency too??
>
> There's no way Solr can include the time to send the response over the
> network in QTime.  The value is calculated and put into the response
> before Solr starts sending.
>
> Thanks,
> Shawn
>


Re: Solr 6.6.0 - DIH - Multiple entities - Multiple DBs

2019-07-11 Thread Shawn Heisey

On 7/11/2019 9:04 AM, Joseph_Tucker wrote:

Looks like I've managed to get some semblance of this working.
The indexes are much faster, but the RAM usage by SolrJ is quite high. Is it
normal to see around 6GB of RAM usage?
(My test is indexing 250,000 records with the 50 child entities)


Whatever max heap value you tell Java it can have, it will eventually 
use.  That's how Java's memory model works.  You can try lowering the 
max heap, to see whether it actually needs that much memory.  If the 
program really does require all the heap it's allowed, reducing the max 
heap size will cause the program to throw errors and probably behave in 
an unpredictable manner.


Many JDBC drivers will load the entire result set from a database query 
into memory by default, which can explain very high memory use.  You 
would need to research your specific JDBC driver to see if it does this, 
and if so, learn how to have it stream the results instead of storing them.


Thanks,
Shawn


Re: Solr 6.6.0 - DIH - Multiple entities - Multiple DBs

2019-07-11 Thread Joseph_Tucker
Thanks For the help.

Looks like I've managed to get some semblance of this working. 
The indexes are much faster, but the RAM usage by SolrJ is quite high. Is it
normal to see around 6GB of RAM usage?
(My test is indexing 250,000 records with the 50 child entities)

In short, I'm running through a loop against a DB 50 times (to mimic 50
entities) and adding the results to a Map, then using that map to loop
through and commit values to Solr.


Jörn Franke wrote
> Ideally you use scripts that can use JVM/Java - in this way you can always
> use the latest SolrJ client library but also other libraries that are
> relevant (eg Tika for unstructured content).
> This does not have to be Java directly but can be based also on Scala or
> JVM script languages, such as Groovy.
> 
> There are also wrappers for Python etc, but those may not always leverage
> the latest version of the library.





--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr Sudden I/O spike

2019-07-11 Thread Shawn Heisey

On 6/14/2019 5:53 AM, Sripra deep wrote:

   Any help would be appreciated, I am using solr 7.1.0, Suddenly we got a
high I/O even with a very low request rate and the core went down. Did
anybody experience the same or root cause of this.

Below are the log error msg that we got from solr.log





org.eclipse.jetty.io.EofException





Caused by: java.io.IOException: Broken pipe


This exception means that the client making the http request 
disconnected before Solr was able to respond.  When Solr finishes 
processing and tries to respond, it finds that it can't, because the 
connection is gone.  The client probably has a timeout that it reached.


The other reply you got mentioned segment merging.  That would cause an 
I/O spike, but queries would still execute during the merge, so I think 
it's more likely that what's happening is your OS is swapping memory 
from running programs out to disk, which can cause those programs to 
stop responding until that memory is swapped back in.  A system that is 
using swap or paging space will typically run VERY slowly.


If that's what is happening, the fix for that problem is to either 
adjust what's running on the server so it needs less memory, or to add 
memory to the server.


Thanks,
Shawn


Re: Solr Sudden I/O spike

2019-07-11 Thread kshitij tyagi
Hi,

Can you checck and update if there is any indexing going on the core or a
merge or an optimise triggered on the same. There might be an instance of
high IO in case any bacckgroung merging triggers while serving query
requests.

Regards,
kshitij

On Fri, Jun 14, 2019 at 5:23 PM Sripra deep 
wrote:

> Hi,
>   Any help would be appreciated, I am using solr 7.1.0, Suddenly we got a
> high I/O even with a very low request rate and the core went down. Did
> anybody experience the same or root cause of this.
>
> Below are the log error msg that we got from solr.log
>
> 2019-06-06 10:37:14.490 INFO  (qtp761960786-8618) [   ]
> o.a.s.s.HttpSolrCall Unable to write response, client closed connection or
> we are shutting down
> org.eclipse.jetty.io.EofException
> at org.eclipse.jetty.io
> .ChannelEndPoint.flush(ChannelEndPoint.java:199)
> at org.eclipse.jetty.io.WriteFlusher.flush(WriteFlusher.java:420)
> at
> org.eclipse.jetty.io.WriteFlusher.completeWrite(WriteFlusher.java:375)
> at
> org.eclipse.jetty.io
> .SelectChannelEndPoint$3.run(SelectChannelEndPoint.java:107)
> at
> org.eclipse.jetty.io
> .SelectChannelEndPoint.onSelected(SelectChannelEndPoint.java:193)
> at
> org.eclipse.jetty.io
> .ManagedSelector$SelectorProducer.processSelected(ManagedSelector.java:283)
> at
> org.eclipse.jetty.io
> .ManagedSelector$SelectorProducer.produce(ManagedSelector.java:181)
> at
>
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceExecuteConsume(ExecuteProduceConsume.java:169)
> at
>
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:145)
> at
>
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
> at
>
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
> at
>
> org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: Broken pipe
> at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
> at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
> at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
> at sun.nio.ch.IOUtil.write(IOUtil.java:51)
> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
> at org.eclipse.jetty.io
> .ChannelEndPoint.flush(ChannelEndPoint.java:177)
> ... 12 more
> Thanks,
> Sripradeep P
>


Re: Indexing nested document: Solr 8.1.1

2019-07-11 Thread sreejith.variyath
Hi, I was using the url
*http://localhost:8983/solr/my-core/update/json/docs*. It was wrong. I
should use *http://localhost:8983/solr/my-core/update* and its worked.

Thanks



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Indexing nested document: Solr 8.1.1

2019-07-11 Thread Sreejith Variyath
Hi, I am trying to index a sample nested document in solr. But I am getting
error

"ERROR: [doc=1] multiple values encountered for non multiValued field
_childDocuments_.id: [2, 3]"

I am using ClassicIndexSchemaFactory. So I have defined all the fields in
schema.xml. Below my field settings in schema.xml.












**

Below the json document which I am trying to index.

{ "id": "1",
"title": "Solr adds block join support",
"content_type": "parentDocument",
"_childDocuments_": [{
"id": "2",
"comments": "SolrCloud supports it too!"
},
{
"id": "3",
"comments": "SolrCloud supports it too!3"
}]
 }

Could some one please help me to figure out the issues ?. Do I need to make
the inner fields *multiValued="true" ?*

-- 
Best Regards,
*Sreejith *

-- 
w: www.tarams.com 

 
   
   

=
DISCLAIMER:
 The information in this message 
is confidential and may be legally 
privileged. It is intended solely for 
the addressee. Access to this 
message by anyone else is unauthorized. If 
you are not the intended 
recipient, any disclosure, copying, or 
distribution of the message, or 
any action or omission taken by you in 
reliance on it, is prohibited and
 may be unlawful. Please immediately 
contact the sender if you have 
received this message in error. Further, 
this e-mail may contain viruses
 and all reasonable precaution to minimize 
the risk arising there from 
is taken by Tarams. Tarams is not liable for 
any damage sustained by you
 as a result of any virus in this e-mail. All 
applicable virus checks 
should be carried out by you before opening this 
e-mail or any 
attachment thereto.
Thank you - Tarams Software Technologies 
Pvt.Ltd.
=