Multiple Queries using spark and solr

2017-03-25 Thread Matt Magnusson
Hello:
I'm interested in querying solr as a spark rdd.  Has anyone used the
lucidworks code api https://github.com/lucidworks/spark-solr to issue
multiple queries?  I'd like to have it execute multiple queries and
have the top n results be combined into one spark rdd for further
analysis.  The examples that I can find use just one query as the
originating data source to process.  I'm looking for something similar
to the executor function in solr streaming expressions but using
spark.  Thanks

Matt


Re: unable to get more throughput with more threads

2017-03-23 Thread Matt Magnusson
Out of curosity, what is your index size? I'm trying to do something
similar with maximizing output, I'm currently looking at streaming
expressions which I'm seeing some interesting results for, I'm also
finding that the direct mass query route seems to hit a wall for
performance. I'm also finding that about 10 threads seems to be an
optimum number.

On Thu, Mar 23, 2017 at 8:10 PM, Suresh Pendap  wrote:
> Hi,
> I am new to SOLR search engine technology and I am trying to get some 
> performance numbers to get maximum throughput from the SOLR cluster of a 
> given size.
> I am currently doing only query load testing in which I randomly fire a bunch 
> of queries to the SOLR cluster to generate the query load.  I understand that 
> it is not the ideal workload as the
> ingestion and commits happening invalidate the Solr Caches, so it is 
> advisable to perform query load along with some documents being ingested.
>
> The SOLR cluster was made up of 2 shards and 2 replicas. So there were total 
> 4 replicas serving the queries. The SOLR nodes were running on an LXD 
> container with 12 cores and 88GB RAM.
> The heap size allocated was 8g min and 8g max. All the other SOLR 
> configurations were default.
>
> The client node was running on an 8 core VM.
>
> I performed the test with 1 thread, 10 client threads and 50 client threads.  
> I noticed that as I increased the number of threads, the query latency kept 
> increasing drastically which I was not expecting.
>
> Since my initial test was randomly picking queries from a file, I decided to 
> keep things constant and ran the program which fired the same query again and 
> again. Since it is the same query, all the documents will
> be in the Cache and the query response time should be very fast. I was also 
> expecting that with 10 or 50 client threads, the query latency should not be 
> increasing.
>
> The throughput increased only up to 10 client threads but then it was same 
> for 50 threads, 100 threads and the latency of the query kept increasing as I 
> increased the number of threads.
> The query was returning 2 documents only.
>
> The table below summarizes the numbers that I was saying with a single query.
>
>
>
>
>
> #No of Client Nodes
> #No of Threads  99 pct Latency  95 pct latency  throughput  CPU 
> Utilization Server Configuration
>
> 1   1   9 ms7 ms180 reqs/sec8%
>
> Heap size: ms=8g, mx=8g
>
> default configuration
>
>
> 1   10  400 ms  360 ms  360 reqs/sec10%
>
> Heap size: ms=8g, mx=8g
>
> default configuration
>
>
>
>
> I also ran the client program on the SOLR server node in order to rule our 
> the network latency factor. On the server node also the response time was 
> higher for 10 threads, but the amplification was smaller.
>
> I am getting an impression that probably my query requests are getting queued 
> up and limited due to probably some thread pool size on the server side.  
> However I saw that the default jetty.xml does
> have the thread pool of min size of 10 and  max of 1.
>
> Is there any other internal SOLR thread pool configuration which might be 
> limiting the query response time?
>
> I wanted to check with the community if what I am seeing is abnormal 
> behavior, what could be the issue?  Is there any configuration that I can 
> tweak to get better query response times for more load?
>
> Regards
> Suresh
>


Concatenating streams in streaming expressions

2017-03-22 Thread Matt Magnusson
Hello;

Does anyone know of a way where I can concatenate source streams?

For example if I have two searches
search(prod,q="content:cat",fl="id,score",sort="score desc")
search(prod,q="content:dog",fl="id,score",sort="score desc")


Is there a way to have these come out as one stream. I've been trying
to use the executor function by storing these searches as expr_s.  I
however, can't figure out how to merge the output of these back into
one stream.  If I run the following code,

executor(search(queries, q="*:*",fl="id, expr_s", sort="id asc",
qt="/export")). It gives this output:

{
  "result-set": {
"docs": [
  {
"EOF": true,
"RESPONSE_TIME": 32
  }
]
  }
}

So not the underlying tuples returned.

I want it the return to be like this for all individual searches
combined into one stream.

{
  "result-set": {
"docs": [
  {
"score": 12.340755,
"id": "9a49d7d6f5b3cc597f8e55e66bb6d96438b670d1"
  },
  {
"score": 11.879734,
"id": "887d349fc9390a87ac7fd4209af59af61531ad06"
  },
  {
"score": 11.82577,
"id": "c91971049ab95cb32dc2d0f8d616aad25ee04bb7"
  },...




 I know the searches are working correctly using the executor function
because I can have them save output back to solr if I also include the
update and commit functions in the expr_s field in my source queries
collection.  Thanks

Matt