Re: solr 8.6.3 and noggit

2020-11-20 Thread Susmit Shukla
Thanks Mike
That explains it, just removing the noggit-0.6 jar should fix it. This
error depended on classloading order and didn't show up on mac but was a
problem on linux.



On Fri, Nov 20, 2020 at 2:54 PM Mike Drob  wrote:

> Noggit code was forked into Solr, see SOLR-13427
>
> https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.6.3/solr/solrj/src/java/org/noggit/ObjectBuilder.java
>
> It looks like that particular method was added in 8.4 via
> https://issues.apache.org/jira/browse/SOLR-13824
>
> Is it possible you're using an older SolrJ against a newer Solr server (or
> vice versa).
>
> Mike
>
> On Fri, Nov 20, 2020 at 2:25 PM Susmit Shukla 
> wrote:
>
> > Hi,
> > got this error using streaming with solrj 8.6.3 . does it use noggit-0.8.
> > It was not mentioned in dependencies
> > https://github.com/apache/lucene-solr/blob/branch_8_6/solr/solrj/ivy.xml
> >
> > Caused by: java.lang.NoSuchMethodError: 'java.lang.Object
> > org.noggit.ObjectBuilder.getValStrict()'
> >
> > at org.apache.solr.common.util.Utils.fromJSON(Utils.java:284)
> > ~[solr-solrj-8.6.3.jar:8.6.3 e001c2221812a0ba9e9378855040ce72f93eced4 -
> > jasongerlowski - 2020-10-03 18:12:06]
> >
>


solr 8.6.3 and noggit

2020-11-20 Thread Susmit Shukla
Hi,
got this error using streaming with solrj 8.6.3 . does it use noggit-0.8.
It was not mentioned in dependencies
https://github.com/apache/lucene-solr/blob/branch_8_6/solr/solrj/ivy.xml

Caused by: java.lang.NoSuchMethodError: 'java.lang.Object
org.noggit.ObjectBuilder.getValStrict()'

at org.apache.solr.common.util.Utils.fromJSON(Utils.java:284)
~[solr-solrj-8.6.3.jar:8.6.3 e001c2221812a0ba9e9378855040ce72f93eced4 -
jasongerlowski - 2020-10-03 18:12:06]


Gather Nodes Streaming

2019-03-20 Thread Susmit Shukla
Hi,

Trying to use solr streaming 'gatherNodes' function. It is for extracting
email graph based on from and to fields.
It requires 'to' field to be a single value field with docvalues enabled
since it is used internally for sorting and unique streams

The 'to' field can contain multiple email addresses - each being a node.
How to map multiple comma separated email addresses from the 'to' fields as
separate graph nodes?

Thanks



>
>


Re: Streaming and large resultsets

2017-11-11 Thread Susmit Shukla
Hi Lanny,

For long running streaming queries with many shards and huge resultsets,
solrj's default settings for http max connections/connections per host may
not be enough. If you are using the worker collection (/stream), it depends
on dispensing http clients using SolrClientCache with default limits. Could
be useful to turn on debug logging and check.

Thanks,
Susmit

On Thu, Nov 9, 2017 at 8:35 PM, Lanny Ripple  wrote:

> First, Joel, thanks for your help on this.
>
> 1) I have to admit we really haven't played with a lot of system tuning
> recently (before DocValues for sure).   We'll go through another tuning
> round.
>
> 2) At the time I ran these numbers this morning we were not indexing.  We
> build this collection once a month and then client jobs can update it.  I
> was watching our job queue and there were no jobs running at that time.
> It's possible someone else was querying against other collections but they
> wouldn't have been updating this collection.
>
> 3) I'll try /export on each node.  We're pretty cookie-cutter with all
> nodes being the same and configuration controlled with puppet.  We collect
> system metrics to a Graphite display panel and no host looks out of sorts
> relative to the others.  That said I wouldn't be surprised if a node was
> out of whack.
>
> Thanks again.
>   -ljr
>
> On Thu, Nov 9, 2017 at 2:34 PM Joel Bernstein  wrote:
>
> > In my experience this should be very fast:
> >
> >  search(graph-october,
> > q="outs:tokenA",
> > fl="id,token",
> > sort="id asc",
> > qt="/export",
> > wt=javabin)
> >
> >
> > When the DocValues cache is statically warmed for the two output fields I
> > would see somewhere around 500,000 docs per second exported from a single
> > node.
> >
> > You have sixteen shards which would give you 16 times the throughput. But
> > off course the data is being sent back through the single aggregator node
> > so your throughput is only as fast as the aggregator node can process the
> > results.
> >
> > This does not explain the slowness that you are seeing. I see a couple of
> > possible reasons:
> >
> > 1) The memory on the system is not tuned optimally. You allocated a large
> > amount of memory to the heap and are not providing enough memory to OS
> > filesystem. Lucene DocValues use the OS filesystem cache for the
> DocValues
> > caches. So I would bump down the size of heap considerably.
> >
> > 2) Are you indexing while querying at all? If you are you would need to
> be
> > statically warming the DocValues caches for the id field which is used
> for
> > sorting. Following each commit there is a top level docvalues cache that
> is
> > rebuilt for sorting on string fields. If you use a static warming query
> it
> > will warm the cache before making the new searcher live for searchers. I
> > would also pause indexing if possible and run queries only to see how it
> > runs without indexing.
> >
> > 3) Try running a query directly to /export handler on each node. Possibly
> > one of your nodes is slow for some reason and that is causing the entire
> > query to respond slowly.
> >
> >
> >
> >
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> > On Thu, Nov 9, 2017 at 2:22 PM, Lanny Ripple 
> wrote:
> >
> > > Happy to do so.  I am testing streams for the first time so we don't
> have
> > > any 5.x experience.  The collection I'm testing was loaded after going
> to
> > > 6.6.1 and fixing up the solrconfig for lucene_version and removing the
> > > /export clause.  The indexes run 57G per replica.  We are using 64G
> hosts
> > > with 48G heaps using G1GC but this isn't our only large collection.
> This
> > > morning as I'm running these our cluster is quiet.  I realize some of
> the
> > > performance we are seeing is going to be our data size so not expecting
> > any
> > > silver bullets.
> > >
> > > We are storing 858M documents that are basically
> > >
> > > id: String
> > > token: String
> > > outs: String[]
> > > outsCount: Int
> > >
> > > All stored=true, docvalues=true.
> > >
> > > The `outs` reference a select number of tokens (1.26M).  Here are
> current
> > > percentiles of our outsCount
> > >
> > > `outsCount`
> > > 50%   12
> > > 85%  127
> > > 98%  937
> > > 99.9% 16,284
> > >
> > > I'll display the /stream query but I'm setting up the same thing in
> > solrj.
> > > I'm going to name our small result set "tokenA" and our large one
> > "tokenV".
> > >
> > >   search(graph-october,
> > > q="outs:tokenA",
> > > fl="id,token",
> > > sort="id asc",
> > > qt="/export",
> > > wt=javabin)
> > >
> > > I've placed this in file /tmp/expr and invoke with
> > >
> > >   curl -sSN -m 3600 --data-urlencode expr@/tmp/expr
> > > http://host/solr/graph-october/stream
> > >
> > > The large resultset query replaces "tokenA" with "tokenV".
> > >
> > > My /select query is
> > >
> > >   curl -sSN -m 3600 -d wt=csv -d rows=1 -d 

Re: deep paging in parallel sql

2017-09-07 Thread Susmit Shukla
you could use filter clause to create a custom cursor  since the results
are sorted. I had used the approach with raw cloudsolr stream, not with
parallelSQL though.
This would be useful-
https://lucidworks.com/2013/12/12/coming-soon-to-solr-efficient-cursor-based-iteration-of-large-result-sets/

Thanks,
Susmit

On Wed, Sep 6, 2017 at 10:45 PM, Imran Rajjad  wrote:

> My only concern is the performance as the cursor moves forward in
> resultset with approximately 2 billion records
>
> Regards,
> Imran
>
> Sent from Mail for Windows 10
>
> From: Joel Bernstein
> Sent: Wednesday, September 6, 2017 7:04 PM
> To: solr-user@lucene.apache.org
> Subject: Re: deep paging in parallel sql
>
> Parallel SQL supports unlimited SELECT statements which return the entire
> result set. The documentation discusses the differences between the limited
> and unlimited SELECT statements. Other then the LIMIT clause there is not
> yet support for paging.
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Wed, Sep 6, 2017 at 9:11 AM, Imran Rajjad  wrote:
>
> > Dear list,
> >
> > Is it possible to enable deep paging when querying data through Parallel
> > SQL?
> >
> > Regards,
> > Imran
> >
> > Sent from Mail for Windows 10
> >
> >
>
>


Re: solr /export handler - behavior during close()

2017-06-27 Thread Susmit Shukla
Hi Joel,

I was on solr 6.3 branch. I see HttpClient deprecated methods are all fixed
in master.
I had forgot to mention that I used a custom SolrClientCache to have higher
limits for maxConnectionPerHost settings thats why I saw difference in
behavior. SolrClientCache also looks configurable with a new constructor on
master branch.

I guess it is all good going forward on master.

Thanks,
Susmit

On Tue, Jun 27, 2017 at 10:14 AM, Joel Bernstein <joels...@gmail.com> wrote:

> Ok, I see where it's not set the stream context. This needs to be fixed.
>
> I'm curious about where you're seeing deprecated methods in the
> HttpClientUtil? I was reviewing the master version of HttpClientUtil and
> didn't see any deprecations in my IDE.
>
> I'm wondering if you're using an older version of HttpClientUtil then I
> used when I was testing SOLR-10698?
>
> You also mentioned that the SolrStream and the SolrClientCache were using
> the same approach to create the client. In that case changing the
> ParallelStream to set the streamContext shouldn't have any effect on the
> close() issue.
>
>
>
>
>
>
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Sun, Jun 25, 2017 at 10:48 AM, Susmit Shukla <shukla.sus...@gmail.com>
> wrote:
>
> > Hi Joel,
> >
> > Looked at the fix for SOLR-10698, there could be 2 potential issues
> >
> > - Parallel Stream does not set stream context on newly created
> SolrStreams
> > in open() method.
> >
> > - This results in creation of new uncached HttpSolrClient in open()
> method
> > of SolrStream. This client is created using deprecated methods of http
> > client library (HttpClientUtil.createClient) and behaves differently on
> > close() than the one created using HttpClientBuilder API. SolrClientCache
> > too uses the same deprecated API
> >
> > This test case shows the problem
> >
> > ParallelStream ps = new parallelStream(tupleStream,...)
> >
> > while(true){
> >
> > read();
> >
> > break after 2 iterations
> >
> > }
> >
> > ps.close()
> >
> > //close() reads through the end of tupleStream.
> >
> > I tried with HttpClient created by *org**.**apache**.**http**.**impl**.*
> > *client**.HttpClientBuilder.create()* and close() is working for that.
> >
> >
> > Thanks,
> >
> > Susmit
> >
> > On Wed, May 17, 2017 at 7:33 AM, Susmit Shukla <shukla.sus...@gmail.com>
> > wrote:
> >
> > > Thanks Joel, will try that.
> > > Binary response would be more performant.
> > > I observed the server sends responses in 32 kb chunks and the client
> > reads
> > > it with 8 kb buffer on inputstream. I don't know if changing that can
> > > impact anything on performance. Even if buffer size is increased on
> > > httpclient, it can't override the hardcoded 8kb buffer on
> > > sun.nio.cs.StreamDecoder
> > >
> > > Thanks,
> > > Susmit
> > >
> > > On Wed, May 17, 2017 at 5:49 AM, Joel Bernstein <joels...@gmail.com>
> > > wrote:
> > >
> > >> Susmit,
> > >>
> > >> You could wrap a LimitStream around the outside of all the relational
> > >> algebra. For example:
> > >>
> > >> parallel(limit((intersect(intersect(search, search), union(search,
> > >> search)
> > >>
> > >> In this scenario the limit would happen on the workers.
> > >>
> > >> As far as the worker/replica ratio. This will depend on how heavy the
> > >> export is. If it's a light export, small number of fields, mostly
> > numeric,
> > >> simple sort params, then I've seen a ratio of 5 (workers) to 1
> (replica)
> > >> work well. This will basically saturate the CPU on the replica. But
> > >> heavier
> > >> exports will saturate the replicas with fewer workers.
> > >>
> > >> Also I tend to use Direct DocValues to get the best performance. I'm
> not
> > >> sure how much difference this makes, but it should eliminate the
> > >> compression overhead fetching the data from the DocValues.
> > >>
> > >> Varun's suggestion of using the binary transport will provide a nice
> > >> performance increase as well. But you'll need to upgrade. You may need
> > to
> > >> do that anyway as the fix on the early stream close will be on a later
> > >> version that was refactored to support the binary transport.
> > >>
> > >> Joel Bernstein
> > >>

Re: solr /export handler - behavior during close()

2017-06-25 Thread Susmit Shukla
Hi Joel,

Looked at the fix for SOLR-10698, there could be 2 potential issues

- Parallel Stream does not set stream context on newly created SolrStreams
in open() method.

- This results in creation of new uncached HttpSolrClient in open() method
of SolrStream. This client is created using deprecated methods of http
client library (HttpClientUtil.createClient) and behaves differently on
close() than the one created using HttpClientBuilder API. SolrClientCache
too uses the same deprecated API

This test case shows the problem

ParallelStream ps = new parallelStream(tupleStream,...)

while(true){

read();

break after 2 iterations

}

ps.close()

//close() reads through the end of tupleStream.

I tried with HttpClient created by *org**.**apache**.**http**.**impl**.*
*client**.HttpClientBuilder.create()* and close() is working for that.


Thanks,

Susmit

On Wed, May 17, 2017 at 7:33 AM, Susmit Shukla <shukla.sus...@gmail.com>
wrote:

> Thanks Joel, will try that.
> Binary response would be more performant.
> I observed the server sends responses in 32 kb chunks and the client reads
> it with 8 kb buffer on inputstream. I don't know if changing that can
> impact anything on performance. Even if buffer size is increased on
> httpclient, it can't override the hardcoded 8kb buffer on
> sun.nio.cs.StreamDecoder
>
> Thanks,
> Susmit
>
> On Wed, May 17, 2017 at 5:49 AM, Joel Bernstein <joels...@gmail.com>
> wrote:
>
>> Susmit,
>>
>> You could wrap a LimitStream around the outside of all the relational
>> algebra. For example:
>>
>> parallel(limit((intersect(intersect(search, search), union(search,
>> search)
>>
>> In this scenario the limit would happen on the workers.
>>
>> As far as the worker/replica ratio. This will depend on how heavy the
>> export is. If it's a light export, small number of fields, mostly numeric,
>> simple sort params, then I've seen a ratio of 5 (workers) to 1 (replica)
>> work well. This will basically saturate the CPU on the replica. But
>> heavier
>> exports will saturate the replicas with fewer workers.
>>
>> Also I tend to use Direct DocValues to get the best performance. I'm not
>> sure how much difference this makes, but it should eliminate the
>> compression overhead fetching the data from the DocValues.
>>
>> Varun's suggestion of using the binary transport will provide a nice
>> performance increase as well. But you'll need to upgrade. You may need to
>> do that anyway as the fix on the early stream close will be on a later
>> version that was refactored to support the binary transport.
>>
>> Joel Bernstein
>> http://joelsolr.blogspot.com/
>>
>> On Tue, May 16, 2017 at 8:03 PM, Joel Bernstein <joels...@gmail.com>
>> wrote:
>>
>> > Yep, saw it. I'll comment on the ticket for what I believe needs to be
>> > done.
>> >
>> > Joel Bernstein
>> > http://joelsolr.blogspot.com/
>> >
>> > On Tue, May 16, 2017 at 8:00 PM, Varun Thacker <va...@vthacker.in>
>> wrote:
>> >
>> >> Hi Joel,Susmit
>> >>
>> >> I created https://issues.apache.org/jira/browse/SOLR-10698 to track
>> the
>> >> issue
>> >>
>> >> @Susmit looking at the stack trace I see the expression is using
>> >> JSONTupleStream
>> >> . I wonder if you tried using JavabinTupleStreamParser could it help
>> >> improve performance ?
>> >>
>> >> On Tue, May 16, 2017 at 9:39 AM, Susmit Shukla <
>> shukla.sus...@gmail.com>
>> >> wrote:
>> >>
>> >> > Hi Joel,
>> >> >
>> >> > queries can be arbitrarily nested with AND/OR/NOT joins e.g.
>> >> >
>> >> > (intersect(intersect(search, search), union(search, search))). If I
>> cut
>> >> off
>> >> > the innermost stream with a limit, the complete intersection would
>> not
>> >> > happen at upper levels. Also would the limit stream have same effect
>> as
>> >> > using /select handler with rows parameter?
>> >> > I am trying to force input stream close through reflection, just to
>> see
>> >> if
>> >> > it gives performance gains.
>> >> >
>> >> > 2) would experiment with null streams. Is workers = number of
>> replicas
>> >> in
>> >> > data collection a good thumb rule? is parallelstream performance
>> upper
>> >> > bounded by number of replicas?
>> >> >
>> >> > Thanks,
>> &g

Re: Performance Issue in Streaming Expressions

2017-06-01 Thread Susmit Shukla
Hi,

Which version of solr are you on?
Increasing memory may not be useful as streaming API does not keep stuff in
memory (except may be hash joins).
Increasing replicas (not sharding) and pushing the join computation on
worker solr cluster with #workers > 1 would definitely make things faster.
Are you limiting your results at some cutoff? if yes, then SOLR-10698
 can be useful fix. Also
binary response format for streaming would be faster. (available in 6.5
probably)



On Thu, Jun 1, 2017 at 3:04 PM, thiaga rajan <
ecethiagu2...@yahoo.co.in.invalid> wrote:

> We are working on a proposal and feeling streaming API along with export
> handler will best fit for our usecases. We are already of having a
> structure in solr in which we are using graph queries to produce
> hierarchical structure. Now from the structure we need to join couple of
> more collections. We have 5 different collections.
>   Collection 1- 800 k records.
> Collection 2- 200k records.   Collection 3
> - 7k records.   Collection 4 - 6
> million records. Collection 5 - 150 k records
> we are using the below strategy
> innerJoin( intersect( innerJoin(collection 1,collection 2),
> innerJoin(Collection 3, Collection 4)), collection 5).
>We are seeing performance is too slow when we start having
> collection 4. Just with collection 1 2 5 the results are coming in 2 secs.
> The moment I have included collection 4 in the query I could see  a
> performance impact. I believe exporting large results from collection 4 is
> causing the issie. Currently I am using single sharded collection with no
> replica. I thinking if we can increase the memory as first option to
> increase performance as processing doc values need more memory. Then if
> that did not worked I can check using parallel stream/ sharding. Kindly
> advise is there could be anything else I  missing?
> Sent from Yahoo Mail on Android


Re: solr /export handler - behavior during close()

2017-05-17 Thread Susmit Shukla
Thanks Joel, will try that.
Binary response would be more performant.
I observed the server sends responses in 32 kb chunks and the client reads
it with 8 kb buffer on inputstream. I don't know if changing that can
impact anything on performance. Even if buffer size is increased on
httpclient, it can't override the hardcoded 8kb buffer on
sun.nio.cs.StreamDecoder

Thanks,
Susmit

On Wed, May 17, 2017 at 5:49 AM, Joel Bernstein <joels...@gmail.com> wrote:

> Susmit,
>
> You could wrap a LimitStream around the outside of all the relational
> algebra. For example:
>
> parallel(limit((intersect(intersect(search, search), union(search,
> search)
>
> In this scenario the limit would happen on the workers.
>
> As far as the worker/replica ratio. This will depend on how heavy the
> export is. If it's a light export, small number of fields, mostly numeric,
> simple sort params, then I've seen a ratio of 5 (workers) to 1 (replica)
> work well. This will basically saturate the CPU on the replica. But heavier
> exports will saturate the replicas with fewer workers.
>
> Also I tend to use Direct DocValues to get the best performance. I'm not
> sure how much difference this makes, but it should eliminate the
> compression overhead fetching the data from the DocValues.
>
> Varun's suggestion of using the binary transport will provide a nice
> performance increase as well. But you'll need to upgrade. You may need to
> do that anyway as the fix on the early stream close will be on a later
> version that was refactored to support the binary transport.
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Tue, May 16, 2017 at 8:03 PM, Joel Bernstein <joels...@gmail.com>
> wrote:
>
> > Yep, saw it. I'll comment on the ticket for what I believe needs to be
> > done.
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> > On Tue, May 16, 2017 at 8:00 PM, Varun Thacker <va...@vthacker.in>
> wrote:
> >
> >> Hi Joel,Susmit
> >>
> >> I created https://issues.apache.org/jira/browse/SOLR-10698 to track the
> >> issue
> >>
> >> @Susmit looking at the stack trace I see the expression is using
> >> JSONTupleStream
> >> . I wonder if you tried using JavabinTupleStreamParser could it help
> >> improve performance ?
> >>
> >> On Tue, May 16, 2017 at 9:39 AM, Susmit Shukla <shukla.sus...@gmail.com
> >
> >> wrote:
> >>
> >> > Hi Joel,
> >> >
> >> > queries can be arbitrarily nested with AND/OR/NOT joins e.g.
> >> >
> >> > (intersect(intersect(search, search), union(search, search))). If I
> cut
> >> off
> >> > the innermost stream with a limit, the complete intersection would not
> >> > happen at upper levels. Also would the limit stream have same effect
> as
> >> > using /select handler with rows parameter?
> >> > I am trying to force input stream close through reflection, just to
> see
> >> if
> >> > it gives performance gains.
> >> >
> >> > 2) would experiment with null streams. Is workers = number of replicas
> >> in
> >> > data collection a good thumb rule? is parallelstream performance upper
> >> > bounded by number of replicas?
> >> >
> >> > Thanks,
> >> > Susmit
> >> >
> >> > On Tue, May 16, 2017 at 5:59 AM, Joel Bernstein <joels...@gmail.com>
> >> > wrote:
> >> >
> >> > > Your approach looks OK. The single sharded worker collection is only
> >> > needed
> >> > > if you were using CloudSolrStream to send the initial Streaming
> >> > Expression
> >> > > to the /stream handler. You are not doing this, so you're approach
> is
> >> > fine.
> >> > >
> >> > > Here are some thoughts on what you described:
> >> > >
> >> > > 1) If you are closing the parallel stream after the top 1000
> results,
> >> > then
> >> > > try wrapping the intersect in a LimitStream. This stream doesn't
> exist
> >> > yet
> >> > > so it will be a custom stream. The LimitStream can return the EOF
> >> tuple
> >> > > after it reads N tuples. This will cause the worker nodes to close
> the
> >> > > underlying stream and cause the Broken Pipe exception to occur at
> the
> >> > > /export handler, which will stop the /export.
> >> > >
> >> > > Here is the basic approach:
> 

Re: solr /export handler - behavior during close()

2017-05-16 Thread Susmit Shukla
Hi Joel,

queries can be arbitrarily nested with AND/OR/NOT joins e.g.

(intersect(intersect(search, search), union(search, search))). If I cut off
the innermost stream with a limit, the complete intersection would not
happen at upper levels. Also would the limit stream have same effect as
using /select handler with rows parameter?
I am trying to force input stream close through reflection, just to see if
it gives performance gains.

2) would experiment with null streams. Is workers = number of replicas in
data collection a good thumb rule? is parallelstream performance upper
bounded by number of replicas?

Thanks,
Susmit

On Tue, May 16, 2017 at 5:59 AM, Joel Bernstein <joels...@gmail.com> wrote:

> Your approach looks OK. The single sharded worker collection is only needed
> if you were using CloudSolrStream to send the initial Streaming Expression
> to the /stream handler. You are not doing this, so you're approach is fine.
>
> Here are some thoughts on what you described:
>
> 1) If you are closing the parallel stream after the top 1000 results, then
> try wrapping the intersect in a LimitStream. This stream doesn't exist yet
> so it will be a custom stream. The LimitStream can return the EOF tuple
> after it reads N tuples. This will cause the worker nodes to close the
> underlying stream and cause the Broken Pipe exception to occur at the
> /export handler, which will stop the /export.
>
> Here is the basic approach:
>
> parallel(limit(intersect(search, search)))
>
>
> 2) It can be tricky to understand where the bottleneck lies when using the
> ParallelStream for parallel relational algebra. You can use the NullStream
> to get an understanding of why performance is not increasing when you
> increase the workers. Here is the basic approach:
>
> parallel(null(intersect(search, search)))
>
> The NullStream will eat all the tuples on the workers and return a single
> tuple with the tuple count and the time taken to run the expression. So
> you'll get one tuple from each worker. This will eliminate any bottleneck
> on tuples returning through the ParallelStream and you can focus on the
> performance of the intersect and the /export handler.
>
> Then experiment with:
>
> 1) Increasing the number of parallel workers.
> 2) Increasing the number of replicas in the data collections.
>
> And watch the timing information coming back from the NullStream tuples. If
> increasing the workers is not improving performance then the bottleneck may
> be in the /export handler. So try increasing replicas and see if that
> improves performance. Different partitions of the streams will be served by
> different replicas.
>
> If performance doesn't improve with the NullStream after increasing both
> workers and replicas then we know the bottleneck is the network.
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Mon, May 15, 2017 at 10:37 PM, Susmit Shukla <shukla.sus...@gmail.com>
> wrote:
>
> > Hi Joel,
> >
> > Regarding the implementation, I am wrapping the topmost TupleStream in a
> > ParallelStream and execute it on the worker cluster (one of the joined
> > cluster doubles up as worker cluster). ParallelStream does submit the
> query
> > to /stream handler.
> > for #2, for e.g. I am creating 2 CloudSolrStreams , wrapping them in
> > IntersectStream and wrapping that in ParallelStream and reading out the
> > tuples from parallel stream. close() is called on parallelStream. I do
> have
> > custom streams but that is similar to intersectStream.
> > I am on solr 6.3.1
> > The 2 solr clusters serving the join queries are having many shards.
> Worker
> > collection is also multi sharded and is one from the main clusters, so do
> > you imply I should be using a single sharded "worker" collection? Would
> the
> > joins execute faster?
> > On a side note, increasing the workers beyond 1 was not improving the
> > execution times but was degrading if number was 3 and above. That is
> > counter intuitive since the joins are huge and putting more workers
> should
> > have improved the performance.
> >
> > Thanks,
> > Susmit
> >
> >
> > On Mon, May 15, 2017 at 6:47 AM, Joel Bernstein <joels...@gmail.com>
> > wrote:
> >
> > > Ok please do report any issues you run into. This is quite a good bug
> > > report.
> > >
> > > I reviewed the code and I believe I see the problem. The problem seems
> to
> > > be that output code from the /stream handler is not properly accounting
> > for
> > > client disconnects and closing the underlying stream. What I see in the
> > > code is that exceptions coming from rea

Re: solr /export handler - behavior during close()

2017-05-15 Thread Susmit Shukla
Hi Joel,

Regarding the implementation, I am wrapping the topmost TupleStream in a
ParallelStream and execute it on the worker cluster (one of the joined
cluster doubles up as worker cluster). ParallelStream does submit the query
to /stream handler.
for #2, for e.g. I am creating 2 CloudSolrStreams , wrapping them in
IntersectStream and wrapping that in ParallelStream and reading out the
tuples from parallel stream. close() is called on parallelStream. I do have
custom streams but that is similar to intersectStream.
I am on solr 6.3.1
The 2 solr clusters serving the join queries are having many shards. Worker
collection is also multi sharded and is one from the main clusters, so do
you imply I should be using a single sharded "worker" collection? Would the
joins execute faster?
On a side note, increasing the workers beyond 1 was not improving the
execution times but was degrading if number was 3 and above. That is
counter intuitive since the joins are huge and putting more workers should
have improved the performance.

Thanks,
Susmit


On Mon, May 15, 2017 at 6:47 AM, Joel Bernstein <joels...@gmail.com> wrote:

> Ok please do report any issues you run into. This is quite a good bug
> report.
>
> I reviewed the code and I believe I see the problem. The problem seems to
> be that output code from the /stream handler is not properly accounting for
> client disconnects and closing the underlying stream. What I see in the
> code is that exceptions coming from read() in the stream do automatically
> close the underlying stream. But exceptions from the writing of the stream
> do not close the stream. This needs to be fixed.
>
> A few questions about your streaming implementation:
>
> 1) Are you sending requests to the /stream handler? Or are you embedding
> CloudSolrStream in your application and bypassing the /stream handler?
>
> 2) If you're sending Streaming Expressions to the stream handler are you
> using SolrStream or CloudSolrStream to send the expression?
>
> 3) What version of Solr are you using.
>
> 4) Have you implemented any custom streams?
>
>
> #2 is an important question. If you're sending expressions to the /stream
> handler using CloudSolrStream the collection running the expression would
> have to be setup a specific way. The collection running the expression will
> have to be a* single shard collection*. You can have as many replicas as
> you want but only one shard. That's because CloudSolrStream picks one
> replica in each shard to forward the request to then merges the results
> from the shards. So if you send in an expression using CloudSolrStream that
> expression will be sent to each shard to be run and each shard will be
> duplicating the work and return duplicate results.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Sat, May 13, 2017 at 7:03 PM, Susmit Shukla <shukla.sus...@gmail.com>
> wrote:
>
> > Thanks Joel
> > Streaming is awesome, just had a huge implementation in my project. I
> found
> > out a couple more issues with streaming and did local hacks for them,
> would
> > raise them too.
> >
> > On Sat, May 13, 2017 at 2:09 PM, Joel Bernstein <joels...@gmail.com>
> > wrote:
> >
> > > Ah, then this is unexpected behavior. Can you open a ticket for this?
> > >
> > > Joel Bernstein
> > > http://joelsolr.blogspot.com/
> > >
> > > On Sat, May 13, 2017 at 2:51 PM, Susmit Shukla <
> shukla.sus...@gmail.com>
> > > wrote:
> > >
> > > > Hi Joel,
> > > >
> > > > I was using CloudSolrStream for the above test. Below is the call
> > stack.
> > > >
> > > > at
> > > > org.apache.http.impl.io.ChunkedInputStream.read(
> > > > ChunkedInputStream.java:215)
> > > > at
> > > > org.apache.http.impl.io.ChunkedInputStream.close(
> > > > ChunkedInputStream.java:316)
> > > > at
> > > > org.apache.http.impl.execchain.ResponseEntityProxy.streamClosed(
> > > > ResponseEntityProxy.java:128)
> > > > at
> > > > org.apache.http.conn.EofSensorInputStream.checkClose(
> > > > EofSensorInputStream.java:228)
> > > > at
> > > > org.apache.http.conn.EofSensorInputStream.close(
> > > > EofSensorInputStream.java:174)
> > > > at sun.nio.cs.StreamDecoder.implClose(StreamDecoder.java:378)
> > > > at sun.nio.cs.StreamDecoder.close(StreamDecoder.java:193)
> > > > at java.io.InputStreamReader.close(InputStreamReader.java:199)
> > > > at
> > > > org.apache.so

Re: solr /export handler - behavior during close()

2017-05-13 Thread Susmit Shukla
Thanks Joel
Streaming is awesome, just had a huge implementation in my project. I found
out a couple more issues with streaming and did local hacks for them, would
raise them too.

On Sat, May 13, 2017 at 2:09 PM, Joel Bernstein <joels...@gmail.com> wrote:

> Ah, then this is unexpected behavior. Can you open a ticket for this?
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Sat, May 13, 2017 at 2:51 PM, Susmit Shukla <shukla.sus...@gmail.com>
> wrote:
>
> > Hi Joel,
> >
> > I was using CloudSolrStream for the above test. Below is the call stack.
> >
> > at
> > org.apache.http.impl.io.ChunkedInputStream.read(
> > ChunkedInputStream.java:215)
> > at
> > org.apache.http.impl.io.ChunkedInputStream.close(
> > ChunkedInputStream.java:316)
> > at
> > org.apache.http.impl.execchain.ResponseEntityProxy.streamClosed(
> > ResponseEntityProxy.java:128)
> > at
> > org.apache.http.conn.EofSensorInputStream.checkClose(
> > EofSensorInputStream.java:228)
> > at
> > org.apache.http.conn.EofSensorInputStream.close(
> > EofSensorInputStream.java:174)
> > at sun.nio.cs.StreamDecoder.implClose(StreamDecoder.java:378)
> > at sun.nio.cs.StreamDecoder.close(StreamDecoder.java:193)
> > at java.io.InputStreamReader.close(InputStreamReader.java:199)
> > at
> > org.apache.solr.client.solrj.io.stream.JSONTupleStream.
> > close(JSONTupleStream.java:91)
> > at
> > org.apache.solr.client.solrj.io.stream.SolrStream.close(
> > SolrStream.java:186)
> >
> > Thanks,
> > Susmit
> >
> > On Sat, May 13, 2017 at 10:48 AM, Joel Bernstein <joels...@gmail.com>
> > wrote:
> >
> > > I was just reading the Java docs on the ChunkedInputStream.
> > >
> > > "Note that this class NEVER closes the underlying stream"
> > >
> > > In that scenario the /export would indeed continue to send data. I
> think
> > we
> > > can consider this an anti-pattern for the /export handler currently.
> > >
> > > I would suggest using one of the Streaming Clients to connect to the
> > export
> > > handler. Either CloudSolrStream or SolrStream will both interact with
> the
> > > /export handler in a the way that it expects.
> > >
> > >
> > > Joel Bernstein
> > > http://joelsolr.blogspot.com/
> > >
> > > On Sat, May 13, 2017 at 12:28 PM, Susmit Shukla <
> shukla.sus...@gmail.com
> > >
> > > wrote:
> > >
> > > > Hi Joel,
> > > >
> > > > I did not observe that. On calling close() on stream, it cycled
> through
> > > all
> > > > the hits that /export handler calculated.
> > > > e.g. with a *:* query and export handler on a 100k document index, I
> > > could
> > > > see the 100kth record printed on the http wire debug log although
> close
> > > was
> > > > called after reading 1st tuple. The time taken for the operation with
> > > > close() call was same as that if I had read all the 100k tuples.
> > > > As I have pointed out, close() on underlying ChunkedInputStream calls
> > > > read() and solr server has probably no way to distinguish it from
> > read()
> > > > happening from regular tuple reads..
> > > > I think there should be an abort() API for solr streams that hooks
> into
> > > > httpmethod.abort() . That would enable client to disconnect early and
> > > > probably that would disconnect the underlying socket so there would
> be
> > no
> > > > leaks.
> > > >
> > > > Thanks,
> > > > Susmit
> > > >
> > > >
> > > > On Sat, May 13, 2017 at 7:42 AM, Joel Bernstein <joels...@gmail.com>
> > > > wrote:
> > > >
> > > > > If the client closes the connection to the export handler then this
> > > > > exception will occur automatically on the server.
> > > > >
> > > > > Joel Bernstein
> > > > > http://joelsolr.blogspot.com/
> > > > >
> > > > > On Sat, May 13, 2017 at 1:46 AM, Susmit Shukla <
> > > shukla.sus...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hi Joel,
> > > > > >
> > > > > > Thanks for the insight. How can this exception be thrown/forced
> > from
> > > > > client
> > > > > > side. Cli

Re: solr /export handler - behavior during close()

2017-05-13 Thread Susmit Shukla
Hi Joel,

I was using CloudSolrStream for the above test. Below is the call stack.

at
org.apache.http.impl.io.ChunkedInputStream.read(ChunkedInputStream.java:215)
at
org.apache.http.impl.io.ChunkedInputStream.close(ChunkedInputStream.java:316)
at
org.apache.http.impl.execchain.ResponseEntityProxy.streamClosed(ResponseEntityProxy.java:128)
at
org.apache.http.conn.EofSensorInputStream.checkClose(EofSensorInputStream.java:228)
at
org.apache.http.conn.EofSensorInputStream.close(EofSensorInputStream.java:174)
at sun.nio.cs.StreamDecoder.implClose(StreamDecoder.java:378)
at sun.nio.cs.StreamDecoder.close(StreamDecoder.java:193)
at java.io.InputStreamReader.close(InputStreamReader.java:199)
at
org.apache.solr.client.solrj.io.stream.JSONTupleStream.close(JSONTupleStream.java:91)
at
org.apache.solr.client.solrj.io.stream.SolrStream.close(SolrStream.java:186)

Thanks,
Susmit

On Sat, May 13, 2017 at 10:48 AM, Joel Bernstein <joels...@gmail.com> wrote:

> I was just reading the Java docs on the ChunkedInputStream.
>
> "Note that this class NEVER closes the underlying stream"
>
> In that scenario the /export would indeed continue to send data. I think we
> can consider this an anti-pattern for the /export handler currently.
>
> I would suggest using one of the Streaming Clients to connect to the export
> handler. Either CloudSolrStream or SolrStream will both interact with the
> /export handler in a the way that it expects.
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Sat, May 13, 2017 at 12:28 PM, Susmit Shukla <shukla.sus...@gmail.com>
> wrote:
>
> > Hi Joel,
> >
> > I did not observe that. On calling close() on stream, it cycled through
> all
> > the hits that /export handler calculated.
> > e.g. with a *:* query and export handler on a 100k document index, I
> could
> > see the 100kth record printed on the http wire debug log although close
> was
> > called after reading 1st tuple. The time taken for the operation with
> > close() call was same as that if I had read all the 100k tuples.
> > As I have pointed out, close() on underlying ChunkedInputStream calls
> > read() and solr server has probably no way to distinguish it from read()
> > happening from regular tuple reads..
> > I think there should be an abort() API for solr streams that hooks into
> > httpmethod.abort() . That would enable client to disconnect early and
> > probably that would disconnect the underlying socket so there would be no
> > leaks.
> >
> > Thanks,
> > Susmit
> >
> >
> > On Sat, May 13, 2017 at 7:42 AM, Joel Bernstein <joels...@gmail.com>
> > wrote:
> >
> > > If the client closes the connection to the export handler then this
> > > exception will occur automatically on the server.
> > >
> > > Joel Bernstein
> > > http://joelsolr.blogspot.com/
> > >
> > > On Sat, May 13, 2017 at 1:46 AM, Susmit Shukla <
> shukla.sus...@gmail.com>
> > > wrote:
> > >
> > > > Hi Joel,
> > > >
> > > > Thanks for the insight. How can this exception be thrown/forced from
> > > client
> > > > side. Client can't do a System.exit() as it is running as a webapp.
> > > >
> > > > Thanks,
> > > > Susmit
> > > >
> > > > On Fri, May 12, 2017 at 4:44 PM, Joel Bernstein <joels...@gmail.com>
> > > > wrote:
> > > >
> > > > > In this scenario the /export handler continues to export results
> > until
> > > it
> > > > > encounters a "Broken Pipe" exception. This exception is trapped and
> > > > ignored
> > > > > rather then logged as it's not considered an exception if the
> client
> > > > > disconnects early.
> > > > >
> > > > > Joel Bernstein
> > > > > http://joelsolr.blogspot.com/
> > > > >
> > > > > On Fri, May 12, 2017 at 2:10 PM, Susmit Shukla <
> > > shukla.sus...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I have a question regarding solr /export handler. Here is the
> > > scenario
> > > > -
> > > > > > I want to use the /export handler - I only need sorted data and
> > this
> > > is
> > > > > the
> > > > > > fastest way to get it. I am doing multiple level joins using
> > streams
> > > > > using
> > > > > > /export handler. I know the number of top level records to be

Re: solr /export handler - behavior during close()

2017-05-13 Thread Susmit Shukla
Hi Joel,

I did not observe that. On calling close() on stream, it cycled through all
the hits that /export handler calculated.
e.g. with a *:* query and export handler on a 100k document index, I could
see the 100kth record printed on the http wire debug log although close was
called after reading 1st tuple. The time taken for the operation with
close() call was same as that if I had read all the 100k tuples.
As I have pointed out, close() on underlying ChunkedInputStream calls
read() and solr server has probably no way to distinguish it from read()
happening from regular tuple reads..
I think there should be an abort() API for solr streams that hooks into
httpmethod.abort() . That would enable client to disconnect early and
probably that would disconnect the underlying socket so there would be no
leaks.

Thanks,
Susmit


On Sat, May 13, 2017 at 7:42 AM, Joel Bernstein <joels...@gmail.com> wrote:

> If the client closes the connection to the export handler then this
> exception will occur automatically on the server.
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Sat, May 13, 2017 at 1:46 AM, Susmit Shukla <shukla.sus...@gmail.com>
> wrote:
>
> > Hi Joel,
> >
> > Thanks for the insight. How can this exception be thrown/forced from
> client
> > side. Client can't do a System.exit() as it is running as a webapp.
> >
> > Thanks,
> > Susmit
> >
> > On Fri, May 12, 2017 at 4:44 PM, Joel Bernstein <joels...@gmail.com>
> > wrote:
> >
> > > In this scenario the /export handler continues to export results until
> it
> > > encounters a "Broken Pipe" exception. This exception is trapped and
> > ignored
> > > rather then logged as it's not considered an exception if the client
> > > disconnects early.
> > >
> > > Joel Bernstein
> > > http://joelsolr.blogspot.com/
> > >
> > > On Fri, May 12, 2017 at 2:10 PM, Susmit Shukla <
> shukla.sus...@gmail.com>
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > I have a question regarding solr /export handler. Here is the
> scenario
> > -
> > > > I want to use the /export handler - I only need sorted data and this
> is
> > > the
> > > > fastest way to get it. I am doing multiple level joins using streams
> > > using
> > > > /export handler. I know the number of top level records to be
> retrieved
> > > but
> > > > not for each individual stream rolling up to the final result.
> > > > I observed that calling close() on a /export stream is too expensive.
> > It
> > > > reads the stream to the very end of hits. Assuming there are 100
> > million
> > > > hits for each stream ,first 1k records were found after joins and we
> > call
> > > > close() after that, it would take many minutes/hours to finish it.
> > > > Currently I have put close() call in a different thread - basically
> > fire
> > > > and forget. But the cluster is very strained because of the
> > unneccessary
> > > > reads.
> > > >
> > > > Internally streaming uses ChunkedInputStream of HttpClient and it has
> > to
> > > be
> > > > drained in the close() call. But from server point of view, it should
> > > stop
> > > > sending more data once close() has been issued.
> > > > There is a read() call in close() method of ChunkedInputStream that
> is
> > > > indistinguishable from real read(). If /export handler stops sending
> > more
> > > > data after close it would be very useful.
> > > >
> > > > Another option would be to use /select handler and get into business
> of
> > > > managing a custom cursor mark that is based on the stream sort and is
> > > reset
> > > > until it fetches the required records at topmost level.
> > > >
> > > > Any thoughts.
> > > >
> > > > Thanks,
> > > > Susmit
> > > >
> > >
> >
>


Re: solr /export handler - behavior during close()

2017-05-12 Thread Susmit Shukla
Hi Joel,

Thanks for the insight. How can this exception be thrown/forced from client
side. Client can't do a System.exit() as it is running as a webapp.

Thanks,
Susmit

On Fri, May 12, 2017 at 4:44 PM, Joel Bernstein <joels...@gmail.com> wrote:

> In this scenario the /export handler continues to export results until it
> encounters a "Broken Pipe" exception. This exception is trapped and ignored
> rather then logged as it's not considered an exception if the client
> disconnects early.
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Fri, May 12, 2017 at 2:10 PM, Susmit Shukla <shukla.sus...@gmail.com>
> wrote:
>
> > Hi,
> >
> > I have a question regarding solr /export handler. Here is the scenario -
> > I want to use the /export handler - I only need sorted data and this is
> the
> > fastest way to get it. I am doing multiple level joins using streams
> using
> > /export handler. I know the number of top level records to be retrieved
> but
> > not for each individual stream rolling up to the final result.
> > I observed that calling close() on a /export stream is too expensive. It
> > reads the stream to the very end of hits. Assuming there are 100 million
> > hits for each stream ,first 1k records were found after joins and we call
> > close() after that, it would take many minutes/hours to finish it.
> > Currently I have put close() call in a different thread - basically fire
> > and forget. But the cluster is very strained because of the unneccessary
> > reads.
> >
> > Internally streaming uses ChunkedInputStream of HttpClient and it has to
> be
> > drained in the close() call. But from server point of view, it should
> stop
> > sending more data once close() has been issued.
> > There is a read() call in close() method of ChunkedInputStream that is
> > indistinguishable from real read(). If /export handler stops sending more
> > data after close it would be very useful.
> >
> > Another option would be to use /select handler and get into business of
> > managing a custom cursor mark that is based on the stream sort and is
> reset
> > until it fetches the required records at topmost level.
> >
> > Any thoughts.
> >
> > Thanks,
> > Susmit
> >
>


solr /export handler - behavior during close()

2017-05-12 Thread Susmit Shukla
Hi,

I have a question regarding solr /export handler. Here is the scenario -
I want to use the /export handler - I only need sorted data and this is the
fastest way to get it. I am doing multiple level joins using streams using
/export handler. I know the number of top level records to be retrieved but
not for each individual stream rolling up to the final result.
I observed that calling close() on a /export stream is too expensive. It
reads the stream to the very end of hits. Assuming there are 100 million
hits for each stream ,first 1k records were found after joins and we call
close() after that, it would take many minutes/hours to finish it.
Currently I have put close() call in a different thread - basically fire
and forget. But the cluster is very strained because of the unneccessary
reads.

Internally streaming uses ChunkedInputStream of HttpClient and it has to be
drained in the close() call. But from server point of view, it should stop
sending more data once close() has been issued.
There is a read() call in close() method of ChunkedInputStream that is
indistinguishable from real read(). If /export handler stops sending more
data after close it would be very useful.

Another option would be to use /select handler and get into business of
managing a custom cursor mark that is based on the stream sort and is reset
until it fetches the required records at topmost level.

Any thoughts.

Thanks,
Susmit


Json Parse Exception in CloudSolrStream class

2016-07-03 Thread Susmit Shukla
Hi,

I'm using a string field in sort parameters of a solr query. The query is
used with /export handler to stream data using CloudSolrStream. When the
data in field contains a double quote, the cloudSolrStream fails to read
data and throws this error -

field data = "first (alias) last" 

org.noggit.JSONParser$ParseException: Expected ',' or '}':
char=F,position=43701
BEFORE='4DC93D74AEDE28292D27A2EC39F8761E1","field_dv":""F' AFTER='irst
(alias) last" 

Re: export with collapse filter runs into NPE

2016-06-10 Thread Susmit Shukla
Hi Joel,

I would need to join results from 2 solr clouds before collapsing so it
would not be an issue right now.
I ran into another issue - if data in any of the shards is empty, export
throws an error-
Once i have atleast one document in each shard, it works fine.

org.apache.solr.common.SolrException; null:java.io.IOException:
org.apache.solr.search.SyntaxError: xport RankQuery is required for xsort:
rq={!xport}

at
org.apache.solr.response.SortingResponseWriter.write(SortingResponseWriter.java:101)

at
org.apache.solr.response.QueryResponseWriterUtil.writeQueryResponse(QueryResponseWriterUtil.java:53)

at org.apache.solr.servlet.HttpSolrCall.writeResponse(HttpSolrCall.java:742)

at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:471)

...

Caused by: org.apache.solr.search.SyntaxError: xport RankQuery is required
for xsort: rq={!xport}

... 26 more

On Fri, Jun 10, 2016 at 1:09 PM, Joel Bernstein <joels...@gmail.com> wrote:

> This sounds like a bug. I'm pretty sure there are no tests that use
> collapse with the export handler.
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Fri, Jun 10, 2016 at 3:59 PM, Susmit Shukla <shukla.sus...@gmail.com>
> wrote:
>
> > Hi,
> >
> > I'm running this export query, it is working fine. f1 is the uniqueKey
> and
> > running solr 5.3.1
> >
> > /export?q=f1:term1=f1+desc=f1,f2
> >
> > if I add collapsing filter, it is giving NullPointerException
> >
> > /export?q=f1:term1=f1+desc=f1,f2={!collapse field=f2}
> >
> > does collapsing filter work with /export handler?
> >
> >
> > java.lang.NullPointerException
> > at
> > org.apache.lucene.util.BitSetIterator.(BitSetIterator.java:58)
> > at
> >
> org.apache.solr.response.SortingResponseWriter.write(SortingResponseWriter.java:138)
> > at
> >
> org.apache.solr.response.QueryResponseWriterUtil.writeQueryResponse(QueryResponseWriterUtil.java:53)
> > at
> > org.apache.solr.servlet.HttpSolrCall.writeResponse(HttpSolrCall.java:742)
> > at
> org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:471)
> > at
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:214)
> > at
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:179)
> > at
> >
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
> > at
> >
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
> > at
> >
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
> > at
> >
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
> > at
> >
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
> > at
> >
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
> > at
> > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
> > at
> >
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
> > at
> >
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
> > at
> >
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
> > at
> >
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
> > at
> >
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110)
> > at
> >
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
> > at org.eclipse.jetty.server.Server.handle(Server.java:499)
> > at
> > org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310)
> > at
> >
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
> > at
> >
> org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
> > at
> >
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
> > at
> >
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
> > at java.lang.Thread.run(Thread.java:745)
> >
>


export with collapse filter runs into NPE

2016-06-10 Thread Susmit Shukla
Hi,

I'm running this export query, it is working fine. f1 is the uniqueKey and
running solr 5.3.1

/export?q=f1:term1=f1+desc=f1,f2

if I add collapsing filter, it is giving NullPointerException

/export?q=f1:term1=f1+desc=f1,f2={!collapse field=f2}

does collapsing filter work with /export handler?


java.lang.NullPointerException
at org.apache.lucene.util.BitSetIterator.(BitSetIterator.java:58)
at 
org.apache.solr.response.SortingResponseWriter.write(SortingResponseWriter.java:138)
at 
org.apache.solr.response.QueryResponseWriterUtil.writeQueryResponse(QueryResponseWriterUtil.java:53)
at 
org.apache.solr.servlet.HttpSolrCall.writeResponse(HttpSolrCall.java:742)
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:471)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:214)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:179)
at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
at 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
at 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
at 
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
at 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
at org.eclipse.jetty.server.Server.handle(Server.java:499)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310)
at 
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
at 
org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
at java.lang.Thread.run(Thread.java:745)


Re: start parameter for CloudSolrStream

2016-06-08 Thread Susmit Shukla
Thanks Joel,

Yes, with /select handler start parameter is being applied to query running
on individual shard and doesn't return expected aggregate results. Probably
need to roll out some custom join on collections running on different solr
clouds.
Also are multiple fq clauses supported for CloudSolrStream params?

Thanks,
Susmit

On Wed, Jun 8, 2016 at 8:58 AM, Joel Bernstein <joels...@gmail.com> wrote:

> CloudSolrStream doesn't really understand the concept of paging. It just
> sees a stream of Tuples coming from a collection and merges them.
>
> If you're using the default /select handler it will be passed the start
> param and start from that point. But if use the /export handler the start
> parameter will be ignored.
>
> In general though paging is not a supported feature yet of the Streaming
> API. There are plans to support this in the future to add support for the
> OFFSET SQL clause.
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Tue, Jun 7, 2016 at 5:08 PM, Susmit Shukla <shukla.sus...@gmail.com>
> wrote:
>
> > *sending with correct subject*
> >
> > Does solr streaming aggregation support pagination?
> > Some documents seem to be skipped if I set "start" parameter for
> > CloudSolrStream for a sharded collection.
> >
> > Thanks,
> > Susmit
> >
>


start parameter for CloudSolrStream

2016-06-07 Thread Susmit Shukla
*sending with correct subject*

Does solr streaming aggregation support pagination?
Some documents seem to be skipped if I set "start" parameter for
CloudSolrStream for a sharded collection.

Thanks,
Susmit


Re: Field Definitions Ignored

2016-06-07 Thread Susmit Shukla
Does solr streaming aggregation support pagination?
Some documents seem to be skipped if I set "start" parameter for
CloudSolrStream for a sharded collection.

Thanks,
Susmit


Re: fq behavior...

2016-05-05 Thread Susmit Shukla
Please take a look at this blog, specifically "Leapfrog Anyone?" section-
http://yonik.com/advanced-filter-caching-in-solr/

Thanks,
Susmit

On Thu, May 5, 2016 at 10:54 PM, Bastien Latard - MDPI AG <
lat...@mdpi.com.invalid> wrote:

> Hi guys,
>
> Just a quick question, that I did not find an easy answer.
>
> 1.
>
>Is the fq "executed" before or after the usual query (q)
>
>e.g.: select?q=title:"something really specific"=bPublic:true=10
>
>Would it first:
>
>  * get all the "specific" results, and then apply the filter
>  * OR is it first getting all the docs matching the fq and then
>running the "q" query
>
> In other words, does it first check for "the best cardinality"?
>
> Kind regards,
> Bastien
>
>


Re: Query String Limit

2016-05-05 Thread Susmit Shukla
Hi Prasanna,

What is the exact number you set it to?
What error did you get on solr console and in the solr logs?
Did you reload the core/restarted solr after bumping up the solrconfig?

Thanks,
Susmit

On Wed, May 4, 2016 at 9:45 PM, Prasanna S. Dhakephalkar <
prasann...@merajob.in> wrote:

> Hi
>
> We had increased the maxBooleanClauses to a large number, but it did not
> work
>
> Here is the query
>
>
> http://localhost:8983/solr/collection1/select?fq=record_id%3A(604929+504197+
>
> 500759+510957+624719+524081+544530+375687+494822+468221+553049+441998+495212
>
> +462613+623866+344379+462078+501936+189274+609976+587180+620273+479690+60601
>
> 8+487078+496314+497899+374231+486707+516582+74518+479684+1696152+1090711+396
>
> 784+377205+600603+539686+550483+436672+512228+1102968+600604+487699+612271+4
>
> 87978+433952+479846+492699+380838+412290+487086+515836+487957+525335+495426+
>
> 619724+49726+444558+67422+368749+630542+473638+613887+1679503+509367+1108299
>
> +498818+528683+530270+595087+468595+585998+487888+600612+515884+455568+60643
>
> 8+526281+497992+460147+587530+576456+526021+790508+486148+469160+365923+4846
>
> 54+510829+488792+610933+254610+632700+522376+594418+514817+439283+1676569+52
>
> 4031+431557+521628+609255+627205+1255921+57+477017+519675+548373+350309+
>
> 491176+524276+570935+549458+495765+512814+494722+382249+619036+477309+487718
>
> +470604+514622+1240902+570607+613830+519130+479708+630293+496994+623870+5706
>
> 72+390434+483496+609115+490875+443859+292168+522383+501802+606498+596773+479
>
> 881+486020+488654+490422+512636+495512+489480+626269+614618+498967+476988+47
>
> 7608+486568+270095+295480+478367+607120+583892+593474+494373+368030+484522+5
>
> 01183+432822+448109+553418+584084+614868+486206+481014+495027+501880+479113+
>
> 615208+488161+512278+597663+569409+139097+489490+584000+493619+607479+281080
>
> +518617+518803+487896+719003+584153+484341+505689+278177+539722+548001+62529
>
> 6+1676456+507566+619039+501882+530385+474125+293642+612857+568418+640839+519
>
> 893+524335+612859+618762+479460+479719+593700+573677+525991+610965+462087+52
>
> 1251+501197+443642+1684784+533972+510695+475499+490644+613829+613893+479467+
>
> 542478+1102898+499230+436921+458632+602303+488468+1684407+584373+494603+4992
>
> 45+548019+600436+606997+59+503156+440428+518759+535013+548023+494273+649
>
> 062+528704+469282+582249+511250+496466+497675+505937+489504+600444+614240+19
>
> 35577+464232+522398+613809+1206232+607149+607644+498059+506810+487115+550976
>
> +638174+600849+525655+625011+500082+606336+507156+487887+333601+457209+60111
>
> 0+494927+1712081+601280+486061+501558+600451+263864+527378+571918+472415+608
>
> 130+212386+380460+590400+478850+631886+486782+608013+613824+581767+527023+62
>
> 3207+607013+505819+485418+486786+537626+507047+92+527473+495520+553141+5
>
> 17837+497295+563266+495506+532725+267057+497321+453249+524341+429654+720001+
>
> 539946+490813+479491+479628+479630+1125985+351147+524296+565077+439949+61241
>
> 3+495854+479493+1647796+600259+229346+492571+485638+596394+512112+477237+600
>
> 459+263780+704068+485934+450060+475944+582280+488031+1094010+1687904+539515+
>
> 525820+539516+505985+600461+488991+387733+520928+362967+351847+531586+616101
>
> +479925+494156+511292+515729+601903+282655+491244+610859+486081+325500+43639
>
> 7+600708+523445+480737+486083+614767+486278+1267655+484845+495145+562624+493
>
> 381+8060+638731+501347+565979+325132+501363+268866+614113+479646+1964487+631
>
> 934+25717+461612+376451+513712+527557+459209+610194+1938903+488861+426305+47
>
> 7676+1222682+1246647+567986+501908+791653+325802+498354+435156+484862+533068
>
> +339875+395827+475148+331094+528741+540715+623480+416601+516419+600473+62563
>
> 2+480570+447412+449778+503316+492365+563298+486361+500907+514521+138405+6123
>
> 27+495344+596879+524918+474563+47273+514739+553189+548418+448943+450612+6006
>
> 78+484753+485302+271844+474199+487922+473784+431524+535371+513583+514746+612
>
> 534+327470+485855+517878+384102+485856+612768+494791+504840+601330+493551+55
>
> 8620+540131+479809+394179+487866+559955+578444+576571+485861+488879+573089+4
>
> 97552+487898+490369+535756+614155+633027+487473+517912+523364+527419+600487+
>
> 486128+278040+598478+487395+600579+585691+498970+488151+608187+445943+631971
>
> +230291+504552+534443+501924+489148+292672+528874+434783+479533+485301+61908
>
> 9+629083+479383+600981+534717+645420+604921+618714+522329+597822+507413+5706
>
> 05+491732+464741+511564+613929+526049+614817+589065+603307+491990+467339+264
>
> 426+487907+492982+589067+487674+487820+492983+486708+504140+1216198+625736+4
>
> 92984+530116+615663+503248+1896822+600588+518139+494994+621846+599669+488207
>
> +640923+487580+539856+603968+444717+492991+614824+491735+492992+495149+52117
>
> 2+365778+261681+600502+479682+597464+492997+587172+624381+482355+1246338+593
>
> 642+492000+494707+620137+493000+20617+585199+587176+587177+1877064+587179+53
>
> 

Re: The Streaming API (Solrj.io) : id must have DocValues?

2016-04-26 Thread Susmit Shukla
Which solrj version are you using? could you try with solrj 6.0

On Tue, Apr 26, 2016 at 10:36 AM, sudsport s  wrote:

> @Joel
> >Can you describe how you're planning on using Streaming?
>
> I am mostly using it for distirbuted join case. We were planning to use
> similar logic (hash id and join) in Spark for our usecase. but since data
> is stored in solr , I will be using solr stream to perform same operation.
>
> I have similar user cases to build probabilistic data-structures while
> streaming results. I might have to spend some time in exploring query
> optimization (while doing join decide sort order etc)
>
> Please let me know if you have any feedback.
>
> On Tue, Apr 26, 2016 at 10:30 AM, sudsport s  wrote:
>
> > Thanks @Reth yes that was my one of the concern. I will look at JIRA you
> > mentioned.
> >
> > Thanks Joel
> > I used some of examples for streaming client from your blog. I got basic
> > tuple stream working but I get following exception while running parallel
> > string.
> >
> >
> > java.io.IOException: java.util.concurrent.ExecutionException:
> > org.noggit.JSONParser$ParseException: JSON Parse Error: char=<,position=0
> > BEFORE='<' AFTER='html>   >
> >
> > looks like Parallel stream is trying to access /stream on shard. can
> > someone tell me how to enable stream handler? I have export handler
> > enabled. I will look at latest solrconfig to see if I can turn that on.
> >
> >
> >
> > @Joel I am running sizing exercises already , I will run new one with
> > solr5.5+ and docValues on id enabled.
> >
> > BTW Solr streaming has amazing response times thanks for making it so
> > FAST!!!
> >
> >
> >
> >
> >
> >
> >
> > On Mon, Apr 25, 2016 at 10:54 AM, Joel Bernstein 
> > wrote:
> >
> >> Can you describe how you're planning on using Streaming? I can provide
> >> some
> >> feedback on how it will perform for your use use.
> >>
> >> When scaling out Streaming you'll get large performance boosts when you
> >> increase the number of shards, replicas and workers. This is
> particularly
> >> true if you're doing parallel relational algebra or map/reduce
> operations.
> >>
> >> As far a DocValues being expensive with unique fields, you'll want to
> do a
> >> sizing exercise to see how many documents per-shard work best for your
> use
> >> case. There are different docValues implementations that will allow you
> to
> >> trade off memory for performance.
> >>
> >> Joel Bernstein
> >> http://joelsolr.blogspot.com/
> >>
> >> On Mon, Apr 25, 2016 at 3:30 AM, Reth RM  wrote:
> >>
> >> > Hi,
> >> >
> >> > So, is the concern related to same field value being stored twice:
> with
> >> > stored=true and docValues=true? If that is the case, there is a jira
> >> > relevant to this, fixed[1]. If you upgrade to 5.5/6.0 version, it is
> >> > possible to read non-stored fields from docValues index., check out.
> >> >
> >> >
> >> > [1] https://issues.apache.org/jira/browse/SOLR-8220
> >> >
> >> > On Mon, Apr 25, 2016 at 9:44 AM, sudsport s 
> >> wrote:
> >> >
> >> > > Thanks Erik for reply,
> >> > >
> >> > > Since I was storing Id (its stored field) and after enabling
> >> docValues my
> >> > > guess is it will be stored in 2 places. also as per my understanding
> >> > > docValues are great when you have values which repeat. I am not sure
> >> how
> >> > > beneficial it would be for uniqueId field.
> >> > > I am looking at collection of few hundred billion documents , that
> is
> >> > > reason I really want to care about expense from design phase.
> >> > >
> >> > >
> >> > >
> >> > >
> >> > > On Sun, Apr 24, 2016 at 7:24 PM, Erick Erickson <
> >> erickerick...@gmail.com
> >> > >
> >> > > wrote:
> >> > >
> >> > > > In a word, "yes".
> >> > > >
> >> > > > DocValues aren't particularly expensive, or expensive at all. The
> >> idea
> >> > > > is that when you sort by a field or facet, the field has to be
> >> > > > "uninverted" which builds the entire structure in Java's JVM (this
> >> is
> >> > > > when the field is _not_ DocValues).
> >> > > >
> >> > > > DocValues essentially serialize this structure to disk. So your
> >> > > > on-disk index size is larger, but that size is MMaped rather than
> >> > > > stored on Java's heap.
> >> > > >
> >> > > > Really, the question I'd have to ask though is "why do you care
> >> about
> >> > > > the expense?". If you have a functional requirement that has to be
> >> > > > served by returning the id via the /export handler, you really
> have
> >> no
> >> > > > choice.
> >> > > >
> >> > > > Best,
> >> > > > Erick
> >> > > >
> >> > > >
> >> > > > On Sun, Apr 24, 2016 at 9:55 AM, sudsport s  >
> >> > > wrote:
> >> > > > > I was trying to use Streaming for reading basic tuple stream. I
> am
> >> > > using
> >> > > > > sort by id asc ,
> >> > > > > I am getting following exception
> >> > > > >
> >> > > > > I am using export search handler as per
> >> > > 

Re: Cross collection join in Solr 5.x

2016-04-21 Thread Susmit Shukla
I have done it by extending the solr join plugin. Needed to override 2
methods from join plugin and it works out.

Thanks,
Susmit

On Thu, Apr 21, 2016 at 12:01 PM, Mikhail Khludnev <
mkhlud...@griddynamics.com> wrote:

> Hello,
>
> There is no much progress on
> https://issues.apache.org/jira/browse/SOLR-8297
> Although it's really achievable.
>
> On Thu, Apr 21, 2016 at 7:52 PM, Shikha Somani 
> wrote:
>
> > Greetings,
> >
> >
> > Background: Our application is using Solr 4.10 and has multiple
> > collections all of them sharded equally on Solr. These collections were
> > joined to support complex queries.
> >
> >
> > Problem: We are trying to upgrade to Solr 5.x. However from Solr 5.2
> > onward to join two collections it is a requirement that the secondary
> > collection must be singly sharded and replicated where primary collection
> > is. But collections are very large and need to be sharded for
> performance.
> >
> >
> > Query: Is there any way in Solr 5.x to join two collections both of which
> > are equally sharded i.e. the secondary collection is also sharded as the
> > primary.
> >
> >
> > Thanks,
> > Shikha
> >
> > 
> >
> >
> >
> >
> >
> >
> > NOTE: This message may contain information that is confidential,
> > proprietary, privileged or otherwise protected by law. The message is
> > intended solely for the named addressee. If received in error, please
> > destroy and notify the sender. Any use of this email is prohibited when
> > received in error. Impetus does not represent, warrant and/or guarantee,
> > that the integrity of this communication has been maintained nor that the
> > communication is free of errors, virus, interception or interference.
> >
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> 
> 
>


Re: Return only parent on child query match (w/o block-join)

2016-04-19 Thread Susmit Shukla
Hi Shamik,

you could try solr grouping using group.query construct. you could discard
the child match from the result i.e. any doc that has parent_doc_id field
and use join to fetch the parent record

q=*:*=true=title:title2={!join
from=parent_doc_id to=doc_id}parent_doc_id:*=10

Thanks,
Susmit


On Tue, Apr 19, 2016 at 10:29 AM, Shamik Bandopadhyay 
wrote:

> Hi,
>
>I have a set of documents indexed which has a pseudo parent-child
> relationship. Each child document had a reference to the parent document.
> Due to document availability complexity (and the condition of updating both
> parent-child documents at the time of indexing), I'm not able to use
> explicit block-join.Instead of a nested structure, they are all flat.
> Here's an example:
>
> 
>   1
>   Parent title
>   123
> 
> 
>   2
>   Child title1
>   123
> 
> 
>   3
>   Child title2
>   123
> 
> 
>   4
>   Misc title2
> 
>
> What I'm looking is if I search "title2", the result should bring back the
> following two docs, 1 matching the parent and one based on a regular match.
>
> 
>   1
>   Parent title
>   123
> 
> 
>   4
>   Misc title2
> 
>
> With block-join support, I could have used Block Join Parent Query Parser,
> q={!parent which="content_type:parentDocument"}title:title2
>
> Transforming result documents is an alternate but it has the reverse
> support through ChildDocTransformerFactory
>
> Just wondering if there's a way to address query in a different way. Any
> pointers will be appreciated.
>
> -Thanks,
> Shamik
>


Re: UUID processor handling of empty string

2016-04-17 Thread Susmit Shukla
Hi Erick/Jack,

I agree that "Your code is violating the contract for the UUID update
processor." so index could be in bad state. I have already put the fix and
no further action needed. I was just curious about the resulting behavior.

For completeness here were my results -
indexed 2 docs with these fields - using solr console's document tab


doc1: {"id":""}
doc2: {"id":""}

matchAllDocs query
q=*:*=id+desc
"numFound": 2, "start": 0, "docs": [ {"id":
"9542901e-ede3-46dc-af6c-c30025c7b417"}, {"id":
"f29fcb97-ef5e-4c3e-b4fe-f50a963f894d"} ]

q=*:*=id+asc - no change in order
"numFound": 2, "start": 0, "docs": [ {"id":
"9542901e-ede3-46dc-af6c-c30025c7b417"}, {"id":
"f29fcb97-ef5e-4c3e-b4fe-f50a963f894d"} ]

doc1: {"id":"whatever"}
got error:
"error": { "msg": "Invalid UUID String: 'whatever'",

doc1: {"_version_":-1} - id field is omitted but atleast one field needed
to index
doc2: {"_version_":-1}

matchAllDocs query
q=*:*=id+desc
"numFound": 2, "start": 0, "docs": [ {"id": "
c4e19489-fad1-42f4-b216-88ba550f3d16"}, {"id": "
99d652b8-3eb6-4a9f-a722-33246e8553d4"} ]

q=*:*=id+asc - works
"numFound": 2, "start": 0, "docs": [ {"id": "
99d652b8-3eb6-4a9f-a722-33246e8553d4"}, {"id": "
c4e19489-fad1-42f4-b216-88ba550f3d16"} ]

On Sat, Apr 16, 2016 at 8:01 PM, Erick Erickson <erickerick...@gmail.com>
wrote:

> I did a quick experiment (admittedly with 5.x, but even if this is a
> bug it won't be back-ported to 5.3) and this works exactly as I
> expect. I have three docs with IDs as follows
> doc1: . This is equivalent to your ""
> doc2: whatever
> doc3:
>
> As expected, when the output comes back doc1 has an empty field, doc2
> has "whatever" and doc3 has a newly-generated uuid that happens to
> start with "f".
>
> Adding =id asc returns:
> doc1: (empty string)
> doc3: fblahblah
> doc2: whatever
>
> Adding =id desc returns
> doc2: whatever
> doc3: fblahblah
> doc1:(empty string)
>
> So for about the third time, "what do you mean by 'doesn't work'?"
> Provide simple example date (just how you specify the "id" field is
> sufficient). Provide the requests you're using. Point out what's not
> as you expect.
>
> You might want to review:
> http://wiki.apache.org/solr/UsingMailingLists
>
> Best,
> Erick
>
> On Sat, Apr 16, 2016 at 9:54 AM, Jack Krupansky
> <jack.krupan...@gmail.com> wrote:
> > Remove that line of code from your client, or... add the remove blank
> field
> > update processor as Hoss suggested. Your code is violating the contract
> for
> > the UUID update processor. An empty string is still a value, and the
> > presence of a value is an explicit trigger to suppress the UUID update
> > processor.
> >
> > -- Jack Krupansky
> >
> > On Sat, Apr 16, 2016 at 12:41 PM, Susmit Shukla <shukla.sus...@gmail.com
> >
> > wrote:
> >
> >> I am seeing the UUID getting generated when I set the field as empty
> string
> >> like this - solrDoc.addField("id", ""); with solr 5.3.1 and based on the
> >> above schema.
> >> The resulting documents in the index are searchable but not sortable.
> >> Someone could verify if this bug exists and file a jira.
> >>
> >> Thanks,
> >> Susmit
> >>
> >>
> >>
> >> On Sat, Apr 16, 2016 at 8:56 AM, Jack Krupansky <
> jack.krupan...@gmail.com>
> >> wrote:
> >>
> >> > "UUID processor factory is generating uuid even if it is empty."
> >> >
> >> > The processor will generate the UUID only if the id field is not
> >> specified
> >> > in the input document. Empty value and value not present are not the
> same
> >> > thing.
> >> >
> >> > So, please clarify your specific situation.
> >> >
> >> >
> >> > -- Jack Krupansky
> >> >
> >> > On Thu, Apr 14, 2016 at 7:20 PM, Susmit Shukla <
> shukla.sus...@gmail.com>
> >> > wrote:
> >> >
> >> > > Hi Chris/Erick,
> >> > >
> >> > > Does not work in the sense the order of documents does not change on
> >> > > changing sort from asc to desc.
> >> > > This could be just a trivial bug where UUID processo

Re: UUID processor handling of empty string

2016-04-16 Thread Susmit Shukla
I am seeing the UUID getting generated when I set the field as empty string
like this - solrDoc.addField("id", ""); with solr 5.3.1 and based on the
above schema.
The resulting documents in the index are searchable but not sortable.
Someone could verify if this bug exists and file a jira.

Thanks,
Susmit



On Sat, Apr 16, 2016 at 8:56 AM, Jack Krupansky <jack.krupan...@gmail.com>
wrote:

> "UUID processor factory is generating uuid even if it is empty."
>
> The processor will generate the UUID only if the id field is not specified
> in the input document. Empty value and value not present are not the same
> thing.
>
> So, please clarify your specific situation.
>
>
> -- Jack Krupansky
>
> On Thu, Apr 14, 2016 at 7:20 PM, Susmit Shukla <shukla.sus...@gmail.com>
> wrote:
>
> > Hi Chris/Erick,
> >
> > Does not work in the sense the order of documents does not change on
> > changing sort from asc to desc.
> > This could be just a trivial bug where UUID processor factory is
> generating
> > uuid even if it is empty.
> > This is on solr 5.3.0
> >
> > Thanks,
> > Susmit
> >
> >
> >
> >
> >
> > On Thu, Apr 14, 2016 at 2:30 PM, Chris Hostetter <
> hossman_luc...@fucit.org
> > >
> > wrote:
> >
> > >
> > > I'm also confused by what exactly you mean by "doesn't work" but a
> > general
> > > suggestion you can try is putting the
> > > RemoveBlankFieldUpdateProcessorFactory before your UUID Processor...
> > >
> > >
> > >
> >
> https://lucene.apache.org/solr/6_0_0/solr-core/org/apache/solr/update/processor/RemoveBlankFieldUpdateProcessorFactory.html
> > >
> > > If you are also worried about strings that aren't exactly empty, but
> > > consist only of whitespace, you can put TrimFieldUpdateProcessorFactory
> > > before RemoveBlankFieldUpdateProcessorFactory ...
> > >
> > >
> > >
> >
> https://lucene.apache.org/solr/6_0_0/solr-core/org/apache/solr/update/processor/TrimFieldUpdateProcessorFactory.html
> > >
> > >
> > > : Date: Thu, 14 Apr 2016 12:30:24 -0700
> > > : From: Erick Erickson <erickerick...@gmail.com>
> > > : Reply-To: solr-user@lucene.apache.org
> > > : To: solr-user <solr-user@lucene.apache.org>
> > > : Subject: Re: UUID processor handling of empty string
> > > :
> > > : What do you mean "doesn't work"? An empty string is
> > > : different than not being present. Thee UUID update
> > > : processor (I'm pretty sure) only adds a field if it
> > > : is _absent_. Specifying it as an empty string
> > > : fails that test so no value is added.
> > > :
> > > : At that point, if this uuid field is also the ,
> > > : then each doc that comes in with an empty field will replace
> > > : the others.
> > > :
> > > : If it's _not_ the , the sorting will be confusing.
> > > : All the empty string fields are equal, so the tiebreaker is
> > > : the internal Lucene doc ID, which may change as merges
> > > : happen. You can specify secondary sort fields to make the
> > > : sort predictable (the  field is popular for this).
> > > :
> > > : Best,
> > > : Erick
> > > :
> > > : On Thu, Apr 14, 2016 at 12:18 PM, Susmit Shukla <
> > shukla.sus...@gmail.com>
> > > wrote:
> > > : > Hi,
> > > : >
> > > : > I have configured solr schema to generate unique id for a
> collection
> > > using
> > > : > UUIDUpdateProcessorFactory
> > > : >
> > > : > I am seeing a peculiar behavior - if the unique 'id' field is
> > > explicitly
> > > : > set as empty string in the SolrInputDocument, the document gets
> > indexed
> > > : > with UUID update processor generating the id.
> > > : > However, sorting does not work if uuid was generated in this way.
> > Also
> > > : > cursor functionality that depends on unique id sort also does not
> > work.
> > > : > I guess the correct behavior would be to fail the indexing if user
> > > provides
> > > : > an empty string for a uuid field.
> > > : >
> > > : > The issues do not happen if I omit the id field from the
> > > SolrInputDocument .
> > > : >
> > > : > SolrInputDocument
> > > : >
> > > : > solrDoc.addField("id", "");
> > > : >
> > > : > ...
> > > : >
> > > : > I am using schema similar to below-
> > > : >
> > > : > 
> > > : >
> > > : > 
> > > : >
> > > : >  > > required="true" />
> > > : >
> > > : > id
> > > : >
> > > : > 
> > > : > 
> > > : > 
> > > : >   id
> > > : > 
> > > : > 
> > > : > 
> > > : >
> > > : >
> > > : >  
> > > : >
> > > : >  uuid
> > > : >
> > > : > 
> > > : >
> > > : >
> > > : > Thanks,
> > > : > Susmit
> > > :
> > >
> > > -Hoss
> > > http://www.lucidworks.com/
> > >
> >
>


Re: UUID processor handling of empty string

2016-04-14 Thread Susmit Shukla
Hi Chris/Erick,

Does not work in the sense the order of documents does not change on
changing sort from asc to desc.
This could be just a trivial bug where UUID processor factory is generating
uuid even if it is empty.
This is on solr 5.3.0

Thanks,
Susmit





On Thu, Apr 14, 2016 at 2:30 PM, Chris Hostetter <hossman_luc...@fucit.org>
wrote:

>
> I'm also confused by what exactly you mean by "doesn't work" but a general
> suggestion you can try is putting the
> RemoveBlankFieldUpdateProcessorFactory before your UUID Processor...
>
>
> https://lucene.apache.org/solr/6_0_0/solr-core/org/apache/solr/update/processor/RemoveBlankFieldUpdateProcessorFactory.html
>
> If you are also worried about strings that aren't exactly empty, but
> consist only of whitespace, you can put TrimFieldUpdateProcessorFactory
> before RemoveBlankFieldUpdateProcessorFactory ...
>
>
> https://lucene.apache.org/solr/6_0_0/solr-core/org/apache/solr/update/processor/TrimFieldUpdateProcessorFactory.html
>
>
> : Date: Thu, 14 Apr 2016 12:30:24 -0700
> : From: Erick Erickson <erickerick...@gmail.com>
> : Reply-To: solr-user@lucene.apache.org
> : To: solr-user <solr-user@lucene.apache.org>
> : Subject: Re: UUID processor handling of empty string
> :
> : What do you mean "doesn't work"? An empty string is
> : different than not being present. Thee UUID update
> : processor (I'm pretty sure) only adds a field if it
> : is _absent_. Specifying it as an empty string
> : fails that test so no value is added.
> :
> : At that point, if this uuid field is also the ,
> : then each doc that comes in with an empty field will replace
> : the others.
> :
> : If it's _not_ the , the sorting will be confusing.
> : All the empty string fields are equal, so the tiebreaker is
> : the internal Lucene doc ID, which may change as merges
> : happen. You can specify secondary sort fields to make the
> : sort predictable (the  field is popular for this).
> :
> : Best,
> : Erick
> :
> : On Thu, Apr 14, 2016 at 12:18 PM, Susmit Shukla <shukla.sus...@gmail.com>
> wrote:
> : > Hi,
> : >
> : > I have configured solr schema to generate unique id for a collection
> using
> : > UUIDUpdateProcessorFactory
> : >
> : > I am seeing a peculiar behavior - if the unique 'id' field is
> explicitly
> : > set as empty string in the SolrInputDocument, the document gets indexed
> : > with UUID update processor generating the id.
> : > However, sorting does not work if uuid was generated in this way. Also
> : > cursor functionality that depends on unique id sort also does not work.
> : > I guess the correct behavior would be to fail the indexing if user
> provides
> : > an empty string for a uuid field.
> : >
> : > The issues do not happen if I omit the id field from the
> SolrInputDocument .
> : >
> : > SolrInputDocument
> : >
> : > solrDoc.addField("id", "");
> : >
> : > ...
> : >
> : > I am using schema similar to below-
> : >
> : > 
> : >
> : > 
> : >
> : >  required="true" />
> : >
> : > id
> : >
> : > 
> : > 
> : > 
> : >   id
> : > 
> : > 
> : > 
> : >
> : >
> : >  
> : >
> : >  uuid
> : >
> : > 
> : >
> : >
> : > Thanks,
> : > Susmit
> :
>
> -Hoss
> http://www.lucidworks.com/
>


UUID processor handling of empty string

2016-04-14 Thread Susmit Shukla
Hi,

I have configured solr schema to generate unique id for a collection using
UUIDUpdateProcessorFactory

I am seeing a peculiar behavior - if the unique 'id' field is explicitly
set as empty string in the SolrInputDocument, the document gets indexed
with UUID update processor generating the id.
However, sorting does not work if uuid was generated in this way. Also
cursor functionality that depends on unique id sort also does not work.
I guess the correct behavior would be to fail the indexing if user provides
an empty string for a uuid field.

The issues do not happen if I omit the id field from the SolrInputDocument .

SolrInputDocument

solrDoc.addField("id", "");

...

I am using schema similar to below-







id




  id





 
   
 uuid
   



Thanks,
Susmit


Question regarding empty UUID field

2016-04-12 Thread Susmit Shukla
Hi,

I have configured solr schema to generate unique id for a collection using
UUIDUpdateProcessorFactory

I am seeing a peculiar behavior - if the unique 'id' field is explicitly
set as empty string in the SolrInputDocument, the document gets indexed. I
can see in the solr query console a good uuid value was generated by solr
and assigned to id.
However, sorting does not work if uuid was generated in this way. Also
cursor functionality that depends on unique id sort also does not work.
I guess the correct behavior would be to fail the indexing if user provides
an empty string for a uuid field.

The issues do not happen if I omit the id field from the SolrInputDocument .

SolrInputDocument

solrDoc.addField("id", "");

...

I am using schema similar to below-







id




  id





 
   
 uuid
   



Thanks,
Susmit


Solr Cloud Default Document Routing

2014-09-24 Thread Susmit Shukla
Hi,

I'm building out a multi shard solr collection as the index size is likely
to grow fast.
I was testing out the setup with 2 shards on 2 nodes with test data.
Indexed few documents with id as the unique key.
collection create command -
/solr/admin/collections?action=CREATEname=multishardnumShards=2

used this command to upload - curl
http://server/solr/multishard/update/json?commitWithin=2000 --data-binary
@data.json -H 'Content-type:application/json'

data.json -
[
  {
id: 100161200
  }
  {
id: 100161384
  }
]

when I query on one of the node with with an id constraint, I see the query
executed on both shards which looks inefficient - Qtime increased to double
digits. I guess solr would know based on id which shard data went to.

I have a few questions around this as I could not find pertinent
information on user lists or documentation.
- query is hitting all shards and replicas - if I have 3 shards and 5
replicas , how would the performance be impacted since for the very simple
case it increased to double digits?
- Could id lookup queries just go to one shard automatically?

/solr/multishard/select?q=id%3A100161200wt=jsonindent=truedebugQuery=true

QTime:13,

  debug:{
track:{
  rid:-multishard_shard1_replica1-1411605234897-171,
  EXECUTE_QUERY:[
http://server1/solr/multishard_shard1_replica1/;,[
  QTime,1,
  ElapsedTime,4,
  RequestPurpose,GET_TOP_IDS,
  NumFound,1,
  Response,some resp],
http://server2/solr/multishard_shard2_replica1/;,[
  QTime,1,
  ElapsedTime,6,
  RequestPurpose,GET_TOP_IDS,
  NumFound,0,
  Response,some]],
  GET_FIELDS:[
http://server1/solr/multishard_shard1_replica1/;,[
  QTime,0,
  ElapsedTime,4,
  RequestPurpose,GET_FIELDS,GET_DEBUG,
  NumFound,1,


Thanks,
Susmit


fix wiki error

2014-07-08 Thread Susmit Shukla
The url for solr atomic update documentation should contain json in the end.
Here is the page -
https://wiki.apache.org/solr/UpdateJSON#Solr_4.0_Example

curl http://localhost:8983/solr/update/*json* -H 'Content-type:application/json'