Re: Indexing (posting document) taking a lot of time

2016-08-16 Thread kshitij tyagi
I am posting json using curl.

On Wed, Aug 17, 2016 at 4:41 AM, Alexandre Rafalovitch 
wrote:

> What format are those documents? Solr XML? Custom JSON?
>
> Or are you sending PDF/binary documents to Solr's extract handler and
> asking it to do the extraction of the useful stuff? If later, you
> could take that step out of Solr with a custom client using Tika (what
> Solr has under the hood) and only send to Solr the processed output.
>
> Regards,
>Alex.
> 
> Newsletter and resources for Solr beginners and intermediates:
> http://www.solr-start.com/
>
>
> On 16 August 2016 at 22:49, kshitij tyagi 
> wrote:
> > 400kb is size of single document and i am sending 100 documents per
> request.
> > solr heap size is 16gb and running on multithread.
> >
> > On Tue, Aug 16, 2016 at 5:10 PM, Emir Arnautovic <
> > emir.arnauto...@sematext.com> wrote:
> >
> >> Hi,
> >>
> >> 400KB/doc * 100doc = 40MB. If you are running it single threaded, Solr
> >> will be idle while accepting relatively large request. Or is 400KB 100
> doc
> >> bulk that you are sending?
> >>
> >> What is Solr's heap size? I would try increasing number of threads and
> >> monitor Solr's heap/CPU/IO to see where is the bottleneck.
> >>
> >> How complex is fields' analysis?
> >>
> >> Regards,
> >> Emir
> >>
> >>
> >> On 16.08.2016 13:25, kshitij tyagi wrote:
> >>
> >>> hi,
> >>>
> >>> we are sending about 100 documents per request for indexing? we have
> >>> autocmmit set to false and commit only when 1 documents are
> >>> present.solr and the machine sending request are in same pool.
> >>>
> >>>
> >>>
> >>> On Tue, Aug 16, 2016 at 4:51 PM, Emir Arnautovic <
> >>> emir.arnauto...@sematext.com> wrote:
> >>>
> >>> Hi,
> 
>  Do you send one doc per request? How frequently do you commit? Where
> is
>  Solr running? What is network connection between your machine and
> Solr?
>  What are JVM settings? Is 10-30s for entire indexing or single doc?
> 
>  Regards,
>  Emir
> 
> 
>  On 16.08.2016 11:34, kshitij tyagi wrote:
> 
>  Hi alexandre,
> >
> > 1 document of 400kb size is taking approx 10-30 sec and this is
> > varying. I
> > am posting document using curl
> >
> > On Tue, Aug 16, 2016 at 2:11 PM, Alexandre Rafalovitch <
> > arafa...@gmail.com>
> > wrote:
> >
> > How many records is that and what is 'slow'? Also is this standalone
> or
> >
> >> cluster setup?
> >>
> >> On 16 Aug 2016 6:33 PM, "kshitij tyagi" <
> kshitij.shopcl...@gmail.com>
> >> wrote:
> >>
> >> Hi,
> >>
> >>> I am indexing a lot of data about 8GB, but it is taking a lot of
> >>> time. I
> >>> have read about maxBufferedDocs, ramBufferSizeMB, merge policy
> ,etc in
> >>> solrconfig file.
> >>>
> >>> It would be helpful if someone could help me out tune the segtting
> for
> >>> faster indexing speeds.
> >>>
> >>> *I have read the docs but not able to get what exactly means
> changing
> >>>
> >>> these
> >>
> >> configs.*
> >>>
> >>>
> >>> *Regards,*
> >>> *Kshitij*
> >>>
> >>>
> >>> --
>  Monitoring * Alerting * Anomaly Detection * Centralized Log Management
>  Solr & Elasticsearch Support * http://sematext.com/
> 
> 
> 
> >> --
> >> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> >> Solr & Elasticsearch Support * http://sematext.com/
> >>
> >>
>


Re: Multiple rollups/facets in one streaming aggregation?

2016-08-16 Thread Radu Gheorghe
Thanks a lot, Joel, for your very fast and informative reply!

We'll chew on this and add a Jira if we're going on this route.
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/


On Tue, Aug 16, 2016 at 8:29 PM, Joel Bernstein  wrote:
> For the initial implementation we could skip the merge piece if that helps
> get things done faster. In this scenario the metrics could be gathered
> after some parallel operation, then there would be no need for a merge.
> Sample syntax:
>
> metrics(parallel(join())
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Tue, Aug 16, 2016 at 1:25 PM, Joel Bernstein  wrote:
>
>> The concept of a MetricStream was in the early designs but hasn't yet been
>> implemented. Now might be a good time to work on the implementation.
>>
>> The MetricStream wraps a stream and gathers metrics in memory, continuing
>> to emit the tuples from the underlying stream. This allows multiple
>> MetricStreams to operate over the same stream without transforming the
>> stream. Psuedo code for a metric expression syntax is below:
>>
>> metrics(metrics(search())
>>
>> The MetricStream delivers it's metrics through the EOF Tuple. So the
>> MetricStream simply adds the finished aggregations to the EOF Tuple and
>> returns it. If we're going to support parallel metric gathering then we'll
>> also need to support the merging of the metrics. Something like this:
>>
>> metrics(parallel(metrics(join())
>>
>> Where the metrics wrapping the parallel function would need to collect the
>> EOF tuples from each worker and the merge the metrics and then emit the
>> merged metrics in and EOF Tuple.
>>
>> If you think this meets your needs, feel free to create a jira and add
>> begin a patch and I can help get it committed.
>>
>>
>> Joel Bernstein
>> http://joelsolr.blogspot.com/
>>
>> On Tue, Aug 16, 2016 at 11:52 AM, Radu Gheorghe <
>> radu.gheor...@sematext.com> wrote:
>>
>>> Hello Solr users :)
>>>
>>> Right now it seems that if I want to rollup on two different fields
>>> with streaming expressions, I would need to do two separate requests.
>>> This is too slow for our use-case, when we need to do joins before
>>> sorting and rolling up (because we'd have to re-do the joins).
>>>
>>> Since in our case we are actually looking for some not-necessarily
>>> accurate facets (top N), the best solution we could come up with was
>>> to implement a new stream decorator that implements an algorithm like
>>> Count-min sketch[1] which would run on the tuples provided by the
>>> stream function it wraps. This would have two big wins for us:
>>> 1) it would do the facet without needing to sort on the facet field,
>>> so we'll potentially save lots of memory
>>> 2) because sorting isn't needed, we could do multiple facets in one go
>>>
>>> That said, I have two (broad) questions:
>>> A) is there a better way of doing this? Let's reduce the problem to
>>> streaming aggregations, where the assumption is that we have multiple
>>> collections where data needs to be joined, and then facet on fields
>>> from all collections. But maybe there's a better algorithm, something
>>> out of the box or closer to what is offered out of the box?
>>> B) whatever the best way is, could we do it in a way that can be
>>> contributed back to Solr? Any hints on how to do that? Just another
>>> decorator?
>>>
>>> Thanks and best regards,
>>> Radu
>>> --
>>> Performance Monitoring * Log Analytics * Search Analytics
>>> Solr & Elasticsearch Support * http://sematext.com/
>>>
>>> [1] https://en.wikipedia.org/wiki/Count%E2%80%93min_sketch
>>>
>>
>>


Re: Increasing filterCache size and Java Heap size

2016-08-16 Thread Erick Erickson
Yes. Each entry is roughly 1K + maxdoc/8 bytes. The maxdoc/8 is the
bitmap that holds the result set and the 1K is just overhead for the
text of the query itself and cache overhead. Usually it's safe to
ignore since the maxdoc/8 usually dominates by a wide margin.

Best,
Erick

On Tue, Aug 16, 2016 at 8:02 PM, Zheng Lin Edwin Yeo
 wrote:
> Hi,
>
> Would like to check, do I need to increase my Java Heap size for Solr, if I
> plan to increase my filterCache size in solrconfig.xml?
>
> I'm using Solr 6.1.0
>
> Regards,
> Edwin


Increasing filterCache size and Java Heap size

2016-08-16 Thread Zheng Lin Edwin Yeo
Hi,

Would like to check, do I need to increase my Java Heap size for Solr, if I
plan to increase my filterCache size in solrconfig.xml?

I'm using Solr 6.1.0

Regards,
Edwin


Error During Indexing - org.apache.solr.common.SolrException; org.apache.solr.common.SolrException: early EOF

2016-08-16 Thread Jaspal Sawhney
Hello
We are running solr 4.6 in master-slave configuration where in our master is 
used entirely for indexing. No search traffic comes to master ever.
Off late we have started to get the early EOF error on the solr Master which 
results in a Broken Pipe error on the commerce application from where Indexing 
was kicked off from.

Things to mention

  1.  We have a couple of sites – each of which has the same document size but 
diff document count.
  2.  This error is being observed in the site which has the most number of 
document count I.e. 2204743
  3.  The way I have understood solr to work is that irrespective of number of 
document – the throughput is controlled by the ‘Number of Threads’ and ‘Batch 
size’ - Am I correct?
 *   In our case we have not touched the batch size and Number of Threads 
when this error started coming
 *   However when I do touch these parameters (specifically reduce them) 
the error does not come – however indexing time increases a lot.
  4.  We have to index overnight daily because we put product prices in the 
Index which get updated nightly
  5.  Solr master is running with a 20 GB Heap

What we have tried

  1.  I disabled autoCommit (I.e. Hard commit) and put the autoSoftCommit as 5 
mins
 *   I realized afterwards that this was a wrong test because my 
understanding of soft commit was incorrect, My understanding now is that hard 
commit just truncate the Tlog do hardCommit should be better indexing 
performance.
 *   This test failed for lack of space reason however because disable 
autoCommit did not make sense – I did not retry this test yet.
  2.  Increased the RAMBufferSizeMB from 100MB to 1000MB
 *   This test did not yield anything favorable – the master gave the early 
EOF exception
  3.  Increased the merge factor from 20 —> 100
 *   This test did not yield anything favorable – the master gave the early 
EOF exception
  4.  Flipped the autoCommit to 15 secs and disabled auto commit
 *   This test did not yield anything favorable – the master gave the early 
EOF exception
 *   I got the input for this from 
https://lucidworks.com/blog/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
 - Heavy (Bulk) Indexing section
  5.  Tried to bypass transaction log all together – This test is underway 
currently

Questions

  1.  Since we are not using solrCloud – I want to understand the impact of 
bypassing transaction log
  2.  How does solr take documents which are sent to it to storage as in what 
is the journey of a document from segment to tlog to storage.

It would be great If there are any pointers which you can share.

Thanks
J./

The actual Error Log
ERROR - 2016-08-16 22:59:55.988; org.apache.solr.common.SolrException; 
org.apache.solr.common.SolrException: early EOF
at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:176)
at 
org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
at 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1859)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:721)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:417)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:201)
at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
at 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
at 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)
at 
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
at 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
at org.eclipse.jetty.server.Server.handle(Server.java:368)
at 
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)
at 
org.eclipse.jetty.server.BlockingHttpConnect

Re: The most efficient way to get un-inverted view of the index?

2016-08-16 Thread Joel Bernstein
You'll want to use org.apache.lucene.index.DocValues. The DocValues api has
replaced the field cache.





Joel Bernstein
http://joelsolr.blogspot.com/

On Tue, Aug 16, 2016 at 8:18 PM, Roman Chyla  wrote:

> I need to read data from the index in order to build a special cache.
> Previously, in SOLR4, this was accomplished with FieldCache or
> DocTermOrds
>
> Now, I'm struggling to see what API to use, there is many of them:
>
> on lucene level:
>
> UninvertingReader.getNumericDocValues (and others)
> .getNumericValues()
> MultiDocValues.getNumericValues()
> MultiFields.getTerms()
>
> on solr level:
>
> reader.getNumericValues()
> UninvertingReader.getNumericDocValues()
> and extensions to FilterLeafReader - eg. very intersting, but
> undocumented facet accumulators (ex: NumericAcc)
>
>
> I need this for solr, and ideally re-use the existing cache [ie. the
> special cache is using another fields so those get loaded only once
> and reused in the old solr; which is a win-win situation]
>
> If I use reader.getValues() or FilterLeafReader will I be reading data
> every time the object is created? What would be the best way to read
> data only once?
>
> Thanks,
>
> --roman
>


The most efficient way to get un-inverted view of the index?

2016-08-16 Thread Roman Chyla
I need to read data from the index in order to build a special cache.
Previously, in SOLR4, this was accomplished with FieldCache or
DocTermOrds

Now, I'm struggling to see what API to use, there is many of them:

on lucene level:

UninvertingReader.getNumericDocValues (and others)
.getNumericValues()
MultiDocValues.getNumericValues()
MultiFields.getTerms()

on solr level:

reader.getNumericValues()
UninvertingReader.getNumericDocValues()
and extensions to FilterLeafReader - eg. very intersting, but
undocumented facet accumulators (ex: NumericAcc)


I need this for solr, and ideally re-use the existing cache [ie. the
special cache is using another fields so those get loaded only once
and reused in the old solr; which is a win-win situation]

If I use reader.getValues() or FilterLeafReader will I be reading data
every time the object is created? What would be the best way to read
data only once?

Thanks,

--roman


Re: Creating a SolrJ Data Service to send JSON to Solr

2016-08-16 Thread Anshum Gupta
I would also suggest sending the JSON directly to the JSON end point, with
the mapping :
https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Index+Handlers#UploadingDatawithIndexHandlers-JSONUpdateConveniencePaths

On Tue, Aug 16, 2016 at 4:43 PM Alexandre Rafalovitch 
wrote:

> Why do you need a POJO? For Solr purposes, you could just get the
> field names from schema and use those to map directly from JSON to the
> 'addField' calls in SolrDocument.
>
> Do you need it for non-Solr purposes? Then you can search for generic
> Java dynamic POJO generation solution.
>
> Also, you could look at creating a superset rather than common-subset
> POJO and then ignore all unknown fields on Solr side by adding a
> dynamicField that matches '*' with everything (index, store,
> docValues) set to false.
>
> Regards,
>Alex.
>
> 
> Newsletter and resources for Solr beginners and intermediates:
> http://www.solr-start.com/
>
>
> On 17 August 2016 at 02:49, Jennifer Coston
>  wrote:
> >
> > Hello,
> > I am trying to write a data service using SolrJ that will allow me to
> > accept JSON through a REST API, create a Solr document ,and write it to
> > multiple different Solr cores (depending on the core name specified). The
> > problem I am running into is that each core is going to have a different
> > schema. My current code has the common fields between all the schemas in
> a
> > data POJO which I then walk and set the values specified in the JSON to
> the
> > Solr Document. However, I don’t want to create a different class for each
> > schema to process the JSON and convert it to a Solr Document. Is there a
> > way to process the extra JSON fields that are not common between the
> > schemas and add them to the Solr Document, without knowing what they are
> > ahead of time? Is there a way to convert JSON to a Solr Document without
> > having to use a POJO?  An alternative I was looking into is to use the
> > SolrClient to get the schema fields, create a POJO, walk that POJO to
> > create a Solr Document and then add it to Solr but, it doesn’t seem to be
> > possible to obtain the fields this way.
> >
> > I know that the easiest way to add JSON to Solr would be to use a curl
> > command and send the JSON directly to Solr but this doesn’t match our
> > requirements, so I need to figure out a way to perform the same operation
> > using SolrJ. Any other ideas or suggestions would be greatly appreciated!
> >
> > Thank you,
> >
> > -Jennifer
>


Re: Creating a SolrJ Data Service to send JSON to Solr

2016-08-16 Thread Alexandre Rafalovitch
Why do you need a POJO? For Solr purposes, you could just get the
field names from schema and use those to map directly from JSON to the
'addField' calls in SolrDocument.

Do you need it for non-Solr purposes? Then you can search for generic
Java dynamic POJO generation solution.

Also, you could look at creating a superset rather than common-subset
POJO and then ignore all unknown fields on Solr side by adding a
dynamicField that matches '*' with everything (index, store,
docValues) set to false.

Regards,
   Alex.


Newsletter and resources for Solr beginners and intermediates:
http://www.solr-start.com/


On 17 August 2016 at 02:49, Jennifer Coston
 wrote:
>
> Hello,
> I am trying to write a data service using SolrJ that will allow me to
> accept JSON through a REST API, create a Solr document ,and write it to
> multiple different Solr cores (depending on the core name specified). The
> problem I am running into is that each core is going to have a different
> schema. My current code has the common fields between all the schemas in a
> data POJO which I then walk and set the values specified in the JSON to the
> Solr Document. However, I don’t want to create a different class for each
> schema to process the JSON and convert it to a Solr Document. Is there a
> way to process the extra JSON fields that are not common between the
> schemas and add them to the Solr Document, without knowing what they are
> ahead of time? Is there a way to convert JSON to a Solr Document without
> having to use a POJO?  An alternative I was looking into is to use the
> SolrClient to get the schema fields, create a POJO, walk that POJO to
> create a Solr Document and then add it to Solr but, it doesn’t seem to be
> possible to obtain the fields this way.
>
> I know that the easiest way to add JSON to Solr would be to use a curl
> command and send the JSON directly to Solr but this doesn’t match our
> requirements, so I need to figure out a way to perform the same operation
> using SolrJ. Any other ideas or suggestions would be greatly appreciated!
>
> Thank you,
>
> -Jennifer


Re: solr date range query

2016-08-16 Thread Alexandre Rafalovitch
Solr does support a Date Range field, though it is not super documented:
https://cwiki.apache.org/confluence/display/solr/Working+with+Dates
http://wiki.apache.org/solr/DateRangeField
https://issues.apache.org/jira/browse/SOLR-6103

There is also an older trick of using Spatial to index date ranges. It
takes a bit to wrap the head around, but is quite interesting.
https://wiki.apache.org/solr/SpatialForTimeDurations

Regards,
   Alex.

Newsletter and resources for Solr beginners and intermediates:
http://www.solr-start.com/


On 16 August 2016 at 21:51, solr2020  wrote:
> Hi,
>
> We have list of events with events start date and end date.for eg:
> event1 starts @ 2nd Aug 2016 ends @ 3rd Aug 2016
> event2 starts @ 4th Aug 2016 ends @ 5th Aug 2016
> event3 starts @ 1st Aug 2016 ends @ 7th Aug 2016
> event4 starts @ 15th july 2016 ends @ 15th Aug 2016
>
> when user selects a date range Aug 2nd to Aug 5th 2016 we are able to fetch
> event1 and event2 with start and end date range query (Aug 2nd  TO Aug 5th
> ). But as event3 and event4 are also an ongoing event we need to fetch that
> . how this can be achieved?
>
> Thanks.
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/solr-date-range-query-tp4291918.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Modified stat of index

2016-08-16 Thread Alexandre Rafalovitch
I believe you can get that via Luke REST API:
http://localhost:8983/solr//admin/luke

Regards,
   Alex.

Newsletter and resources for Solr beginners and intermediates:
http://www.solr-start.com/


On 17 August 2016 at 07:18, Scott Derrick  wrote:
> I need to retrieve the last modified timestamp of my search index.
>
> Is there a query I can use or is it stored in a particular file?
>
> thansk,
>
> Scott
>
> --
> One man's "magic" is another man's engineering. "Supernatural" is a null
> word.”
> Robert A. Heinlein
>


Re: Indexing (posting document) taking a lot of time

2016-08-16 Thread Alexandre Rafalovitch
What format are those documents? Solr XML? Custom JSON?

Or are you sending PDF/binary documents to Solr's extract handler and
asking it to do the extraction of the useful stuff? If later, you
could take that step out of Solr with a custom client using Tika (what
Solr has under the hood) and only send to Solr the processed output.

Regards,
   Alex.

Newsletter and resources for Solr beginners and intermediates:
http://www.solr-start.com/


On 16 August 2016 at 22:49, kshitij tyagi  wrote:
> 400kb is size of single document and i am sending 100 documents per request.
> solr heap size is 16gb and running on multithread.
>
> On Tue, Aug 16, 2016 at 5:10 PM, Emir Arnautovic <
> emir.arnauto...@sematext.com> wrote:
>
>> Hi,
>>
>> 400KB/doc * 100doc = 40MB. If you are running it single threaded, Solr
>> will be idle while accepting relatively large request. Or is 400KB 100 doc
>> bulk that you are sending?
>>
>> What is Solr's heap size? I would try increasing number of threads and
>> monitor Solr's heap/CPU/IO to see where is the bottleneck.
>>
>> How complex is fields' analysis?
>>
>> Regards,
>> Emir
>>
>>
>> On 16.08.2016 13:25, kshitij tyagi wrote:
>>
>>> hi,
>>>
>>> we are sending about 100 documents per request for indexing? we have
>>> autocmmit set to false and commit only when 1 documents are
>>> present.solr and the machine sending request are in same pool.
>>>
>>>
>>>
>>> On Tue, Aug 16, 2016 at 4:51 PM, Emir Arnautovic <
>>> emir.arnauto...@sematext.com> wrote:
>>>
>>> Hi,

 Do you send one doc per request? How frequently do you commit? Where is
 Solr running? What is network connection between your machine and Solr?
 What are JVM settings? Is 10-30s for entire indexing or single doc?

 Regards,
 Emir


 On 16.08.2016 11:34, kshitij tyagi wrote:

 Hi alexandre,
>
> 1 document of 400kb size is taking approx 10-30 sec and this is
> varying. I
> am posting document using curl
>
> On Tue, Aug 16, 2016 at 2:11 PM, Alexandre Rafalovitch <
> arafa...@gmail.com>
> wrote:
>
> How many records is that and what is 'slow'? Also is this standalone or
>
>> cluster setup?
>>
>> On 16 Aug 2016 6:33 PM, "kshitij tyagi" 
>> wrote:
>>
>> Hi,
>>
>>> I am indexing a lot of data about 8GB, but it is taking a lot of
>>> time. I
>>> have read about maxBufferedDocs, ramBufferSizeMB, merge policy ,etc in
>>> solrconfig file.
>>>
>>> It would be helpful if someone could help me out tune the segtting for
>>> faster indexing speeds.
>>>
>>> *I have read the docs but not able to get what exactly means changing
>>>
>>> these
>>
>> configs.*
>>>
>>>
>>> *Regards,*
>>> *Kshitij*
>>>
>>>
>>> --
 Monitoring * Alerting * Anomaly Detection * Centralized Log Management
 Solr & Elasticsearch Support * http://sematext.com/



>> --
>> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
>> Solr & Elasticsearch Support * http://sematext.com/
>>
>>


Re: Request to add probabilistic Query Parser Request Handler

2016-08-16 Thread Walter Underwood
In a search engine, “probabilistic” usually refers to a ranking model, as 
opposed to a vector space model.

This name will almost certainly confuse people.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Aug 16, 2016, at 3:16 PM, Akash Mehta  wrote:
> 
> UserName is mehtakash93.
> 
> On 16 August 2016 at 15:11, Akash Mehta  wrote:
> 
>> The main aim of this request Handler is to get the best parsing for a
>> given query. This basically means recognizing different phrases within the
>> query. We need some kind of training data to generate these phrases.
>> 



Re: Request to add probabilistic Query Parser Request Handler

2016-08-16 Thread Akash Mehta
UserName is mehtakash93.

On 16 August 2016 at 15:11, Akash Mehta  wrote:

> The main aim of this request Handler is to get the best parsing for a
> given query. This basically means recognizing different phrases within the
> query. We need some kind of training data to generate these phrases.
>


Request to add probabilistic Query Parser Request Handler

2016-08-16 Thread Akash Mehta
The main aim of this request Handler is to get the best parsing for a given
query. This basically means recognizing different phrases within the query.
We need some kind of training data to generate these phrases.


Modified stat of index

2016-08-16 Thread Scott Derrick

I need to retrieve the last modified timestamp of my search index.

Is there a query I can use or is it stored in a particular file?

thansk,

Scott

--
One man's "magic" is another man's engineering. "Supernatural" is a null word.”
Robert A. Heinlein



Re: What's the best practices for indexing XML Content with dynamic XML Elements (SOLR 6.1) ?

2016-08-16 Thread Stan Lee
Sorry for not being specific. I believe this SOLR plugin (LUX) may fit my
scenario (query without knowing the tag in advance).
http://luxdb.org/README.html

On Tue, Aug 16, 2016 at 12:18 PM, Erick Erickson 
wrote:

> You haven't really described the scenario you want
> to implement. I get that you have raw XML of an
> unknown structure. What do you want to _do_ with that?
>
> 1> if all you want to do is index the data (i.e. strip the tags)
> try HtmlStripCharFilterFactory.
> 2> If you want to intelligently take content of the XML
> and ingest it into specific Solr fields, I don't think you'll be
> able to do that without writing some specific code to
> parse the XML, explore it and "do the right thing" with it
> which will probably involve SolrJ, an XML parser and
> some programming.
>
> Best,
> Erick
>
> On Tue, Aug 16, 2016 at 6:15 AM, Stan Lee  wrote:
> > We currently have a Microsoft SQL table with a XML datatype. We use DIH
> to
> > import the XML Content as is, that is not using the XPathEntityProcessor.
> > If the elements of the XML content is known, XPathEntity make sense.
> Could
> > someone kindly suggest the right way of handling such scenario, without
> > impacting search performance?
> > Which tokenizer should we be using?
> >
> >
> > Thanks.
>


Re: Need to understand solr merging and commit relationship

2016-08-16 Thread kshitij tyagi
 i have 2 solr cores on a machine with same configs.

Problem is I am getting faster indexing speed on core1 and slower on core2.

Both cores have same index size and configuration.

On Tue, Aug 16, 2016 at 11:34 PM, Erick Erickson 
wrote:

> Why? What is the problem you're facing that you hope
> understanding more about these will help?
>
> Here are two places to start:
> http://blog.mikemccandless.com/2011/02/visualizing-
> lucenes-segment-merges.html
> https://lucidworks.com/blog/2013/08/23/understanding-
> transaction-logs-softcommit-and-commit-in-sorlcloud/
>
> In general every time you do a hard commit the Lucene index is checked
> to see if there are segments that should be merged. If so, then a
> background
> thread is kicked off to start merging selected segments. Which segments
> is decided by the MergePolicy in effect (TieredMergePolicy is the default).
>
> Best,
> Erick
>
> On Tue, Aug 16, 2016 at 10:47 AM, kshitij tyagi
>  wrote:
> > I need to understand clearly that is there any relationship between solr
> > merging and solr commit?
> >
> > If there is then what is it?
> >
> > Also i need to understand how both of these affect indexing speed on the
> > core?
>


Re: Need to understand solr merging and commit relationship

2016-08-16 Thread Erick Erickson
Why? What is the problem you're facing that you hope
understanding more about these will help?

Here are two places to start:
http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html
https://lucidworks.com/blog/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

In general every time you do a hard commit the Lucene index is checked
to see if there are segments that should be merged. If so, then a background
thread is kicked off to start merging selected segments. Which segments
is decided by the MergePolicy in effect (TieredMergePolicy is the default).

Best,
Erick

On Tue, Aug 16, 2016 at 10:47 AM, kshitij tyagi
 wrote:
> I need to understand clearly that is there any relationship between solr
> merging and solr commit?
>
> If there is then what is it?
>
> Also i need to understand how both of these affect indexing speed on the
> core?


Need to understand solr merging and commit relationship

2016-08-16 Thread kshitij tyagi
I need to understand clearly that is there any relationship between solr
merging and solr commit?

If there is then what is it?

Also i need to understand how both of these affect indexing speed on the
core?


Re: Multiple rollups/facets in one streaming aggregation?

2016-08-16 Thread Joel Bernstein
For the initial implementation we could skip the merge piece if that helps
get things done faster. In this scenario the metrics could be gathered
after some parallel operation, then there would be no need for a merge.
Sample syntax:

metrics(parallel(join())


Joel Bernstein
http://joelsolr.blogspot.com/

On Tue, Aug 16, 2016 at 1:25 PM, Joel Bernstein  wrote:

> The concept of a MetricStream was in the early designs but hasn't yet been
> implemented. Now might be a good time to work on the implementation.
>
> The MetricStream wraps a stream and gathers metrics in memory, continuing
> to emit the tuples from the underlying stream. This allows multiple
> MetricStreams to operate over the same stream without transforming the
> stream. Psuedo code for a metric expression syntax is below:
>
> metrics(metrics(search())
>
> The MetricStream delivers it's metrics through the EOF Tuple. So the
> MetricStream simply adds the finished aggregations to the EOF Tuple and
> returns it. If we're going to support parallel metric gathering then we'll
> also need to support the merging of the metrics. Something like this:
>
> metrics(parallel(metrics(join())
>
> Where the metrics wrapping the parallel function would need to collect the
> EOF tuples from each worker and the merge the metrics and then emit the
> merged metrics in and EOF Tuple.
>
> If you think this meets your needs, feel free to create a jira and add
> begin a patch and I can help get it committed.
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Tue, Aug 16, 2016 at 11:52 AM, Radu Gheorghe <
> radu.gheor...@sematext.com> wrote:
>
>> Hello Solr users :)
>>
>> Right now it seems that if I want to rollup on two different fields
>> with streaming expressions, I would need to do two separate requests.
>> This is too slow for our use-case, when we need to do joins before
>> sorting and rolling up (because we'd have to re-do the joins).
>>
>> Since in our case we are actually looking for some not-necessarily
>> accurate facets (top N), the best solution we could come up with was
>> to implement a new stream decorator that implements an algorithm like
>> Count-min sketch[1] which would run on the tuples provided by the
>> stream function it wraps. This would have two big wins for us:
>> 1) it would do the facet without needing to sort on the facet field,
>> so we'll potentially save lots of memory
>> 2) because sorting isn't needed, we could do multiple facets in one go
>>
>> That said, I have two (broad) questions:
>> A) is there a better way of doing this? Let's reduce the problem to
>> streaming aggregations, where the assumption is that we have multiple
>> collections where data needs to be joined, and then facet on fields
>> from all collections. But maybe there's a better algorithm, something
>> out of the box or closer to what is offered out of the box?
>> B) whatever the best way is, could we do it in a way that can be
>> contributed back to Solr? Any hints on how to do that? Just another
>> decorator?
>>
>> Thanks and best regards,
>> Radu
>> --
>> Performance Monitoring * Log Analytics * Search Analytics
>> Solr & Elasticsearch Support * http://sematext.com/
>>
>> [1] https://en.wikipedia.org/wiki/Count%E2%80%93min_sketch
>>
>
>


Re: Multiple rollups/facets in one streaming aggregation?

2016-08-16 Thread Joel Bernstein
The concept of a MetricStream was in the early designs but hasn't yet been
implemented. Now might be a good time to work on the implementation.

The MetricStream wraps a stream and gathers metrics in memory, continuing
to emit the tuples from the underlying stream. This allows multiple
MetricStreams to operate over the same stream without transforming the
stream. Psuedo code for a metric expression syntax is below:

metrics(metrics(search())

The MetricStream delivers it's metrics through the EOF Tuple. So the
MetricStream simply adds the finished aggregations to the EOF Tuple and
returns it. If we're going to support parallel metric gathering then we'll
also need to support the merging of the metrics. Something like this:

metrics(parallel(metrics(join())

Where the metrics wrapping the parallel function would need to collect the
EOF tuples from each worker and the merge the metrics and then emit the
merged metrics in and EOF Tuple.

If you think this meets your needs, feel free to create a jira and add
begin a patch and I can help get it committed.


Joel Bernstein
http://joelsolr.blogspot.com/

On Tue, Aug 16, 2016 at 11:52 AM, Radu Gheorghe 
wrote:

> Hello Solr users :)
>
> Right now it seems that if I want to rollup on two different fields
> with streaming expressions, I would need to do two separate requests.
> This is too slow for our use-case, when we need to do joins before
> sorting and rolling up (because we'd have to re-do the joins).
>
> Since in our case we are actually looking for some not-necessarily
> accurate facets (top N), the best solution we could come up with was
> to implement a new stream decorator that implements an algorithm like
> Count-min sketch[1] which would run on the tuples provided by the
> stream function it wraps. This would have two big wins for us:
> 1) it would do the facet without needing to sort on the facet field,
> so we'll potentially save lots of memory
> 2) because sorting isn't needed, we could do multiple facets in one go
>
> That said, I have two (broad) questions:
> A) is there a better way of doing this? Let's reduce the problem to
> streaming aggregations, where the assumption is that we have multiple
> collections where data needs to be joined, and then facet on fields
> from all collections. But maybe there's a better algorithm, something
> out of the box or closer to what is offered out of the box?
> B) whatever the best way is, could we do it in a way that can be
> contributed back to Solr? Any hints on how to do that? Just another
> decorator?
>
> Thanks and best regards,
> Radu
> --
> Performance Monitoring * Log Analytics * Search Analytics
> Solr & Elasticsearch Support * http://sematext.com/
>
> [1] https://en.wikipedia.org/wiki/Count%E2%80%93min_sketch
>


Creating a SolrJ Data Service to send JSON to Solr

2016-08-16 Thread Jennifer Coston

Hello,
I am trying to write a data service using SolrJ that will allow me to
accept JSON through a REST API, create a Solr document ,and write it to
multiple different Solr cores (depending on the core name specified). The
problem I am running into is that each core is going to have a different
schema. My current code has the common fields between all the schemas in a
data POJO which I then walk and set the values specified in the JSON to the
Solr Document. However, I don’t want to create a different class for each
schema to process the JSON and convert it to a Solr Document. Is there a
way to process the extra JSON fields that are not common between the
schemas and add them to the Solr Document, without knowing what they are
ahead of time? Is there a way to convert JSON to a Solr Document without
having to use a POJO?  An alternative I was looking into is to use the
SolrClient to get the schema fields, create a POJO, walk that POJO to
create a Solr Document and then add it to Solr but, it doesn’t seem to be
possible to obtain the fields this way.

I know that the easiest way to add JSON to Solr would be to use a curl
command and send the JSON directly to Solr but this doesn’t match our
requirements, so I need to figure out a way to perform the same operation
using SolrJ. Any other ideas or suggestions would be greatly appreciated!

Thank you,

-Jennifer


Re: SolrJ for .NET / C#

2016-08-16 Thread Joe Lawson
On Tue, Aug 16, 2016 at 12:24 PM, GW  wrote:

> Interesting, I managed to do Solr SQL
>
> It is true that pretty much all operations still work by calling a
collection API directly. The benefits I'm referring to are dynamic cluster
state discovery, routing of requests automatically based on the state,
proper POST and query operations that interact without depending on
inter-cluster routing. Basically removing/abstracting away operational
concerns from the application itself.


Re: Delete replica on down node, after start down node, the deleted replica comes back.

2016-08-16 Thread Erick Erickson
Right, when you restart the downed node, all the
structure is still on disk, i.e. the index is there, the
core.properties file is there etc. I'm assuming you
use the collections DELETEREPLICA command.

Now when Solr starts up on that node, it uses
"core discovery" to find all the "core.properties" files
and reads the information there which includes
the collection and replica that that core belongs to
and registers itself in Zookeeper, thus the node
"comes back".

To get it to be gone permanently either
1> use DELETEREPLICA when the node is running
or
2> nuke the entire directory on the downed machine.
Actually, just renaming core.properties to something
else would do.

Here's a bit about core.properties
https://cwiki.apache.org/confluence/display/solr/Defining+core.properties

Best,
Erick

On Tue, Aug 16, 2016 at 12:53 AM, Jerome Yang  wrote:
> Hi all,
>
> I run into a strange behavior.
> Both on solr6.1 and solr5.3.
>
> For example, there are 4 nodes in cloud mode, one of them is stopped.
> Then I delete a replica on the down node.
> After that I start the down node.
> The deleted replica comes back.
>
> Is this a normal behavior?
>
> Same situation.
> 4 nodes, 1 node is down.
> And I delete a collection.
> After start the down node.
> Replicas in the down node of that collection come back again.
> And I can not use collection api DELETE to delete it.
> It says that collection is not exist.
> But if I use CREATE action to create a same name collection, it says
> collection is already exist.
> The only way is to make things right is to clean it manually from zookeeper
> and data directory.
>
> How to prevent this happen?
>
> Regards,
> Jerome


Re: SolrJ for .NET / C#

2016-08-16 Thread GW
Interesting, I managed to do Solr SQL

On 16 August 2016 at 12:22, Joe Lawson 
wrote:

> The sad part of doing plain old REST requests is you basically miss out on
> all the SolrCloud features that are inherent in client call optimization
> and collection discovery. It would be nice if some companies made /contrib
> offerings for different languages that could be better maintained.
>
> Most REST clients are stuck in a pre-SolrCloud world or master/slave
> configuration and that paradigm is going away.
>
> On Tue, Aug 16, 2016 at 10:43 AM, GW  wrote:
>
> > The client that comes with PHP is lame. If installed you should
> un-install
> > php5-solr and install the Pecl/Pear libs which are good to the end of 5.x
> > and 6.01. It tanks with 6.1.
> >
> > I defer to my own effort of changing everything to plain old REST
> requests.
> >
> > On 16 August 2016 at 10:39, GW  wrote:
> >
> > > As long as you are .NET you will be last in line. You try using the
> REST
> > > API. All you get with a .NET/C# lib is a wrapper for the REST API.
> > >
> > >
> > >
> > > On 16 August 2016 at 09:08, Joe Lawson  > opensourceconnections.com>
> > > wrote:
> > >
> > >> All I have seen is SolrNET, forks of SolrNET and people using
> RestSharp.
> > >>
> > >> On Tue, Aug 16, 2016 at 9:01 AM, Eirik Hungnes 
> > >> wrote:
> > >>
> > >> > Hi
> > >> >
> > >> > I have been looking around for a library for .NET / C#. We are
> > currently
> > >> > using SolrNet, but that is ofc not as well equipped as SolrJ, and
> have
> > >> > heard rumors occasionally about someone, also Lucene, has been
> working
> > >> on a
> > >> > port to other languages?
> > >> >
> > >> > --
> > >> > Best regards,
> > >> >
> > >> > Eirik
> > >> >
> > >>
> > >>
> > >>
> > >> --
> > >> -Joe
> > >>
> > >
> > >
> >
>
>
>
> --
> -Joe
>


Re: SolrJ for .NET / C#

2016-08-16 Thread Joe Lawson
The sad part of doing plain old REST requests is you basically miss out on
all the SolrCloud features that are inherent in client call optimization
and collection discovery. It would be nice if some companies made /contrib
offerings for different languages that could be better maintained.

Most REST clients are stuck in a pre-SolrCloud world or master/slave
configuration and that paradigm is going away.

On Tue, Aug 16, 2016 at 10:43 AM, GW  wrote:

> The client that comes with PHP is lame. If installed you should un-install
> php5-solr and install the Pecl/Pear libs which are good to the end of 5.x
> and 6.01. It tanks with 6.1.
>
> I defer to my own effort of changing everything to plain old REST requests.
>
> On 16 August 2016 at 10:39, GW  wrote:
>
> > As long as you are .NET you will be last in line. You try using the REST
> > API. All you get with a .NET/C# lib is a wrapper for the REST API.
> >
> >
> >
> > On 16 August 2016 at 09:08, Joe Lawson  opensourceconnections.com>
> > wrote:
> >
> >> All I have seen is SolrNET, forks of SolrNET and people using RestSharp.
> >>
> >> On Tue, Aug 16, 2016 at 9:01 AM, Eirik Hungnes 
> >> wrote:
> >>
> >> > Hi
> >> >
> >> > I have been looking around for a library for .NET / C#. We are
> currently
> >> > using SolrNet, but that is ofc not as well equipped as SolrJ, and have
> >> > heard rumors occasionally about someone, also Lucene, has been working
> >> on a
> >> > port to other languages?
> >> >
> >> > --
> >> > Best regards,
> >> >
> >> > Eirik
> >> >
> >>
> >>
> >>
> >> --
> >> -Joe
> >>
> >
> >
>



-- 
-Joe


Re: What's the best practices for indexing XML Content with dynamic XML Elements (SOLR 6.1) ?

2016-08-16 Thread Erick Erickson
You haven't really described the scenario you want
to implement. I get that you have raw XML of an
unknown structure. What do you want to _do_ with that?

1> if all you want to do is index the data (i.e. strip the tags)
try HtmlStripCharFilterFactory.
2> If you want to intelligently take content of the XML
and ingest it into specific Solr fields, I don't think you'll be
able to do that without writing some specific code to
parse the XML, explore it and "do the right thing" with it
which will probably involve SolrJ, an XML parser and
some programming.

Best,
Erick

On Tue, Aug 16, 2016 at 6:15 AM, Stan Lee  wrote:
> We currently have a Microsoft SQL table with a XML datatype. We use DIH to
> import the XML Content as is, that is not using the XPathEntityProcessor.
> If the elements of the XML content is known, XPathEntity make sense. Could
> someone kindly suggest the right way of handling such scenario, without
> impacting search performance?
> Which tokenizer should we be using?
>
>
> Thanks.


Re:

2016-08-16 Thread Erick Erickson
Please follow the unsubscribe instructions here:
http://lucene.apache.org/solr/resources.html

You must use the _exact_ e-mail address you first subscribed with.

Let us know if that doesn't work.

Best,
Erick

On Tue, Aug 16, 2016 at 7:41 AM, Rose, John B  wrote:
> unsubscribe


Multiple rollups/facets in one streaming aggregation?

2016-08-16 Thread Radu Gheorghe
Hello Solr users :)

Right now it seems that if I want to rollup on two different fields
with streaming expressions, I would need to do two separate requests.
This is too slow for our use-case, when we need to do joins before
sorting and rolling up (because we'd have to re-do the joins).

Since in our case we are actually looking for some not-necessarily
accurate facets (top N), the best solution we could come up with was
to implement a new stream decorator that implements an algorithm like
Count-min sketch[1] which would run on the tuples provided by the
stream function it wraps. This would have two big wins for us:
1) it would do the facet without needing to sort on the facet field,
so we'll potentially save lots of memory
2) because sorting isn't needed, we could do multiple facets in one go

That said, I have two (broad) questions:
A) is there a better way of doing this? Let's reduce the problem to
streaming aggregations, where the assumption is that we have multiple
collections where data needs to be joined, and then facet on fields
from all collections. But maybe there's a better algorithm, something
out of the box or closer to what is offered out of the box?
B) whatever the best way is, could we do it in a way that can be
contributed back to Solr? Any hints on how to do that? Just another
decorator?

Thanks and best regards,
Radu
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/

[1] https://en.wikipedia.org/wiki/Count%E2%80%93min_sketch


Re: SolrJ for .NET / C#

2016-08-16 Thread GW
The client that comes with PHP is lame. If installed you should un-install
php5-solr and install the Pecl/Pear libs which are good to the end of 5.x
and 6.01. It tanks with 6.1.

I defer to my own effort of changing everything to plain old REST requests.

On 16 August 2016 at 10:39, GW  wrote:

> As long as you are .NET you will be last in line. You try using the REST
> API. All you get with a .NET/C# lib is a wrapper for the REST API.
>
>
>
> On 16 August 2016 at 09:08, Joe Lawson 
> wrote:
>
>> All I have seen is SolrNET, forks of SolrNET and people using RestSharp.
>>
>> On Tue, Aug 16, 2016 at 9:01 AM, Eirik Hungnes 
>> wrote:
>>
>> > Hi
>> >
>> > I have been looking around for a library for .NET / C#. We are currently
>> > using SolrNet, but that is ofc not as well equipped as SolrJ, and have
>> > heard rumors occasionally about someone, also Lucene, has been working
>> on a
>> > port to other languages?
>> >
>> > --
>> > Best regards,
>> >
>> > Eirik
>> >
>>
>>
>>
>> --
>> -Joe
>>
>
>


solr-user@lucene.apache.org

2016-08-16 Thread Rose, John B
unsubscribe


Re: SolrJ for .NET / C#

2016-08-16 Thread GW
As long as you are .NET you will be last in line. You try using the REST
API. All you get with a .NET/C# lib is a wrapper for the REST API.



On 16 August 2016 at 09:08, Joe Lawson 
wrote:

> All I have seen is SolrNET, forks of SolrNET and people using RestSharp.
>
> On Tue, Aug 16, 2016 at 9:01 AM, Eirik Hungnes  wrote:
>
> > Hi
> >
> > I have been looking around for a library for .NET / C#. We are currently
> > using SolrNet, but that is ofc not as well equipped as SolrJ, and have
> > heard rumors occasionally about someone, also Lucene, has been working
> on a
> > port to other languages?
> >
> > --
> > Best regards,
> >
> > Eirik
> >
>
>
>
> --
> -Joe
>


Re: SolrJ for .NET / C#

2016-08-16 Thread Shawn Heisey
On 8/16/2016 7:01 AM, Eirik Hungnes wrote:
> I have been looking around for a library for .NET / C#. We are
> currently using SolrNet, but that is ofc not as well equipped as
> SolrJ, and have heard rumors occasionally about someone, also Lucene,
> has been working on a port to other languages?

The only client that the Solr project maintains is SolrJ -- the Java
client.  This client is an integral part of Solr itself, so it is kept
up to date.  Naturally this is the client that we recommend, but
sometimes the choice of development language does not include Java.

Clients for any other programming language are third-party software.  We
have no control over that software, and changes in new versions of Solr
will occasionally break those clients.  For instance, one of the main
Solr clients for PHP was broken by a change in Solr 4.0, and it took the
maintainers of that client a LONG time to fix the problem.

I have mentioned the possibility of having the project build/maintain
clients for other languages, or perhaps have some of them absorbed into
the project (if the license is compatible) but nobody has volunteered to
take on the task.  I don't have much experience with those programming
languages.

You can find information about third-party clients here:

https://wiki.apache.org/solr/IntegratingSolr

There are some .NET clients there.  The most recent of them was last
updated a year ago.

Thanks,
Shawn



Re: Inconsistent results with solr admin ui and solrj

2016-08-16 Thread Jan Høydahl
I’m not sure of the root cause for your problem.
Solr is built to stay in sync automatically, so there is no need to script 
anything in that regard.
There may be something with your environement, network, ZooKeeper setup or 
similar that caused the state you were in. I would need to dig further into the 
system to diagnose, such as looking at state.json files, live_nodes znode etc. 
But really, this should JustWork™ :)

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 16. aug. 2016 kl. 15.58 skrev Pranaya Behera :
> 
> Hi,
> I did as you said, now it is coming ok.
> And what are the things to look for while checking about these kind of 
> issues, such as mismatch count, lukerequest not returning all the fields etc. 
> The doc sync is one, how can I programmatically use the info and sync them ? 
> Is there any method in solrj?
> 
> On 16/08/16 14:50, Jan Høydahl wrote:
>> Hi,
>> 
>> There is clearly something wrong when your two replicas are not in sync. 
>> Could you go to the “Cloud->Tree” tab of admin UI and look in the overseer 
>> queue whether you find signs of stuck jobs or something?
>> Btw - what warnings do you see in the logs? Anything repeatedly popping up?
>> 
>> I would also try the following:
>> 1. Take down the node hosting replica 1 (assuming that replica2 is the 
>> correct, most current)
>> 2. Manually empty the data folder
>> 3. Take the node up again
>> 4. Verify that a full index recovery happens, and that they get back in sync
>> 5. Run your indexing procedure.
>> 6. Verify that both replicas are still in sync
>> 
>> --
>> Jan Høydahl, search solution architect
>> Cominvent AS - www.cominvent.com
>> 
>>> 16. aug. 2016 kl. 06.51 skrev Pranaya Behera :
>>> 
>>> Hi,
>>> a.) Yes index is static, not updated live. We index new documents over old 
>>> documents by this sequesce, deleteall docs, add 10 freshly fetched from db, 
>>> after adding all the docs to cloud instance, commit. Commit happens only 
>>> once per collection,
>>> b.) I took one shard and below are the results for the each replica, it has 
>>> 2 replica.
>>> Replica - 2
>>> Last Modified: 33 minutes ago
>>> Num Docs: 127970
>>> Max Doc: 127970
>>> Heap Memory Usage: -1
>>> Deleted Docs: 0
>>> Version: 14530
>>> Segment Count: 5
>>> Optimized: yes
>>> Current: yes
>>> Data:  /var/solr/data/product_shard1_replica2/data
>>> Index: /var/solr/data/product_shard1_replica2/data/index.20160816040537452
>>> Impl:  org.apache.solr.core.NRTCachingDirectoryFactory
>>> 
>>> Replica - 1
>>> Last Modified: about 19 hours ago
>>> Num Docs: 234013
>>> Max Doc: 234013
>>> Heap Memory Usage: -1
>>> Deleted Docs: 0
>>> Version: 14272
>>> Segment Count: 7
>>> Optimized: yes
>>> Current: no
>>> Data:  /var/solr/data/product_shard1_replica1/data
>>> Index: /var/solr/data/product_shard1_replica1/data/index
>>> Impl:  org.apache.solr.core.NRTCachingDirectoryFactory
>>> 
>>> c.) With the admin ui: if I query for all, *:* it gives different numFound 
>>> each time.
>>> e.g.
>>> 1.
>>> 
>>> |{ "responseHeader":{ "zkConnected":true, "status":0, "QTime":7, "params":{ 
>>> "q":"*:*", "indent":"on", "wt":"json", "_":"1471322871767"}}, 
>>> "response":{"numFound":452300,"start":0,"maxScore":1.0, 2. |
>>> |{ "responseHeader":{ "zkConnected":true, "status":0, "QTime":23, 
>>> "params":{ "q":"*:*", "indent":"on", "wt":"json", "_":"1471322871767"}}, 
>>> "response":{"numFound":574013,"start":0,"maxScore":1.0, This is queried 
>>> live from the solr instances. |
>>> 
>>> It happens with any type of queries, if I search in parent document or 
>>> search through child documents to get parents. sorting is used in both 
>>> cases but with different field, while doingblock join query sortingis on 
>>> the child document field, otherwise on the parent document field.
>>> 
>>> d.) I dont find any errors in the logs. All warnings only.
>>> 
>>> On 14/08/16 02:56, Jan Høydahl wrote:
 Could it be that your cluster is not in sync, so that when Solr picks 
 three nodes, results will vary depending on what replica answers?
 
 A few questions:
 
 a) Is your index static, i.e. not being updated live?
 b) Can you try to go directly to the core menu of both replicas for each 
 shard, and compare numDocs / maxDocs for each? Both replicas in each shard 
 should have same count.
 c) What are you querying on and sorting by? Does it happen with only one 
 query and sorting?
 d) Are there any errors in the logs?
 
 If possible, please share some queries, responses, config, screenshots etc.
 
 --
 Jan Høydahl, search solution architect
 Cominvent AS - www.cominvent.com
 
> 13. aug. 2016 kl. 12.10 skrev Pranaya Behera :
> 
> Hi,
>I am running solr 6.1.0 with solrcloud. We have 3 instance of 
> zookeeper and 3 instance of solrcloud. All three of them are active and 
> up. One collection has 3 shards, each shard has 2 replicas.
> 
> Everytime 

We are not the leader

2016-08-16 Thread Tamás Barta
Hi,

We have two Solr 5.4.1 instances running in a ZK cluster. The system worked
well for month but now something happened.

Node1 is in "recovery" state (we didn't restarted it and didn't do anything
with it) and Node2 is the only active. The problem is that Node2 says that
"We are not the leader" and sometimes says that "ClusterState says we are
the leader but locally we don't think so. Request came from null".

In ZK I see that leader is Node2 but that node refuses that. So I can't
start Node1 now because Node2 says to it that leave me alone. Node2 is the
only server which receives user request so I can't restart it. Restarting
Node1 and ZK nodes doesn't solve the problem.

Could you help me how could this happen and what should I do to fix the
system?

Thanks,
Tamas


Re: Inconsistent results with solr admin ui and solrj

2016-08-16 Thread Pranaya Behera

Hi,
 I did as you said, now it is coming ok.
And what are the things to look for while checking about these kind of 
issues, such as mismatch count, lukerequest not returning all the fields 
etc. The doc sync is one, how can I programmatically use the info and 
sync them ? Is there any method in solrj?


On 16/08/16 14:50, Jan Høydahl wrote:

Hi,

There is clearly something wrong when your two replicas are not in sync. Could you 
go to the “Cloud->Tree” tab of admin UI and look in the overseer queue whether 
you find signs of stuck jobs or something?
Btw - what warnings do you see in the logs? Anything repeatedly popping up?

I would also try the following:
1. Take down the node hosting replica 1 (assuming that replica2 is the correct, 
most current)
2. Manually empty the data folder
3. Take the node up again
4. Verify that a full index recovery happens, and that they get back in sync
5. Run your indexing procedure.
6. Verify that both replicas are still in sync

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com


16. aug. 2016 kl. 06.51 skrev Pranaya Behera :

Hi,
a.) Yes index is static, not updated live. We index new documents over old 
documents by this sequesce, deleteall docs, add 10 freshly fetched from db, 
after adding all the docs to cloud instance, commit. Commit happens only once 
per collection,
b.) I took one shard and below are the results for the each replica, it has 2 
replica.
Replica - 2
Last Modified: 33 minutes ago
Num Docs: 127970
Max Doc: 127970
Heap Memory Usage: -1
Deleted Docs: 0
Version: 14530
Segment Count: 5
Optimized: yes
Current: yes
Data:  /var/solr/data/product_shard1_replica2/data
Index: /var/solr/data/product_shard1_replica2/data/index.20160816040537452
Impl:  org.apache.solr.core.NRTCachingDirectoryFactory

Replica - 1
Last Modified: about 19 hours ago
Num Docs: 234013
Max Doc: 234013
Heap Memory Usage: -1
Deleted Docs: 0
Version: 14272
Segment Count: 7
Optimized: yes
Current: no
Data:  /var/solr/data/product_shard1_replica1/data
Index: /var/solr/data/product_shard1_replica1/data/index
Impl:  org.apache.solr.core.NRTCachingDirectoryFactory

c.) With the admin ui: if I query for all, *:* it gives different numFound each 
time.
e.g.
1.

|{ "responseHeader":{ "zkConnected":true, "status":0, "QTime":7, "params":{ "q":"*:*", "indent":"on", "wt":"json", 
"_":"1471322871767"}}, "response":{"numFound":452300,"start":0,"maxScore":1.0, 2. |
|{ "responseHeader":{ "zkConnected":true, "status":0, "QTime":23, "params":{ "q":"*:*", "indent":"on", "wt":"json", 
"_":"1471322871767"}}, "response":{"numFound":574013,"start":0,"maxScore":1.0, This is queried live from the solr instances. |

It happens with any type of queries, if I search in parent document or search 
through child documents to get parents. sorting is used in both cases but with 
different field, while doingblock join query sortingis on the child document 
field, otherwise on the parent document field.

d.) I dont find any errors in the logs. All warnings only.

On 14/08/16 02:56, Jan Høydahl wrote:

Could it be that your cluster is not in sync, so that when Solr picks three 
nodes, results will vary depending on what replica answers?

A few questions:

a) Is your index static, i.e. not being updated live?
b) Can you try to go directly to the core menu of both replicas for each shard, 
and compare numDocs / maxDocs for each? Both replicas in each shard should have 
same count.
c) What are you querying on and sorting by? Does it happen with only one query 
and sorting?
d) Are there any errors in the logs?

If possible, please share some queries, responses, config, screenshots etc.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com


13. aug. 2016 kl. 12.10 skrev Pranaya Behera :

Hi,
I am running solr 6.1.0 with solrcloud. We have 3 instance of zookeeper and 
3 instance of solrcloud. All three of them are active and up. One collection 
has 3 shards, each shard has 2 replicas.

Everytime query whether from solrj or admin ui, getting inconsistent results. 
e.g.
1. numFound is always fluctuating.
2. facet count shows the count for a field, filter query on that field gets 0 
results.
3. luke requests work(not sure whether gives correct info of all the dynamic 
field) on per shard not on collection when invoked from curl but doesnt work 
when called from solrj.
4. admin ui shows expanded results, same query goes from solrj, 
getExpandedResults() gives 0 docs.

What would be cause of all this ? Any pointer to look for an error anything in 
the logs.






Re: Solr 6 Configuration - java.net.SocketTimeoutException

2016-08-16 Thread slee
Thanks.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-6-Configuration-java-net-SocketTimeoutException-tp4291813p4291935.html
Sent from the Solr - User mailing list archive at Nabble.com.


What's the best practices for indexing XML Content with dynamic XML Elements (SOLR 6.1) ?

2016-08-16 Thread Stan Lee
We currently have a Microsoft SQL table with a XML datatype. We use DIH to
import the XML Content as is, that is not using the XPathEntityProcessor.
If the elements of the XML content is known, XPathEntity make sense. Could
someone kindly suggest the right way of handling such scenario, without
impacting search performance?
Which tokenizer should we be using?


Thanks.


Re: Indexing (posting document) taking a lot of time

2016-08-16 Thread Emir Arnautovic
That is quite big document! You need to minitor Solr to see if you are 
feeding documents fast enough or if you are saturating it with large 
number of large requests. Play with batch size and number of threads to 
find sweet spot. Maybe try extremes first (one doc/one thread, one doc 
many threads etc.) and it might tell you more what is slowing things 
down. If you are not using any Solr/JVM/OS monitoring tool, it will help 
you a lot to diagnose issue. One such tool is our SPM 
(http://sematext.com/spm).


Regards,
Emir

On 16.08.2016 14:49, kshitij tyagi wrote:

400kb is size of single document and i am sending 100 documents per request.
solr heap size is 16gb and running on multithread.

On Tue, Aug 16, 2016 at 5:10 PM, Emir Arnautovic <
emir.arnauto...@sematext.com> wrote:


Hi,

400KB/doc * 100doc = 40MB. If you are running it single threaded, Solr
will be idle while accepting relatively large request. Or is 400KB 100 doc
bulk that you are sending?

What is Solr's heap size? I would try increasing number of threads and
monitor Solr's heap/CPU/IO to see where is the bottleneck.

How complex is fields' analysis?

Regards,
Emir


On 16.08.2016 13:25, kshitij tyagi wrote:


hi,

we are sending about 100 documents per request for indexing? we have
autocmmit set to false and commit only when 1 documents are
present.solr and the machine sending request are in same pool.



On Tue, Aug 16, 2016 at 4:51 PM, Emir Arnautovic <
emir.arnauto...@sematext.com> wrote:

Hi,

Do you send one doc per request? How frequently do you commit? Where is
Solr running? What is network connection between your machine and Solr?
What are JVM settings? Is 10-30s for entire indexing or single doc?

Regards,
Emir


On 16.08.2016 11:34, kshitij tyagi wrote:

Hi alexandre,

1 document of 400kb size is taking approx 10-30 sec and this is
varying. I
am posting document using curl

On Tue, Aug 16, 2016 at 2:11 PM, Alexandre Rafalovitch <
arafa...@gmail.com>
wrote:

How many records is that and what is 'slow'? Also is this standalone or


cluster setup?

On 16 Aug 2016 6:33 PM, "kshitij tyagi" 
wrote:

Hi,


I am indexing a lot of data about 8GB, but it is taking a lot of
time. I
have read about maxBufferedDocs, ramBufferSizeMB, merge policy ,etc in
solrconfig file.

It would be helpful if someone could help me out tune the segtting for
faster indexing speeds.

*I have read the docs but not able to get what exactly means changing

these

configs.*


*Regards,*
*Kshitij*


--

Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/




--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/




--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: SolrJ for .NET / C#

2016-08-16 Thread Joe Lawson
All I have seen is SolrNET, forks of SolrNET and people using RestSharp.

On Tue, Aug 16, 2016 at 9:01 AM, Eirik Hungnes  wrote:

> Hi
>
> I have been looking around for a library for .NET / C#. We are currently
> using SolrNet, but that is ofc not as well equipped as SolrJ, and have
> heard rumors occasionally about someone, also Lucene, has been working on a
> port to other languages?
>
> --
> Best regards,
>
> Eirik
>



-- 
-Joe


Re: Need Help Resolving Unknown Shape Definition Error

2016-08-16 Thread Jennifer Coston

Thanks David! I have updated my fieldType to be:



And the queries seem to be working now!

Thanks again!

Jennifer



From:   David Smiley 
To: solr-user@lucene.apache.org
Date:   08/15/2016 11:48 PM
Subject:Re: Need Help Resolving Unknown Shape Definition Error



Hello Jennifer,

The spatial documentation is largely this page:
https://cwiki.apache.org/confluence/display/solr/Spatial+Search
(however note the online version is always for the latest Solr release. You
can download a PDF versioned against your Solr version).

To do polygon searches, you both need to add the JTS jar (which you already
did), and also to set the spatialContextFactory as the ref guide indicates
-- that you have yet to do and is I think why you see that error.

Another thing I see that looks like a problem is that you set geo=false,
yet didn't set the worldBounds.  Typically geo=true and you get the typical
decimal degree +/- 180, +/- 90 box.  But if you set false then the grid
system  needs to know the extent of your grid.

~ David

On Thu, Aug 11, 2016 at 4:04 PM Jennifer Coston <
jennifer.cos...@raytheon.com> wrote:

>
> Hello,
>
> I am trying to setup a local solr core so that I can perform Spatial
> searches on it. I am using version 5.2.1. I have updated my schema.xml
file
> to include the location-rpt fieldType:
>
>  class="solr.SpatialRecursivePrefixTreeFieldType"
> geo="false" distErrPct="0.025" maxDistErr="0.001"
> distanceUnits="degrees" />
>
> And I have defined my field to use this type:
>
>  stored="true" />
>
> I also added the jts-1.4.0.jar file to C:\solr-5.2.1\server\solr-webapp
> \webapp\WEB-INF\lib.
>
> However when I try to add a document through the Solr Admin Console I am
> seeing this response:
>
> {
>   "responseHeader": {
> "status": 400,
> "QTime": 6
>   },
>   "error": {
> "msg": "Unknown Shape definition [POLYGON((-77.23 38.922, -77.23
> 38.923, -77.228 38.923, -77.228 38.922, -77.23 38.922))]",
> "code": 400
>   }
> }
>
> I can submit documents successfully if I remove the positionWkt field.
Did
> I miss a configuration step?
>
> Here is the document I am trying to add:
>
> {
> "observationId": "8e09f47f",
> "observationType": "image",
> "startTime": "2015-09-19T21:03:51Z",
> "endTime": "2015-09-19T21:03:51Z",
> "receiptTime": "2016-07-29T15:49:49.328Z",
> "locationLat": 38.9225015078814,
> "locationLon": -77.22900299194423,
> "position": "38.9225015078814,-77.22900299194423",
> "positionWkt": "POLYGON((-77.23 38.922, -77.23 38.923, -77.228
> 38.923, -77.228 38.922, -77.23 38.922))",
> "provider": "a"
> }
>
> Here are the fields I added to the schema.xml file (I started with the
> template, please let me know if you need the whole thing):
>
> observationId
>
> 
> 
> 
>  required="true" multiValued="false"/>
> 
> 
> 
> 
> 
> 
> 
>  stored="true" />
>
> Thank you!
>
> Jennifer

--
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com


SolrJ for .NET / C#

2016-08-16 Thread Eirik Hungnes
Hi

I have been looking around for a library for .NET / C#. We are currently
using SolrNet, but that is ofc not as well equipped as SolrJ, and have
heard rumors occasionally about someone, also Lucene, has been working on a
port to other languages?

-- 
Best regards,

Eirik


Re: Indexing (posting document) taking a lot of time

2016-08-16 Thread kshitij tyagi
400kb is size of single document and i am sending 100 documents per request.
solr heap size is 16gb and running on multithread.

On Tue, Aug 16, 2016 at 5:10 PM, Emir Arnautovic <
emir.arnauto...@sematext.com> wrote:

> Hi,
>
> 400KB/doc * 100doc = 40MB. If you are running it single threaded, Solr
> will be idle while accepting relatively large request. Or is 400KB 100 doc
> bulk that you are sending?
>
> What is Solr's heap size? I would try increasing number of threads and
> monitor Solr's heap/CPU/IO to see where is the bottleneck.
>
> How complex is fields' analysis?
>
> Regards,
> Emir
>
>
> On 16.08.2016 13:25, kshitij tyagi wrote:
>
>> hi,
>>
>> we are sending about 100 documents per request for indexing? we have
>> autocmmit set to false and commit only when 1 documents are
>> present.solr and the machine sending request are in same pool.
>>
>>
>>
>> On Tue, Aug 16, 2016 at 4:51 PM, Emir Arnautovic <
>> emir.arnauto...@sematext.com> wrote:
>>
>> Hi,
>>>
>>> Do you send one doc per request? How frequently do you commit? Where is
>>> Solr running? What is network connection between your machine and Solr?
>>> What are JVM settings? Is 10-30s for entire indexing or single doc?
>>>
>>> Regards,
>>> Emir
>>>
>>>
>>> On 16.08.2016 11:34, kshitij tyagi wrote:
>>>
>>> Hi alexandre,

 1 document of 400kb size is taking approx 10-30 sec and this is
 varying. I
 am posting document using curl

 On Tue, Aug 16, 2016 at 2:11 PM, Alexandre Rafalovitch <
 arafa...@gmail.com>
 wrote:

 How many records is that and what is 'slow'? Also is this standalone or

> cluster setup?
>
> On 16 Aug 2016 6:33 PM, "kshitij tyagi" 
> wrote:
>
> Hi,
>
>> I am indexing a lot of data about 8GB, but it is taking a lot of
>> time. I
>> have read about maxBufferedDocs, ramBufferSizeMB, merge policy ,etc in
>> solrconfig file.
>>
>> It would be helpful if someone could help me out tune the segtting for
>> faster indexing speeds.
>>
>> *I have read the docs but not able to get what exactly means changing
>>
>> these
>
> configs.*
>>
>>
>> *Regards,*
>> *Kshitij*
>>
>>
>> --
>>> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
>>> Solr & Elasticsearch Support * http://sematext.com/
>>>
>>>
>>>
> --
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
>
>


Re: solr date range query

2016-08-16 Thread GW
This query would indicate two multivalued fields

This query will return results if you put in a value for the field
eventEnddate of 10 years ago as long as the field eventStartdate is
satisfied.





On 16 August 2016 at 08:16, solr2020  wrote:

> eventStartdate:[2016-08-02T00:00:00Z TO 2016-08-05T23:59:59.999Z] OR
> eventEnddate:[2016-08-02T00:00:00Z TO 2016-08-05T23:59:59.999Z]
>
> this is my query.
>
>
>
> --
> View this message in context: http://lucene.472066.n3.
> nabble.com/solr-date-range-query-tp4291918p4291922.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: solr date range query

2016-08-16 Thread solr2020
eventStartdate:[2016-08-02T00:00:00Z TO 2016-08-05T23:59:59.999Z] OR
eventEnddate:[2016-08-02T00:00:00Z TO 2016-08-05T23:59:59.999Z]

this is my query.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-date-range-query-tp4291918p4291922.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: solr date range query

2016-08-16 Thread GW
can you send the query you are using?

On 16 August 2016 at 08:03, solr2020  wrote:

> yes. dates are stored as a single valued date field
>
>
>
> --
> View this message in context: http://lucene.472066.n3.
> nabble.com/solr-date-range-query-tp4291918p4291920.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: solr date range query

2016-08-16 Thread solr2020
yes. dates are stored as a single valued date field



--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-date-range-query-tp4291918p4291920.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: solr date range query

2016-08-16 Thread GW
Am I to assume these dates are stored in a single multivalued field?

On 16 August 2016 at 07:51, solr2020  wrote:

> Hi,
>
> We have list of events with events start date and end date.for eg:
> event1 starts @ 2nd Aug 2016 ends @ 3rd Aug 2016
> event2 starts @ 4th Aug 2016 ends @ 5th Aug 2016
> event3 starts @ 1st Aug 2016 ends @ 7th Aug 2016
> event4 starts @ 15th july 2016 ends @ 15th Aug 2016
>
> when user selects a date range Aug 2nd to Aug 5th 2016 we are able to fetch
> event1 and event2 with start and end date range query (Aug 2nd  TO Aug 5th
> ). But as event3 and event4 are also an ongoing event we need to fetch that
> . how this can be achieved?
>
> Thanks.
>
>
>
> --
> View this message in context: http://lucene.472066.n3.
> nabble.com/solr-date-range-query-tp4291918.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


solr date range query

2016-08-16 Thread solr2020
Hi,

We have list of events with events start date and end date.for eg:
event1 starts @ 2nd Aug 2016 ends @ 3rd Aug 2016
event2 starts @ 4th Aug 2016 ends @ 5th Aug 2016
event3 starts @ 1st Aug 2016 ends @ 7th Aug 2016
event4 starts @ 15th july 2016 ends @ 15th Aug 2016

when user selects a date range Aug 2nd to Aug 5th 2016 we are able to fetch
event1 and event2 with start and end date range query (Aug 2nd  TO Aug 5th
). But as event3 and event4 are also an ongoing event we need to fetch that
. how this can be achieved?

Thanks.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-date-range-query-tp4291918.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Indexing (posting document) taking a lot of time

2016-08-16 Thread Emir Arnautovic

Hi,

400KB/doc * 100doc = 40MB. If you are running it single threaded, Solr 
will be idle while accepting relatively large request. Or is 400KB 100 
doc bulk that you are sending?


What is Solr's heap size? I would try increasing number of threads and 
monitor Solr's heap/CPU/IO to see where is the bottleneck.


How complex is fields' analysis?

Regards,
Emir

On 16.08.2016 13:25, kshitij tyagi wrote:

hi,

we are sending about 100 documents per request for indexing? we have
autocmmit set to false and commit only when 1 documents are
present.solr and the machine sending request are in same pool.



On Tue, Aug 16, 2016 at 4:51 PM, Emir Arnautovic <
emir.arnauto...@sematext.com> wrote:


Hi,

Do you send one doc per request? How frequently do you commit? Where is
Solr running? What is network connection between your machine and Solr?
What are JVM settings? Is 10-30s for entire indexing or single doc?

Regards,
Emir


On 16.08.2016 11:34, kshitij tyagi wrote:


Hi alexandre,

1 document of 400kb size is taking approx 10-30 sec and this is varying. I
am posting document using curl

On Tue, Aug 16, 2016 at 2:11 PM, Alexandre Rafalovitch <
arafa...@gmail.com>
wrote:

How many records is that and what is 'slow'? Also is this standalone or

cluster setup?

On 16 Aug 2016 6:33 PM, "kshitij tyagi" 
wrote:

Hi,

I am indexing a lot of data about 8GB, but it is taking a lot of time. I
have read about maxBufferedDocs, ramBufferSizeMB, merge policy ,etc in
solrconfig file.

It would be helpful if someone could help me out tune the segtting for
faster indexing speeds.

*I have read the docs but not able to get what exactly means changing


these


configs.*


*Regards,*
*Kshitij*



--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/




--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: Indexing (posting document) taking a lot of time

2016-08-16 Thread kshitij tyagi
hi,

we are sending about 100 documents per request for indexing? we have
autocmmit set to false and commit only when 1 documents are
present.solr and the machine sending request are in same pool.



On Tue, Aug 16, 2016 at 4:51 PM, Emir Arnautovic <
emir.arnauto...@sematext.com> wrote:

> Hi,
>
> Do you send one doc per request? How frequently do you commit? Where is
> Solr running? What is network connection between your machine and Solr?
> What are JVM settings? Is 10-30s for entire indexing or single doc?
>
> Regards,
> Emir
>
>
> On 16.08.2016 11:34, kshitij tyagi wrote:
>
>> Hi alexandre,
>>
>> 1 document of 400kb size is taking approx 10-30 sec and this is varying. I
>> am posting document using curl
>>
>> On Tue, Aug 16, 2016 at 2:11 PM, Alexandre Rafalovitch <
>> arafa...@gmail.com>
>> wrote:
>>
>> How many records is that and what is 'slow'? Also is this standalone or
>>> cluster setup?
>>>
>>> On 16 Aug 2016 6:33 PM, "kshitij tyagi" 
>>> wrote:
>>>
>>> Hi,

 I am indexing a lot of data about 8GB, but it is taking a lot of time. I
 have read about maxBufferedDocs, ramBufferSizeMB, merge policy ,etc in
 solrconfig file.

 It would be helpful if someone could help me out tune the segtting for
 faster indexing speeds.

 *I have read the docs but not able to get what exactly means changing

>>> these
>>>
 configs.*


 *Regards,*
 *Kshitij*


> --
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
>
>


Re: Indexing (posting document) taking a lot of time

2016-08-16 Thread Emir Arnautovic

Hi,

Do you send one doc per request? How frequently do you commit? Where is 
Solr running? What is network connection between your machine and Solr? 
What are JVM settings? Is 10-30s for entire indexing or single doc?


Regards,
Emir

On 16.08.2016 11:34, kshitij tyagi wrote:

Hi alexandre,

1 document of 400kb size is taking approx 10-30 sec and this is varying. I
am posting document using curl

On Tue, Aug 16, 2016 at 2:11 PM, Alexandre Rafalovitch 
wrote:


How many records is that and what is 'slow'? Also is this standalone or
cluster setup?

On 16 Aug 2016 6:33 PM, "kshitij tyagi" 
wrote:


Hi,

I am indexing a lot of data about 8GB, but it is taking a lot of time. I
have read about maxBufferedDocs, ramBufferSizeMB, merge policy ,etc in
solrconfig file.

It would be helpful if someone could help me out tune the segtting for
faster indexing speeds.

*I have read the docs but not able to get what exactly means changing

these

configs.*


*Regards,*
*Kshitij*



--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Fwd: Solr - search score and tf-idf vector from individual fields

2016-08-16 Thread govind nitk
Hi Developers,


down votefavorite


This is a fundamental question which I was unable to get from the solr help
and other related Stackoverflow queries.

I have few hundred thousand documents which have 12 fields in them (to be
indexed). All of these fields have text in them (each field can have
varying length text in them - may be from 10 to 5000 characters). For e.g ,
lets say these fields are named A, B . L (12 in all)

Now, when I search for documents, my query comes from 3 fields. X1 , X2 and
X3. Now X1 (conceptually) closely matches with fields C, D , and E. X2
(conceptually) closely matches with fields F, G and J. And X3 is basically
the same field as A. But X1 and X2 should be searched for, all over the
fields (including A). Just filtering against their conceptually matching
fields will not do.

So when designing the schema, my only criterion is the ranking and the
search. I also want (can I ? ) get scores of my query against individual
fields. Something like this

Query : X1 , Score against C , E and over all score (for all returned
documents)

Query : X2 , Score against M , N , O and over all score (for all returned
documents)

Query : X1 + X2 , Score against C , E, M, N and O, and over all score (for
all returned documents)

The reason I want those individual scores is I want to further use those
scores for ML algorithms to further reshuffle/fit the rankings against a
training set.

I also want want the tf-idf vector components of X1 and X2 against C, E and
M,N,O respectively.

Can anyone please let me know if this is possible ?


Re: Indexing (posting document) taking a lot of time

2016-08-16 Thread kshitij tyagi
Hi alexandre,

1 document of 400kb size is taking approx 10-30 sec and this is varying. I
am posting document using curl

On Tue, Aug 16, 2016 at 2:11 PM, Alexandre Rafalovitch 
wrote:

> How many records is that and what is 'slow'? Also is this standalone or
> cluster setup?
>
> On 16 Aug 2016 6:33 PM, "kshitij tyagi" 
> wrote:
>
> > Hi,
> >
> > I am indexing a lot of data about 8GB, but it is taking a lot of time. I
> > have read about maxBufferedDocs, ramBufferSizeMB, merge policy ,etc in
> > solrconfig file.
> >
> > It would be helpful if someone could help me out tune the segtting for
> > faster indexing speeds.
> >
> > *I have read the docs but not able to get what exactly means changing
> these
> > configs.*
> >
> >
> > *Regards,*
> > *Kshitij*
> >
>


Re: Inconsistent results with solr admin ui and solrj

2016-08-16 Thread Jan Høydahl
Hi,

There is clearly something wrong when your two replicas are not in sync. Could 
you go to the “Cloud->Tree” tab of admin UI and look in the overseer queue 
whether you find signs of stuck jobs or something?
Btw - what warnings do you see in the logs? Anything repeatedly popping up?

I would also try the following: 
1. Take down the node hosting replica 1 (assuming that replica2 is the correct, 
most current)
2. Manually empty the data folder
3. Take the node up again
4. Verify that a full index recovery happens, and that they get back in sync
5. Run your indexing procedure.
6. Verify that both replicas are still in sync

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 16. aug. 2016 kl. 06.51 skrev Pranaya Behera :
> 
> Hi,
> a.) Yes index is static, not updated live. We index new documents over old 
> documents by this sequesce, deleteall docs, add 10 freshly fetched from db, 
> after adding all the docs to cloud instance, commit. Commit happens only once 
> per collection,
> b.) I took one shard and below are the results for the each replica, it has 2 
> replica.
> Replica - 2
> Last Modified: 33 minutes ago
> Num Docs: 127970
> Max Doc: 127970
> Heap Memory Usage: -1
> Deleted Docs: 0
> Version: 14530
> Segment Count: 5
> Optimized: yes
> Current: yes
> Data:  /var/solr/data/product_shard1_replica2/data
> Index: /var/solr/data/product_shard1_replica2/data/index.20160816040537452
> Impl:  org.apache.solr.core.NRTCachingDirectoryFactory
> 
> Replica - 1
> Last Modified: about 19 hours ago
> Num Docs: 234013
> Max Doc: 234013
> Heap Memory Usage: -1
> Deleted Docs: 0
> Version: 14272
> Segment Count: 7
> Optimized: yes
> Current: no
> Data:  /var/solr/data/product_shard1_replica1/data
> Index: /var/solr/data/product_shard1_replica1/data/index
> Impl:  org.apache.solr.core.NRTCachingDirectoryFactory
> 
> c.) With the admin ui: if I query for all, *:* it gives different numFound 
> each time.
> e.g.
> 1.
> 
> |{ "responseHeader":{ "zkConnected":true, "status":0, "QTime":7, "params":{ 
> "q":"*:*", "indent":"on", "wt":"json", "_":"1471322871767"}}, 
> "response":{"numFound":452300,"start":0,"maxScore":1.0, 2. |
> |{ "responseHeader":{ "zkConnected":true, "status":0, "QTime":23, "params":{ 
> "q":"*:*", "indent":"on", "wt":"json", "_":"1471322871767"}}, 
> "response":{"numFound":574013,"start":0,"maxScore":1.0, This is queried live 
> from the solr instances. |
> 
> It happens with any type of queries, if I search in parent document or search 
> through child documents to get parents. sorting is used in both cases but 
> with different field, while doingblock join query sortingis on the child 
> document field, otherwise on the parent document field.
> 
> d.) I dont find any errors in the logs. All warnings only.
> 
> On 14/08/16 02:56, Jan Høydahl wrote:
>> Could it be that your cluster is not in sync, so that when Solr picks three 
>> nodes, results will vary depending on what replica answers?
>> 
>> A few questions:
>> 
>> a) Is your index static, i.e. not being updated live?
>> b) Can you try to go directly to the core menu of both replicas for each 
>> shard, and compare numDocs / maxDocs for each? Both replicas in each shard 
>> should have same count.
>> c) What are you querying on and sorting by? Does it happen with only one 
>> query and sorting?
>> d) Are there any errors in the logs?
>> 
>> If possible, please share some queries, responses, config, screenshots etc.
>> 
>> --
>> Jan Høydahl, search solution architect
>> Cominvent AS - www.cominvent.com
>> 
>>> 13. aug. 2016 kl. 12.10 skrev Pranaya Behera :
>>> 
>>> Hi,
>>>I am running solr 6.1.0 with solrcloud. We have 3 instance of zookeeper 
>>> and 3 instance of solrcloud. All three of them are active and up. One 
>>> collection has 3 shards, each shard has 2 replicas.
>>> 
>>> Everytime query whether from solrj or admin ui, getting inconsistent 
>>> results. e.g.
>>> 1. numFound is always fluctuating.
>>> 2. facet count shows the count for a field, filter query on that field gets 
>>> 0 results.
>>> 3. luke requests work(not sure whether gives correct info of all the 
>>> dynamic field) on per shard not on collection when invoked from curl but 
>>> doesnt work when called from solrj.
>>> 4. admin ui shows expanded results, same query goes from solrj, 
>>> getExpandedResults() gives 0 docs.
>>> 
>>> What would be cause of all this ? Any pointer to look for an error anything 
>>> in the logs.
>> 
> 



Re: Indexing (posting document) taking a lot of time

2016-08-16 Thread Alexandre Rafalovitch
How many records is that and what is 'slow'? Also is this standalone or
cluster setup?

On 16 Aug 2016 6:33 PM, "kshitij tyagi"  wrote:

> Hi,
>
> I am indexing a lot of data about 8GB, but it is taking a lot of time. I
> have read about maxBufferedDocs, ramBufferSizeMB, merge policy ,etc in
> solrconfig file.
>
> It would be helpful if someone could help me out tune the segtting for
> faster indexing speeds.
>
> *I have read the docs but not able to get what exactly means changing these
> configs.*
>
>
> *Regards,*
> *Kshitij*
>


Indexing (posting document) taking a lot of time

2016-08-16 Thread kshitij tyagi
Hi,

I am indexing a lot of data about 8GB, but it is taking a lot of time. I
have read about maxBufferedDocs, ramBufferSizeMB, merge policy ,etc in
solrconfig file.

It would be helpful if someone could help me out tune the segtting for
faster indexing speeds.

*I have read the docs but not able to get what exactly means changing these
configs.*


*Regards,*
*Kshitij*


Delete replica on down node, after start down node, the deleted replica comes back.

2016-08-16 Thread Jerome Yang
Hi all,

I run into a strange behavior.
Both on solr6.1 and solr5.3.

For example, there are 4 nodes in cloud mode, one of them is stopped.
Then I delete a replica on the down node.
After that I start the down node.
The deleted replica comes back.

Is this a normal behavior?

Same situation.
4 nodes, 1 node is down.
And I delete a collection.
After start the down node.
Replicas in the down node of that collection come back again.
And I can not use collection api DELETE to delete it.
It says that collection is not exist.
But if I use CREATE action to create a same name collection, it says
collection is already exist.
The only way is to make things right is to clean it manually from zookeeper
and data directory.

How to prevent this happen?

Regards,
Jerome