Re: export with collapse filter runs into NPE

2016-06-10 Thread Joel Bernstein
Yeah, it sounds like we've got two good bugs here. Feel free to create jira
tickets for them, I don't believe they've been created yet. It would be
good to get these fixed for the next release.


Joel Bernstein
http://joelsolr.blogspot.com/

On Fri, Jun 10, 2016 at 7:25 PM, Susmit Shukla 
wrote:

> Hi Joel,
>
> I would need to join results from 2 solr clouds before collapsing so it
> would not be an issue right now.
> I ran into another issue - if data in any of the shards is empty, export
> throws an error-
> Once i have atleast one document in each shard, it works fine.
>
> org.apache.solr.common.SolrException; null:java.io.IOException:
> org.apache.solr.search.SyntaxError: xport RankQuery is required for xsort:
> rq={!xport}
>
> at
>
> org.apache.solr.response.SortingResponseWriter.write(SortingResponseWriter.java:101)
>
> at
>
> org.apache.solr.response.QueryResponseWriterUtil.writeQueryResponse(QueryResponseWriterUtil.java:53)
>
> at
> org.apache.solr.servlet.HttpSolrCall.writeResponse(HttpSolrCall.java:742)
>
> at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:471)
>
> ...
>
> Caused by: org.apache.solr.search.SyntaxError: xport RankQuery is required
> for xsort: rq={!xport}
>
> ... 26 more
>
> On Fri, Jun 10, 2016 at 1:09 PM, Joel Bernstein 
> wrote:
>
> > This sounds like a bug. I'm pretty sure there are no tests that use
> > collapse with the export handler.
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> > On Fri, Jun 10, 2016 at 3:59 PM, Susmit Shukla 
> > wrote:
> >
> > > Hi,
> > >
> > > I'm running this export query, it is working fine. f1 is the uniqueKey
> > and
> > > running solr 5.3.1
> > >
> > > /export?q=f1:term1=f1+desc=f1,f2
> > >
> > > if I add collapsing filter, it is giving NullPointerException
> > >
> > > /export?q=f1:term1=f1+desc=f1,f2={!collapse field=f2}
> > >
> > > does collapsing filter work with /export handler?
> > >
> > >
> > > java.lang.NullPointerException
> > > at
> > > org.apache.lucene.util.BitSetIterator.(BitSetIterator.java:58)
> > > at
> > >
> >
> org.apache.solr.response.SortingResponseWriter.write(SortingResponseWriter.java:138)
> > > at
> > >
> >
> org.apache.solr.response.QueryResponseWriterUtil.writeQueryResponse(QueryResponseWriterUtil.java:53)
> > > at
> > >
> org.apache.solr.servlet.HttpSolrCall.writeResponse(HttpSolrCall.java:742)
> > > at
> > org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:471)
> > > at
> > >
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:214)
> > > at
> > >
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:179)
> > > at
> > >
> >
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
> > > at
> > >
> >
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
> > > at
> > >
> >
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
> > > at
> > >
> >
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
> > > at
> > >
> >
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
> > > at
> > >
> >
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
> > > at
> > >
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
> > > at
> > >
> >
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
> > > at
> > >
> >
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
> > > at
> > >
> >
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
> > > at
> > >
> >
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
> > > at
> > >
> >
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110)
> > > at
> > >
> >
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
> > > at org.eclipse.jetty.server.Server.handle(Server.java:499)
> > > at
> > > org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310)
> > > at
> > >
> >
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
> > > at
> > >
> >
> org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
> > > at
> > >
> >
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
> > > at
> > >
> >
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
> > > at java.lang.Thread.run(Thread.java:745)
> > >
> >
>


Re: Issues with coordinates in Solr during updating of fields

2016-06-10 Thread Zheng Lin Edwin Yeo
Would like to check, what is the use of the gps_0_coordinate and
gps_1_coordinate
field then? Is it just to store the data points, or does it have any other
use?

When I do the query, I found that we are only querying the gps_field, which
is something like this:
http://localhost:8983/solr/collection1/highlight?q=*:*={!geofilt
pt=1.5,100.0 sfield=gps d=5}


Regards,
Edwin

On 27 May 2016 at 08:48, Erick Erickson  wrote:

> Should be fine. When the location field is
> re-indexed (as it is with Atomic Updates)
> the two fields will be filled back in.
>
> Best,
> Erick
>
> On Thu, May 26, 2016 at 4:45 PM, Zheng Lin Edwin Yeo
>  wrote:
> > Thanks Erick for your reply.
> >
> > It works when I remove the 'stored="true" ' from the gps_0_coordinate and
> > gps_1_coordinate.
> >
> > But will this affect the search functions of the gps coordinates in the
> > future?
> >
> > Yes, I am referring to Atomic Updates.
> >
> > Regards,
> > Edwin
> >
> >
> > On 27 May 2016 at 02:02, Erick Erickson  wrote:
> >
> >> Try removing the 'stored="true" ' from the gps_0_coordinate and
> >> gps_1_coordinate.
> >>
> >> When you say "...tried to do an update on any other fileds" I'm assuming
> >> you're
> >> talking about Atomic Updates, which require that the destinations of
> >> copyFields are single valued. Under the covers the location type is
> >> split and copied to the other two fields so I suspect that's what's
> going
> >> on.
> >>
> >> And you could also try one of the other types, see:
> >> https://cwiki.apache.org/confluence/display/solr/Spatial+Search
> >>
> >> Best,
> >> Erick
> >>
> >> On Thu, May 26, 2016 at 1:46 AM, Zheng Lin Edwin Yeo
> >>  wrote:
> >> > Anyone has any solutions to this problem?
> >> >
> >> > I tried to remove the gps_0_coordinate and gps_1_coordinate, but I
> will
> >> get
> >> > the following error during indexing.
> >> > ERROR: [doc=id1] unknown field 'gps_0_coordinate'
> >> >
> >> > Regards,
> >> > Edwin
> >> >
> >> >
> >> > On 25 May 2016 at 11:37, Zheng Lin Edwin Yeo 
> >> wrote:
> >> >
> >> >> Hi,
> >> >>
> >> >> I have an implementation of storing the coordinates in Solr during
> >> >> indexing.
> >> >> During indexing, I will only store the value in the field name
> ="gps".
> >> For
> >> >> the field name = "gps_0_coordinate" and "gps_1_coordinate", the value
> >> will
> >> >> be auto filled and indexed from the "gps" field.
> >> >>
> >> >> >> required="false"/>
> >> >> >> stored="true" required="false"/>
> >> >> >> stored="true" required="false"/>
> >> >>
> >> >> But when I tried to do an update on any other fields in the index,
> Solr
> >> >> will try to add another value in the "gps_0_coordinate" and
> >> >> "gps_1_coordinate". However, as these 2 fields are not multi-Valued,
> it
> >> >> will lead to an error:
> >> >> multiple values encountered for non multiValued field
> gps_0_coordinate:
> >> >> [1.0,1.0]
> >> >>
> >> >> Does anyone knows how we can solve this issue?
> >> >>
> >> >> I am using Solr 5.4.0
> >> >>
> >> >> Regards,
> >> >> Edwin
> >> >>
> >>
>


Re: export with collapse filter runs into NPE

2016-06-10 Thread Susmit Shukla
Hi Joel,

I would need to join results from 2 solr clouds before collapsing so it
would not be an issue right now.
I ran into another issue - if data in any of the shards is empty, export
throws an error-
Once i have atleast one document in each shard, it works fine.

org.apache.solr.common.SolrException; null:java.io.IOException:
org.apache.solr.search.SyntaxError: xport RankQuery is required for xsort:
rq={!xport}

at
org.apache.solr.response.SortingResponseWriter.write(SortingResponseWriter.java:101)

at
org.apache.solr.response.QueryResponseWriterUtil.writeQueryResponse(QueryResponseWriterUtil.java:53)

at org.apache.solr.servlet.HttpSolrCall.writeResponse(HttpSolrCall.java:742)

at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:471)

...

Caused by: org.apache.solr.search.SyntaxError: xport RankQuery is required
for xsort: rq={!xport}

... 26 more

On Fri, Jun 10, 2016 at 1:09 PM, Joel Bernstein  wrote:

> This sounds like a bug. I'm pretty sure there are no tests that use
> collapse with the export handler.
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Fri, Jun 10, 2016 at 3:59 PM, Susmit Shukla 
> wrote:
>
> > Hi,
> >
> > I'm running this export query, it is working fine. f1 is the uniqueKey
> and
> > running solr 5.3.1
> >
> > /export?q=f1:term1=f1+desc=f1,f2
> >
> > if I add collapsing filter, it is giving NullPointerException
> >
> > /export?q=f1:term1=f1+desc=f1,f2={!collapse field=f2}
> >
> > does collapsing filter work with /export handler?
> >
> >
> > java.lang.NullPointerException
> > at
> > org.apache.lucene.util.BitSetIterator.(BitSetIterator.java:58)
> > at
> >
> org.apache.solr.response.SortingResponseWriter.write(SortingResponseWriter.java:138)
> > at
> >
> org.apache.solr.response.QueryResponseWriterUtil.writeQueryResponse(QueryResponseWriterUtil.java:53)
> > at
> > org.apache.solr.servlet.HttpSolrCall.writeResponse(HttpSolrCall.java:742)
> > at
> org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:471)
> > at
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:214)
> > at
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:179)
> > at
> >
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
> > at
> >
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
> > at
> >
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
> > at
> >
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
> > at
> >
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
> > at
> >
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
> > at
> > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
> > at
> >
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
> > at
> >
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
> > at
> >
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
> > at
> >
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
> > at
> >
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110)
> > at
> >
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
> > at org.eclipse.jetty.server.Server.handle(Server.java:499)
> > at
> > org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310)
> > at
> >
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
> > at
> >
> org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
> > at
> >
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
> > at
> >
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
> > at java.lang.Thread.run(Thread.java:745)
> >
>


RE: using spell check on phrases

2016-06-10 Thread Dyer, James
Kaveh,

If your query has "mm" set to zero or a low value, then you may want to 
override this when the spellchecker checks possible collations.  For example:

spellcheck.collateParam.mm=100%

You may also want to consider adding "spellcheck.maxResultsForSuggest" to your 
query, so that it will return spelling suggestions even when the query returns 
some results.  Also if you set "spellcheck.alternativeTermCount", then it will 
try to correct all of the query keywords, including those that exist in the 
dictionary.

See https://cwiki.apache.org/confluence/display/solr/Spell+Checking for more 
information.

James Dyer
Ingram Content Group

-Original Message-
From: kaveh minooie [mailto:ka...@plutoz.com] 
Sent: Monday, June 06, 2016 8:19 PM
To: solr-user@lucene.apache.org
Subject: using spell check on phrases

Hi everyone

I am using solr 6 and DirectSolrSpellChecker, and edismax parser. the 
problem that I am having is that when the query is a phrase, every 
single word in the phrase need to be misspelled for the spell checker to 
gets activated and gives suggestions. if only one of the word is 
misspelled then it just says that spelling is correct:
true

I was wondering if anyone has encountered this situation before and 
knows how to solve it?

thanks,

-- 
Kaveh Minooie



RE: Questions regarding re-index when using Solr as a data source

2016-06-10 Thread Hui Liu
Thank you Walter.

-Original Message-
From: Walter Underwood [mailto:wun...@wunderwood.org] 
Sent: Friday, June 10, 2016 3:53 PM
To: solr-user@lucene.apache.org
Subject: Re: Questions regarding re-index when using Solr as a data source

Those are brand new features that I have not used, so I can’t comment on them.

But I know they do not make Solr into a database.

If you need a transactional database that can support search, you probably want 
MarkLogic. I worked at MarkLogic for a couple of years. In some ways, MarkLogic 
is like Solr, but the support for transactions goes very deep. It is not 
something you can put on top of a search engine.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Jun 10, 2016, at 12:39 PM, Hui Liu  wrote:
> 
> What if we plan to use Solr version 6.x? this url says it support 2 different 
> update modes: atomic update and optimistic concurrency:
> 
> https://cwiki.apache.org/confluence/display/solr/Updating+Parts+of+Documents
> 
> I tested 'optimistic concurrency' and it appears to be working, i.e if a 
> document I am updating got changed by another person I will get error if I 
> supply a _version_ value, So maybe you are referring to an older version of 
> Solr?
> 
> Regards,
> Hui
> 
> -Original Message-
> From: Walter Underwood [mailto:wun...@wunderwood.org] 
> Sent: Friday, June 10, 2016 11:18 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Questions regarding re-index when using Solr as a data source
> 
> Solr does not have transactions at all. The “commit” is really “submit batch”.
> 
> Solr does not have update. You can add, delete, or replace an entire document.
> 
> There is no optimistic concurrency control because there is no concurrency 
> control. Clients can concurrently add documents to a batch, then any client 
> can submit the entire batch.
> 
> Replication is not transactional. Replication is a file copy of the 
> underlying indexes (classic) or copying the documents in a batch (Solr Cloud).
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
> 
>> On Jun 10, 2016, at 7:41 AM, Hui Liu  wrote:
>> 
>> Walter,
>> 
>>  Thank you for your advice. We are new to Solr and have been using 
>> Oracle for past 10+ years, so we are used to the idea of having a tool that 
>> can be used as both data store and also searchable by having indexes on top 
>> of it. I guess the reason we are considering Solr as data store is due to it 
>> has some features of a database that our application requires, such as 1) be 
>> able to detect duplicate record by having a unique field; 2) allow us to do 
>> concurrent update by using Optimistic concurrency control feature; 3) its 
>> 'replication' feature allowing us to store multiple copies of data; so if we 
>> were to use a file system, we will not have the above features (at least not 
>> 1 and 2) and have to implement those ourselves. The other option is to pick 
>> another database tool such as Mysql or Cassandra, then we will need to learn 
>> and support an additional tool besides Solr; but you brought up several very 
>> good points about operational factors we should consider if we pick Solr as 
>> a data store. Also our application is more of a OLTP than OLAP. I will 
>> update our colleagues and stakeholders about these concerns. Thanks again!
>> 
>> Regards,
>> Hui
>> -Original Message-
>> From: Walter Underwood [mailto:wun...@wunderwood.org] 
>> Sent: Thursday, June 09, 2016 1:24 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Questions regarding re-index when using Solr as a data source
>> 
>> In the HowToReindex page, under “Using Solr as a Data Store”, it says this: 
>> "Don't do this unless you have no other option. Solr is not really designed 
>> for this role.” So don’t start by planning to do this.
>> 
>> Using a second copy of Solr is still using Solr as a repository. That 
>> doesn’t satisfy any sort of requirements for disaster recovery. How do you 
>> know that data is good? How do you make a third copy? How do you roll back 
>> to a previous version? How do you deal with a security breach that affects 
>> all your systems? Are the systems in the same data center? How do you deal 
>> with ransomware (U. of Calgary paid $20K yesterday)?
>> 
>> If a consultant suggested this to me, I’d probably just give up and get a 
>> different consultant.
>> 
>> Here is what we do for batch loading.
>> 
>> 1. For each Solr collection, we define a JSONL feed format, with a JSON 
>> Schema.
>> 2. The owners of the data write an extractor to pull the data out of 
>> wherever it is, then generate the JSON feed.
>> 3. We validate the JSON feed against the JSON schema.
>> 4. If the feed is valid, we save it to Amazon S3 along with a manifest which 
>> lists the version of the JSON Schema.
>> 5. Then a multi-threaded loader reads the feed and sends it to Solr.

Re: OT: is Heliosearch discontinued?

2016-06-10 Thread tedsolr
That's fantastic! Thanks Joel



--
View this message in context: 
http://lucene.472066.n3.nabble.com/OT-is-Heliosearch-discontinued-tp4242345p4281792.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: export with collapse filter runs into NPE

2016-06-10 Thread Joel Bernstein
This sounds like a bug. I'm pretty sure there are no tests that use
collapse with the export handler.

Joel Bernstein
http://joelsolr.blogspot.com/

On Fri, Jun 10, 2016 at 3:59 PM, Susmit Shukla 
wrote:

> Hi,
>
> I'm running this export query, it is working fine. f1 is the uniqueKey and
> running solr 5.3.1
>
> /export?q=f1:term1=f1+desc=f1,f2
>
> if I add collapsing filter, it is giving NullPointerException
>
> /export?q=f1:term1=f1+desc=f1,f2={!collapse field=f2}
>
> does collapsing filter work with /export handler?
>
>
> java.lang.NullPointerException
> at
> org.apache.lucene.util.BitSetIterator.(BitSetIterator.java:58)
> at
> org.apache.solr.response.SortingResponseWriter.write(SortingResponseWriter.java:138)
> at
> org.apache.solr.response.QueryResponseWriterUtil.writeQueryResponse(QueryResponseWriterUtil.java:53)
> at
> org.apache.solr.servlet.HttpSolrCall.writeResponse(HttpSolrCall.java:742)
> at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:471)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:214)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:179)
> at
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
> at
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
> at
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
> at
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
> at
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
> at
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
> at
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
> at
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
> at
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
> at
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110)
> at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
> at org.eclipse.jetty.server.Server.handle(Server.java:499)
> at
> org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310)
> at
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
> at
> org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
> at java.lang.Thread.run(Thread.java:745)
>


Re: OT: is Heliosearch discontinued?

2016-06-10 Thread Joel Bernstein
You can actually find those old articles on https://archive.org/web/. I
haven't gone back and collected the writings to repost.

Joel Bernstein
http://joelsolr.blogspot.com/

On Fri, Jun 10, 2016 at 3:31 PM, tedsolr  wrote:

> There were many great white papers hosted on that old site. Does anyone
> know
> if they were moved? I've got lots of broken links - I wish I could get to
> that reference material.
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/OT-is-Heliosearch-discontinued-tp4242345p4281773.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


export with collapse filter runs into NPE

2016-06-10 Thread Susmit Shukla
Hi,

I'm running this export query, it is working fine. f1 is the uniqueKey and
running solr 5.3.1

/export?q=f1:term1=f1+desc=f1,f2

if I add collapsing filter, it is giving NullPointerException

/export?q=f1:term1=f1+desc=f1,f2={!collapse field=f2}

does collapsing filter work with /export handler?


java.lang.NullPointerException
at org.apache.lucene.util.BitSetIterator.(BitSetIterator.java:58)
at 
org.apache.solr.response.SortingResponseWriter.write(SortingResponseWriter.java:138)
at 
org.apache.solr.response.QueryResponseWriterUtil.writeQueryResponse(QueryResponseWriterUtil.java:53)
at 
org.apache.solr.servlet.HttpSolrCall.writeResponse(HttpSolrCall.java:742)
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:471)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:214)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:179)
at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
at 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
at 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
at 
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
at 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
at org.eclipse.jetty.server.Server.handle(Server.java:499)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310)
at 
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
at 
org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
at java.lang.Thread.run(Thread.java:745)


Re: Questions regarding re-index when using Solr as a data source

2016-06-10 Thread Walter Underwood
Those are brand new features that I have not used, so I can’t comment on them.

But I know they do not make Solr into a database.

If you need a transactional database that can support search, you probably want 
MarkLogic. I worked at MarkLogic for a couple of years. In some ways, MarkLogic 
is like Solr, but the support for transactions goes very deep. It is not 
something you can put on top of a search engine.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Jun 10, 2016, at 12:39 PM, Hui Liu  wrote:
> 
> What if we plan to use Solr version 6.x? this url says it support 2 different 
> update modes: atomic update and optimistic concurrency:
> 
> https://cwiki.apache.org/confluence/display/solr/Updating+Parts+of+Documents
> 
> I tested 'optimistic concurrency' and it appears to be working, i.e if a 
> document I am updating got changed by another person I will get error if I 
> supply a _version_ value, So maybe you are referring to an older version of 
> Solr?
> 
> Regards,
> Hui
> 
> -Original Message-
> From: Walter Underwood [mailto:wun...@wunderwood.org] 
> Sent: Friday, June 10, 2016 11:18 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Questions regarding re-index when using Solr as a data source
> 
> Solr does not have transactions at all. The “commit” is really “submit batch”.
> 
> Solr does not have update. You can add, delete, or replace an entire document.
> 
> There is no optimistic concurrency control because there is no concurrency 
> control. Clients can concurrently add documents to a batch, then any client 
> can submit the entire batch.
> 
> Replication is not transactional. Replication is a file copy of the 
> underlying indexes (classic) or copying the documents in a batch (Solr Cloud).
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
> 
>> On Jun 10, 2016, at 7:41 AM, Hui Liu  wrote:
>> 
>> Walter,
>> 
>>  Thank you for your advice. We are new to Solr and have been using 
>> Oracle for past 10+ years, so we are used to the idea of having a tool that 
>> can be used as both data store and also searchable by having indexes on top 
>> of it. I guess the reason we are considering Solr as data store is due to it 
>> has some features of a database that our application requires, such as 1) be 
>> able to detect duplicate record by having a unique field; 2) allow us to do 
>> concurrent update by using Optimistic concurrency control feature; 3) its 
>> 'replication' feature allowing us to store multiple copies of data; so if we 
>> were to use a file system, we will not have the above features (at least not 
>> 1 and 2) and have to implement those ourselves. The other option is to pick 
>> another database tool such as Mysql or Cassandra, then we will need to learn 
>> and support an additional tool besides Solr; but you brought up several very 
>> good points about operational factors we should consider if we pick Solr as 
>> a data store. Also our application is more of a OLTP than OLAP. I will 
>> update our colleagues and stakeholders about these concerns. Thanks again!
>> 
>> Regards,
>> Hui
>> -Original Message-
>> From: Walter Underwood [mailto:wun...@wunderwood.org] 
>> Sent: Thursday, June 09, 2016 1:24 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Questions regarding re-index when using Solr as a data source
>> 
>> In the HowToReindex page, under “Using Solr as a Data Store”, it says this: 
>> "Don't do this unless you have no other option. Solr is not really designed 
>> for this role.” So don’t start by planning to do this.
>> 
>> Using a second copy of Solr is still using Solr as a repository. That 
>> doesn’t satisfy any sort of requirements for disaster recovery. How do you 
>> know that data is good? How do you make a third copy? How do you roll back 
>> to a previous version? How do you deal with a security breach that affects 
>> all your systems? Are the systems in the same data center? How do you deal 
>> with ransomware (U. of Calgary paid $20K yesterday)?
>> 
>> If a consultant suggested this to me, I’d probably just give up and get a 
>> different consultant.
>> 
>> Here is what we do for batch loading.
>> 
>> 1. For each Solr collection, we define a JSONL feed format, with a JSON 
>> Schema.
>> 2. The owners of the data write an extractor to pull the data out of 
>> wherever it is, then generate the JSON feed.
>> 3. We validate the JSON feed against the JSON schema.
>> 4. If the feed is valid, we save it to Amazon S3 along with a manifest which 
>> lists the version of the JSON Schema.
>> 5. Then a multi-threaded loader reads the feed and sends it to Solr.
>> 
>> Reloading is safe and easy, because all the feeds in S3 are valid.
>> 
>> Storing backups in S3 instead of running a second Solr is massively cheaper, 
>> easier, and safer.
>> 
>> We also have a clear contract between the content owners 

Re: Simulate doc linking via post filter cache check

2016-06-10 Thread tedsolr
The terms component will not work for me because it holds on to terms from
deleted documents. My indexes are too volatile.

I could perform a search for every match - but that would not perform. Maybe
I need something that can compare two searches. Anyone know of an existing
filter component does something similar?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Simulate-doc-linking-via-post-filter-cache-check-tp4275842p4281783.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Questions regarding re-index when using Solr as a data source

2016-06-10 Thread Hui Liu
What if we plan to use Solr version 6.x? this url says it support 2 different 
update modes: atomic update and optimistic concurrency:

https://cwiki.apache.org/confluence/display/solr/Updating+Parts+of+Documents

I tested 'optimistic concurrency' and it appears to be working, i.e if a 
document I am updating got changed by another person I will get error if I 
supply a _version_ value, So maybe you are referring to an older version of 
Solr?

Regards,
Hui

-Original Message-
From: Walter Underwood [mailto:wun...@wunderwood.org] 
Sent: Friday, June 10, 2016 11:18 AM
To: solr-user@lucene.apache.org
Subject: Re: Questions regarding re-index when using Solr as a data source

Solr does not have transactions at all. The “commit” is really “submit batch”.

Solr does not have update. You can add, delete, or replace an entire document.

There is no optimistic concurrency control because there is no concurrency 
control. Clients can concurrently add documents to a batch, then any client can 
submit the entire batch.

Replication is not transactional. Replication is a file copy of the underlying 
indexes (classic) or copying the documents in a batch (Solr Cloud).

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Jun 10, 2016, at 7:41 AM, Hui Liu  wrote:
> 
> Walter,
> 
>   Thank you for your advice. We are new to Solr and have been using 
> Oracle for past 10+ years, so we are used to the idea of having a tool that 
> can be used as both data store and also searchable by having indexes on top 
> of it. I guess the reason we are considering Solr as data store is due to it 
> has some features of a database that our application requires, such as 1) be 
> able to detect duplicate record by having a unique field; 2) allow us to do 
> concurrent update by using Optimistic concurrency control feature; 3) its 
> 'replication' feature allowing us to store multiple copies of data; so if we 
> were to use a file system, we will not have the above features (at least not 
> 1 and 2) and have to implement those ourselves. The other option is to pick 
> another database tool such as Mysql or Cassandra, then we will need to learn 
> and support an additional tool besides Solr; but you brought up several very 
> good points about operational factors we should consider if we pick Solr as a 
> data store. Also our application is more of a OLTP than OLAP. I will update 
> our colleagues and stakeholders about these concerns. Thanks again!
> 
> Regards,
> Hui
> -Original Message-
> From: Walter Underwood [mailto:wun...@wunderwood.org] 
> Sent: Thursday, June 09, 2016 1:24 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Questions regarding re-index when using Solr as a data source
> 
> In the HowToReindex page, under “Using Solr as a Data Store”, it says this: 
> "Don't do this unless you have no other option. Solr is not really designed 
> for this role.” So don’t start by planning to do this.
> 
> Using a second copy of Solr is still using Solr as a repository. That doesn’t 
> satisfy any sort of requirements for disaster recovery. How do you know that 
> data is good? How do you make a third copy? How do you roll back to a 
> previous version? How do you deal with a security breach that affects all 
> your systems? Are the systems in the same data center? How do you deal with 
> ransomware (U. of Calgary paid $20K yesterday)?
> 
> If a consultant suggested this to me, I’d probably just give up and get a 
> different consultant.
> 
> Here is what we do for batch loading.
> 
> 1. For each Solr collection, we define a JSONL feed format, with a JSON 
> Schema.
> 2. The owners of the data write an extractor to pull the data out of wherever 
> it is, then generate the JSON feed.
> 3. We validate the JSON feed against the JSON schema.
> 4. If the feed is valid, we save it to Amazon S3 along with a manifest which 
> lists the version of the JSON Schema.
> 5. Then a multi-threaded loader reads the feed and sends it to Solr.
> 
> Reloading is safe and easy, because all the feeds in S3 are valid.
> 
> Storing backups in S3 instead of running a second Solr is massively cheaper, 
> easier, and safer.
> 
> We also have a clear contract between the content owners and the search team. 
> That contract is enforced by the JSON Schema on every single batch.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
> 
>> On Jun 9, 2016, at 9:51 AM, Hui Liu  wrote:
>> 
>> Hi Walter,
>> 
>> Thank you for the reply, sorry I need to clarify what I mean by 'migrate 
>> tables' from Oracle to Solr, we are not literally move existing records from 
>> Oracle to Solr, instead, we are building a new application directly feed 
>> data into Solr as document and fields, in parallel of another existing 
>> application which feeds the same data into Oracle tables/columns, of course, 
>> the Solr schema will be 

Re: Re-create shard with compositeId router and known hash range

2016-06-10 Thread Henrik Brautaset Aronsen
On Fri, Jun 10, 2016 at 6:18 PM, Erick Erickson 
wrote:

> Well, how brave do you want to be ;)?


Hi Erick, thanks for your reply!


> There's no great magic to the
> Zookeeper nodes here. If you do everything just right you could create
> one manually. By that I mean you could "hand edit" the znode with the
> Zookeeper commands, you'd have to dig for the exact commands


So, if I create (or edit) the correct entries in ZK, Solr should just pick
that up and behave accordingly?  I thought I had to do this through the
Solr API.  I think I'll experiment some more with this.


> You _may_ be able to use the ADDREPLICA command, assuming that the shard
> information is still in the ZK node. I haven't tried this however.
>

The shard information is gone from zookeeper (I guess that's what you mean
by ZK node?), and I can't specify hash ranges through the ADDREPLICA
command.


> All that said, if the node is somehow permanently gone, you have to
> re-index anyway to get the data back so recreating the collection
> would be less fooling around.
>

I'm not really interested in the data, since all data in the collection has
a TTL of 30 minutes.

I ended up re-creating the collection even though it gave me a couple of
minutes downtime.  If this happens again, it would be awesome if I could
 manually create the shards with the specified hash ranges.

Cheers,
Henrik


Re: OT: is Heliosearch discontinued?

2016-06-10 Thread tedsolr
There were many great white papers hosted on that old site. Does anyone know
if they were moved? I've got lots of broken links - I wish I could get to
that reference material.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/OT-is-Heliosearch-discontinued-tp4242345p4281773.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr Schema for same field names within different input entities

2016-06-10 Thread Aniruddh Sharma
Thanks a lot Eric.

Thanks and Regards
Aniruddh

On Fri, Jun 10, 2016 at 12:25 PM, Erick Erickson 
wrote:

> Usually people put an application layer between the Business User and
> the actual query to form complex Solr queries that "do the right
> thing". Unfortunately there's no good automated ways to do this that I
> know of as each app has its own set of peculiarities.
>
> Best,
> Erick
>
> On Wed, Jun 8, 2016 at 2:28 PM, Aniruddh Sharma 
> wrote:
> > Hi Eric
> >
> > Thanks for prompt response. The reason for not flattening in given format
> > was (this I used as example for a very simple data structure). But in
> > actual my record has 100 of fields like this with different nesting
> inside.
> >
> > and once I ingest data in Solr , then Business User will make a search
> > rather than a IT person and Business User needs to have some simple
> mapping
> > to understand new field schema on which they can query.
> >
> > As its end goal is to be used by Business User , and my input record has
> > multiple parameters of nesting. How can I deal with this situation.
> >
> > Thanks and Regards
> > Aniruddh
> >
> > On Wed, Jun 8, 2016 at 5:20 PM, Erick Erickson 
> > wrote:
> >
> >> Why not just flatten this? I.e. have fields
> >> prev_temp
> >> day_temp
> >> next_temp
> >> prev_humidity
> >> day_humitidy
> >> next_humidity
> >> ?
> >>
> >> If you use multiValued fields, there's no good way to
> >> express
> >> prev_temp=X AND prev_humidity=Y
> >> because they'd both be in a single MV field called "temp"
> >> and "humidity"
> >> so querying
> >> temp=X and humidity=Y could match
> >> the previous day's temp and the next day's humidity.
> >>
> >> Best,
> >> Erick
> >>
> >> On Wed, Jun 8, 2016 at 1:52 PM, Aniruddh Sharma 
> >> wrote:
> >> > Hi Susheel
> >> >
> >> > Thanks for prompt response.
> >> >
> >> > I have a further query on it.  Wouldn't above mentioned approach be
> >> > appropriate if I am either getting PreviousDay or CurrentDay.
> >> >
> >> > In my case I will sometimes be getting both PreviousDay and
> CurrentDay in
> >> > same record. so when I store temp/humidity as multi-valued it wouldn't
> >> know
> >> > whether I have stored for previousDay or currentDay.
> >> >
> >> > Kindly guide me if I misunderstand.
> >> >
> >> > Thanks and Regards
> >> > Aniruddh
> >> >
> >> > On Wed, Jun 8, 2016 at 4:41 PM, Susheel Kumar 
> >> wrote:
> >> >
> >> >> How about creating schema with temperature, humidity & a day field
> (and
> >> >> other fields you may have like zipcode/city/country etc). Put
> >> day="next" or
> >> >> day="previous" and during query use fq (filter query) to have
> >> >> fq=day:previous or fq=day:next.
> >> >>
> >> >> Thanks,
> >> >> Susheel
> >> >>
> >> >> On Wed, Jun 8, 2016 at 2:46 PM, Aniruddh Sharma <
> asharma...@gmail.com>
> >> >> wrote:
> >> >>
> >> >> > Hi
> >> >> >
> >> >> > Request help
> >> >> >
> >> >> > I have following XML data to start with
> >> >> >
> >> >> > 
> >> >> >
> >> >> >   13
> >> >> >   50
> >> >> > 
> >> >> >
> >> >> >   15
> >> >> >   60
> >> >> > 
> >> >> > 
> >> >> >
> >> >> >
> >> >> > Please notice it has "previousDay" and "nextDay" and both of them
> >> >> contains
> >> >> > details of same field "temperature" and "humidity"
> >> >> >
> >> >> > What is best way to create schema for it , where I could query for
> >> >> > temperature on previousDay as well as on currentDay
> >> >> >
> >> >> >
> >> >> >
> >> >> > Thanks and Regards
> >> >> > Aniruddh
> >> >> >
> >> >>
> >>
>


Re: Solr Schema for same field names within different input entities

2016-06-10 Thread Erick Erickson
Usually people put an application layer between the Business User and
the actual query to form complex Solr queries that "do the right
thing". Unfortunately there's no good automated ways to do this that I
know of as each app has its own set of peculiarities.

Best,
Erick

On Wed, Jun 8, 2016 at 2:28 PM, Aniruddh Sharma  wrote:
> Hi Eric
>
> Thanks for prompt response. The reason for not flattening in given format
> was (this I used as example for a very simple data structure). But in
> actual my record has 100 of fields like this with different nesting inside.
>
> and once I ingest data in Solr , then Business User will make a search
> rather than a IT person and Business User needs to have some simple mapping
> to understand new field schema on which they can query.
>
> As its end goal is to be used by Business User , and my input record has
> multiple parameters of nesting. How can I deal with this situation.
>
> Thanks and Regards
> Aniruddh
>
> On Wed, Jun 8, 2016 at 5:20 PM, Erick Erickson 
> wrote:
>
>> Why not just flatten this? I.e. have fields
>> prev_temp
>> day_temp
>> next_temp
>> prev_humidity
>> day_humitidy
>> next_humidity
>> ?
>>
>> If you use multiValued fields, there's no good way to
>> express
>> prev_temp=X AND prev_humidity=Y
>> because they'd both be in a single MV field called "temp"
>> and "humidity"
>> so querying
>> temp=X and humidity=Y could match
>> the previous day's temp and the next day's humidity.
>>
>> Best,
>> Erick
>>
>> On Wed, Jun 8, 2016 at 1:52 PM, Aniruddh Sharma 
>> wrote:
>> > Hi Susheel
>> >
>> > Thanks for prompt response.
>> >
>> > I have a further query on it.  Wouldn't above mentioned approach be
>> > appropriate if I am either getting PreviousDay or CurrentDay.
>> >
>> > In my case I will sometimes be getting both PreviousDay and CurrentDay in
>> > same record. so when I store temp/humidity as multi-valued it wouldn't
>> know
>> > whether I have stored for previousDay or currentDay.
>> >
>> > Kindly guide me if I misunderstand.
>> >
>> > Thanks and Regards
>> > Aniruddh
>> >
>> > On Wed, Jun 8, 2016 at 4:41 PM, Susheel Kumar 
>> wrote:
>> >
>> >> How about creating schema with temperature, humidity & a day field (and
>> >> other fields you may have like zipcode/city/country etc). Put
>> day="next" or
>> >> day="previous" and during query use fq (filter query) to have
>> >> fq=day:previous or fq=day:next.
>> >>
>> >> Thanks,
>> >> Susheel
>> >>
>> >> On Wed, Jun 8, 2016 at 2:46 PM, Aniruddh Sharma 
>> >> wrote:
>> >>
>> >> > Hi
>> >> >
>> >> > Request help
>> >> >
>> >> > I have following XML data to start with
>> >> >
>> >> > 
>> >> >
>> >> >   13
>> >> >   50
>> >> > 
>> >> >
>> >> >   15
>> >> >   60
>> >> > 
>> >> > 
>> >> >
>> >> >
>> >> > Please notice it has "previousDay" and "nextDay" and both of them
>> >> contains
>> >> > details of same field "temperature" and "humidity"
>> >> >
>> >> > What is best way to create schema for it , where I could query for
>> >> > temperature on previousDay as well as on currentDay
>> >> >
>> >> >
>> >> >
>> >> > Thanks and Regards
>> >> > Aniruddh
>> >> >
>> >>
>>


Re: Query exact match with ASCIIFoldingFilterFactory

2016-06-10 Thread Erick Erickson
What query are you using? From what you've shown, the exact match
should work. Perhaps use a phrase query?

And while the analyzer is very cool, it has its limitations,
particularly it doesn't show the interactions with the _parser_. So
add  to the URL and look at the parsed_query bits of the
output, that may show you that the query isn't quite being parsed the
way you expect.

Best,
Erick

On Wed, Jun 8, 2016 at 9:04 AM, marotosg  wrote:
> Hi all,
>
> I am trying to query and match on a collection of documents with a field
> which is basically text coming from pdfs. It could contain any type of text.
>
> field type
>  positionIncrementGap="100">
>   
> 
>  generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"
> preserveOriginal="1"/>
>  preserveOriginal="false"/>
> 
>   
>   
> 
>  generateWordParts="0" generateNumberParts="0" catenateWords="0"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"
> preserveOriginal="1"/>
>  preserveOriginal="false"/>
> 
>   
>  
>
>
> It works well in general but i have one use case is not working and I don't
> know how to solve it.
> when I try to make an exact match like below.
> q=docContent:"dq/ex report"
>
> It can't find the match because the worddelimiter is separating the
> positions on the index but not in the query as I don't want to retrieve
> false positives.
>
> Result from analyser
> Index: dq/ex dq ex dqex report
> Query: dq/exreport
>
> Is it possible to use the same functionality but make exact match.
>
> Thanks
> Sergio
>
>
>
>
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Query-exact-match-with-ASCIIFoldingFilterFactory-tp4281256.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Bug in ExtractingRequestHandler

2016-06-10 Thread Gilbert Boyreau

Hello,

I think there's a bug in the |ExtractingRequestHandler|Handler (Tika 
parser).
Some tika's exception are not catch, and the handler return a 0 status, 
indicating no problem's with that content.


I give a look at the code (Solr 5.1, ExtractingDocumentLoader:221), only 
TikaException are catch and send back by SolrException.

The problem still remains on Solr 5.5.

Here's the two stacktrace's :

java.io.IOException :
ERROR - 2016-06-10 14:12:03.932; [ centreinffo] 
org.apache.pdfbox.filter.FlateFilter; FlateFilter: stop reading 
corrupt stream due to a DataFormatException
INFO  - 2016-06-10 14:12:03.940; [   centreinffo] 
org.apache.solr.update.processor.LogUpdateProcessor; [centreinffo] 
webapp=/solr path=/update/extract 
params={fmap.content=contenuDocument=tika_=document_Régionsetformation_280=javabin=/var/local/ci-services/documents/document_Régionsetformation_280=2} 
{add=[document_Régionsetformation_280 (1536759351407017984)]} 0 74

and  java.io.EOFException
ERROR - 2016-06-10 14:10:49.246; [ centreinffo] 
org.apache.fontbox.ttf.TrueTypeFont; An error occured when reading 
table hmtx

java.io.EOFException
at 
org.apache.fontbox.ttf.MemoryTTFDataStream.readSignedShort(MemoryTTFDataStream.java:139)
at 
org.apache.fontbox.ttf.HorizontalMetricsTable.initData(HorizontalMetricsTable.java:62)
at 
org.apache.fontbox.ttf.TrueTypeFont.initializeTable(TrueTypeFont.java:280)
at 
org.apache.fontbox.ttf.TrueTypeFont.getHorizontalMetrics(TrueTypeFont.java:204)
at 
org.apache.fontbox.ttf.TrueTypeFont.getAdvanceWidth(TrueTypeFont.java:346)
at 
org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.getFontWidth(PDTrueTypeFont.java:677)
at 
org.apache.pdfbox.pdmodel.font.PDSimpleFont.getFontWidth(PDSimpleFont.java:231)
at 
org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:411)
at 
org.apache.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:62)
at 
org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:557)
at 
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
at 
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
at 
org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
at 
org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:460)
at 
org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:385)
at 
org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:344)

at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:134)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:146)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at 
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:221)
at 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) 


   ...
INFO  - 2016-06-10 14:10:50.207; [   centreinffo] 
org.apache.solr.update.processor.LogUpdateProcessor; [centreinffo] 
webapp=/solr path=/update/extract 
params={fmap.content=contenuDocument=tika_=document_Régionsetformation_600=javabin=/var/local/ci-services/documents/document_Régionsetformation_600=2} 
{add=[document_Régionsetformation_600 (1536759274020012032)]} 0 2061


Regards,
Gilbert Boyreau



Re: Re-create shard with compositeId router and known hash range

2016-06-10 Thread Erick Erickson
Well, how brave do you want to be ;)? There's no great magic to the
Zookeeper nodes here. If you do everything just right you could create
one manually. By that I mean you could "hand edit" the znode with the
Zookeeper commands, you'd have to dig for the exact commands You
_may_ be able to use the ADDREPLICA command, assuming that the shard
information is still in the ZK node. I haven't tried this however.

The first thing I'd do is see what's up with the replicas not coming
up. What does the Solr log show? There must be some info there. And,
assuming you still have a data directory there, then your docs _may_
be intact. you may be able to simply move them to some other Solr
instance (actually, the entire collectionX_shardY_replicaZ and bounce
the Solr server there. I don't _know_ that'll work, but what you'd see
is the node in Zookeeper magically change the IP address of the
replicas in question.

All that said, if the node is somehow permanently gone, you have to
re-index anyway to get the data back so recreating the collection
would be less fooling around.

Best,
Erick

On Wed, Jun 8, 2016 at 2:40 AM, Henrik Brautaset Aronsen
 wrote:
> Hi.
>
> We have a SolrCloud setup with 20 shards, each with only 1 replica, served
> on 8 servers.
>
> After a server went down we are left with 16 shards, which means that some
> of the compositeId hash ranges aren't hosted by any cores.  Somehow the
> shards/cores didn't come back after the server came up again.  I can see
> the server in /live_nodes.
>
> But all is not bad: The data in the collection is volatile with a TTS of 30
> minutes, and we have a failover in place that tries a new random
> compositeId whenever an "add" operation fails.
>
> My question is: Is it possible to re-create the missing shards or do I have
> to delete and create the collection from scratch?
>
> I know which hash ranges are are missing, but the CREATESHARD [1] API call
> doesn't support shards with the 'compositeId' router.  And I cannot use
> SPLITSHARD [2] since it only divides the original shard's hash.
>
> Best regards,
> Henrik
>
>
> [1]
> https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api8
> [2]
> https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api3


Re: Questions regarding re-index when using Solr as a data source

2016-06-10 Thread Walter Underwood
Solr does not have transactions at all. The “commit” is really “submit batch”.

Solr does not have update. You can add, delete, or replace an entire document.

There is no optimistic concurrency control because there is no concurrency 
control. Clients can concurrently add documents to a batch, then any client can 
submit the entire batch.

Replication is not transactional. Replication is a file copy of the underlying 
indexes (classic) or copying the documents in a batch (Solr Cloud).

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Jun 10, 2016, at 7:41 AM, Hui Liu  wrote:
> 
> Walter,
> 
>   Thank you for your advice. We are new to Solr and have been using 
> Oracle for past 10+ years, so we are used to the idea of having a tool that 
> can be used as both data store and also searchable by having indexes on top 
> of it. I guess the reason we are considering Solr as data store is due to it 
> has some features of a database that our application requires, such as 1) be 
> able to detect duplicate record by having a unique field; 2) allow us to do 
> concurrent update by using Optimistic concurrency control feature; 3) its 
> 'replication' feature allowing us to store multiple copies of data; so if we 
> were to use a file system, we will not have the above features (at least not 
> 1 and 2) and have to implement those ourselves. The other option is to pick 
> another database tool such as Mysql or Cassandra, then we will need to learn 
> and support an additional tool besides Solr; but you brought up several very 
> good points about operational factors we should consider if we pick Solr as a 
> data store. Also our application is more of a OLTP than OLAP. I will update 
> our colleagues and stakeholders about these concerns. Thanks again!
> 
> Regards,
> Hui
> -Original Message-
> From: Walter Underwood [mailto:wun...@wunderwood.org] 
> Sent: Thursday, June 09, 2016 1:24 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Questions regarding re-index when using Solr as a data source
> 
> In the HowToReindex page, under “Using Solr as a Data Store”, it says this: 
> "Don't do this unless you have no other option. Solr is not really designed 
> for this role.” So don’t start by planning to do this.
> 
> Using a second copy of Solr is still using Solr as a repository. That doesn’t 
> satisfy any sort of requirements for disaster recovery. How do you know that 
> data is good? How do you make a third copy? How do you roll back to a 
> previous version? How do you deal with a security breach that affects all 
> your systems? Are the systems in the same data center? How do you deal with 
> ransomware (U. of Calgary paid $20K yesterday)?
> 
> If a consultant suggested this to me, I’d probably just give up and get a 
> different consultant.
> 
> Here is what we do for batch loading.
> 
> 1. For each Solr collection, we define a JSONL feed format, with a JSON 
> Schema.
> 2. The owners of the data write an extractor to pull the data out of wherever 
> it is, then generate the JSON feed.
> 3. We validate the JSON feed against the JSON schema.
> 4. If the feed is valid, we save it to Amazon S3 along with a manifest which 
> lists the version of the JSON Schema.
> 5. Then a multi-threaded loader reads the feed and sends it to Solr.
> 
> Reloading is safe and easy, because all the feeds in S3 are valid.
> 
> Storing backups in S3 instead of running a second Solr is massively cheaper, 
> easier, and safer.
> 
> We also have a clear contract between the content owners and the search team. 
> That contract is enforced by the JSON Schema on every single batch.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
> 
>> On Jun 9, 2016, at 9:51 AM, Hui Liu  wrote:
>> 
>> Hi Walter,
>> 
>> Thank you for the reply, sorry I need to clarify what I mean by 'migrate 
>> tables' from Oracle to Solr, we are not literally move existing records from 
>> Oracle to Solr, instead, we are building a new application directly feed 
>> data into Solr as document and fields, in parallel of another existing 
>> application which feeds the same data into Oracle tables/columns, of course, 
>> the Solr schema will be somewhat different than Oracle; also we only keep 
>> those data for 90 days for user to search on, we hope once we run both 
>> system in parallel for some time (> 90 days), we will build up enough new 
>> data in Solr and we no longer need any old data in Oracle, by then we will 
>> be able to use Solr as our only data store.
>> 
>> It sounds to me that we may need to consider save the data into either file 
>> system, or another database, in case we need to rebuild the indexes; and the 
>> reason I mentioned to save data into another Solr system is by reading this 
>> info from https://wiki.apache.org/solr/HowToReindex : so just trying to get 
>> a feedback on if there is any update 

RE: Questions regarding re-index when using Solr as a data source

2016-06-10 Thread Hui Liu
Walter,

Thank you for your advice. We are new to Solr and have been using 
Oracle for past 10+ years, so we are used to the idea of having a tool that can 
be used as both data store and also searchable by having indexes on top of it. 
I guess the reason we are considering Solr as data store is due to it has some 
features of a database that our application requires, such as 1) be able to 
detect duplicate record by having a unique field; 2) allow us to do concurrent 
update by using Optimistic concurrency control feature; 3) its 'replication' 
feature allowing us to store multiple copies of data; so if we were to use a 
file system, we will not have the above features (at least not 1 and 2) and 
have to implement those ourselves. The other option is to pick another database 
tool such as Mysql or Cassandra, then we will need to learn and support an 
additional tool besides Solr; but you brought up several very good points about 
operational factors we should consider if we pick Solr as a data store. Also 
our application is more of a OLTP than OLAP. I will update our colleagues and 
stakeholders about these concerns. Thanks again!

Regards,
Hui
-Original Message-
From: Walter Underwood [mailto:wun...@wunderwood.org] 
Sent: Thursday, June 09, 2016 1:24 PM
To: solr-user@lucene.apache.org
Subject: Re: Questions regarding re-index when using Solr as a data source

In the HowToReindex page, under “Using Solr as a Data Store”, it says this: 
"Don't do this unless you have no other option. Solr is not really designed for 
this role.” So don’t start by planning to do this.

Using a second copy of Solr is still using Solr as a repository. That doesn’t 
satisfy any sort of requirements for disaster recovery. How do you know that 
data is good? How do you make a third copy? How do you roll back to a previous 
version? How do you deal with a security breach that affects all your systems? 
Are the systems in the same data center? How do you deal with ransomware (U. of 
Calgary paid $20K yesterday)?

If a consultant suggested this to me, I’d probably just give up and get a 
different consultant.

Here is what we do for batch loading.

1. For each Solr collection, we define a JSONL feed format, with a JSON Schema.
2. The owners of the data write an extractor to pull the data out of wherever 
it is, then generate the JSON feed.
3. We validate the JSON feed against the JSON schema.
4. If the feed is valid, we save it to Amazon S3 along with a manifest which 
lists the version of the JSON Schema.
5. Then a multi-threaded loader reads the feed and sends it to Solr.

Reloading is safe and easy, because all the feeds in S3 are valid.

Storing backups in S3 instead of running a second Solr is massively cheaper, 
easier, and safer.

We also have a clear contract between the content owners and the search team. 
That contract is enforced by the JSON Schema on every single batch.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Jun 9, 2016, at 9:51 AM, Hui Liu  wrote:
> 
> Hi Walter,
> 
> Thank you for the reply, sorry I need to clarify what I mean by 'migrate 
> tables' from Oracle to Solr, we are not literally move existing records from 
> Oracle to Solr, instead, we are building a new application directly feed data 
> into Solr as document and fields, in parallel of another existing application 
> which feeds the same data into Oracle tables/columns, of course, the Solr 
> schema will be somewhat different than Oracle; also we only keep those data 
> for 90 days for user to search on, we hope once we run both system in 
> parallel for some time (> 90 days), we will build up enough new data in Solr 
> and we no longer need any old data in Oracle, by then we will be able to use 
> Solr as our only data store.
> 
> It sounds to me that we may need to consider save the data into either file 
> system, or another database, in case we need to rebuild the indexes; and the 
> reason I mentioned to save data into another Solr system is by reading this 
> info from https://wiki.apache.org/solr/HowToReindex : so just trying to get a 
> feedback on if there is any update on this approach? And any better way to do 
> this to minimize the downtime caused by the schema change and re-index? For 
> example, in Oracle, we are able to add a new column or new index online 
> without any impact of existing queries as existing indexes are intact.
> 
> Alternatives when a traditional reindex isn't possible
> 
> Sometimes the option of "do your indexing again" is difficult. Perhaps the 
> original data is very slow to access, or it may be difficult to get in the 
> first place.
> 
> Here's where we go against our own advice that we just gave you. Above we 
> said "don't use Solr itself as a datasource" ... but one way to deal with 
> data availability problems is to set up a completely separate Solr instance 
> (not distributed, which for SolrCloud means numShards=1) 

Re: Scoring changes between 4.10 and 5.5

2016-06-10 Thread Upayavira
Tracked it down to this ticket:

https://issues.apache.org/jira/browse/LUCENE-6590

which changed the implementation of normalize() in
org.apache.lucene.search.similarities.TFIDFSimilarity.

I've asked for comment on that ticket.

Upayavira

On Fri, 10 Jun 2016, at 01:39 AM, Ahmet Arslan wrote:
> Hi,
> 
> I wondered the same before and failed to decipher TFIDFSimilarity.
> Scoring looks like tf*idf*idf to me.
> 
> I appreciate someone who will shed some light on this.
> 
> Thanks,
> Ahmet
> 
> 
> 
> On Friday, June 10, 2016 12:37 AM, Upayavira  wrote:
> I've just done a very simple, single term query against a 4.10 system
> and a 5.5 system, each with much the same data.
> 
> The score for the 4.10 system was essentially made up of the field
> weight, which is:
>score = tf * idf 
> 
> Whereas, in the 5.5 system, there is an additional "query weight", which
> is idf * query norm. If query norm is 1, then the final score is now:
>   score = query_weight * field_weight
>   = ( idf * 1 ) * (tf * idf)
>   = tf * idf^2
> 
> Can anyone explain why this new "query weight" element has appeared in
> our scores somewhere between 4.10 and 5.5?
> 
> Thanks!
> 
> Upayavira
> 
> 4.10 score 
>   "2937439": {
> "match": true,
> "value": 5.5993805,
> "description": "weight(description:obama in 394012)
> [DefaultSimilarity], result of:",
> "details": [
>   {
> "match": true,
> "value": 5.5993805,
> "description": "fieldWeight in 394012, product of:",
> "details": [
>   {
> "match": true,
> "value": 1,
> "description": "tf(freq=1.0), with freq of:",
> "details": [
>   {
> "match": true,
> "value": 1,
> "description": "termFreq=1.0"
>   }
> ]
>   },
>   {
> "match": true,
> "value": 5.5993805,
> "description": "idf(docFreq=56010, maxDocs=5568765)"
>   },
>   {
> "match": true,
> "value": 1,
> "description": "fieldNorm(doc=394012)"
>   }
> ]
>   }
> ]
> 5.5 score 
>   "2502281":{
> "match":true,
> "value":28.51136,
> "description":"weight(description:obama in 43472) [], result
> of:",
> "details":[{
> "match":true,
> "value":28.51136,
> "description":"score(doc=43472,freq=1.0), product of:",
> "details":[{
> "match":true,
> "value":5.339603,
> "description":"queryWeight, product of:",
> "details":[{
> "match":true,
> "value":5.339603,
> "description":"idf(docFreq=31905,
> maxDocs=2446459)"},
>   {
> "match":true,
> "value":1.0,
> "description":"queryNorm"}]},
>   {
> "match":true,
> "value":5.339603,
> "description":"fieldWeight in 43472, product of:",
> "details":[{
> "match":true,
> "value":1.0,
> "description":"tf(freq=1.0), with freq of:",
> "details":[{
> "match":true,
> "value":1.0,
> "description":"termFreq=1.0"}]},
>   {
> "match":true,
> "value":5.339603,
> "description":"idf(docFreq=31905,
> maxDocs=2446459)"},
>   {
> "match":true,
> "value":1.0,
> "description":"fieldNorm(doc=43472)"}]}]}]},


Re: Bypassing ExtractingRequestHandler

2016-06-10 Thread Charlie Hull

On 10/06/2016 02:20, Justin Lee wrote:

Has anybody had any experience bypassing ExtractingRequestHandler and
simply managing Tika manually?  I want to make a small modification to Tika
to get and save additional data from my PDFs, but I have been
procrastinating in no small part due to the unpleasant prospect of setting
up a development environment where I could compile and debug modifications
that might run through PDFBox, Tika, and ExtractingRequestHandler.  It
occurs to me that it would be much easier if the two were separate, so I
could have direct control over Tika and just submit the text to Solr after
extraction.  Am I going to regret this approach?  I'm not sure what
ExtractingRequestHandler really does for me that Tika doesn't already do.


We tend to prefer running Tika externally as it's entirely possible that 
Tika will crash or hang with certain files - and that will bring down 
Solr if you're running Tika within it. Here's a Dropwizard wrapper 
around Tika that might be of use:

https://github.com/mattflax/dropwizard-tika-server

Cheers

Charlie


Also, I was reading this

stackoverflow entry and someone offhandedly mentioned that
ExtractingRequestHandler might be separated in the future anyway. Is there
a public roadmap for the project, or does one have to keep up with the
developer's mailing list and hunt through JIRA entries to keep up with the
pulse of the project?

Thanks,
Justin




--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


Re: Question about multiple fq parameters

2016-06-10 Thread Mikhail Khludnev
Ahmet,

Honestly I don't know, but googling gives:
More DateRangeField Details
https://cwiki.apache.org/confluence/display/solr/Working+with+Dates

On Fri, Jun 10, 2016 at 3:44 AM, Ahmet Arslan 
wrote:

> Hi Mikhail,
>
> Can you please explain what this mysterious op parameter is?
> How is it related to range queries issued on date fields?
>
> Thanks,
> Ahmet
>
>
> On Thursday, June 9, 2016 11:43 AM, Mikhail Khludnev <
> mkhlud...@griddynamics.com> wrote:
> Shawn,
> I found "op" at
> org.apache.solr.schema.DateRangeField.parseSpatialArgs(QParser, String).
>
>
> On Thu, Jun 9, 2016 at 1:46 AM, Shawn Heisey  wrote:
>
> > On 6/8/2016 2:28 PM, Steven White wrote:
> > >
> ?q=*=OR={!field+f=DateA+op=Intersects}[2020-01-01+TO+2030-01-01]
> >
> > Looking at this and checking the code for the Field query parser, I
> > cannot see how what you have used above is any different than:
> >
> > fq=DateA:[2020-01-01 TO 2030-01-01]
> >
> > The "op=Intersects" parameter that you have included appears to be
> > ignored by the parser code that I examined.
> >
> > If my understanding of the documentation and the code is correct, then
> > you should be able to use this:
> >
> > fq=DateB:[2000-01-01 TO 2020-01-01] OR DateA:[2020-01-01 TO 2030-01-01]
> >
> > In my examples I have changed the URL encoded "+" character back to a
> > regular space.
> >
> > Thanks,
> > Shawn
> >
> >
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> 
>  >
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics





Re: Solutions for Multi-word Synonyms

2016-06-10 Thread Bernd Fehling
As Doug said,
you should really try to build your own solution for Multi-word Synonyms
because every need is different and you can customize it for your special
use case, like adding a Thesaurus.

http://www.ub.uni-bielefeld.de/~befehl/base/solr/InsideBase_eurovocThesaurus.html

Regards
Bernd

Am 09.06.2016 um 17:06 schrieb Doug Turnbull:
> Mary Jo,
> 
> Honestly half the time I run into this problem, I end up creating a
> QParserPlugin because I need to do something specific. With a QParserPlugin
> I can run whatever analysis, slicing and dicing of the query string to
> manually construct whatever I need to
> 
> http://www.supermind.org/blog/1134/custom-solr-queryparsers-for-fun-and-profit
> 
> One thing I often do is repeat the functionality of Elasticsearch's match
> query. Elasticsearch's match query does the following:
> 
> - Analyze the query string using the field's query-time analyzer
> - Create an OR query with the tokens that come out of the analysis
> 
> You can look at the field query parser as something of a starting point for
> this.
> 
> I usually do this in the context of a boost query, not as the main edismax
> query.
> 
> If I have time, this is something I've been meaning to open source.
> 
> Best
> -Doug
> 
> On Tue, Jun 7, 2016 at 2:51 PM Joe Lawson 
> wrote:
> 
>> I'm sorry I wasn't more specific, I meant we were hijacking the thread with
>> the question, "Anyone used a different method of
>> handling multi-term synonyms that isn't as global?" as the original thread
>> was about getting synonym_edismax running.
>>
>> On Tue, Jun 7, 2016 at 2:24 PM, MaryJo Sminkey 
>> wrote:
>>
 MaryJo you might want to start a new thread, I think we kinda hijacked
>>> this
 one. Also if you are interested in tuning queries check out
 http://splainer.io/ and https://www.quepid.com which are interactive
>>> tools
 (both of which my company makes) to tune for search relevancy.

>>>
>>>
>>> Okay I changed the subject. But I don't need a tuning tool, I already
>> know
>>> WHY I'm not getting the results I need, the problem is how to fix it or
>> get
>>> around what the plugin is doing. Which is why I was inquiring if people
>>> have had success with something other than this particularly plugin for
>>> more advanced queries that it messes around with. It seems to do a good
>> job
>>> if you aren't doing anything particularly complicated with your search
>>> logic, but I don't see a good way to solve the issue I'm having, and a
>>> tuning tool isn't really going to help with that. We were pretty happy
>> with
>>> our search relevancy for the most part *other* than the problem with the
>>> multi-term synonyms not working reliably but I definitely can't lose
>>> relevancy that we had just to get those working.
>>>
>>> In reviewing your tools previously, the problem as I recall is that they
>>> rely on querying Solr directly, while our searches go through multiple
>>> levels of an application which includes a lot of additional logic in
>> terms
>>> of what the data that gets sent to Solr are, so they just aren't going to
>>> be much use for us. It was easier for me to just write my own tool that
>>> essentially does the same kind of thing, but with my application logic
>>> built in.
>>>
>>> Mary Jo
>>>
>>
> 

-- 
*
Bernd FehlingBielefeld University Library
Dipl.-Inform. (FH)LibTec - Library Technology
Universitätsstr. 25  and Knowledge Management
33615 Bielefeld
Tel. +49 521 106-4060   bernd.fehling(at)uni-bielefeld.de

BASE - Bielefeld Academic Search Engine - www.base-search.net
*