Faceting search issues

2016-09-26 Thread Beyene, Iyob
Hi,

When I query solr using faceted search to check for duplicates using the 
following ,

'http://localhost:8983/solr/core/select?q=*:*=true=name=2`,

I get the following response with no facet data.


{"responseHeader": {"status": 0,"QTime": 541,"params": {"q": "*:*",
"facet.field": "name","facet.mincount": "2","rows": "0","facet": 
"true"}},"response": {"numFound": 316544,"start": 0,"maxScore": 1,"docs": 
[]},"facet_counts": {"facet_queries": {},"facet_fields": {"name": 
[]},"facet_dates": {},"facet_ranges": {},"facet_intervals": 
{},"facet_heatmaps": {}}}


but when I specify the name in fq

'http://localhost:8983/solr/core/select?q=*:*=true=name=2=name:elephant`

I get a facet result like these

{"responseHeader": {"status": 0,"QTime": 541,"params": {"q": 
"*:*","facet.field": "name","fq": "name:elephant","facet.mincount": "2","rows": 
"0","facet": "true"}},"response": {"numFound": 2,"start": 0,"maxScore": 
1,"docs": []},"facet_counts": {"facet_queries": {},"facet_fields": {"name": 
["elephant",4]},"facet_dates": {},"facet_ranges": {},"facet_intervals": 
{},"facet_heatmaps": {}}}


The field I am basing the facet search on is defined like below




Is there some variation of faceting that could help me analyze the difference?

Thanks

Iyob







Re: Faceting and Grouping Performance Degradation in Solr 5

2016-09-26 Thread Solr User
Thanks again for your work on honoring the facet.method.  I have an
observation that I would like to share and get your feedback on if possible.

I performance tested Solr 5.5.2 with various facet queries and the only way
I get comparable results to Solr 4.8.1 is when I expungeDeletes.  Is it
possible that Solr 5 is not as efficiently ignoring deletes as Solr 4?
Here are the details.

Scenario #1:  Using facet.method=uif with faceting on several multi-valued
fields.
4.8.1 (with deletes): 115 ms
5.5.2 (with deletes): 155 ms
5.5.2 (without deletes): 125 ms
5.5.2 (1 segment without deletes): 44 ms

Scenario #2:  Using facet.method=enum with faceting on several multi-valued
fields.  These fields are different than Scenario #1 and perform much
better with enum hence that method is used instead.
4.8.1 (with deletes): 38 ms
5.5.2 (with deletes): 49 ms
5.5.2 (without deletes): 42 ms
5.5.2 (1 segment without deletes): 34 ms



On Tue, May 31, 2016 at 11:57 AM, Alessandro Benedetti <
abenede...@apache.org> wrote:

> Interesting developments :
>
> https://issues.apache.org/jira/browse/SOLR-9176
>
> I think we found why term Enum seems slower in recent Solr !
> In our case it is likely to be related to the commit I mention in the Jira.
> Have a check Joel !
>
> On Wed, May 25, 2016 at 12:30 PM, Alessandro Benedetti <
> abenede...@apache.org> wrote:
>
> > I am investigating this scenario right now.
> > I can confirm that the enum slowness is in Solr 6.0 as well.
> > And I agree with Joel, it seems to be un-related with the famous faceting
> > regression :(
> >
> > Furthermore with the legacy facet approach, if you set docValues for the
> > field you are not going to be able to try the enum approach anymore.
> >
> > org/apache/solr/request/SimpleFacets.java:448
> >
> > if (method == FacetMethod.ENUM && sf.hasDocValues()) {
> >   // only fc can handle docvalues types
> >   method = FacetMethod.FC;
> > }
> >
> >
> > I got really horrible regressions simply using term enum in both Solr 4
> > and Solr 6.
> >
> > And even the most optimized fcs approach with docValues and
> > facet.threads=nCore does not perform as the simple enum in Solr 4 .
> >
> > i.e.
> >
> > For some sample queries I have 40 ms vs 160 ms and similar...
> > I think we should open an issue if we can confirm it is not related with
> > the other.
> > A lot of people will continue using the legacy approach for a while...
> >
> > On Wed, May 18, 2016 at 10:42 PM, Joel Bernstein 
> > wrote:
> >
> >> The enum slowness is interesting. It would appear on the surface to not
> be
> >> related to the FieldCache issue. I don't think the main emphasis of the
> >> JSON facet API has been the enum approach. You may find using the JSON
> >> facet API and eliminating the use of enum meets your performance needs.
> >>
> >> With the CollapsingQParserPlugin top_fc is definitely faster during
> >> queries. The tradeoff is slower warming times and increased memory usage
> >> if
> >> the collapse fields are used in faceting, as faceting will load the
> field
> >> into a different cache.
> >>
> >> Joel Bernstein
> >> http://joelsolr.blogspot.com/
> >>
> >> On Wed, May 18, 2016 at 5:28 PM, Solr User  wrote:
> >>
> >> > Joel,
> >> >
> >> > Thank you for taking the time to respond to my question.  I tried the
> >> JSON
> >> > Facet API for one query that uses facet.method=enum (since this one
> has
> >> a
> >> > ton of unique values and performed better with enum) but this was way
> >> > slower than even the slower Solr 5 times.  I did not try the new API
> >> with
> >> > the non-enum queries though so I will give that a go.  It looks like
> >> Solr
> >> > 5.5.1 also has a facet.method=uif which will be interesting to try.
> >> >
> >> > If these do not prove helpful, it looks like I will need to wait for
> >> > SOLR-8096 to be resolved before upgrading.
> >> >
> >> > Thanks also for your comment on top_fc for the CollapsingQParser.  I
> use
> >> > collapse/expand for some queries but traditional grouping for others
> >> due to
> >> > performance.  It will be interesting to see if those grouping queries
> >> > perform better now using CollapsingQParser with top_fc.
> >> >
> >> > On Wed, May 18, 2016 at 11:39 AM, Joel Bernstein 
> >> > wrote:
> >> >
> >> > > Yes, SOLR-8096 is the issue here.
> >> > >
> >> > > I don't believe indexing with docValues is going to help too much
> with
> >> > > this. The enum slowness may not be related, but I'm not positive
> about
> >> > > that.
> >> > >
> >> > > The major slowdowns are likely due to the removal of the top level
> >> > > FieldCache from general use and the removal of the FieldValuesCache
> >> which
> >> > > was used for multi-value field faceting.
> >> > >
> >> > > The JSON facet API covers all the functionality in the traditional
> >> > > faceting, and it has been developed to be very performant.
> >> > >
> >> > > You may also want to see if Collapse/Expand can meet your
> 

Re: JNDI settings

2016-09-26 Thread xavier jmlucjav
I did set up JNDI for DIH once, and you have to tweak the jetty setup. Of
course, solr should have its own jetty instance, the old way of being just
a war is not true anymore. I don't remember where, but there should be some
instructions somewhere, it took me an afternoon to set it up fine.

xavier

On Wed, Sep 21, 2016 at 1:15 PM, Aristedes Maniatis 
wrote:

> On 13/09/2016 1:29am, Aristedes Maniatis wrote:
> > I am using Solr 5.5 and wanting to add JNDI settings to Solr (for data
> import). I'm new to Solr Cloud setup (previously I was running Solr running
> as a custom bundled war) so I can't figure where to put the JNDI settings
> with user/pass themselves.
> >
> > I don't want to add it to jetty.xml because that's part of the packaged
> application which will be upgraded from time to time.
> >
> > Should it go into solr.xml inside the solr.home directory? If so, what's
> the right syntax there?
>
>
> Just a follow up on this question. Does anyone know of how I can add JNDI
> settings to Solr without overwriting parts of the application itself?
>
> Cheers
> Ari
>
>
>
> --
> -->
> Aristedes Maniatis
> GPG fingerprint CBFB 84B4 738D 4E87 5E5C  5EFA EF6A 7D2E 3E49 102A
>


Re: issue transplanting standalone core into solrcloud (plus upgrade)

2016-09-26 Thread xavi jmlucjav
I guess there is no other way than reindex:
- of course, not all fields are stored, that would have been too easy
- it might (??) work if as Jan says I build a custom solr version with
removed IntFields added etc, but going down this rabbithole sounds too
risky, too much work for what, not sure it would eventually work, specially
considering the last point:
- I did not get any response to this, but my understanding now is that you
cannot take a standalone solr core /data  (without a _version_ field) and
put that into solrcloud setup, as _version_ is needed.

xavier

On Mon, Sep 26, 2016 at 9:21 PM, Jan Høydahl  wrote:

> If all the fields in your current schema has stored=“true”, you can try to
> export
> the full index to an XML file which can then be imported into 6.1.
> If some fields are not stored you will only be able to recover the
> inverted index
> representation of that data, which may not be enough to recreate the
> original
> data (or in some cases maybe it is enough).
>
> If you share a copy of your old schema.xml we may be able to help.
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
>
> > 26. sep. 2016 kl. 20.39 skrev Shawn Heisey :
> >
> > On 9/26/2016 6:28 AM, xavi jmlucjav wrote:
> >> Yes, I had to change some fields, basically to use TrieIntField etc
> >> instead
> >> of the old IntField. I was assuming by using the IndexUpgrader to
> upgrade
> >> the data to 6.1, the older IntField would work with the new
> TrieIntField.
> >> But I have tried loading the upgraded data into a standalone 6.1 and I
> am
> >> hitting the same issue, so this is not related to _version_ field (more
> on
> >> that below). Forget about solrcloud for now, having an old 3.6 index,
> >> should it be possible to use IndexUpgrader and load it on 6.1? How would
> >> one need to handle IntFields etc?
> >
> > The only option when you change the class on a field in your schema is
> > to wipe the index and rebuild it.  TrieIntField uses a completely
> > different on-disk data format than IntField did.  The two formats simply
> > aren't compatible.  This is not a bug, it's a fundamental fact of Lucene
> > indexes.
> >
> > Lucene doesn't use a schema -- that's a Solr concept.  IndexUpgrader is
> > a Lucene program that doesn't know what kind of data each field
> > contains, it just reaches down into the old index format, grabs the
> > internal data in each field, and copies it to a new index using the new
> > format.  The internal data must still be consistent with the Lucene
> > program for the index to work in a new version.  When you're running
> > Solr, it uses the schema to know how to read the index.
> >
> > In 5.x and 6.x, IntField does not exist, and attempting to read that
> > data using TrieIntField will not work.
> >
> > The luceneMatchVersion setting in solrconfig.xml can cause certain
> > components (tokenizers and filters mainly) to revert to old behavior in
> > the previous major version.  Version 6.x doesn't hold onto behavior from
> > 3.x and 4.x -- it can only revert behavior back to 5.x versions.
> >
> > The luceneMatchVersion setting cannot bring back removed classes like
> > IntField, and it does NOT affect the on-disk index format.
> >
> > Your particular situation will require a full reindex.  It is not
> > possible to upgrade an index using those old class types.
> >
> > Thanks,
> > Shawn
> >
>
>


Re: issue transplanting standalone core into solrcloud (plus upgrade)

2016-09-26 Thread Jan Høydahl
If all the fields in your current schema has stored=“true”, you can try to 
export
the full index to an XML file which can then be imported into 6.1.
If some fields are not stored you will only be able to recover the inverted 
index
representation of that data, which may not be enough to recreate the original
data (or in some cases maybe it is enough).

If you share a copy of your old schema.xml we may be able to help.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 26. sep. 2016 kl. 20.39 skrev Shawn Heisey :
> 
> On 9/26/2016 6:28 AM, xavi jmlucjav wrote:
>> Yes, I had to change some fields, basically to use TrieIntField etc
>> instead
>> of the old IntField. I was assuming by using the IndexUpgrader to upgrade
>> the data to 6.1, the older IntField would work with the new TrieIntField.
>> But I have tried loading the upgraded data into a standalone 6.1 and I am
>> hitting the same issue, so this is not related to _version_ field (more on
>> that below). Forget about solrcloud for now, having an old 3.6 index,
>> should it be possible to use IndexUpgrader and load it on 6.1? How would
>> one need to handle IntFields etc?
> 
> The only option when you change the class on a field in your schema is
> to wipe the index and rebuild it.  TrieIntField uses a completely
> different on-disk data format than IntField did.  The two formats simply
> aren't compatible.  This is not a bug, it's a fundamental fact of Lucene
> indexes.
> 
> Lucene doesn't use a schema -- that's a Solr concept.  IndexUpgrader is
> a Lucene program that doesn't know what kind of data each field
> contains, it just reaches down into the old index format, grabs the
> internal data in each field, and copies it to a new index using the new
> format.  The internal data must still be consistent with the Lucene
> program for the index to work in a new version.  When you're running
> Solr, it uses the schema to know how to read the index.
> 
> In 5.x and 6.x, IntField does not exist, and attempting to read that
> data using TrieIntField will not work.
> 
> The luceneMatchVersion setting in solrconfig.xml can cause certain
> components (tokenizers and filters mainly) to revert to old behavior in
> the previous major version.  Version 6.x doesn't hold onto behavior from
> 3.x and 4.x -- it can only revert behavior back to 5.x versions.
> 
> The luceneMatchVersion setting cannot bring back removed classes like
> IntField, and it does NOT affect the on-disk index format.
> 
> Your particular situation will require a full reindex.  It is not
> possible to upgrade an index using those old class types.
> 
> Thanks,
> Shawn
> 



Re: issue transplanting standalone core into solrcloud (plus upgrade)

2016-09-26 Thread Shawn Heisey
On 9/26/2016 6:28 AM, xavi jmlucjav wrote:
> Yes, I had to change some fields, basically to use TrieIntField etc
> instead
> of the old IntField. I was assuming by using the IndexUpgrader to upgrade
> the data to 6.1, the older IntField would work with the new TrieIntField.
> But I have tried loading the upgraded data into a standalone 6.1 and I am
> hitting the same issue, so this is not related to _version_ field (more on
> that below). Forget about solrcloud for now, having an old 3.6 index,
> should it be possible to use IndexUpgrader and load it on 6.1? How would
> one need to handle IntFields etc?

The only option when you change the class on a field in your schema is
to wipe the index and rebuild it.  TrieIntField uses a completely
different on-disk data format than IntField did.  The two formats simply
aren't compatible.  This is not a bug, it's a fundamental fact of Lucene
indexes.

Lucene doesn't use a schema -- that's a Solr concept.  IndexUpgrader is
a Lucene program that doesn't know what kind of data each field
contains, it just reaches down into the old index format, grabs the
internal data in each field, and copies it to a new index using the new
format.  The internal data must still be consistent with the Lucene
program for the index to work in a new version.  When you're running
Solr, it uses the schema to know how to read the index.

In 5.x and 6.x, IntField does not exist, and attempting to read that
data using TrieIntField will not work.

The luceneMatchVersion setting in solrconfig.xml can cause certain
components (tokenizers and filters mainly) to revert to old behavior in
the previous major version.  Version 6.x doesn't hold onto behavior from
3.x and 4.x -- it can only revert behavior back to 5.x versions.

The luceneMatchVersion setting cannot bring back removed classes like
IntField, and it does NOT affect the on-disk index format.

Your particular situation will require a full reindex.  It is not
possible to upgrade an index using those old class types.

Thanks,
Shawn



Re: Challenges with new Solrcloud Backup/Restore functionality

2016-09-26 Thread Hrishikesh Gadre
Hi Stephen,

regarding #1, can you verify following steps during backup/restore?

- Before backup command, make sure to run a "hard" commit on the original
collection. The backup operation will capture only hard committed data.
- After restore command, check the Solr web UI to verify that all replicas
of the new (or restored) collection are in the "active" state. During my
testing, I found that when one or more replicas are in "recovery" state,
the doc count of the restored collection doesn't match the doc count of the
original collection. But after the recovery is complete, the doc counts
match. I will file a JIRA to fix this issue.

Thanks
Hrishikesh

On Mon, Sep 26, 2016 at 9:34 AM, Stephen Weiss  wrote:

> #2 - that's great news.  I'll try to patch it in and test it out.
>
> #1 - In all cases, the backup and restore both appear successful.  There
> are no failure messages for any of the shards, no warnings, etc - I didn't
> even realize at first that data was missing until I noticed differences in
> some of the query results when we were testing.  Either manual restore of
> the data or using the restore API (with all data on one node), we see the
> same, so I think it's more a problem in the backup process than the restore
> process.
>
> If there's any kind of debugging output we can provide that can help solve
> this, let me know.
>
> --
> Steve
>
> On Sun, Sep 25, 2016 at 7:17 PM, Hrishikesh Gadre 
> wrote:
>
>> Hi Steve,
>>
>> Regarding the 2nd issue, a JIRA is already created and patch is uploaded
>> (SOLR-9527). Can someone review and commit the patch?
>>
>> Regarding 1st issue, does backup command succeeds? Also do you see any
>> warning/error log messages? How about the restore command?
>>
>> Thanks
>> Hrishikesh
>>
>>
>>
>> On Sat, Sep 24, 2016 at 12:14 PM, Stephen Weiss 
>> wrote:
>>
>>> Hi everyone,
>>>
>>> We're very excited about SolrCloud's new backup / restore collection
>>> APIs, which should introduce some major new efficiencies into our indexing
>>> workflow.  Unfortunately, we've run into some snags with it that are
>>> preventing us from moving into production.  I was hoping someone on the
>>> list could help.
>>>
>>> 1) Data inconsistencies
>>>
>>> There seems to be a problem getting all the data consistently.
>>> Sometimes, the backup will contain all of the data in the collection, and
>>> sometimes, large portions of the collection (as much as 40%) will be
>>> missing.  We haven't quite figured out what might cause this yet, although
>>> one thing I've noticed is the chances of success are greater when we are
>>> only backing up one collection at a time.  Unfortunately, for our workflow,
>>> it will be difficult to make that work, and there still doesn't seem to be
>>> a guarantee of success either way.
>>>
>>> 2) Shards are not distributed
>>>
>>> To make matters worse, for some reason, any type of restore operation
>>> always seems to put all shards of the collection on the same node.  We've
>>> tried setting maxShardsPerNode to 1 in the restore command, but this has no
>>> effect.  We are seeing the same behavior on both 6.1 and 6.2.1.  No matter
>>> what we do, all the shards always go to the same node - and it's not even
>>> the node that we execute the restore request on, but oddly enough, a
>>> totally different node, and always the same one (the 4th one).  It should
>>> be noted that all nodes of our 8 node cloud are up and totally functional
>>> when this happens.
>>>
>>> To work around this, we wrote up a quick script to create an empty
>>> collection, which always distributes itself across the cloud quite well
>>> (another indication that there's nothing wrong with the nodes themselves),
>>> and then we rsync the individual shards' data into the empty shards and
>>> reload the collection.  This works fine, however, because of the data
>>> inconsistencies mentioned above, we can't really move forward anyway.
>>>
>>>
>>> Problem #2, we have a reasonable workaround for, but problem #1 we do
>>> not.  If anyone has any thoughts about either of these problems, I would be
>>> very grateful.  Thanks!
>>>
>>> --
>>> Steve
>>>
>>> 
>>>
>>> WGSN is a global foresight business. Our experts provide deep insight
>>> and analysis of consumer, fashion and design trends. We inspire our clients
>>> to plan and trade their range with unparalleled confidence and accuracy.
>>> Together, we Create Tomorrow.
>>>
>>> WGSN is part of WGSN Limited, comprising of
>>> market-leading products including WGSN.com, WGSN
>>> Lifestyle & Interiors, WGSN
>>> INstock, WGSN StyleTrial<
>>> http://www.wgsn.com/en/styletrial/> and WGSN Mindset<
>>> http://www.wgsn.com/en/services/consultancy/>, our bespoke consultancy
>>> services.
>>>
>>> The information in or attached to this email is 

Re: -field1:value1 OR field2:value2

2016-09-26 Thread Shawn Heisey
On 9/26/2016 3:56 AM, Sandeep Khanzode wrote:
> Hi Alex, It seems that this is not an issue with AND clause. For
> example, if I do ... field1:value1 AND -field2:value2  ... the results
> seem to be an intersection of both. Is this an issue with OR? Which is
> which we replace it with an implicit (*:* NOT)?

This is a fairly common issue that users run into.  I just created a
wiki page to explain the situation.  Consider it a starting point that
can be refined by additional edits:

https://wiki.apache.org/solr/NegativeQueryProblems

Thanks,
Shawn



Re: How to retrieve parent documents without a nested structure (block-join)

2016-09-26 Thread shamik
Thanks Alex, this has been extremely helpful. There's one doubt though.

The query returns expected result if I use "select" or "query" request
handler, but fails for others. Here's the debug output from "/select" using
edismax.

http://localhost:8983/solr/techproducts/query?q=({!join%20from=manu_id_s%20to=id}ipod)(name:GB18030%20-manu_id_s:*)=id,title=query=xml=false=false=edismax

*(+(JoinQuery({!join from=manu_id_s to=id}text:ipod)
(name:gb18030 -manu_id_s:*)))/no_coord

+({!join from=manu_id_s to=id}text:ipod
(name:gb18030 -manu_id_s:*))
*

Now, if I use "/browse", I don't get any results back. Here's a snippet from
browse request handler config.


 
   explicit
   velocity
   browse
   layout
   Solritas

   
   edismax
   
  text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4
  title^10.0 description^5.0 keywords^5.0 author^2.0
resourcename^1.0 subject^0.5
   
   100%
   *:*
   10
   *,score

As you can see, I've defined the "qf" fields with defType as edismax.

Here's the query:

http://localhost:8983/solr/techproducts/browse?q=({!join%20from=manu_id_s%20to=id}ipod)(name:GB18030%20-manu_id_s:*)=query=xml=false=false

Output:

*(+((JoinQuery({!join from=manu_id_s
to=id}text:ipod) (DisjunctionMaxQuery((keywords:name:gb18030^5.0 |
author:name:gb18030^2.0 | ((subject:name subject:gb subject:18030)~3)^0.5 |
manu:name:gb18030^1.1 | ((description:name description:gb
description:18030)~3)^5.0 | ((title:name title:gb title:18030)~3)^10.0 |
features:name:gb18030 | cat:name:GB18030^1.4 | name:name:gb18030^1.2 |
text:name:gb18030^0.5 | id:name:GB18030^10.0 | resourcename:name:gb18030 |
sku:"namegb 18030"^1.5)) -manu_id_s:*))~2))/no_coord

+(({!join from=manu_id_s to=id}text:ipod
((keywords:name:gb18030^5.0 | author:name:gb18030^2.0 | ((subject:name
subject:gb subject:18030)~3)^0.5 | manu:name:gb18030^1.1 |
((description:name description:gb description:18030)~3)^5.0 | ((title:name
title:gb title:18030)~3)^10.0 | features:name:gb18030 | cat:name:GB18030^1.4
| name:name:gb18030^1.2 | text:name:gb18030^0.5 | id:name:GB18030^10.0 |
resourcename:name:gb18030 | sku:"namegb 18030"^1.5) -manu_id_s:*))~2)*

If I remove the join query condition ({!join from=manu_id_s to=id}ipod) ,
the query returns the result based on the second condition. 

The other doubt I've is why "text" is getting picked as a default field in
the join condition? I've defined the "df" fields in "browse" which are being
used in the second condition. Do I need to explicitly set the df fields
inside the join condition?

The other thing I've noticed is the difference in parsed query if I add a
space in between the two clause. For e.g. *q=({!join from=manu_id_s
to=id}ipod) (name:GB18030 -manu_id_s:*)* results in 

*(+((JoinQuery({!join from=manu_id_s
to=id}text:ipod) (name:gb18030 -manu_id_s:*))~2))/no_coord

+(({!join from=manu_id_s to=id}text:ipod
(name:gb18030 -manu_id_s:*))~2)*




--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-retrieve-parent-documents-without-a-nested-structure-block-join-tp4297510p4298115.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: issue transplanting standalone core into solrcloud (plus upgrade)

2016-09-26 Thread Jan Høydahl
Better keep your old schema unchanged if you want to use an old index. The 
upgrader does not change field type for you. If the old IntField does not exist 
in 6.x you're out of luck, may try to build a custom version with the old field 
types as addons..

Sendt fra min iPhone

> Den 26. sep. 2016 kl. 14.28 skrev xavi jmlucjav :
> 
> Hi Shawn/Jan,
> 
>> On Sun, Sep 25, 2016 at 6:18 PM, Shawn Heisey  wrote:
>> 
>>> On 9/25/2016 4:24 AM, xavi jmlucjav wrote:
>>> Everything went well, no errors when solr restarted, the collections
>> shows
>>> the right number of docs. But when I try to run a query, I get:
>>> 
>>> null:java.lang.NullPointerException
>> 
>> Did you change any of the fieldType class values as you adjusted the
>> schema for the upgrade?  A number of classes that were valid and
>> deprecated in 3.6 and 4.x were completely removed by 5.x, and 6.x
>> probably removed a few more.
>> 
> 
> Yes, I had to change some fields, basically to use TrieIntField etc instead
> of the old IntField. I was assuming by using the IndexUpgrader to upgrade
> the data to 6.1, the older IntField would work with the new TrieIntField.
> But I have tried loading the upgraded data into a standalone 6.1 and I am
> hitting the same issue, so this is not related to _version_ field (more on
> that below). Forget about solrcloud for now, having an old 3.6 index,
> should it be possible to use IndexUpgrader and load it on 6.1? How would
> one need to handle IntFields etc?
> 
> 
> 
>> 
>> If you did make changes like this to your schema, then what's in the
>> index will no longer match the schema, and the *only* option is a
>> reindex.  Exceptions are likely if you don't reindex after schema
>> changes to the class value(s) or the index analyzer(s).
>> 
>> Regarding the _version_ field:  SolrCloud expects this field to be in
>> your schema.  It might also expect that that every document in the index
>> will already contain a value in this field.  Adding _version_ to your
>> schema should be treated similarly to the changes mentioned above -- a
>> reindex is required for proper operation.
>> 
>> Even if the schema didn't change in a way that *requires* a reindex ...
>> the number of changes to the analysis components across three major
>> version jumps is quite large.  Solr might not work as expected because
>> of those changes unless you reindex, even if you don't see any
>> exceptions.  Changes to your schema because of changes in analysis
>> component behavior might  be required -- which is another situation that
>> usually requires a reindex.
>> 
>> Because of these potential problems, I always start a new Solr version
>> with no index data and completely rebuild my indexes after an upgrade.
>> That is the best way to ensure success.
>> 
> 
> I am totally aware of all the advantages of reindexing, sure. And that is
> what I always do, this time thought, seems the original data is not
> available...
> 
> 
>> You referenced a mailing list thread where somebody had success
>> converting non-cloud to cloud... but that was on version 4.8.1, two
>> major versions back from the version you're running.  They also did not
>> upgrade major versions -- from some things they said at the beginning of
>> the thread, I know that the source version was at least 4.4.  The thread
>> didn't mention any schema changes, either.
>> 
>> If the schema doesn't change at all, moving from non-cloud to cloud is
>> very possible, but if the schema changes, the index data might not match
>> the schema any more, and that situation will not work.
>> 
> Since you jumped three major versions, it's almost guaranteed that your
>> schema *did* change, and the changes may have been more extensive than
>> just adding the _version_ field.
>> 
>> It's possible that there's a problem when converting a non-cloud install
>> with no _version_ field to a cloud install where the only schema change
>> is adding the _version_ field.  We can treat THAT situation as a bug,
>> but if there are other schema changes besides adding _version_, the
>> exception you encountered is most likely not a bug.
>> 
> 
> 
> The are two orthogonal issues here:
> A. moving to solrcloud from  standalone without reindexing. And without
> having a _version_ field already indexed, of course. Is this even possible?
> From the thread above, I understood it was possible, but you say that
> solrcloud expects _version_ to be there, with values, so this makes this
> move totally impossible without a reindexing. This should be made clear
> somewhere in the doc. I understand it is not a frequent scenario, but will
> be a deal breaker when it happens. So far the only thing I found is the
> aforementioned thread, that if I am not misreading, makes it sound as it
> will work ok.
> 
> B. upgrading from a very old 3.6 version to 6.1 without reindexing: it
> seems like I am hitting an issue with this first. Even if this was
> resolved, I would not be able to achieve my 

Re: Challenges with new Solrcloud Backup/Restore functionality

2016-09-26 Thread Stephen Weiss
#2 - that's great news.  I'll try to patch it in and test it out.

#1 - In all cases, the backup and restore both appear successful.  There are no 
failure messages for any of the shards, no warnings, etc - I didn't even 
realize at first that data was missing until I noticed differences in some of 
the query results when we were testing.  Either manual restore of the data or 
using the restore API (with all data on one node), we see the same, so I think 
it's more a problem in the backup process than the restore process.

If there's any kind of debugging output we can provide that can help solve 
this, let me know.

--
Steve

On Sun, Sep 25, 2016 at 7:17 PM, Hrishikesh Gadre 
> wrote:
Hi Steve,

Regarding the 2nd issue, a JIRA is already created and patch is uploaded 
(SOLR-9527). Can someone review and commit the patch?

Regarding 1st issue, does backup command succeeds? Also do you see any 
warning/error log messages? How about the restore command?

Thanks
Hrishikesh



On Sat, Sep 24, 2016 at 12:14 PM, Stephen Weiss 
> wrote:
Hi everyone,

We're very excited about SolrCloud's new backup / restore collection APIs, 
which should introduce some major new efficiencies into our indexing workflow.  
Unfortunately, we've run into some snags with it that are preventing us from 
moving into production.  I was hoping someone on the list could help.

1) Data inconsistencies

There seems to be a problem getting all the data consistently.  Sometimes, the 
backup will contain all of the data in the collection, and sometimes, large 
portions of the collection (as much as 40%) will be missing.  We haven't quite 
figured out what might cause this yet, although one thing I've noticed is the 
chances of success are greater when we are only backing up one collection at a 
time.  Unfortunately, for our workflow, it will be difficult to make that work, 
and there still doesn't seem to be a guarantee of success either way.

2) Shards are not distributed

To make matters worse, for some reason, any type of restore operation always 
seems to put all shards of the collection on the same node.  We've tried 
setting maxShardsPerNode to 1 in the restore command, but this has no effect.  
We are seeing the same behavior on both 6.1 and 6.2.1.  No matter what we do, 
all the shards always go to the same node - and it's not even the node that we 
execute the restore request on, but oddly enough, a totally different node, and 
always the same one (the 4th one).  It should be noted that all nodes of our 8 
node cloud are up and totally functional when this happens.

To work around this, we wrote up a quick script to create an empty collection, 
which always distributes itself across the cloud quite well (another indication 
that there's nothing wrong with the nodes themselves), and then we rsync the 
individual shards' data into the empty shards and reload the collection.  This 
works fine, however, because of the data inconsistencies mentioned above, we 
can't really move forward anyway.


Problem #2, we have a reasonable workaround for, but problem #1 we do not.  If 
anyone has any thoughts about either of these problems, I would be very 
grateful.  Thanks!

--
Steve



WGSN is a global foresight business. Our experts provide deep insight and 
analysis of consumer, fashion and design trends. We inspire our clients to plan 
and trade their range with unparalleled confidence and accuracy. Together, we 
Create Tomorrow.

WGSN is part of WGSN Limited, comprising of 
market-leading products including WGSN.com, WGSN Lifestyle 
& Interiors, WGSN 
INstock, WGSN 
StyleTrial and WGSN 
Mindset, our bespoke consultancy 
services.

The information in or attached to this email is confidential and may be legally 
privileged. If you are not the intended recipient of this message, any use, 
disclosure, copying, distribution or any action taken in reliance on it is 
prohibited and may be unlawful. If you have received this message in error, 
please notify the sender immediately by return email and delete this message 
and any copies from your computer and network. WGSN does not warrant that this 
email and any attachments are free from viruses and accepts no liability for 
any loss resulting from infected email transmissions.

WGSN reserves the right to monitor all email through its networks. Any views 
expressed may be those of the originator and not necessarily of WGSN. WGSN is 
powered by Ascential plc, which transforms knowledge 
businesses to deliver exceptional performance.

Please be advised all phone calls may be recorded for training and quality 
purposes and by accepting and/or making calls 

Re: Spellcheck: using multiple dictionaries (DirectSolrSpellChecker and FileBasedSpellChecker)

2016-09-26 Thread Ryan Yacyshyn
Ok, thanks Andrey.



On Tue, 27 Sep 2016 at 00:13 Kydryavtsev Andrey  wrote:

> Hello, Ryan
>
>
> As it obvious from exception message - you are forced to use same instance
> of Analyzer to all of spell checkers which should be conjuncted.
>
> How this instance is initialized inside SpellChecker instance could be
> found here -
> https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/spelling/SolrSpellChecker.java#L65
>
> So one of possibilities to make it work - use same field for both spell
> checkers. solrconfig.xml could looks like this:
>
>   
> default
> solr.DirectSolrSpellChecker
> field_for_spell_check
> …
> 
>
>
> 
> wordbreak
> solr.WordBreakSolrSpellChecker
>  field_for_spell_check 
>   ….
> 
>
> 23.09.2016, 12:13, "Ryan Yacyshyn" :
> > Hi everyone,
> >
> > I'm looking at using two different implementations of spell checking
> > together: DirectSolrSpellChecker and FileBasedSpellChecker but I get the
> > following error:
> >
> > msg: "All checkers need to use the same Analyzer.",
> > trace: "java.lang.IllegalArgumentException: All checkers need to use the
> > same Analyzer. at
> >
> org.apache.solr.spelling.ConjunctionSolrSpellChecker.addChecker(ConjunctionSolrSpellChecker.java:79)
> > at
> >
> org.apache.solr.handler.component.SpellCheckComponent.getSpellChecker(SpellCheckComponent.java:603)
> > at
> >
> org.apache.solr.handler.component.SpellCheckComponent.prepare(SpellCheckComponent.java:126)
> > at ...
> >
> > The source mentions that the "initial use-case was to use
> > WordBreakSolrSpellChecker in conjunction with the
> DirectSolrSpellChecker".
> >
> > If I make a query with only of the dictionaries (file or direct), they
> both
> > work fine, combining them into one query throws the error. I'm not sure
> if
> > I'm doing something wrong or if I just can't use these two together
> (yet).
> >
> > I'm using 6.2.0. Thanks for any help!
> >
> > Ryan
>


Re: Spellcheck: using multiple dictionaries (DirectSolrSpellChecker and FileBasedSpellChecker)

2016-09-26 Thread Kydryavtsev Andrey
Hello, Ryan


As it obvious from exception message - you are forced to use same instance of 
Analyzer to all of spell checkers which should be conjuncted.

How this instance is initialized inside SpellChecker instance could be found 
here - 
https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/spelling/SolrSpellChecker.java#L65

So one of possibilities to make it work - use same field for both spell 
checkers. solrconfig.xml could looks like this:

  
            default
            solr.DirectSolrSpellChecker
            field_for_spell_check
        …
        


        
            wordbreak
            solr.WordBreakSolrSpellChecker
             field_for_spell_check 
              ….
        

23.09.2016, 12:13, "Ryan Yacyshyn" :
> Hi everyone,
>
> I'm looking at using two different implementations of spell checking
> together: DirectSolrSpellChecker and FileBasedSpellChecker but I get the
> following error:
>
> msg: "All checkers need to use the same Analyzer.",
> trace: "java.lang.IllegalArgumentException: All checkers need to use the
> same Analyzer. at
> org.apache.solr.spelling.ConjunctionSolrSpellChecker.addChecker(ConjunctionSolrSpellChecker.java:79)
> at
> org.apache.solr.handler.component.SpellCheckComponent.getSpellChecker(SpellCheckComponent.java:603)
> at
> org.apache.solr.handler.component.SpellCheckComponent.prepare(SpellCheckComponent.java:126)
> at ...
>
> The source mentions that the "initial use-case was to use
> WordBreakSolrSpellChecker in conjunction with the DirectSolrSpellChecker".
>
> If I make a query with only of the dictionaries (file or direct), they both
> work fine, combining them into one query throws the error. I'm not sure if
> I'm doing something wrong or if I just can't use these two together (yet).
>
> I'm using 6.2.0. Thanks for any help!
>
> Ryan


org.apache.lucene.index.CorruptIndexException: checksum failed

2016-09-26 Thread chauncey
hi all
i'm using solr, and hdfs(cdh) to mount
solr version : 4.10.1
cdh version : 5.3
here is the solr console log:

auto commit error...:org.apache.solr.common.SolrException: Error opening new
searcher
at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1565)
at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1677)
at
org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:607)
at org.apache.solr.update.CommitTracker.run(CommitTracker.java:216)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed
(hardware problem?) : expected=91ca8e86 actual=b3ecc777
(resource=BufferedChecksumIndexInput(_rxm_Lucene41_0.tip))
at org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.java:211)
at
org.apache.lucene.codecs.CodecUtil.checksumEntireFile(CodecUtil.java:268)
at
org.apache.lucene.codecs.blocktree.BlockTreeTermsReader.(BlockTreeTermsReader.java:125)
at
org.apache.lucene.codecs.lucene41.Lucene41PostingsFormat.fieldsProducer(Lucene41PostingsFormat.java:441)
at
org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.(PerFieldPostingsFormat.java:197)
at
org.apache.lucene.codecs.perfield.PerFieldPostingsFormat.fieldsProducer(PerFieldPostingsFormat.java:254)
at
org.apache.lucene.index.SegmentCoreReaders.(SegmentCoreReaders.java:120)
at org.apache.lucene.index.SegmentReader.(SegmentReader.java:108)
at
org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:144)
at
org.apache.lucene.index.BufferedUpdatesStream.applyDeletesAndUpdates(BufferedUpdatesStream.java:282)
at
org.apache.lucene.index.IndexWriter.applyAllDeletesAndUpdates(IndexWriter.java:3266)
at
org.apache.lucene.index.IndexWriter.maybeApplyDeletes(IndexWriter.java:3257)
at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:421)
at
org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:292)
at
org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:277)
at
org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:251)
at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1476)
... 10 more

or:


auto commit error...:org.apache.solr.common.SolrException: Error opening new
searcher
at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1565)
at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1677)
at
org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:607)
at org.apache.solr.update.CommitTracker.run(CommitTracker.java:216)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.lucene.index.CorruptIndexException: codec footer
mismatch: actual footer=-181212622 vs expected footer=-1071082520 (resource:
_ry0_Lucene41_0.doc)
at org.apache.lucene.codecs.CodecUtil.validateFooter(CodecUtil.java:235)
at 
org.apache.lucene.codecs.CodecUtil.retrieveChecksum(CodecUtil.java:228)
at
org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.(Lucene41PostingsReader.java:88)
at
org.apache.lucene.codecs.lucene41.Lucene41PostingsFormat.fieldsProducer(Lucene41PostingsFormat.java:434)
at
org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.(PerFieldPostingsFormat.java:197)
at
org.apache.lucene.codecs.perfield.PerFieldPostingsFormat.fieldsProducer(PerFieldPostingsFormat.java:254)
at
org.apache.lucene.index.SegmentCoreReaders.(SegmentCoreReaders.java:120)
at org.apache.lucene.index.SegmentReader.(SegmentReader.java:108)
at

Re: -field1:value1 OR field2:value2

2016-09-26 Thread Sandeep Khanzode
Sure. Noted. 
Thanks for the link ...  SRK 

On Monday, September 26, 2016 8:29 PM, Erick Erickson 
 wrote:
 

 Please do not cross post to multiple lists, it's considered bad
etiquette.

Solr does not implement strict boolean logic, please read:

https://lucidworks.com/blog/2011/12/28/why-not-and-or-and-not/

Best,
Erick

On Mon, Sep 26, 2016 at 2:58 AM, Alexandre Rafalovitch
 wrote:
> I don't remember specifically :-(. Search the archives
> http://search-lucene.com/ or follow-up on Solr Users list. Remember to
> mention the version of Solr, as there were some bugs/features/fixes
> with OR, I think.
>
> Regards,
>  Alex.
> 
> Newsletter and resources for Solr beginners and intermediates:
> http://www.solr-start.com/
>
>
> On 26 September 2016 at 16:56, Sandeep Khanzode
>  wrote:
>> Hi Alex,
>> It seems that this is not an issue with AND clause. For example, if I do ...
>> field1:value1 AND -field2:value2
>> ... the results seem to be an intersection of both.
>> Is this an issue with OR? Which is which we replace it with an implicit (*:* 
>> NOT)? SRK
>>
>>    On Monday, September 26, 2016 3:09 PM, Sandeep Khanzode 
>> wrote:
>>
>>
>>  Yup. That works. So does (*:* NOT ...)
>> Thanks, Alex.  SRK
>>
>>    On Monday, September 26, 2016 3:03 PM, Alexandre Rafalovitch 
>> wrote:
>>
>>
>>  Try field2:value2 OR (*:* -field1=value1)
>>
>> There is a magic in negative query syntax that breaks down when it
>> gets more complex. It's been discussed on the mailing list a bunch of
>> times, though the discussions are hard to find by title.
>>
>> Regards,
>>    Alex.
>> 
>> Newsletter and resources for Solr beginners and intermediates:
>> http://www.solr-start.com/
>>
>>
>> On 26 September 2016 at 16:06, Sandeep Khanzode
>>  wrote:
>>> Hi,
>>> If I query for
>>> -field1=value1 ... I get, say, 100 records
>>> and if I query for
>>> field2:value2 ... I may get 200 records
>>>
>>> I would assume that if I query for
>>> -field1:value1 OR field2:value2
>>>
>>> ... I should get atleast 100 records (assuming they overlap, if not, upto 
>>> 300 records). I am assuming that the default joining is OR.
>>>  But I do not ...
>>> The result is that I get less than 100. If I didn't know better, I would 
>>> have said that an AND is being done.
>>>
>>> I am expecting records that EITHER do NOT contain field1:value1 OR which 
>>> contain field2:value2.
>>>
>>> Please let me know what I am missing. Thanks.
>>>
>>> SRK
>>
>>
>>
>>
>>


   

Re: json.facet without a facet ...

2016-09-26 Thread Yonik Seeley
On Mon, Sep 26, 2016 at 9:44 AM, Bram Van Dam  wrote:
> Howdy,
>
> I realize that this might be a strange question, so please bear with me
> here.
>
> I've been replacing my usage of the old Stats Component (stats=true,
> stats.field=foo, [stats.facet=bar]) with the new json.facet sugar. This
> has been a great improvement on all fronts.
>
> However, with the stats component I could calculate stats on a field
> *without* having to facet. The new json.facet API doesn't seem to
> support that in any way that I can see.

>From http://yonik.com/json-facet-api/

Statistics are facets

Statistics are now fully integrated into faceting. Since we start off
with a single facet bucket with a domain defined by the main query and
filters, we can even ask for statistics for this top level bucket,
before breaking up into further buckets via faceting. Example:

json.facet={
  x : "avg(price)",   // the average of the price field will
appear under "x"
  y : "unique(manufacturer)"  // the number of unique manufacturers
will appear under "y"
}


-Yonik


Re: -field1:value1 OR field2:value2

2016-09-26 Thread Erick Erickson
Please do not cross post to multiple lists, it's considered bad
etiquette.

Solr does not implement strict boolean logic, please read:

https://lucidworks.com/blog/2011/12/28/why-not-and-or-and-not/

Best,
Erick

On Mon, Sep 26, 2016 at 2:58 AM, Alexandre Rafalovitch
 wrote:
> I don't remember specifically :-(. Search the archives
> http://search-lucene.com/ or follow-up on Solr Users list. Remember to
> mention the version of Solr, as there were some bugs/features/fixes
> with OR, I think.
>
> Regards,
>   Alex.
> 
> Newsletter and resources for Solr beginners and intermediates:
> http://www.solr-start.com/
>
>
> On 26 September 2016 at 16:56, Sandeep Khanzode
>  wrote:
>> Hi Alex,
>> It seems that this is not an issue with AND clause. For example, if I do ...
>> field1:value1 AND -field2:value2
>> ... the results seem to be an intersection of both.
>> Is this an issue with OR? Which is which we replace it with an implicit (*:* 
>> NOT)? SRK
>>
>> On Monday, September 26, 2016 3:09 PM, Sandeep Khanzode 
>>  wrote:
>>
>>
>>  Yup. That works. So does (*:* NOT ...)
>> Thanks, Alex.  SRK
>>
>> On Monday, September 26, 2016 3:03 PM, Alexandre Rafalovitch 
>>  wrote:
>>
>>
>>  Try field2:value2 OR (*:* -field1=value1)
>>
>> There is a magic in negative query syntax that breaks down when it
>> gets more complex. It's been discussed on the mailing list a bunch of
>> times, though the discussions are hard to find by title.
>>
>> Regards,
>> Alex.
>> 
>> Newsletter and resources for Solr beginners and intermediates:
>> http://www.solr-start.com/
>>
>>
>> On 26 September 2016 at 16:06, Sandeep Khanzode
>>  wrote:
>>> Hi,
>>> If I query for
>>> -field1=value1 ... I get, say, 100 records
>>> and if I query for
>>> field2:value2 ... I may get 200 records
>>>
>>> I would assume that if I query for
>>> -field1:value1 OR field2:value2
>>>
>>> ... I should get atleast 100 records (assuming they overlap, if not, upto 
>>> 300 records). I am assuming that the default joining is OR.
>>>  But I do not ...
>>> The result is that I get less than 100. If I didn't know better, I would 
>>> have said that an AND is being done.
>>>
>>> I am expecting records that EITHER do NOT contain field1:value1 OR which 
>>> contain field2:value2.
>>>
>>> Please let me know what I am missing. Thanks.
>>>
>>> SRK
>>
>>
>>
>>
>>


remove user defined duplicate from search result

2016-09-26 Thread Yongtao Liu
Hi,

I am try to remove user defined duplicate from search result.

like below documents match the query.
when query return, I try to remove doc3 from result since it has duplicate guid 
with doc1.

Id (uniqueKey)

guid

doc1

G1

doc2

G2

doc3

G1


To do this, I generate exclude list based guid field terms.
For each term, we add from the second document to exclude list.
And add these docs to QueryCommand filter.

If there any better approach to handler this requirement?


Below is code change in SolrIndexSearcer.java

  private TreeMap dupDocs = null;

  public QueryResult search(QueryResult qr, QueryCommand cmd) throws 
IOException {
if (cmd.getUniqueField() != null)
{
  DocSet filter = getDuplicateByField(cmd.getUniqueField());
  if (cmd.getFilter() != null) cmd.getFilter().addAllTo(filter);
  cmd.setFilter(filter);
}

getDocListC(qr,cmd);

return qr;
  }

  private synchronized BitDocSet getDuplicateByField(String field) throws 
IOException
  {
if (dupDocs != null && dupDocs.containsKey(field)) {
  return dupDocs.get(field);
}

if (dupDocs == null)
{
  dupDocs = new TreeMap();
}

LeafReader reader = getLeafReader();

BitDocSet res = new BitDocSet(new FixedBitSet(maxDoc()));

Terms terms = reader.terms(field);

if (terms == null)
{
  dupDocs.put(field, res);
  return res;
}

TermsEnum termEnum = terms.iterator();
PostingsEnum docs = null;
BytesRef term = null;
while ((term = termEnum.next()) != null) {
  docs = termEnum.postings(docs, PostingsEnum.NONE);

  // slip first document
  docs.nextDoc();

  int docID = 0;
  while ((docID = docs.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS)
  {
res.add(docID);
  }
}

dupDocs.put(field, res);
return res;
  }

Thanks,
Yongtao


RE: remove user defined duplicate from search result

2016-09-26 Thread Yongtao Liu
Sorry, the table is missing.
Update below email with table.

-Original Message-
From: Yongtao Liu [mailto:y...@commvault.com] 
Sent: Monday, September 26, 2016 10:47 AM
To: 'solr-user@lucene.apache.org'
Subject: remove user defined duplicate from search result

Hi,

I am try to remove user defined duplicate from search result.

like below documents match the query.
when query return, I try to remove doc3 from result since it has duplicate guid 
with doc1.

id(uniqueKey)   guid
doc1G1
doc2G2
doc2G1

To do this, I generate exclude list based guid field terms.
For each term, we add from the second document to exclude list.
And add these docs to QueryCommand filter.

If there any better approach to handler this requirement?


Below is code change in SolrIndexSearcer.java

  private TreeMap dupDocs = null;

  public QueryResult search(QueryResult qr, QueryCommand cmd) throws 
IOException {
if (cmd.getUniqueField() != null)
{
  DocSet filter = getDuplicateByField(cmd.getUniqueField());
  if (cmd.getFilter() != null) cmd.getFilter().addAllTo(filter);
  cmd.setFilter(filter);
}

getDocListC(qr,cmd);

return qr;
  }

  private synchronized BitDocSet getDuplicateByField(String field) throws 
IOException
  {
if (dupDocs != null && dupDocs.containsKey(field)) {
  return dupDocs.get(field);
}

if (dupDocs == null)
{
  dupDocs = new TreeMap();
}

LeafReader reader = getLeafReader();

BitDocSet res = new BitDocSet(new FixedBitSet(maxDoc()));

Terms terms = reader.terms(field);

if (terms == null)
{
  dupDocs.put(field, res);
  return res;
}

TermsEnum termEnum = terms.iterator();
PostingsEnum docs = null;
BytesRef term = null;
while ((term = termEnum.next()) != null) {
  docs = termEnum.postings(docs, PostingsEnum.NONE);

  // slip first document
  docs.nextDoc();

  int docID = 0;
  while ((docID = docs.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS)
  {
res.add(docID);
  }
}

dupDocs.put(field, res);
return res;
  }

Thanks,
Yongtao


json.facet without a facet ...

2016-09-26 Thread Bram Van Dam
Howdy,

I realize that this might be a strange question, so please bear with me
here.

I've been replacing my usage of the old Stats Component (stats=true,
stats.field=foo, [stats.facet=bar]) with the new json.facet sugar. This
has been a great improvement on all fronts.

However, with the stats component I could calculate stats on a field
*without* having to facet. The new json.facet API doesn't seem to
support that in any way that I can see. Which, admittedly, makes sense,
given the name.

Faceting on a random field and setting allBuckets:true kind of
approximates the behaviour I'm after, but that's pretty ugly and
difficult (because I don't know which field to facet on and it would
have to be present in all documents etc).

Is there any way to do this that I'm not seeing?

TL;DR; Trying to calculate statistics using json.facet without faceting.

Thanks,

 - Bram


Re: issue transplanting standalone core into solrcloud (plus upgrade)

2016-09-26 Thread xavi jmlucjav
Hi Shawn/Jan,

On Sun, Sep 25, 2016 at 6:18 PM, Shawn Heisey  wrote:

> On 9/25/2016 4:24 AM, xavi jmlucjav wrote:
> > Everything went well, no errors when solr restarted, the collections
> shows
> > the right number of docs. But when I try to run a query, I get:
> >
> > null:java.lang.NullPointerException
>
> Did you change any of the fieldType class values as you adjusted the
> schema for the upgrade?  A number of classes that were valid and
> deprecated in 3.6 and 4.x were completely removed by 5.x, and 6.x
> probably removed a few more.
>

Yes, I had to change some fields, basically to use TrieIntField etc instead
of the old IntField. I was assuming by using the IndexUpgrader to upgrade
the data to 6.1, the older IntField would work with the new TrieIntField.
But I have tried loading the upgraded data into a standalone 6.1 and I am
hitting the same issue, so this is not related to _version_ field (more on
that below). Forget about solrcloud for now, having an old 3.6 index,
should it be possible to use IndexUpgrader and load it on 6.1? How would
one need to handle IntFields etc?



>
> If you did make changes like this to your schema, then what's in the
> index will no longer match the schema, and the *only* option is a
> reindex.  Exceptions are likely if you don't reindex after schema
> changes to the class value(s) or the index analyzer(s).
>
> Regarding the _version_ field:  SolrCloud expects this field to be in
> your schema.  It might also expect that that every document in the index
> will already contain a value in this field.  Adding _version_ to your
> schema should be treated similarly to the changes mentioned above -- a
> reindex is required for proper operation.
>
> Even if the schema didn't change in a way that *requires* a reindex ...
> the number of changes to the analysis components across three major
> version jumps is quite large.  Solr might not work as expected because
> of those changes unless you reindex, even if you don't see any
> exceptions.  Changes to your schema because of changes in analysis
> component behavior might  be required -- which is another situation that
> usually requires a reindex.
>
> Because of these potential problems, I always start a new Solr version
> with no index data and completely rebuild my indexes after an upgrade.
> That is the best way to ensure success.
>

I am totally aware of all the advantages of reindexing, sure. And that is
what I always do, this time thought, seems the original data is not
available...


> You referenced a mailing list thread where somebody had success
> converting non-cloud to cloud... but that was on version 4.8.1, two
> major versions back from the version you're running.  They also did not
> upgrade major versions -- from some things they said at the beginning of
> the thread, I know that the source version was at least 4.4.  The thread
> didn't mention any schema changes, either.
>
> If the schema doesn't change at all, moving from non-cloud to cloud is
> very possible, but if the schema changes, the index data might not match
> the schema any more, and that situation will not work.
>
Since you jumped three major versions, it's almost guaranteed that your
> schema *did* change, and the changes may have been more extensive than
> just adding the _version_ field.
>
> It's possible that there's a problem when converting a non-cloud install
> with no _version_ field to a cloud install where the only schema change
> is adding the _version_ field.  We can treat THAT situation as a bug,
> but if there are other schema changes besides adding _version_, the
> exception you encountered is most likely not a bug.
>


The are two orthogonal issues here:
A. moving to solrcloud from  standalone without reindexing. And without
having a _version_ field already indexed, of course. Is this even possible?
>From the thread above, I understood it was possible, but you say that
solrcloud expects _version_ to be there, with values, so this makes this
move totally impossible without a reindexing. This should be made clear
somewhere in the doc. I understand it is not a frequent scenario, but will
be a deal breaker when it happens. So far the only thing I found is the
aforementioned thread, that if I am not misreading, makes it sound as it
will work ok.

B. upgrading from a very old 3.6 version to 6.1 without reindexing: it
seems like I am hitting an issue with this first. Even if this was
resolved, I would not be able to achieve my goal due A, but would be good
to know how to get this done too, if possible.

Jan: I tried tweaking luceneMatchVersion too, no luck though.
xavier


>
> Thanks,
> Shawn
>
>


Re: Retaining a field value during DataImport

2016-09-26 Thread Selvam
Hi,

Thanks, I will look into options specified.


On Mon, Sep 26, 2016 at 4:35 PM, Alexandre Rafalovitch 
wrote:

> Transformers do not see what's in the Solr index, they are too early
> in the processing chain.
>
> You could probably do something by exporting that field's value,
> caching it and injecting it back with transformer from that cache.
> Messy but doable.
>
> UpdateRequestProcessor would be able to do it, but your request from
> DIH is coming as a new document, not an update. So the old one would
> be overidden.
>
> SOLR-9530 could be an answer to that, but it is just a design so far -
> no implementation. You could write one yourself or see if showing
> excitement on the JIRA and being ready to debug the patch would get
> the committer's attention.
>
>
> Regards,
> Alex.
> 
> Newsletter and resources for Solr beginners and intermediates:
> http://www.solr-start.com/
>
>
> On 26 September 2016 at 17:36, Selvam  wrote:
> > Hi All,
> >
> > We use DataImportHandler to import data from Redshift. We want to
> overwrite
> > some 250M existing records (that has around 350 columns) while retaining
> > the field value of only one column in those 250M records. The reason is,
> > that one column is a multi-valued and requires a costly query to build
> that
> > values again.
> >
> > I learned about Transformers, I am not sure if it is possible to get the
> > old document value during that process. Any help would be appreciated.
> >
> >
> > --
> > Regards,
> > Selvam
>



-- 
Regards,
Selvam
KnackForge 


Re: Retaining a field value during DataImport

2016-09-26 Thread Alexandre Rafalovitch
Transformers do not see what's in the Solr index, they are too early
in the processing chain.

You could probably do something by exporting that field's value,
caching it and injecting it back with transformer from that cache.
Messy but doable.

UpdateRequestProcessor would be able to do it, but your request from
DIH is coming as a new document, not an update. So the old one would
be overidden.

SOLR-9530 could be an answer to that, but it is just a design so far -
no implementation. You could write one yourself or see if showing
excitement on the JIRA and being ready to debug the patch would get
the committer's attention.


Regards,
Alex.

Newsletter and resources for Solr beginners and intermediates:
http://www.solr-start.com/


On 26 September 2016 at 17:36, Selvam  wrote:
> Hi All,
>
> We use DataImportHandler to import data from Redshift. We want to overwrite
> some 250M existing records (that has around 350 columns) while retaining
> the field value of only one column in those 250M records. The reason is,
> that one column is a multi-valued and requires a costly query to build that
> values again.
>
> I learned about Transformers, I am not sure if it is possible to get the
> old document value during that process. Any help would be appreciated.
>
>
> --
> Regards,
> Selvam


Retaining a field value during DataImport

2016-09-26 Thread Selvam
Hi All,

We use DataImportHandler to import data from Redshift. We want to overwrite
some 250M existing records (that has around 350 columns) while retaining
the field value of only one column in those 250M records. The reason is,
that one column is a multi-valued and requires a costly query to build that
values again.

I learned about Transformers, I am not sure if it is possible to get the
old document value during that process. Any help would be appreciated.


-- 
Regards,
Selvam


Re: JNDI settings

2016-09-26 Thread Aristedes Maniatis
On 21/09/2016 9:15pm, Aristedes Maniatis wrote:
> On 13/09/2016 1:29am, Aristedes Maniatis wrote:
>> I am using Solr 5.5 and wanting to add JNDI settings to Solr (for data 
>> import). I'm new to Solr Cloud setup (previously I was running Solr running 
>> as a custom bundled war) so I can't figure where to put the JNDI settings 
>> with user/pass themselves.
>>
>> I don't want to add it to jetty.xml because that's part of the packaged 
>> application which will be upgraded from time to time.
>>
>> Should it go into solr.xml inside the solr.home directory? If so, what's the 
>> right syntax there?
> 
> 
> Just a follow up on this question. Does anyone know of how I can add JNDI 
> settings to Solr without overwriting parts of the application itself?
> 
> Cheers
> Ari


Am I approaching this the wrong way? Where do other people put 
username/password details for JDBC connections to data storage of the source 
data?

Is there a completely different way to solve this as a better approach? Store 
JDBC connection details in Zookeeper?


Ari





-- 
-->
Aristedes Maniatis
GPG fingerprint CBFB 84B4 738D 4E87 5E5C  5EFA EF6A 7D2E 3E49 102A


Re: issue transplanting standalone core into solrcloud (plus upgrade)

2016-09-26 Thread Jan Høydahl
Did you change the  tag in your solrconfig.xml?
You could try to let it stay at 3.6 and let compatibility mode kick in where 
applicable.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 25. sep. 2016 kl. 12.24 skrev xavi jmlucjav :
> 
> Hi,
> 
> I have an existing 3.6 standalone installation. It has to be moved to
> Solrcloud 6.1.0. Reindexing is not an option, so I did the following:
> 
> - Use IndexUpgrader to upgrade 3.6 -> 4.4 -> 5.5. I did not upgrade to 6.X
> as 5.5 should be readable by 6.x
> - Install solrcloud 6.1 cluster
> - modify schema/solrconfig for cloud support (add _version_, tlog etc)
> - follow the method mentioned here
> http://lucene.472066.n3.nabble.com/Copy-existing-index-from-standalone-Solr-to-Solr-cloud-td4149920.html
> I did not find any other doc on how to transplant a standalone core int
> solrcloud
> 
> Everything went well, no errors when solr restarted, the collections shows
> the right number of docs. But when I try to run a query, I get:
> 
> null:java.lang.NullPointerException
> at
> org.apache.lucene.util.LegacyNumericUtils.prefixCodedToLong(LegacyNumericUtils.java:189)
> at org.apache.solr.schema.TrieField.toObject(TrieField.java:155)
> at org.apache.solr.schema.TrieField.write(TrieField.java:324)
> at
> org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:133)
> at
> org.apache.solr.response.JSONWriter.writeSolrDocument(JSONResponseWriter.java:345)
> at
> org.apache.solr.response.TextResponseWriter.writeDocuments(TextResponseWriter.java:249)
> at
> org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:151)
> at
> org.apache.solr.response.JSONWriter.writeNamedListAsMapWithDups(JSONResponseWriter.java:183)
> at
> org.apache.solr.response.JSONWriter.writeNamedList(JSONResponseWriter.java:299)
> at
> org.apache.solr.response.JSONWriter.writeResponse(JSONResponseWriter.java:95)
> at
> org.apache.solr.response.JSONResponseWriter.write(JSONResponseWriter.java:60)
> at
> org.apache.solr.response.QueryResponseWriterUtil.writeQueryResponse(QueryResponseWriterUtil.java:65)
> at org.apache.solr.servlet.HttpSolrCall.writeResponse(HttpSolrCall.java:731)
> at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:473)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:257)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:208)
> at
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1668)
> 
> I was wondering how the non existance of the _version_ field would be
> handled, but as that thread above said it would work.
> Can anyone shed some light?
> 
> thanks



Query validation before the parser.

2016-09-26 Thread Modassar Ather
Hi,

Queries like wildcards are expensive in terms of execution time and
resources. Also there could be possible errors in user entered queries.

I am trying to write a query validation feature which checks for wrong
grouping, not supported fields, special characters in query without
escaping.
The same feature will also check for the complexity of the query and its
possible impact on resources.

Kindly share your thoughts on what factors should be considered in query
validation for general errors and its complexity and impact on resources.

Thanks,
Modassar


Re: -field1:value1 OR field2:value2

2016-09-26 Thread Alexandre Rafalovitch
I don't remember specifically :-(. Search the archives
http://search-lucene.com/ or follow-up on Solr Users list. Remember to
mention the version of Solr, as there were some bugs/features/fixes
with OR, I think.

Regards,
  Alex.

Newsletter and resources for Solr beginners and intermediates:
http://www.solr-start.com/


On 26 September 2016 at 16:56, Sandeep Khanzode
 wrote:
> Hi Alex,
> It seems that this is not an issue with AND clause. For example, if I do ...
> field1:value1 AND -field2:value2
> ... the results seem to be an intersection of both.
> Is this an issue with OR? Which is which we replace it with an implicit (*:* 
> NOT)? SRK
>
> On Monday, September 26, 2016 3:09 PM, Sandeep Khanzode 
>  wrote:
>
>
>  Yup. That works. So does (*:* NOT ...)
> Thanks, Alex.  SRK
>
> On Monday, September 26, 2016 3:03 PM, Alexandre Rafalovitch 
>  wrote:
>
>
>  Try field2:value2 OR (*:* -field1=value1)
>
> There is a magic in negative query syntax that breaks down when it
> gets more complex. It's been discussed on the mailing list a bunch of
> times, though the discussions are hard to find by title.
>
> Regards,
> Alex.
> 
> Newsletter and resources for Solr beginners and intermediates:
> http://www.solr-start.com/
>
>
> On 26 September 2016 at 16:06, Sandeep Khanzode
>  wrote:
>> Hi,
>> If I query for
>> -field1=value1 ... I get, say, 100 records
>> and if I query for
>> field2:value2 ... I may get 200 records
>>
>> I would assume that if I query for
>> -field1:value1 OR field2:value2
>>
>> ... I should get atleast 100 records (assuming they overlap, if not, upto 
>> 300 records). I am assuming that the default joining is OR.
>>  But I do not ...
>> The result is that I get less than 100. If I didn't know better, I would 
>> have said that an AND is being done.
>>
>> I am expecting records that EITHER do NOT contain field1:value1 OR which 
>> contain field2:value2.
>>
>> Please let me know what I am missing. Thanks.
>>
>> SRK
>
>
>
>
>


Re: -field1:value1 OR field2:value2

2016-09-26 Thread Sandeep Khanzode
Hi Alex,
It seems that this is not an issue with AND clause. For example, if I do ...
field1:value1 AND -field2:value2 
... the results seem to be an intersection of both.
Is this an issue with OR? Which is which we replace it with an implicit (*:* 
NOT)? SRK 

On Monday, September 26, 2016 3:09 PM, Sandeep Khanzode 
 wrote:
 

 Yup. That works. So does (*:* NOT ...)
Thanks, Alex.  SRK 

    On Monday, September 26, 2016 3:03 PM, Alexandre Rafalovitch 
 wrote:
 

 Try field2:value2 OR (*:* -field1=value1)

There is a magic in negative query syntax that breaks down when it
gets more complex. It's been discussed on the mailing list a bunch of
times, though the discussions are hard to find by title.

Regards,
    Alex.

Newsletter and resources for Solr beginners and intermediates:
http://www.solr-start.com/


On 26 September 2016 at 16:06, Sandeep Khanzode
 wrote:
> Hi,
> If I query for
> -field1=value1 ... I get, say, 100 records
> and if I query for
> field2:value2 ... I may get 200 records
>
> I would assume that if I query for
> -field1:value1 OR field2:value2
>
> ... I should get atleast 100 records (assuming they overlap, if not, upto 300 
> records). I am assuming that the default joining is OR.
>  But I do not ...
> The result is that I get less than 100. If I didn't know better, I would have 
> said that an AND is being done.
>
> I am expecting records that EITHER do NOT contain field1:value1 OR which 
> contain field2:value2.
>
> Please let me know what I am missing. Thanks.
>
> SRK


  

   

Re: -field1:value1 OR field2:value2

2016-09-26 Thread Sandeep Khanzode
Yup. That works. So does (*:* NOT ...)
Thanks, Alex.  SRK 

On Monday, September 26, 2016 3:03 PM, Alexandre Rafalovitch 
 wrote:
 

 Try field2:value2 OR (*:* -field1=value1)

There is a magic in negative query syntax that breaks down when it
gets more complex. It's been discussed on the mailing list a bunch of
times, though the discussions are hard to find by title.

Regards,
    Alex.

Newsletter and resources for Solr beginners and intermediates:
http://www.solr-start.com/


On 26 September 2016 at 16:06, Sandeep Khanzode
 wrote:
> Hi,
> If I query for
> -field1=value1 ... I get, say, 100 records
> and if I query for
> field2:value2 ... I may get 200 records
>
> I would assume that if I query for
> -field1:value1 OR field2:value2
>
> ... I should get atleast 100 records (assuming they overlap, if not, upto 300 
> records). I am assuming that the default joining is OR.
>  But I do not ...
> The result is that I get less than 100. If I didn't know better, I would have 
> said that an AND is being done.
>
> I am expecting records that EITHER do NOT contain field1:value1 OR which 
> contain field2:value2.
>
> Please let me know what I am missing. Thanks.
>
> SRK


   

Re: -field1:value1 OR field2:value2

2016-09-26 Thread Alexandre Rafalovitch
Try field2:value2 OR (*:* -field1=value1)

There is a magic in negative query syntax that breaks down when it
gets more complex. It's been discussed on the mailing list a bunch of
times, though the discussions are hard to find by title.

Regards,
Alex.

Newsletter and resources for Solr beginners and intermediates:
http://www.solr-start.com/


On 26 September 2016 at 16:06, Sandeep Khanzode
 wrote:
> Hi,
> If I query for
> -field1=value1 ... I get, say, 100 records
> and if I query for
> field2:value2 ... I may get 200 records
>
> I would assume that if I query for
> -field1:value1 OR field2:value2
>
> ... I should get atleast 100 records (assuming they overlap, if not, upto 300 
> records). I am assuming that the default joining is OR.
>  But I do not ...
> The result is that I get less than 100. If I didn't know better, I would have 
> said that an AND is being done.
>
> I am expecting records that EITHER do NOT contain field1:value1 OR which 
> contain field2:value2.
>
> Please let me know what I am missing. Thanks.
>
> SRK


-field1:value1 OR field2:value2

2016-09-26 Thread Sandeep Khanzode
Hi,
If I query for 
-field1=value1 ... I get, say, 100 records
and if I query for 
field2:value2 ... I may get 200 records

I would assume that if I query for 
-field1:value1 OR field2:value2

... I should get atleast 100 records (assuming they overlap, if not, upto 300 
records). I am assuming that the default joining is OR.
 But I do not ... 
The result is that I get less than 100. If I didn't know better, I would have 
said that an AND is being done.

I am expecting records that EITHER do NOT contain field1:value1 OR which 
contain field2:value2.

Please let me know what I am missing. Thanks.

SRK

Re: How to retrieve parent documents without a nested structure (block-join)

2016-09-26 Thread Alexandre Rafalovitch
This seems to work against the techproducts example in 6.2:
({!join from=manu_id_s to=id}ipod)  (name:GB18030 -manu_id_s:*)

Two clauses, first one does join and parent mapping. The second one
looks at the records that don't have the mapping key at all and run
the match against that. In your case, the second query's keyword is
probably the same as in the first query.

The brackets are there to ensure the queries are split correctly. Use
debugFlag to confirm what happens.

Regards
   Alex.

Newsletter and resources for Solr beginners and intermediates:
http://www.solr-start.com/


On 26 September 2016 at 10:02, shamik  wrote:
> Thanks for getting back on this. I was trying to formulate a query in similar
> lines but not able to construct it (multiple clauses) correctly so far. That
> can be attributed to my inexperience with Solr queries as well. Can you
> please point to any documentation / example for my reference ?
>
> Appreciate your help.
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/How-to-retrieve-parent-documents-without-a-nested-structure-block-join-tp4297510p4297951.html
> Sent from the Solr - User mailing list archive at Nabble.com.