Re: Reload schema or configs failed then drop index, can not recreate that index.

2016-11-22 Thread Erick Erickson
The mail server is pretty heavy-handed at deleting attachments, none of your
(presumably) screenshots came through.

You also haven't told us what version of Solr you're using.

Best,
Erick

On Tue, Nov 22, 2016 at 6:25 PM, Jerome Yang  wrote:
> Sorry, wrong message.
> To correct.
>
> In cloud mode.
>
>1. I created a collection called "test" and then modified the
>managed-schemaI, write something wrong, for example
>"id", then reload collection would failed.
>2. Then I drop the collection "test" and delete configs form zookeeper.
>It works fine. The collection is removed both from zookeeper and hard disk.
>3. Upload the right configs with the same name as before, try to create
>collection as name "test", it would failed and the error is "core with name
>'*' already exists". But actually not.
>4. The restart the whole cluster, do the create again, everything works
>fine.
>
>
> I think when doing the delete collection, there's something still hold in
> somewhere not deleted.
> Please have a look
>
> Regards,
> Jerome
>
> On Wed, Nov 23, 2016 at 10:16 AM, Jerome Yang  wrote:
>
>> Hi all,
>>
>>
>> Here's my situation:
>>
>> In cloud mode.
>>
>>1. I created a collection called "test" and then modified the
>>managed-schemaI got an error as shown in picture 2.
>>2. To get enough error message, I checked solr logs and get message
>>shown in picture 3.
>>3. If I corrected the managed-schema, everything would be fine. But I
>>dropped the index. The index couldn't be created it again, like picture 4.
>>I restarted gptext using "gptext-start -r" and recreated the index, it was
>>created successfully like picture 5.
>>
>>


Re: Comparing a Date value in solr

2016-11-22 Thread Erick Erickson
I wouldn't do it this way, it's far more complex than you need. Try
fq=Startdate__D:[NOW/DAY-7DAYS TO NOW/DAY+1DAY].

Why the weird NOW/DAY+1DAY? Well, that makes fq clauses far
more likely to be reused, see:
https://lucidworks.com/blog/2012/02/23/date-math-now-and-filter-queries/

Best,
Erick

On Tue, Nov 22, 2016 at 7:29 PM, Sadheera Vithanage  wrote:
> Hi All,
>
> I am struggling to get the difference of 2 days and return the matching
> documents.
>
> I got the below function query to work, however I am unable to pass a
> fieldname for *u *in frange function.
>
> {!frange l=0 u=86400}ms(NOW,StartDate__d)
>
>
> What I really want to do is compare the start date with today's date and
> return the documents that falls within a date range.For example 7 days.
>
> Thank you.
>
> --
> Regards
>
> Sadheera Vithanage


Re: negation search help

2016-11-22 Thread Alexandre Rafalovitch
How do you _know_ it is not 'apparent' ? Is it because it is preceded by
the keyword 'no'? Just that keyword? At what maximum distance?

Regards,
   Alex

On 23 Nov 2016 2:59 PM, "Hem Naidu" 
wrote:

> Gurus,
>
> I am new to Solr, I have a requirement to index entire pdf/word documents
> using Solr Tika. Which was successful and able to get the search results
> displayed. Now I need to fine tune the results or adjust index so the
> negative statements should be filtered out the results like my input text
> for index from the documents would be
> ---
> Fortunately no concurrent trauma was found
> In no apparent distress
> --
>
> If user searches for concurrent trauma or distress the search engine should
> filter out the results as it not apparent symptom.
>
> Any help on whether Solr can do this?
> If so, do I need to adjust the index or build custom queries?
>
> Any help on this would be greatly appreciated !
>
> Thanks
>
>
>


negation search help

2016-11-22 Thread Hem Naidu
Gurus,

I am new to Solr, I have a requirement to index entire pdf/word documents
using Solr Tika. Which was successful and able to get the search results
displayed. Now I need to fine tune the results or adjust index so the
negative statements should be filtered out the results like my input text
for index from the documents would be
---
Fortunately no concurrent trauma was found
In no apparent distress
--

If user searches for concurrent trauma or distress the search engine should
filter out the results as it not apparent symptom.

Any help on whether Solr can do this? 
If so, do I need to adjust the index or build custom queries?

Any help on this would be greatly appreciated !

Thanks
 



Comparing a Date value in solr

2016-11-22 Thread Sadheera Vithanage
Hi All,

I am struggling to get the difference of 2 days and return the matching
documents.

I got the below function query to work, however I am unable to pass a
fieldname for *u *in frange function.

{!frange l=0 u=86400}ms(NOW,StartDate__d)


What I really want to do is compare the start date with today's date and
return the documents that falls within a date range.For example 7 days.

Thank you.

-- 
Regards

Sadheera Vithanage


Re: Solr 6 Performance Suggestions

2016-11-22 Thread Jerome Yang
Have you run IndexUpgrader?

Index Format Changes

Solr 6 has no support for reading Lucene/Solr 4.x and earlier indexes.  Be
sure to run the Lucene IndexUpgrader included with Solr 5.5 if you might
still have old 4x formatted segments in your index. Alternatively: fully
optimize your index with Solr 5.5 to make sure it consists only of one
up-to-date index segment.

Regards,
Jerome

On Tue, Nov 22, 2016 at 10:48 PM, Yonik Seeley  wrote:

> It depends highly on what your requests look like, and which ones are
> slower.
> If you're request mix is heterogeneous, find the types of requests
> that seem to have the largest slowdown and let us know what they look
> like.
>
> -Yonik
>
>
> On Tue, Nov 22, 2016 at 8:54 AM, Max Bridgewater
>  wrote:
> > I migrated an application from Solr 4 to Solr 6.  solrconfig.xml  and
> > schema.xml are sensibly the same. The JVM params are also pretty much
> > similar.  The indicces have each about 2 million documents. No particular
> > tuning was done to Solr 6 beyond the default settings. Solr 4 is running
> in
> > Tomcat 7.
> >
> > Early results seem to show Solr 4 outperforming Solr 6. The first shows
> an
> > average response time of 280 ms while the second averages at 430 ms. The
> > test cases were exactly the same, the machines where exactly the same and
> > heap settings exactly the same (Xms24g, Xmx24g). Requests were sent with
> > Jmeter with 50 concurrent threads for 2h.
> >
> > I know that this is not enough information to claim that Solr 4 generally
> > outperforms Solr 6. I also know that this pretty much depends on what the
> > application does. So I am not claiming anything general. All I want to do
> > is get some input before I start digging.
> >
> > What are some things I could tune to improve the numbers for Solr 6? Have
> > you guys experienced such discrepancies?
> >
> > Thanks,
> > Max.
>


Re: Reload schema or configs failed then drop index, can not recreate that index.

2016-11-22 Thread Jerome Yang
Sorry, wrong message.
To correct.

In cloud mode.

   1. I created a collection called "test" and then modified the
   managed-schemaI, write something wrong, for example
   "id", then reload collection would failed.
   2. Then I drop the collection "test" and delete configs form zookeeper.
   It works fine. The collection is removed both from zookeeper and hard disk.
   3. Upload the right configs with the same name as before, try to create
   collection as name "test", it would failed and the error is "core with name
   '*' already exists". But actually not.
   4. The restart the whole cluster, do the create again, everything works
   fine.


I think when doing the delete collection, there's something still hold in
somewhere not deleted.
Please have a look

Regards,
Jerome

On Wed, Nov 23, 2016 at 10:16 AM, Jerome Yang  wrote:

> Hi all,
>
>
> Here's my situation:
>
> In cloud mode.
>
>1. I created a collection called "test" and then modified the
>managed-schemaI got an error as shown in picture 2.
>2. To get enough error message, I checked solr logs and get message
>shown in picture 3.
>3. If I corrected the managed-schema, everything would be fine. But I
>dropped the index. The index couldn't be created it again, like picture 4.
>I restarted gptext using "gptext-start -r" and recreated the index, it was
>created successfully like picture 5.
>
>


Reload schema or configs failed then drop index, can not recreate that index.

2016-11-22 Thread Jerome Yang
Hi all,


Here's my situation:

In cloud mode.

   1. I created a collection called "test" and then modified the
   managed-schemaI got an error as shown in picture 2.
   2. To get enough error message, I checked solr logs and get message
   shown in picture 3.
   3. If I corrected the managed-schema, everything would be fine. But I
   dropped the index. The index couldn't be created it again, like picture 4.
   I restarted gptext using "gptext-start -r" and recreated the index, it was
   created successfully like picture 5.


Re: How to support facet values in search term

2016-11-22 Thread shamik
Thanks for the pointer Alex . I'll go through all four articles, thanksgiving
will be fun :-)



--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-include-facet-fields-in-keyword-search-tp4306967p4307020.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How many versions do you stay behind in production for better stability ?

2016-11-22 Thread Shawn Heisey
On 11/17/2016 12:25 AM, Dorian Hoxha wrote:
> I see that there is a new release on every lucene release. Do you
> always use the latest version since it may have bugs (ex most
> cassandra productions are old compared to latest `stable` version
> because they're not stable). How much behind do you usually stay ?
> (ex: 6.3 just came out, and you need to be in production after 1
> month, will you upgrade on dev if you don't need any new feature?) 

Each new release is considered stable.  The stable branch (currently
branch_6x) is used to create a minor version branch, and the minor
version branch is used to build and tag the release.  Strong efforts are
made in both Lucene and Solr code to ensure that the stable branch lives
up to its name.  This isn't always successful, but it IS a goal.  The
master branch, currently building 7.0-SNAPSHOT, is the playground for
potentially unstable changes.

As for what I have running:  I try to stay current in my dev
environment, but due to compatibility issues with a third-party plugin
we are using, I usually can't.

I don't have any specific restrictions on being X versions back in
production.  Production gets upgraded on the secondaries first.  I will
typically only upgrade primary production, and not to the latest
release, when the following are all true:

* The version in question has run without trouble on the dev server.
* The staging environments successfully use the dev server.
* We have support for the version in the third-party plugin.

Because the secondaries are upgraded before the primaries, we have the
ability to quickly switch between the versions by disabling the
primaries and letting the load balancers shift the traffic.

This whole process can take a while, as you might imagine.  Our primary
production servers are still running 4.x versions.  The secondaries and
the dev server are running 5.x.

The particular version that we have in dev (5.3.2-SNAPSHOT) is a version
that has known performance regressions compared to 4.x, so I am not in a
hurry on this cycle to upgrade.  I'm waiting for the third-party plugin
to support something newer.

Thanks,
Shawn



Re: How to support facet values in search term

2016-11-22 Thread Alexandre Rafalovitch
This looks similar to what was discussed by Ted Sullivan at:
https://lucidworks.com/blog/author/tedsullivan/ (AutoFilter)

There is a repository somewhere linked from an article as well. I'd
start from the beginning, as articles get increasingly more esoteric
towards the end :-)

Regards,
   Alex.

http://www.solr-start.com/ - Resources for Solr users, new and experienced


On 23 November 2016 at 06:29, Shamik Bandopadhyay  wrote:
> Hi,
>
>   I'm looking for some suggestions on enabling search terms to include
> facet fields as well. In my use case, we've a bunch of product and
> corresponding release fields which are explicitly used as facets. But what
> we are observing is that end users tend to use the product name as part of
> the search term, instead of filtering the product from the facet itself.
> For e.g. we've "Product A" and "Product B", each having release 2016, 2017.
> A common user search appears to be "Product A service pack" . Since Product
> A is not part of the search fields (typically text, title, keyword, etc),
> it's not returning any data.
>
> We've a large set of facet fields, I would ideally like to avoid adding
> them as part of the searchable list. Just wondering  if there's a better
> way to handle this situation. Any pointers will be appreciated.
>
> Thanks,
> Shamik


Re: Solr6 CDCR - indexing doc to source replicates to target, doc not searchable in target

2016-11-22 Thread Erick Erickson
Well, it would be awkard to get right. At what point
should a commit be sent? When the queue was
empty? That could be after every document
when incoming docs were slow. And what about the
other shards? CDCR doesn't really know when _all_
the docs in _all_ the shards have been sent. On the
autocommit interval on the source? I guess this latter
is possible, but why bother when you can do the
same with a well tested mechanism? And if the source
did send commits, how would those interact with
autocommits on the target? We discourage clients
sending explicit commits after all.

In short, it doesn't seem like it adds enough value
to be worth it.

Best,
Erick

On Tue, Nov 22, 2016 at 12:49 PM, gayatri.umesh
 wrote:
> Thank you again Erick. Added autoSoftCommit settings in target solrconfig and
> it works now.
>
> As CDCR does not auto-commit on the target upon replication, is there a
> specific reason for this?
>
> Thanks,
> Gayatri
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Solr6-CDCR-indexing-doc-to-source-replicates-to-target-doc-not-searchable-in-target-tp4306717p4306975.html
> Sent from the Solr - User mailing list archive at Nabble.com.


RE: Partial replication blocks subsequent requests when using solrcloud and master/slave replication

2016-11-22 Thread Jeremy Hoy
Thanks for your response Erick.

Yes this looks like the issue we're seeing.  I'll paste in the relevant parts 
of my email to the issue.  The scenario you're describing is a little different 
to ours, so I don't think the naïve patch we have necessarily suits.  I'll do a 
bit more research and see if I can put something better together.

Incidentally if the blocking behaviour is intended, I'm sure not implemented 
quite right, since it's entirely possible for a new searcher to be created in 
between the point where the existing searcher is closed the new index writer 
gets the write lock, but we really don't want that fixed! ;)

Thanks again,

Jeremy

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com]
Sent: 22 November 2016 18:37
To: solr-user 
Subject: Re: Partial replication blocks subsequent requests when using 
solrcloud and master/slave replication

This sounds a lot like:

https://issues.apache.org/jira/browse/SOLR-9706


Could you attach your patch to that issue if you think it's the same?
And please copy/paste your e-mail in a comment if you would, you've obviously 
done more research on the cause than I did and that'd save some work whenever 
someone picks it up.

It's unclear to me whether this is intentional behavior or an accident of code, 
either way having a place to start when analyzing is much appreciated.


Best,
Erick

On Tue, Nov 22, 2016 at 10:02 AM, Jeremy Hoy  wrote:
> Hi All,
>
> We're running a fairly non-standard solr configuration.  We ingest into named 
> shards in master cores and then replicate out to slaves running solr cloud.  
> So in effect we are using solrcloud only to manage the config files and more 
> importantly to look after the cluster state.  Our corpus and search workload, 
> is such that this makes sense to reduce the need to query every shard for 
> each search since the majority of queries contain values that allow is to 
> target search towards the shards holding the appropriate documents, also this 
> isolates the searching slaves from the costs of indexing (we index fairly 
> infrequently, but in fairly large volumes).  I'm happy to expand on this if 
> anyone's is interested or take suggestions as to how to we might better be 
> doing things.
>
> We've been running 4.6.0 for the past 3 years or so, but have recently 
> upgraded to 5.5.2 - we'll likely be upgrading to 6.3.0 shortly.   However we 
> hit a problem when running 5.5.2, which we also replicated in 6.2.1 and 
> 6.3.0.  When a partial replication starts this will usually block all 
> subsequent requests to solr, whilst replication continues in the background.  
> Whilst in this blocked state we took thread dumps using VisualVM; we see this 
> when running 6.3.0:
>
> "explicit-fetchindex-cmd" - Thread t@71
>java.lang.Thread.State: RUNNABLE
> at java.net.SocketInputStream.socketRead0(Native Method)
> ..
> at 
> org.apache.solr.handler.IndexFetcher$FileFetcher.fetchPackets(IndexFetcher.java:1463)
> at 
> org.apache.solr.handler.IndexFetcher$FileFetcher.fetchFile(IndexFetcher.java:1429)
> at 
> org.apache.solr.handler.IndexFetcher.downloadIndexFiles(IndexFetcher.java:855)
> at 
> org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:434)
> at 
> org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:251)
> at 
> org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:397)
> at 
> org.apache.solr.handler.ReplicationHandler.lambda$handleRequestBody$0(ReplicationHandler.java:279)
> at 
> org.apache.solr.handler.ReplicationHandler$$Lambda$82/776974667.run(Unknown 
> Source)
> at java.lang.Thread.run(Thread.java:745)
>
>Locked ownable synchronizers:
> - locked <4c18799d> (a
> java.util.concurrent.locks.ReentrantLock$NonfairSync)
>
> - locked <64a00f> (a
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
>
> and
>
> "qtp1873653341-61" - Thread t@61
>java.lang.Thread.State: TIMED_WAITING
> at sun.misc.Unsafe.park(Native Method)
> - waiting to lock <64a00f> (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) owned by 
> "explicit-fetchindex-cmd" t@71
> at 
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
> at 
> java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.tryLock(ReentrantReadWriteLock.java:871)
> at 
> 

Re: Solr6 CDCR - indexing doc to source replicates to target, doc not searchable in target

2016-11-22 Thread gayatri.umesh
Thank you again Erick. Added autoSoftCommit settings in target solrconfig and
it works now.

As CDCR does not auto-commit on the target upon replication, is there a
specific reason for this?

Thanks,
Gayatri



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr6-CDCR-indexing-doc-to-source-replicates-to-target-doc-not-searchable-in-target-tp4306717p4306975.html
Sent from the Solr - User mailing list archive at Nabble.com.


How to support facet values in search term

2016-11-22 Thread Shamik Bandopadhyay
Hi,

  I'm looking for some suggestions on enabling search terms to include
facet fields as well. In my use case, we've a bunch of product and
corresponding release fields which are explicitly used as facets. But what
we are observing is that end users tend to use the product name as part of
the search term, instead of filtering the product from the facet itself.
For e.g. we've "Product A" and "Product B", each having release 2016, 2017.
A common user search appears to be "Product A service pack" . Since Product
A is not part of the search fields (typically text, title, keyword, etc),
it's not returning any data.

We've a large set of facet fields, I would ideally like to avoid adding
them as part of the searchable list. Just wondering  if there's a better
way to handle this situation. Any pointers will be appreciated.

Thanks,
Shamik


Re: Partial replication blocks subsequent requests when using solrcloud and master/slave replication

2016-11-22 Thread Erick Erickson
This sounds a lot like:

https://issues.apache.org/jira/browse/SOLR-9706


Could you attach your patch to that issue if you think it's the same?
And please copy/paste your e-mail in a comment if you would, you've
obviously done more research on the cause than I did and that'd save
some work whenever someone picks it up.

It's unclear to me whether this is intentional behavior or an accident
of code, either way having a place to start when analyzing is much
appreciated.


Best,
Erick

On Tue, Nov 22, 2016 at 10:02 AM, Jeremy Hoy  wrote:
> Hi All,
>
> We're running a fairly non-standard solr configuration.  We ingest into named 
> shards in master cores and then replicate out to slaves running solr cloud.  
> So in effect we are using solrcloud only to manage the config files and more 
> importantly to look after the cluster state.  Our corpus and search workload, 
> is such that this makes sense to reduce the need to query every shard for 
> each search since the majority of queries contain values that allow is to 
> target search towards the shards holding the appropriate documents, also this 
> isolates the searching slaves from the costs of indexing (we index fairly 
> infrequently, but in fairly large volumes).  I'm happy to expand on this if 
> anyone's is interested or take suggestions as to how to we might better be 
> doing things.
>
> We've been running 4.6.0 for the past 3 years or so, but have recently 
> upgraded to 5.5.2 - we'll likely be upgrading to 6.3.0 shortly.   However we 
> hit a problem when running 5.5.2, which we also replicated in 6.2.1 and 
> 6.3.0.  When a partial replication starts this will usually block all 
> subsequent requests to solr, whilst replication continues in the background.  
> Whilst in this blocked state we took thread dumps using VisualVM; we see this 
> when running 6.3.0:
>
> "explicit-fetchindex-cmd" - Thread t@71
>java.lang.Thread.State: RUNNABLE
> at java.net.SocketInputStream.socketRead0(Native Method)
> ..
> at 
> org.apache.solr.handler.IndexFetcher$FileFetcher.fetchPackets(IndexFetcher.java:1463)
> at 
> org.apache.solr.handler.IndexFetcher$FileFetcher.fetchFile(IndexFetcher.java:1429)
> at 
> org.apache.solr.handler.IndexFetcher.downloadIndexFiles(IndexFetcher.java:855)
> at 
> org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:434)
> at 
> org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:251)
> at 
> org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:397)
> at 
> org.apache.solr.handler.ReplicationHandler.lambda$handleRequestBody$0(ReplicationHandler.java:279)
> at 
> org.apache.solr.handler.ReplicationHandler$$Lambda$82/776974667.run(Unknown 
> Source)
> at java.lang.Thread.run(Thread.java:745)
>
>Locked ownable synchronizers:
> - locked <4c18799d> (a 
> java.util.concurrent.locks.ReentrantLock$NonfairSync)
>
> - locked <64a00f> (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
>
> and
>
> "qtp1873653341-61" - Thread t@61
>java.lang.Thread.State: TIMED_WAITING
> at sun.misc.Unsafe.park(Native Method)
> - waiting to lock <64a00f> (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) owned by 
> "explicit-fetchindex-cmd" t@71
> at 
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
> at 
> java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.tryLock(ReentrantReadWriteLock.java:871)
> at 
> org.apache.solr.update.DefaultSolrCoreState.lock(DefaultSolrCoreState.java:159)
> at 
> org.apache.solr.update.DefaultSolrCoreState.getIndexWriter(DefaultSolrCoreState.java:104)
> at 
> org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1781)
> at 
> org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1931)
> at 
> org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1677)
> at 
> org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1577)
> .
>
>
> The cause of the problem seems to be that in IndexFetcher.fetchLatestIndex, 
> when the running as solrcloud, the searcher is shut down prior to cleaning up 
> the existing segment files and downloading the new ones.
>
> 6.3.0 - Lines(407-409)
> if 
> (solrCore.getCoreDescriptor().getCoreContainer().isZooKeeperAware()) {
> solrCore.closeSearcher();
> }
>
> 

Re: Re-Indexing 143 million rows

2016-11-22 Thread subinalex
Thanks a lot Eric..:-)

On 21 Nov 2016 20:09, "Erick Erickson [via Lucene]" <
ml-node+s472066n4306659...@n3.nabble.com> wrote:

> In a word, "no". Resending the same document will
>
> 1> delete the old version (based on ID)
> 2> index the document just sent.
>
> When a document comes in, Solr can't assume that
> "nothing's changed". What if you changed your schema?
>
> So I'd expect the second run to take at least as long as the first.
>
> Best,
> Erick
>
> On Mon, Nov 21, 2016 at 1:16 AM, subinalex <[hidden email]
> > wrote:
>
> > Hi Team,
> >
> > I have indexed data with 143 rows(docs) into solr.It takes around 3
> hours to
> > index.I usde csvUpdateHandler and indexes the csv file by remote
> streaming.
> > Now ,when i re-indexing the same csv data,it is still taking 3+ hours.
> >
> > Ideally,since there are no changes in _id values,it should have finished
> > quickly right?.
> >
> > Please provide some insights on this..
> >
> > Regards,
> > Subin
> >
> >
> >
> > --
> > View this message in context: http://lucene.472066.n3.
> nabble.com/Re-Indexing-143-million-rows-tp4306622.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
>
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
> http://lucene.472066.n3.nabble.com/Re-Indexing-143-million-rows-
> tp4306622p4306659.html
> To unsubscribe from Re-Indexing 143 million rows, click here
> 
> .
> NAML
> 
>




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Re-Indexing-143-million-rows-tp4306622p4306952.html
Sent from the Solr - User mailing list archive at Nabble.com.

Partial replication blocks subsequent requests when using solrcloud and master/slave replication

2016-11-22 Thread Jeremy Hoy
Hi All,

We're running a fairly non-standard solr configuration.  We ingest into named 
shards in master cores and then replicate out to slaves running solr cloud.  So 
in effect we are using solrcloud only to manage the config files and more 
importantly to look after the cluster state.  Our corpus and search workload, 
is such that this makes sense to reduce the need to query every shard for each 
search since the majority of queries contain values that allow is to target 
search towards the shards holding the appropriate documents, also this isolates 
the searching slaves from the costs of indexing (we index fairly infrequently, 
but in fairly large volumes).  I'm happy to expand on this if anyone's is 
interested or take suggestions as to how to we might better be doing things.

We've been running 4.6.0 for the past 3 years or so, but have recently upgraded 
to 5.5.2 - we'll likely be upgrading to 6.3.0 shortly.   However we hit a 
problem when running 5.5.2, which we also replicated in 6.2.1 and 6.3.0.  When 
a partial replication starts this will usually block all subsequent requests to 
solr, whilst replication continues in the background.  Whilst in this blocked 
state we took thread dumps using VisualVM; we see this when running 6.3.0:

"explicit-fetchindex-cmd" - Thread t@71
   java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
..
at 
org.apache.solr.handler.IndexFetcher$FileFetcher.fetchPackets(IndexFetcher.java:1463)
at 
org.apache.solr.handler.IndexFetcher$FileFetcher.fetchFile(IndexFetcher.java:1429)
at 
org.apache.solr.handler.IndexFetcher.downloadIndexFiles(IndexFetcher.java:855)
at 
org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:434)
at 
org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:251)
at 
org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:397)
at 
org.apache.solr.handler.ReplicationHandler.lambda$handleRequestBody$0(ReplicationHandler.java:279)
at 
org.apache.solr.handler.ReplicationHandler$$Lambda$82/776974667.run(Unknown 
Source)
at java.lang.Thread.run(Thread.java:745)

   Locked ownable synchronizers:
- locked <4c18799d> (a 
java.util.concurrent.locks.ReentrantLock$NonfairSync)

- locked <64a00f> (a 
java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)

and

"qtp1873653341-61" - Thread t@61
   java.lang.Thread.State: TIMED_WAITING
at sun.misc.Unsafe.park(Native Method)
- waiting to lock <64a00f> (a 
java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) owned by 
"explicit-fetchindex-cmd" t@71
at 
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
at 
java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.tryLock(ReentrantReadWriteLock.java:871)
at 
org.apache.solr.update.DefaultSolrCoreState.lock(DefaultSolrCoreState.java:159)
at 
org.apache.solr.update.DefaultSolrCoreState.getIndexWriter(DefaultSolrCoreState.java:104)
at 
org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1781)
at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1931)
at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1677)
at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1577)
.


The cause of the problem seems to be that in IndexFetcher.fetchLatestIndex, 
when the running as solrcloud, the searcher is shut down prior to cleaning up 
the existing segment files and downloading the new ones.

6.3.0 - Lines(407-409)
if 
(solrCore.getCoreDescriptor().getCoreContainer().isZooKeeperAware()) {
solrCore.closeSearcher();
}

Subsequently solrCore.getUpdateHandler().newIndexWriter(true); takes a write 
lock on the indexwriter, which is not released until the openIndexWriter call 
after the new files have been copied.  So because openNewSearcher needs to take 
a read lock on the index writer, and it can't take that whilst the write lock 
is in place, all subsequent requests are blocked.

To test this we queued up a load of search requests, then manually triggered 
replication, reasoning that a new searcher might be created before the write 
lock is taken.  On a test instance manually triggering replication would almost 
always result in all subsequent requests being blocked, but when we queued up 
search requests and ran these whilst triggering replication 

A good way to extract "facets" from json.facet response via solrj?

2016-11-22 Thread Michael Joyner

Hello all,

It seems I can't find a "getFacets" method for SolrJ when handling a 
query response from a json.facet call.


I see that I can get a top level opaque object via "Object obj = 
response.getResponse().get("facets");"


Is there any code in SolrJ to parse this out as an easy to use navigable 
object?


-Mike



Re: Frequent mismatch in the numDocs between replicas

2016-11-22 Thread Erick Erickson
The autocommit settings on leaders and replicas
can be slightly offset in terms of wall clock time so
docs that have been committed on one node may
not have been committed on the other. Your comment
that you can optimize and fix this is evidence that this
is what you're seeing.

to test this:
1> stop indexing
2> issue a "commit" to the collection.

If that shows all replicas with the same count, then
the above is the explanation.

Best,
Erick

On Mon, Nov 21, 2016 at 6:52 PM, Lewin Joy (TMS)  wrote:
> ** PROTECTED 関係者外秘
> Hi,
>
> I am having a strange issue working with solr 6.1 cloud setup on zookeeper 
> 3.4.8
>
> Intermittently after I run Indexing, the replicas are having a different 
> record count.
> And even though there is this mismatch, it is still marked healthy and is 
> being used for queries.
> So, now I get inconsistent results based on the replica used for the query.
>
> This gets resolved after restarting solr servers. Or if I just do an optimize 
> on the collection.
>
> Any idea what could be wrong? Have any of you faced something similar?
> Is there some configuration or setting I should be checking?
>
>
> Thanks,
> Lewin


Re: Solr6 CDCR - indexing doc to source replicates to target, doc not searchable in target

2016-11-22 Thread Erick Erickson
With those settings, you never commit the documents on the target,
thus they are never searchable on the target.

No, I do _not_ recommend you issue a manual commit on the target,
that was just a manual test to see if your issue was about commits. I'd
just either set openSearcher to true in the  section or
perhaps set autoSoftCommit to something other than -1, which means
"never". Set it to as long as you can stand, soft commits are not as
expensive as hard commits, but they aren't free.

See: 
https://lucidworks.com/blog/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

Best,
Erick

On Mon, Nov 21, 2016 at 7:11 PM, gayatri.umesh
 wrote:
> Thank you Erick for pointing out. I missed that!!
>
> Below are the commit settings in solrconfig.xml in both source and target.
> 
> ${solr.autoCommit.maxTime:15000}
> false
> 
>
> 
> ${solr.autoSoftCommit.maxTime:-1}
> 
>
> Is it recommended to issue a commit on the target when indexing the
> document, as replication does not autocommit? Or should I enable autocommit
> in the target solrconfig?
>
> Thanks,
> Gayatri
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Solr6-CDCR-indexing-doc-to-source-replicates-to-target-doc-not-searchable-in-target-tp4306717p4306816.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: compile solr6.2.1

2016-11-22 Thread Erick Erickson
You said "I can't compile" but then showed us the line for
_running_ Solr so I'm
really confused what your problem is. So:

1> can you compile? IOW, if you execute "ant server dest" from the
/solr
directory, does it run to completion? You should see "BUILD SUCCESSFUL" after a
few minutes indicating you have successfully compiled.

2> If <1> works, then is your problem is trying to run Solr
afterwards? This part of your command line
"suspend=y" means that Solr will hang until you connect to it with
your debugger. What happens
when you try to attach with the IDE? How did you set your remote
session up? What IDE do
you use? Did you follow the instructions here:
https://wiki.apache.org/solr/HowToContribute
in "Development Environment Tips"? What was the result?

If you just use "suspend=n" Solr will start up and you can attach with
the debugger any time.
"suspend=y" is, of course, very useful when debugging initialization
and the like.

Best,
Erick

On Mon, Nov 21, 2016 at 9:35 PM, Wunna Lwin  wrote:
> Hi,
>
> I am try to implement solr cloud with version 6.2.1 but I have a problem that 
> I can’t compile solr source code to write custom plugin for solr. I just this 
> command for remote debug "java -Xdebug 
> -Xrunjdwp:transport=dt_socket,server=y,suspend=y,address=8080 -jar “ but it 
> doesn’t work. Help me out please.


Re: Multilevel sorting in JSON-facet

2016-11-22 Thread Aman Tandon
any help here?

With Regards
Aman Tandon

On Thu, Nov 17, 2016 at 7:16 PM, Wonderful Little Things <
amantandon...@gmail.com> wrote:

> Hi,
>
> I want to do the sorting on multiple fields using the JSON-facet API, so
> is this available? And if it is, then what would be the syntax?
>
> Thanks,
> Aman Tandon
>


Re: Solr as am html cache

2016-11-22 Thread Erick Erickson
bq: This seems like it might even be a good approach for creating
additional cores primarily for the purpose of caching

I think you're making it too complex, especially for such a small data set ;)

1> All the data is memory mapped anyway, so what's not in the JVM will
be in the OS's
memory eventually (assuming you have enough physical memory). See:
http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
If you don't have enough physical memory for that to happen adding
another core won't
help.

2> You can set your documentCache in solrconfig.xml high enough that
it'll cache that all your
documents _uncompressed_, memory permitting in 2 minutes of changing
your solrconfig.xml
file.

3> My challenge is always to measure before you code. My intuition is
that if you quantify
the potential gains of going to more complex caching they'll be
insignificant; not worth
the development time. Can't argue with measurements though.

FWIW,
Erick

On Mon, Nov 21, 2016 at 11:56 PM, Aristedes Maniatis  wrote:
> Thanks Erick
>
> Very helpful indeed.
>
> Your guesses on data size are about right. There might only be 50,000 items 
> in the whole index. And typically we'd fetch a batch of 10. Disk is cheap and 
> this really isn't taking much room anyway. For such a tiny data set, it seems 
> like this approach will work well.
>
>
> This seems like it might even be a good approach for creating additional 
> cores primarily for the purpose of caching: that is, a core full of records 
> that are only ever queries by some unique key. I wouldn't want to abuse Solr 
> for a purpose it wasn't designed, but since it is already there it appears to 
> be a useful approach. Rather than getting some data from the db, we fetch it 
> from Solr pre-assembled.
>
> Thanks
> Ari
>
>
>
> On 22/11/16 3:28am, Erick Erickson wrote:
>> Searching isn't really going to be impacted much, if at all. You're
>> essentially talking about setting some field with store="true" and
>> stuffing the HTML into that, right? It will probably have indexed="false"
>> and docValues="false".
>>
>> So.. what that means is that very early in the indexing process, the
>> raw data is dumped to the *.fdt and *.fdx extensions for the segment. These
>> are totally irrelevant for querying, they aren't even read from disk to score
>> the docs. So let's say your numFound = 10,000 and rows=10. Those 10,000
>> docs are scored without having to look at the stored data at all. Now, when
>> the 10 docs are assembled for return, the stored data is read off disk
>> decompressed and returned.
>>
>> So the additional cost will be
>> 1> your index is larger on disk
>> 2> merging etc. will be a bit more costly. This doesn't
>>  seem like a problem if your index doesn't change all
>>  that often.
>> 3> there will be some additional load to decompress the data
>>  and return it.
>>
>> This is a perfectly reasonable approach, my guess is that any difference
>> in search speed will be lost in the noise of measuring and that the
>> additional load of decompressing will be more than offset by not having
>> to make a separate service call to actually get the doc, but as always
>> measuring the performance is the proof you need.
>>
>> You haven't indicated how _many_ docs you have in your corpus, but a
>> rough indication of the additional disk space is about half the raw HTML 
>> size,
>> we've usually seen about a 2:1 compression ratio. With a zillion docs
>> that could be sizeable, but disk space is cheap.
>>
>>
>> Best,
>> Erick
>>
>> On Mon, Nov 21, 2016 at 8:08 AM, Aristedes Maniatis
>>  wrote:
>>> I'm familiar enough with 7-8 years of Solr usage in how it performs as a 
>>> full text search index, including spatial coordinates and much more. But 
>>> for the most part, we've been returning database ids from Solr rather than 
>>> a full record ready to display. We then grab the data and related records 
>>> from the database in the usual way and display it.
>>>
>>> We are thinking now about improving performance of our app. One option is 
>>> Reddis to store html pieces for reuse, rather than assembling the html from 
>>> dozens of queries to the database. We've done what we can with caching in 
>>> the ORM level, and we can't do too much with varnish because of differences 
>>> in page rendering per user (eg shopping baskets).
>>>
>>> But we are thinking about storing the rendered html directly in Solr. The 
>>> downsides appear to be:
>>>
>>> * adding 2-10kB of html to each record and the performance hit this might 
>>> have on searching and retrieving
>>> * additional load of ensuring we rebuild Solr's data every time some part 
>>> of that html changes (but this is minimal in our use case)
>>> * additional cores that we'll want to add to cache other data that isn't 
>>> yet in Solr
>>>
>>> Is this a reasonable approach to avoid running yet another cluster of 
>>> services? Are there downsides to this I haven't thought 

Re: Query parser behavior with AND and negative clause

2016-11-22 Thread Erick Erickson
_How_ does it "not work"? You haven't told us what you expect .vs.
what you get back.

Plus a sample doc that that violates your expectations (just the
dateRange field) would
also help.

Best,
Erick

On Tue, Nov 22, 2016 at 4:23 AM, Sandeep Khanzode
 wrote:
> Hi,
> I have a simple query that should intersect with dateRange1 and NOT be 
> contained within dateRange2. I have tried the following options:
>
> WORKS:
> +{!field f=dateRange1 op=Intersects v='[2016-11-22T12:01:00Z TO 
> 2016-11-22T13:59:00Z]'} +(*:* -{!field f=dateRange2 op=Contains 
> v='[2016-11-22T12:01:00Z TO 2016-11-22T13:59:00Z]'})
>
>
> DOES NOT WORK :
> {!field f=dateRange1 op=Intersects v='[2016-11-22T12:01:00Z TO 
> 2016-11-22T13:59:00Z]'} AND (*:* -{!field f=dateRange2 op=Contains 
> v='[2016-11-22T12:01:00Z TO 2016-11-22T13:59:00Z]'})
>
> Why?
>
> WILL NOT WORK (because of the negative clause at the top level?):
> {!field f=dateRange1 op=Intersects v='[2016-11-22T12:01:00Z TO 
> 2016-11-22T13:59:00Z]'} AND -{!field f=dateRange2 op=Contains 
> v='[2016-11-22T12:01:00Z TO 2016-11-22T13:59:00Z]'}
>
>
> SRK


Re: Solr 6 Performance Suggestions

2016-11-22 Thread Yonik Seeley
It depends highly on what your requests look like, and which ones are slower.
If you're request mix is heterogeneous, find the types of requests
that seem to have the largest slowdown and let us know what they look
like.

-Yonik


On Tue, Nov 22, 2016 at 8:54 AM, Max Bridgewater
 wrote:
> I migrated an application from Solr 4 to Solr 6.  solrconfig.xml  and
> schema.xml are sensibly the same. The JVM params are also pretty much
> similar.  The indicces have each about 2 million documents. No particular
> tuning was done to Solr 6 beyond the default settings. Solr 4 is running in
> Tomcat 7.
>
> Early results seem to show Solr 4 outperforming Solr 6. The first shows an
> average response time of 280 ms while the second averages at 430 ms. The
> test cases were exactly the same, the machines where exactly the same and
> heap settings exactly the same (Xms24g, Xmx24g). Requests were sent with
> Jmeter with 50 concurrent threads for 2h.
>
> I know that this is not enough information to claim that Solr 4 generally
> outperforms Solr 6. I also know that this pretty much depends on what the
> application does. So I am not claiming anything general. All I want to do
> is get some input before I start digging.
>
> What are some things I could tune to improve the numbers for Solr 6? Have
> you guys experienced such discrepancies?
>
> Thanks,
> Max.


RE: Solr 6 Performance Suggestions

2016-11-22 Thread Prateek Jain J

I am not sure but I heard this in one of discussions, that you cant migrate 
directly from solr 4 to solr 6. It has to be incremental like solr 4 to solr 5 
and then to solr 6. I might be wrong but is worth trying. 


Regards,
Prateek Jain

-Original Message-
From: Max Bridgewater [mailto:max.bridgewa...@gmail.com] 
Sent: 22 November 2016 01:54 PM
To: solr-user@lucene.apache.org
Subject: Solr 6 Performance Suggestions

I migrated an application from Solr 4 to Solr 6.  solrconfig.xml  and 
schema.xml are sensibly the same. The JVM params are also pretty much similar.  
The indicces have each about 2 million documents. No particular tuning was done 
to Solr 6 beyond the default settings. Solr 4 is running in Tomcat 7.

Early results seem to show Solr 4 outperforming Solr 6. The first shows an 
average response time of 280 ms while the second averages at 430 ms. The test 
cases were exactly the same, the machines where exactly the same and heap 
settings exactly the same (Xms24g, Xmx24g). Requests were sent with Jmeter with 
50 concurrent threads for 2h.

I know that this is not enough information to claim that Solr 4 generally 
outperforms Solr 6. I also know that this pretty much depends on what the 
application does. So I am not claiming anything general. All I want to do is 
get some input before I start digging.

What are some things I could tune to improve the numbers for Solr 6? Have you 
guys experienced such discrepancies?

Thanks,
Max.


Solr 6 Performance Suggestions

2016-11-22 Thread Max Bridgewater
I migrated an application from Solr 4 to Solr 6.  solrconfig.xml  and
schema.xml are sensibly the same. The JVM params are also pretty much
similar.  The indicces have each about 2 million documents. No particular
tuning was done to Solr 6 beyond the default settings. Solr 4 is running in
Tomcat 7.

Early results seem to show Solr 4 outperforming Solr 6. The first shows an
average response time of 280 ms while the second averages at 430 ms. The
test cases were exactly the same, the machines where exactly the same and
heap settings exactly the same (Xms24g, Xmx24g). Requests were sent with
Jmeter with 50 concurrent threads for 2h.

I know that this is not enough information to claim that Solr 4 generally
outperforms Solr 6. I also know that this pretty much depends on what the
application does. So I am not claiming anything general. All I want to do
is get some input before I start digging.

What are some things I could tune to improve the numbers for Solr 6? Have
you guys experienced such discrepancies?

Thanks,
Max.


Query parser behavior with AND and negative clause

2016-11-22 Thread Sandeep Khanzode
Hi,
I have a simple query that should intersect with dateRange1 and NOT be 
contained within dateRange2. I have tried the following options:

WORKS:
+{!field f=dateRange1 op=Intersects v='[2016-11-22T12:01:00Z TO 
2016-11-22T13:59:00Z]'} +(*:* -{!field f=dateRange2 op=Contains 
v='[2016-11-22T12:01:00Z TO 2016-11-22T13:59:00Z]'}) 


DOES NOT WORK :
{!field f=dateRange1 op=Intersects v='[2016-11-22T12:01:00Z TO 
2016-11-22T13:59:00Z]'} AND (*:* -{!field f=dateRange2 op=Contains 
v='[2016-11-22T12:01:00Z TO 2016-11-22T13:59:00Z]'}) 

Why?

WILL NOT WORK (because of the negative clause at the top level?):
{!field f=dateRange1 op=Intersects v='[2016-11-22T12:01:00Z TO 
2016-11-22T13:59:00Z]'} AND -{!field f=dateRange2 op=Contains 
v='[2016-11-22T12:01:00Z TO 2016-11-22T13:59:00Z]'} 


SRK

Re: How many versions do you stay behind in production for better stability ?

2016-11-22 Thread Naresh Yadav
Please let me know about your conclusion ?
i am also in same confusion.

On Thu, Nov 17, 2016 at 12:55 PM, Dorian Hoxha 
wrote:

> Hi,
>
> I see that there is a new release on every lucene release. Do you always
> use the latest version since it may have bugs (ex most cassandra
> productions are old compared to latest `stable` version because they're not
> stable). How much behind do you usually stay ? (ex: 6.3 just came out, and
> you need to be in production after 1 month, will you upgrade on dev if you
> don't need any new feature?)
>
> Thank You
>


Re: Wildcard searches with space in TextField/StrField

2016-11-22 Thread Sandeep Khanzode
Hi Erick,
I gave this a try. 
These are my results. There is a record with "John D. Smith", and another named 
"John Doe".

1.] {!complexphrase inOrder=true}name:"John D.*" ... does not fetch any 
results. 

2.] {!complexphrase inOrder=true}name:"John D*" ... fetches both results. 



Second observation: There is a record with "John D Smith"
1.] {!complexphrase inOrder=true}name:"John*" ... does not fetch any results. 

2.] {!complexphrase inOrder=true}name:"John D*" ... fetches that record. 

3.] {!complexphrase inOrder=true}name:"John D S*" ... fetches that record. 

SRK 

On Sunday, November 13, 2016 7:43 AM, Erick Erickson 
 wrote:
 

 Right, for that kind of use case you want complexPhraseQueryParser,
see: 
https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-ComplexPhraseQueryParser

Best,
Erick

On Sat, Nov 12, 2016 at 9:39 AM, Sandeep Khanzode
 wrote:
> Thanks, Erick.
>
> I am actually not trying to use the String field (prefer a TextField here).
> But, in my comparisons with TextField, it seems that something like phrase
> matching with whitespace and wildcard (like, 'my do*' or say, 'my dog*', or
> say, 'my dog has*') can only be accomplished with a string type field,
> especially because, with a WhitespaceTokenizer in TextField, the space will
> be lost, and all tokens will be individually considered. Am I missing
> something?
>
> SRK
>
>
> On Friday, November 11, 2016 10:05 PM, Erick Erickson
>  wrote:
>
>
> You have to query text and string fields differently, that's just the
> way it works. The problem is getting the query string through the
> parser as a _single_ token or as multiple tokens.
>
> Let's say you have a string field with the "a b" example. You have a
> single token
> a b that starts at offset 0.
>
> But with a text field, you have two tokens,
> a at position 0
> b at position 1
>
> But when the query parser sees "a b" (without quotes) it splits it
> into two tokens, and only the text field has both tokens so the string
> field won't match.
>
> OTOH, when the query parser sees "a\ b" it passes this through as a
> single token, which only matches the string field as there's no
> _single_ token "a b" in the text field.
>
> But a more interesting question is why you want to search this way.
> String fields are intended for keywords, machine-generated IDs and the
> like. They're pretty useless for searching anything except
> 1> exact tokens
> 2> prefixes
>
> While if you have "my dog has fleas" in a string field, you _can_
> search "*dog*" and get a hit but the performance is poor when you get
> a large corpus. Performance for "my*" will be pretty good though.
>
> In all this sounds like an XY problem, what's the use-case you're
> trying to solve?
>
> Best,
> Erick
>
>
>
> On Thu, Nov 10, 2016 at 10:11 PM, Sandeep Khanzode
>  wrote:
>> Hi Erick, Reth,
>>
>> The 'a\ b*' as well as the q.op=AND approach worked (successfully) only
>> for StrField for me.
>>
>> Any attempt at creating a 'a\ b*' for a TextField does not match any
>> documents. The parsedQuery in debug mode does show 'field:a b*'. I am sure
>> there are documents that should match.
>> Another (maybe unrelated) observation is if I have 'field:a\ b', then the
>> parsedQuery is field:a field:b. Which does not match as expected (matches
>> individually).
>>
>> Can you please provide an example that I can use in Solr Query dashboard?
>> That will be helpful.
>>
>> I have also seen that wildcard queries work irrespective of field type
>> i.e. StrField as well as TextField. That makes sense because with a
>> WhitespaceTokenizer only creates word boundaries when we do not use a
>> EdgeNGramFilter. If I am not wrong, that is. SRK
>>
>>    On Friday, November 11, 2016 5:00 AM, Erick Erickson
>>  wrote:
>>
>>
>>  You can escape the space with a backslash as  'a\ b*'
>>
>> Best,
>> Erick
>>
>> On Thu, Nov 10, 2016 at 2:37 PM, Reth RM  wrote:
>>> I don't think you can do wildcard on StrField. For text field, if your
>>> query is "category:(test m*)"  the parsed query will be  "category:test
>>> OR
>>> category:m*"
>>> You can add q.op=AND to make an AND between those terms.
>>>
>>> For phrase type wild card query support, as per docs, it
>>> is ComplexPhraseQueryParser that supports it. (I haven't tested it
>>> myself)
>>>
>>>
>>> https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-ComplexPhraseQueryParser
>>>
>>> On Thu, Nov 10, 2016 at 11:40 AM, Sandeep Khanzode <
>>> sandeep_khanz...@yahoo.com.invalid> wrote:
>>>
 Hi,
 How does a search like abc* work in StrField. Since the entire thing is
 stored as a single token, is it a type of a trie structure that allows
 such
 wildcard matching?
 How can searches with space like 'a b*' be executed for text fields
 (tokenized on whitespace)? If we specify 

Re: Solr/lucene "planet" + recommendations for blogs to follow

2016-11-22 Thread Alexandre Rafalovitch
I tried weekly. I did not have personal bandwidth for that. It
actually takes quite a lot of time to do the newsletter, especially
since I also try to update the website (a separate messy/hacky story).
And since English is not my first language and writing short copy is
harder than a long one :-)

The curation project would obviously help once I get to it, as the
same material would contribute to both sources, just in different
volumes.

Thanks for bug report. The screenshot does not make it through to the
public (as this thread is) mailing list, but I'll figure it out. I
have enough info.

Regards,
   Alex.

http://www.solr-start.com/ - Resources for Solr users, new and experienced


On 22 November 2016 at 22:45, Dorian Hoxha  wrote:
> Thanks Alex, some kind of weekly newsletter would be great (examples I
> subscribe to are db weekly, postgresql weekly, redis weekly).
>
> If it makes sense, to make it weekly, add some sponsor(targeted) to it, and
> it should be nicer. Maybe even include es,lucene if there's not enough
> content or there's interest.
>
> A small bug on your site, the twitter widget is on top of the sign-up form
> (maybe only happens on small resolutions, happened on fullscreen for me).
> See attached screenshot.
>
> On Tue, Nov 22, 2016 at 12:22 PM, Alexandre Rafalovitch 
> wrote:
>>
>> I am not aware of any aggregator like that. And I looked, hard.
>>
>> I, myself, publish a newsletter (Solr Start, in signature, every 2
>> weeks) that usually has a couple of links to cool Solr stuff I found.
>> Subscribing to newsletter also gives access to full archives...
>>
>> To find the links, I have a bunch of ad-hoc keyword trackers installed
>> for that. Just basic hacks for now.
>>
>> I am also _thinking_ of creating an aggregator. But not so much the
>> planet style as a Yahoo-directory/open-directory style. For which
>> (Yahoo style directory curation and generation), I cannot seem to find
>> a good software package either. So, I may build one from scratch.
>> Probably just as hacky, just because my skills are not universal. A
>> hacky version will probably look like Twitter keyword scanner with URL
>> deduplication, fully manual curation and Wordpress as a publishing
>> platform.
>>
>> But if anybody is interesting in helping with building a proper
>> open-source one as a small big-data pipeline (in Java), give me a
>> yell. The non-hacky system will probably need to put together a
>> crawler (twitter, websites, etc), a graph database, possibly some
>> analyzer/reducer/aggregator, manual/ML curator/tagger, and (in my
>> mind) static site builder with Solr (duh!) as a search backend. I have
>> a lot more design thoughts of course, but the list is not the right
>> place for multi-page idea dump :-) And I am happy to read anybody
>> else's idea dumps on this concept, sent off-the-list.
>>
>> As to "what's happening" - subscribing to JIRA list and filtering out
>> issue notifications is probably a reasonable way to see what work is
>> going on. I have filters that specifically catch CREATE issue emails.
>> I also review release notes in details. That keeps me up to date with
>> new stuff. Older stuff or in-depth explanations of new stuff is -
>> unfortunately - all over the place, so it is hard to give a short list
>> of things to follow. Of course, Lucidworks blog seems to be pretty
>> active: https://lucidworks.com/blog/
>>
>> Regards,
>>Alex.
>>
>> 
>> http://www.solr-start.com/ - Resources for Solr users, new and experienced
>>
>>
>> On 22 November 2016 at 21:56, Dorian Hoxha  wrote:
>> > Hello searcherers,
>> >
>> > Is there a solr/lucene "planet" like planet.postgresql.org ? If not,
>> > what
>> > are some blogs/rss/feeds that I should follow to learn what's happening
>> > in
>> > the solr/lucene worlds ?
>> >
>> > Thank You
>
>


Re: Solr/lucene "planet" + recommendations for blogs to follow

2016-11-22 Thread Dorian Hoxha
Thanks Alex, some kind of weekly newsletter would be great (examples I
subscribe to are db weekly, postgresql weekly, redis weekly).

If it makes sense, to make it weekly, add some sponsor(targeted) to it, and
it should be nicer. Maybe even include es,lucene if there's not enough
content or there's interest.

A small bug on your site, the twitter widget is on top of the sign-up form
(maybe only happens on small resolutions, happened on fullscreen for me).
See attached screenshot.

On Tue, Nov 22, 2016 at 12:22 PM, Alexandre Rafalovitch 
wrote:

> I am not aware of any aggregator like that. And I looked, hard.
>
> I, myself, publish a newsletter (Solr Start, in signature, every 2
> weeks) that usually has a couple of links to cool Solr stuff I found.
> Subscribing to newsletter also gives access to full archives...
>
> To find the links, I have a bunch of ad-hoc keyword trackers installed
> for that. Just basic hacks for now.
>
> I am also _thinking_ of creating an aggregator. But not so much the
> planet style as a Yahoo-directory/open-directory style. For which
> (Yahoo style directory curation and generation), I cannot seem to find
> a good software package either. So, I may build one from scratch.
> Probably just as hacky, just because my skills are not universal. A
> hacky version will probably look like Twitter keyword scanner with URL
> deduplication, fully manual curation and Wordpress as a publishing
> platform.
>
> But if anybody is interesting in helping with building a proper
> open-source one as a small big-data pipeline (in Java), give me a
> yell. The non-hacky system will probably need to put together a
> crawler (twitter, websites, etc), a graph database, possibly some
> analyzer/reducer/aggregator, manual/ML curator/tagger, and (in my
> mind) static site builder with Solr (duh!) as a search backend. I have
> a lot more design thoughts of course, but the list is not the right
> place for multi-page idea dump :-) And I am happy to read anybody
> else's idea dumps on this concept, sent off-the-list.
>
> As to "what's happening" - subscribing to JIRA list and filtering out
> issue notifications is probably a reasonable way to see what work is
> going on. I have filters that specifically catch CREATE issue emails.
> I also review release notes in details. That keeps me up to date with
> new stuff. Older stuff or in-depth explanations of new stuff is -
> unfortunately - all over the place, so it is hard to give a short list
> of things to follow. Of course, Lucidworks blog seems to be pretty
> active: https://lucidworks.com/blog/
>
> Regards,
>Alex.
>
> 
> http://www.solr-start.com/ - Resources for Solr users, new and experienced
>
>
> On 22 November 2016 at 21:56, Dorian Hoxha  wrote:
> > Hello searcherers,
> >
> > Is there a solr/lucene "planet" like planet.postgresql.org ? If not,
> what
> > are some blogs/rss/feeds that I should follow to learn what's happening
> in
> > the solr/lucene worlds ?
> >
> > Thank You
>


Re: Solr/lucene "planet" + recommendations for blogs to follow

2016-11-22 Thread Alexandre Rafalovitch
I am not aware of any aggregator like that. And I looked, hard.

I, myself, publish a newsletter (Solr Start, in signature, every 2
weeks) that usually has a couple of links to cool Solr stuff I found.
Subscribing to newsletter also gives access to full archives...

To find the links, I have a bunch of ad-hoc keyword trackers installed
for that. Just basic hacks for now.

I am also _thinking_ of creating an aggregator. But not so much the
planet style as a Yahoo-directory/open-directory style. For which
(Yahoo style directory curation and generation), I cannot seem to find
a good software package either. So, I may build one from scratch.
Probably just as hacky, just because my skills are not universal. A
hacky version will probably look like Twitter keyword scanner with URL
deduplication, fully manual curation and Wordpress as a publishing
platform.

But if anybody is interesting in helping with building a proper
open-source one as a small big-data pipeline (in Java), give me a
yell. The non-hacky system will probably need to put together a
crawler (twitter, websites, etc), a graph database, possibly some
analyzer/reducer/aggregator, manual/ML curator/tagger, and (in my
mind) static site builder with Solr (duh!) as a search backend. I have
a lot more design thoughts of course, but the list is not the right
place for multi-page idea dump :-) And I am happy to read anybody
else's idea dumps on this concept, sent off-the-list.

As to "what's happening" - subscribing to JIRA list and filtering out
issue notifications is probably a reasonable way to see what work is
going on. I have filters that specifically catch CREATE issue emails.
I also review release notes in details. That keeps me up to date with
new stuff. Older stuff or in-depth explanations of new stuff is -
unfortunately - all over the place, so it is hard to give a short list
of things to follow. Of course, Lucidworks blog seems to be pretty
active: https://lucidworks.com/blog/

Regards,
   Alex.


http://www.solr-start.com/ - Resources for Solr users, new and experienced


On 22 November 2016 at 21:56, Dorian Hoxha  wrote:
> Hello searcherers,
>
> Is there a solr/lucene "planet" like planet.postgresql.org ? If not, what
> are some blogs/rss/feeds that I should follow to learn what's happening in
> the solr/lucene worlds ?
>
> Thank You


Re: Multiple search-queries in 1 http request ?

2016-11-22 Thread Dorian Hoxha
@Alex
Yes, that should also support more efficient serialization(binary) like
msgpack etc.

On Tue, Nov 22, 2016 at 1:33 AM, Alexandre Rafalovitch 
wrote:

> HTTP 2 and whatever that Google's new protocol is are both into
> pipelining over the same connection (HTTP 1.1 too, but not as well).
> So, I feel, the right approach would be instead to check whether
> SolrJ/Jetty can handle those and not worry about it within Solr
> itself.
>
> Regards,
>Alex.
> 
> http://www.solr-start.com/ - Resources for Solr users, new and experienced
>
>
> On 22 November 2016 at 04:58, Walter Underwood 
> wrote:
> > A agree that dispatching multiple queries is better.
> >
> > With multiple queries, we need to deal with multiple result codes,
> multiple timeouts, and so on. Then write tests for all that stuff.
> >
> > wunder
> > Walter Underwood
> > wun...@wunderwood.org
> > http://observer.wunderwood.org/  (my blog)
> >
> >
> >> On Nov 21, 2016, at 9:55 AM, Christian Ortner 
> wrote:
> >>
> >> Hi,
> >>
> >> there has been an JIRA issue[0] for a long time that contains some
> patches
> >> for multiple releases of Solr that implement this functionality. It's a
> >> different topic if those patches still work in recent versions, and the
> >> issue has been resolved as a won't fix.
> >>
> >> Personally, I think starting multiple queries asynchronously right after
> >> each other has little disadvantages over a batching mechanism.
> >>
> >> Best regards,
> >> Chris
> >>
> >>
> >> [0] https://issues.apache.org/jira/browse/SOLR-1093
> >>
> >> On Thu, Nov 17, 2016 at 7:50 PM, Mikhail Khludnev 
> wrote:
> >>
> >>> Hello,
> >>> There is nothing like that in Solr.
> >>>
> >>> On Thursday, November 17, 2016, Dorian Hoxha 
> >>> wrote:
> >>>
>  Hi,
> 
>  I couldn't find anything in core for "multiple separate queries in 1
> http
>  request" like elasticsearch
>    current/search-multi-search.html>
>  ? I found this
>   >>> multiple-queries/>
>  blog-post though I thought there is/should/would be something in core
> ?
> 
>  Thank You
> 
> >>>
> >>>
> >>> --
> >>> Sincerely yours
> >>> Mikhail Khludnev
> >>>
> >
>


Solr/lucene "planet" + recommendations for blogs to follow

2016-11-22 Thread Dorian Hoxha
Hello searcherers,

Is there a solr/lucene "planet" like planet.postgresql.org ? If not, what
are some blogs/rss/feeds that I should follow to learn what's happening in
the solr/lucene worlds ?

Thank You


Re: Using solr(cloud) as source-of-truth for data (with no backing external db)

2016-11-22 Thread Dorian Hoxha
Yeah that looks like the _source that elasticsearch has.

On Mon, Nov 21, 2016 at 9:20 PM, Michael Joyner  wrote:

> Have a "store only" text field that contains a serialized (json?) of the
> master object for deserilization as part of the results parsing if you are
> wanting to save a DB lookup.
>
> I would still store everything in a DB though to have a "master" copy of
> everthing.
>
>
>
> On 11/18/2016 04:45 AM, Dorian Hoxha wrote:
>
>> @alex
>> That makes sense, but it can be ~fixed by just storing every field that
>> you
>> need.
>>
>> @Walter
>> Many of those things are missing from many nosql dbs yet they're used as
>> source of data.
>> As long as the backup is "point in time", meaning consistent timestamp
>> across all shards it ~should be ok for many usecases.
>>
>> The 1-line-curl may need a patch to be disabled from config.
>>
>> On Thu, Nov 17, 2016 at 6:29 PM, Walter Underwood 
>> wrote:
>>
>> I agree, it is a bad idea.
>>>
>>> Solr is missing nearly everything you want in a repository, because it is
>>> not designed to be a repository.
>>>
>>> Does not have:
>>>
>>> * access control
>>> * transactions
>>> * transactional backup
>>> * dump and load
>>> * schema migration
>>> * versioning
>>>
>>> And so on.
>>>
>>> Also, I’m glad to share a one-line curl command that will delete all the
>>> documents
>>> in your collection.
>>>
>>> wunder
>>> Walter Underwood
>>> wun...@wunderwood.org
>>> http://observer.wunderwood.org/  (my blog)
>>>
>>>
>>> On Nov 17, 2016, at 1:20 AM, Alexandre Rafalovitch 

>>> wrote:
>>>
 I've heard of people doing it but it is not recommended.

 One of the biggest implementation breakthroughs is that - after the
 initial learning curve - you will start mapping your input data to
 signals. Those signals will not look very much like your original data
 and therefore are not terribly suitable to be the source of it.

 We are talking copyFields, UpdateRequestProcessor pre-processing,
 fields that are not stored, nested documents flattening,
 denormalization, etc. Getting back from that to original shape of data
 is painful.

 Regards,
Alex.
 
 Solr Example reading group is starting November 2016, join us at
 http://j.mp/SolrERG
 Newsletter and resources for Solr beginners and intermediates:
 http://www.solr-start.com/


 On 17 November 2016 at 18:46, Dorian Hoxha 

>>> wrote:
>>>
 Hi,
>
> Anyone use solr for source-of-data with no `normal` db (of course with
> normal backups/replication) ?
>
> Are there any drawbacks ?
>
> Thank You
>

>>>
>