Re: Solr vs Lucene

2015-10-02 Thread Mark Fenbers
Thanks for the suggestion, but I've looked at aspell and hunspell and 
neither provide a native Java API.  Further, I already use Solr for a 
search engine, too, so why not stick with this infrastructure for 
spelling, too?  I think it will work well for me once I figure out the 
right configuration to get it to do what I want it to.


Mark

On 10/1/2015 4:16 PM, Walter Underwood wrote:

If you want a spell checker, don’t use a search engine. Use a spell checker. 
Something like aspell (http://aspell.net/ ) will be faster 
and better than Solr.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)






Re: Zk and Solr Cloud

2015-10-02 Thread Rallavagu

Thanks Shawn.

Right. That is a great insight into the issue. We ended up clearing the 
overseer queue and then cloud became normal.


We were running Solr indexing process and wondering if that caused the 
queue to grow. Will Solr (leader) add a work entry to zookeeper for 
every update if not what are those work entries?


Thanks

On 10/1/15 10:58 PM, Shawn Heisey wrote:

On 10/1/2015 1:26 PM, Rallavagu wrote:

Solr 4.6.1 single shard with 4 nodes. Zookeeper 3.4.5 ensemble of 3.

See following errors in ZK and Solr and they are connected.

When I see the following error in Zookeeper,

unexpected error, closing socket connection and attempting reconnect
java.io.IOException: Packet len11823809 is out of range!


This is usually caused by the overseer queue (stored in zookeeper)
becoming extraordinarily huge, because it's being flooded with work
entries far faster than the overseer can process them.  This causes the
znode where the queue is stored to become larger than the maximum size
for a znode, which defaults to about 1MB.  In this case (reading your
log message that says len11823809), something in zookeeper has gotten to
be 11MB in size, so the zookeeper client cannot read it.

I think the zookeeper server code must be handling the addition of
children to the queue znode through a code path that doesn't pay
attention to the maximum buffer size, just goes ahead and adds it,
probably by simply appending data.  I'm unfamiliar with how the ZK
database works, so I'm guessing here.

If I'm right about where the problem is, there are two workarounds to
your immediate issue.

1) Delete all the entries in your overseer queue using a zookeeper
client that lets you edit the DB directly.  If you haven't changed the
cloud structure and all your servers are working, this should be safe.

2) Set the jute.maxbuffer system property on the startup commandline for
all ZK servers and all ZK clients (Solr instances) to a size that's
large enough to accommodate the huge znode.  In order to do the deletion
mentioned in option 1 above,you might need to increase jute.maxbuffer on
the servers and the client you use for the deletion.

These are just workarounds.  Whatever caused the huge queue in the first
place must be addressed.  It is frequently a performance issue.  If you
go to the following link, you will see that jute.maxbuffer is considered
an unsafe option:

http://zookeeper.apache.org/doc/r3.3.3/zookeeperAdmin.html#Unsafe+Options

In Jira issue SOLR-7191, I wrote the following in one of my comments:

"The giant queue I encountered was about 85 entries, and resulted in
a packet length of a little over 14 megabytes. If I divide 85 by 14,
I know that I can have about 6 overseer queue entries in one znode
before jute.maxbuffer needs to be increased."

https://issues.apache.org/jira/browse/SOLR-7191?focusedCommentId=14347834

Thanks,
Shawn



Re: Reverse query?

2015-10-02 Thread Andrea Roggerone
Hi Remy,
The question is not really clear, could you explain a little bit better
what you need? Reading your email I understand that you want to get
documents containing all the search terms typed. For instance if you search
for "Mad Max", you wanna get documents containing both Mad and Max. If
that's your need, you can use a phrase query like:

*"*Mad Max*"~2*

where enclosing your keywords between double quotes means that you want to
get both Mad and Max and the optional parameter ~2 is an example of *slop*.
If you need more info you can look for *Phrase Query* in
https://wiki.apache.org/solr/SolrRelevancyFAQ

On Fri, Oct 2, 2015 at 2:33 PM, remi tassing  wrote:

> Hi,
> I have medium-low experience on Solr and I have a question I couldn't quite
> solve yet.
>
> Typically we have quite short query strings (a couple of words) and the
> search is done through a set of bigger documents. What if the logic is
> turned a little bit around. I have a document and I need to find out what
> strings appear in the document. A string here could be a person name
> (including space for example) or a location...which are indexed in Solr.
>
> A concrete example, we take this text from wikipedia (Mad Max):
> "*Mad Max is a 1979 Australian dystopian action film directed by George
> Miller .
> Written by Miller and James McCausland from a story by Miller and producer
> Byron Kennedy , it tells a
> story of societal breakdown
> , murder, and vengeance
> . The film, starring the
> then-little-known Mel Gibson ,
> was released internationally in 1980. It became a top-grossing Australian
> film, while holding the record in the Guinness Book of Records
>  for decades as
> the
> most profitable film ever created,[1]
>  and
> has
> been credited for further opening the global market to Australian New Wave
>  films.*
> 
> "
>
> I would like it to match "Mad Max" but not "Mad" or "Max" seperately, and
> "George Miller", "global market" ...
>
> I've tried the keywordTokenizer but it didn't work. I suppose it's ok for
> the index time but not query time (in this specific case)
>
> I had a look at Luwak but it's not what I'm looking for (
>
> http://www.flax.co.uk/blog/2013/12/06/introducing-luwak-a-library-for-high-performance-stored-queries/
> )
>
> The typical name search doesn't seem to work either,
> https://dzone.com/articles/tips-name-search-solr
>
> I was thinking this problem must have already be solved...or?
>
> Remi
>


Reverse query?

2015-10-02 Thread remi tassing
Hi,
I have medium-low experience on Solr and I have a question I couldn't quite
solve yet.

Typically we have quite short query strings (a couple of words) and the
search is done through a set of bigger documents. What if the logic is
turned a little bit around. I have a document and I need to find out what
strings appear in the document. A string here could be a person name
(including space for example) or a location...which are indexed in Solr.

A concrete example, we take this text from wikipedia (Mad Max):
"*Mad Max is a 1979 Australian dystopian action film directed by George
Miller .
Written by Miller and James McCausland from a story by Miller and producer
Byron Kennedy , it tells a
story of societal breakdown
, murder, and vengeance
. The film, starring the
then-little-known Mel Gibson ,
was released internationally in 1980. It became a top-grossing Australian
film, while holding the record in the Guinness Book of Records
 for decades as the
most profitable film ever created,[1]
 and has
been credited for further opening the global market to Australian New Wave
 films.*

"

I would like it to match "Mad Max" but not "Mad" or "Max" seperately, and
"George Miller", "global market" ...

I've tried the keywordTokenizer but it didn't work. I suppose it's ok for
the index time but not query time (in this specific case)

I had a look at Luwak but it's not what I'm looking for (
http://www.flax.co.uk/blog/2013/12/06/introducing-luwak-a-library-for-high-performance-stored-queries/
)

The typical name search doesn't seem to work either,
https://dzone.com/articles/tips-name-search-solr

I was thinking this problem must have already be solved...or?

Remi


RE: Cannot connect to a zookeeper 3.4.6 instance via zkCli.cmd

2015-10-02 Thread Adrian Liew
Hi Edwin,

I have followed the standards recommended by the Zookeeper article. It seems to 
be working.

Incidentally, I am facing intermittent issues whereby I am unable to connect to 
Zookeeper service via Solr's zkCli.bat command, even after having setting 
automatic startup of my ZooKeeper service. I have basically configured 
(non-sucking-service-manager) nssm to auto start Solr with a dependency of 
Zookeeper to ensure both services are running on startup for each Solr VM. 

Here is an example what I tried to run to connect to the ZK service:

E:\solr-5.3.0\server\scripts\cloud-scripts>zkcli.bat -z 10.0.0.6:2183 -cmd list
Exception in thread "main" org.apache.solr.common.SolrException: java.util.concu
rrent.TimeoutException: Could not connect to ZooKeeper 10.0.0.6:2183 within 3000
0 ms
at org.apache.solr.common.cloud.SolrZkClient.(SolrZkClient.java:18
1)
at org.apache.solr.common.cloud.SolrZkClient.(SolrZkClient.java:11
5)
at org.apache.solr.common.cloud.SolrZkClient.(SolrZkClient.java:10
5)
at org.apache.solr.cloud.ZkCLI.main(ZkCLI.java:181)
Caused by: java.util.concurrent.TimeoutException: Could not connect to ZooKeeper
 10.0.0.6:2183 within 3 ms
at org.apache.solr.common.cloud.ConnectionManager.waitForConnected(Conne
ctionManager.java:208)
at org.apache.solr.common.cloud.SolrZkClient.(SolrZkClient.java:17
3)
... 3 more


Further to this I inspected the output shown in console window by zkServer.cmd:

2015-10-02 08:24:09,305 [myid:3] - WARN  [WorkerSender[myid=3]:QuorumCnxManager@
382] - Cannot open channel to 2 at election address /10.0.0.5:3888
java.net.SocketTimeoutException: connect timed out
at java.net.DualStackPlainSocketImpl.waitForConnect(Native Method)
at java.net.DualStackPlainSocketImpl.socketConnect(Unknown Source)
at java.net.AbstractPlainSocketImpl.doConnect(Unknown Source)
at java.net.AbstractPlainSocketImpl.connectToAddress(Unknown Source)
at java.net.AbstractPlainSocketImpl.connect(Unknown Source)
at java.net.PlainSocketImpl.connect(Unknown Source)
at java.net.SocksSocketImpl.connect(Unknown Source)
at java.net.Socket.connect(Unknown Source)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(Quorum
CnxManager.java:368)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxM
anager.java:341)
at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$Worke
rSender.process(FastLeaderElection.java:449)
at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$Worke
rSender.run(FastLeaderElection.java:430)
at java.lang.Thread.run(Unknown Source)
2015-10-02 08:24:09,305 [myid:3] - INFO  [WorkerReceiver[myid=3]:FastLeaderElect
ion@597] - Notification: 1 (message format version), 3 (n.leader), 0x70011 (
n.zxid), 0x1 (n.round), LOOKING (n.state), 3 (n.sid), 0x7 (n.peerEpoch) LOOKING
(my state)

I noticed the error message by zkServer.cmd as Cannot open channel to 2 at 
election address /10.0.0.5:3888

Can firewall settings be the issue here? I feel this may be a network issue 
between the individual Solr VMs. I am using a Windows Server 2012 R2 64 bit 
environment to run Zookeeper 3.4.6 and Solr 5.3.0.

Currently, I have setup my firewalls in the Advanced Configuration Firewall 
Settings as below:

As for the Firewall settings I have configured the below for each Azure VM 
(Phoenix-Solr-0, Phoenix-Solr-1, Phoenix-Solr-2) in the Firewall Advanced 
Security Settings:

For allowed inbound connections:

Solr port 8983
ZK1 port 2181
ZK2 port 2888
ZK3 port 3888

Regards,
Adrian

-Original Message-
From: Zheng Lin Edwin Yeo [mailto:edwinye...@gmail.com] 
Sent: Friday, October 2, 2015 11:03 AM
To: solr-user@lucene.apache.org
Subject: Re: Cannot connect to a zookeeper 3.4.6 instance via zkCli.cmd

Hi Adrian,

How is your setup of your system like? By right it shouldn't be an issue if we 
use different ports.

in fact, if the various zookeeper instance are running on a single machine, 
they have to be on different ports in order for it to work.


Regards,
Edwin



On 1 October 2015 at 18:19, Adrian Liew  wrote:

> Hi all,
>
> The problem below was resolved by appropriately setting my server ip 
> addresses to have the following for each zoo.cfg:
>
> server.1=10.0.0.4:2888:3888
> server.2=10.0.0.5:2888:3888
> server.3=10.0.0.6:2888:3888
>
> as opposed to the following:
>
> server.1=10.0.0.4:2888:3888
> server.2=10.0.0.5:2889:3889
> server.3=10.0.0.6:2890:3890
>
> I am not sure why the above can be an issue (by right it should not), 
> however I followed the recommendations provided by Zookeeper 
> administration guide under RunningReplicatedZookeeper ( 
> https://zookeeper.apache.org/doc/r3.1.2/zookeeperStarted.html#sc_Runni
> ngReplicatedZooKeeper
> )
>
> Given that I am testing multiple servers in a mutiserver environment, 
> it will be safe to use 

Re: Facet queries blow out the filterCache

2015-10-02 Thread Charlie Hull

On 01/10/2015 23:31, Jeff Wartes wrote:

It still inserts if I address the core directly and use distrib=false.

I’ve got a few collections sharing the same config, so it’s surprisingly
annoying to
change solrconfig.xml right now, but it seemed pretty clear the query is
the thing being cached, since
the cache size only changes when the query does.


Hi Jeff,

I think you may be hitting the same issue we found:
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201409.mbox/%3ccage-mlj+6y1at+ounk3sgacff6zgtjq_nin9_3shn0kfuqx...@mail.gmail.com%3E

Distributed faceting uses the filter cache, where you wouldn't expect it 
to. The solution was to set facet.limit to -1.


Best

Charlie




On 10/1/15, 3:01 PM, "Mikhail Khludnev"  wrote:


hm..
This option was useful for introspecting cache content
https://wiki.apache.org/solr/SolrCaching#showItems It might help you to
find-out a cause.
I'm still blaming distributed requests, it expained here
https://cwiki.apache.org/confluence/display/solr/Faceting#Faceting-Over-Re
questParameters
eg does it happen if you run with distrib=false?

On Fri, Oct 2, 2015 at 12:27 AM, Jeff Wartes 
wrote:



No change, still shows an insert per-request. As does a simplified
request
with only the facet params
"=city=true"


by default it's 100
https://cwiki.apache.org/confluence/display/solr/Faceting#Faceting-Theface
t.limitParameter
and can cause filtering by values, it can be seen in logs, btw.



It’s definitely facet related though, facet=false eliminates the insert.



On 10/1/15, 1:50 PM, "Mikhail Khludnev" 
wrote:


what if you set f.city.facet.limit=-1 ?

On Thu, Oct 1, 2015 at 7:43 PM, Jeff Wartes 
wrote:



I’m doing some fairly simple facet queries in a two-shard 5.3

SolrCloud

index on fields like this:




q=...=id,score=city=true=1
.c
it
y.facet.limit=50=0=0=fc

(no, NOT facet.method=enum - the usage of the filterCache there is
pretty
well documented)

Watching the filterCache stats, it appears that every one of these
queries
causes the "inserts" counter to be incremented by one. Distinct "q="
queries also increase the "size", and eviction happens as normal. If

I

repeat the same query a few times, "lookups" is not incremented, so
these
entries generally appear to be completely wasted. (Although when
running a
lot of these queries, it appears as though a very small set also
increment
the "lookups" counter, but only a small set, and I haven’t figured

out

why
some are special.)

So the question is, why does this facet query have anything to do

with

the
filterCache? This causes a huge amount of filterCache churn with no
apparent benefit.





--
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics









--
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics








--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


Re: Facet queries blow out the filterCache

2015-10-02 Thread Toke Eskildsen
On Thu, 2015-10-01 at 22:31 +, Jeff Wartes wrote:
> It still inserts if I address the core directly and use distrib=false.

It is quite strange that is is triggered with the direct access. If that
can be reproduced in test, it looks like a performance optimization to
be done.

Anyway, operating under the assumption that the single-core facet
request for some reason acts as a distributed call, the key to avoid the
fine-counting is to ensure that _all_ possibly relevant term counts has
been returned in the first facet phase. 

Try setting both facet.mincount=0 and facet.limit=-1.

- Toke Eskildsen, State and University Library, Denmark




Re: Solr 4.7.2 Vs 5.3.0 Docs different for same query

2015-10-02 Thread Ravi Solr
Mr. Uchida,
Thank you for responding. It was my fault, I had a update processor
which takes specific text and string fields and concatenates them into a
single field, and I search on that single field. Recently I used Atomic
update to fix a specific field's value and forgot to disable the
UpdateProcessor chain...Since I was only updating one field the aggregate
field got messed up with just that field value and hence I had issues
searching. I reindexed the data again yesterday night and now it is all
good.

I do have a small question, when we update the zookeeper ensemble with new
configs via 'upconfig' and 'linkconfig' commands do we have to "reload" the
collections on all the nodes to see the updated config ?? Is there a single
call which can update all nodes connected to the ensemble ?? I just went to
the admin UI and hit "Reload" button manually on each of the node...Is that
the correct way to do it ?

Thanks

Ravi Kiran Bhaskar

On Fri, Oct 2, 2015 at 12:04 AM, Tomoko Uchida  wrote:

> Are you sure that you've indexed same data to Solr 4.7.2 and 5.3.0 ?
> If so, I suspect that you have multiple shards and request to one shard.
> (In that case, you might get partial results)
>
> Can you share HTTP request url and the schema and default search field ?
>
>
> 2015-10-02 6:09 GMT+09:00 Ravi Solr :
>
> > I we migrated from 4.7.2 to 5.3.0. I sourced the docs from 4.7.2 core and
> > indexed into 5.3.0 collection (data directories are different) via
> > SolrEntityProcessor. Currently my production is all whack because of this
> > issue. Do I have to go back and reindex all again ?? Is there a quick fix
> > for this ?
> >
> > Here are the results for the query 'obama'...please note the numfound.
> > 4.7.2 has almost 148519 docs while 5.3.0 says it only has 5.3.0 docs. Any
> > pointers on how to correct this ?
> >
> >
> > Solr 4.7.2
> >
> > 
> > 
> >   0
> >   2
> >   
> >  obama
> >   0
> >
> >   
> >   
> > 
> >
> > SolrCloud 5.3.0
> >
> > 
> >   
> >0
> >2
> >
> > obama
> > 0
> > 
> >
> >
> > 
> >
> >
> > Thanks
> >
> > Ravi Kiran Bhaskar
> >
>


Re: Zk and Solr Cloud

2015-10-02 Thread Ravi Solr
Awesome nugget Shawn, I also faced similar issue a while ago while i was
doing a full re-index. It would be great if such tips are added into FAQ
type documentation on cwiki. I love the SOLR forum everyday I learn
something new :-)

Thanks

Ravi Kiran Bhaskar

On Fri, Oct 2, 2015 at 1:58 AM, Shawn Heisey  wrote:

> On 10/1/2015 1:26 PM, Rallavagu wrote:
> > Solr 4.6.1 single shard with 4 nodes. Zookeeper 3.4.5 ensemble of 3.
> >
> > See following errors in ZK and Solr and they are connected.
> >
> > When I see the following error in Zookeeper,
> >
> > unexpected error, closing socket connection and attempting reconnect
> > java.io.IOException: Packet len11823809 is out of range!
>
> This is usually caused by the overseer queue (stored in zookeeper)
> becoming extraordinarily huge, because it's being flooded with work
> entries far faster than the overseer can process them.  This causes the
> znode where the queue is stored to become larger than the maximum size
> for a znode, which defaults to about 1MB.  In this case (reading your
> log message that says len11823809), something in zookeeper has gotten to
> be 11MB in size, so the zookeeper client cannot read it.
>
> I think the zookeeper server code must be handling the addition of
> children to the queue znode through a code path that doesn't pay
> attention to the maximum buffer size, just goes ahead and adds it,
> probably by simply appending data.  I'm unfamiliar with how the ZK
> database works, so I'm guessing here.
>
> If I'm right about where the problem is, there are two workarounds to
> your immediate issue.
>
> 1) Delete all the entries in your overseer queue using a zookeeper
> client that lets you edit the DB directly.  If you haven't changed the
> cloud structure and all your servers are working, this should be safe.
>
> 2) Set the jute.maxbuffer system property on the startup commandline for
> all ZK servers and all ZK clients (Solr instances) to a size that's
> large enough to accommodate the huge znode.  In order to do the deletion
> mentioned in option 1 above,you might need to increase jute.maxbuffer on
> the servers and the client you use for the deletion.
>
> These are just workarounds.  Whatever caused the huge queue in the first
> place must be addressed.  It is frequently a performance issue.  If you
> go to the following link, you will see that jute.maxbuffer is considered
> an unsafe option:
>
> http://zookeeper.apache.org/doc/r3.3.3/zookeeperAdmin.html#Unsafe+Options
>
> In Jira issue SOLR-7191, I wrote the following in one of my comments:
>
> "The giant queue I encountered was about 85 entries, and resulted in
> a packet length of a little over 14 megabytes. If I divide 85 by 14,
> I know that I can have about 6 overseer queue entries in one znode
> before jute.maxbuffer needs to be increased."
>
> https://issues.apache.org/jira/browse/SOLR-7191?focusedCommentId=14347834
>
> Thanks,
> Shawn
>
>


Re: Reverse query?

2015-10-02 Thread Ravi Solr
Hello Remi,
Iam assuming the field where you store the data is analyzed.
The field definition might help us answer your question better. If you are
using edismax handler for your search requests, I believe you can achieve
you goal by setting set your "mm" to 100%, phrase slop "ps" and query slop
"qs" parameters to zero. I think that will force exact matches.

Thanks

Ravi Kiran Bhaskar

On Fri, Oct 2, 2015 at 9:48 AM, Andrea Roggerone <
andrearoggerone.o...@gmail.com> wrote:

> Hi Remy,
> The question is not really clear, could you explain a little bit better
> what you need? Reading your email I understand that you want to get
> documents containing all the search terms typed. For instance if you search
> for "Mad Max", you wanna get documents containing both Mad and Max. If
> that's your need, you can use a phrase query like:
>
> *"*Mad Max*"~2*
>
> where enclosing your keywords between double quotes means that you want to
> get both Mad and Max and the optional parameter ~2 is an example of *slop*.
> If you need more info you can look for *Phrase Query* in
> https://wiki.apache.org/solr/SolrRelevancyFAQ
>
> On Fri, Oct 2, 2015 at 2:33 PM, remi tassing 
> wrote:
>
> > Hi,
> > I have medium-low experience on Solr and I have a question I couldn't
> quite
> > solve yet.
> >
> > Typically we have quite short query strings (a couple of words) and the
> > search is done through a set of bigger documents. What if the logic is
> > turned a little bit around. I have a document and I need to find out what
> > strings appear in the document. A string here could be a person name
> > (including space for example) or a location...which are indexed in Solr.
> >
> > A concrete example, we take this text from wikipedia (Mad Max):
> > "*Mad Max is a 1979 Australian dystopian action film directed by George
> > Miller .
> > Written by Miller and James McCausland from a story by Miller and
> producer
> > Byron Kennedy , it tells a
> > story of societal breakdown
> > , murder, and vengeance
> > . The film, starring the
> > then-little-known Mel Gibson ,
> > was released internationally in 1980. It became a top-grossing Australian
> > film, while holding the record in the Guinness Book of Records
> >  for decades as
> > the
> > most profitable film ever created,[1]
> >  and
> > has
> > been credited for further opening the global market to Australian New
> Wave
> >  films.*
> > 
> > "
> >
> > I would like it to match "Mad Max" but not "Mad" or "Max" seperately, and
> > "George Miller", "global market" ...
> >
> > I've tried the keywordTokenizer but it didn't work. I suppose it's ok for
> > the index time but not query time (in this specific case)
> >
> > I had a look at Luwak but it's not what I'm looking for (
> >
> >
> http://www.flax.co.uk/blog/2013/12/06/introducing-luwak-a-library-for-high-performance-stored-queries/
> > )
> >
> > The typical name search doesn't seem to work either,
> > https://dzone.com/articles/tips-name-search-solr
> >
> > I was thinking this problem must have already be solved...or?
> >
> > Remi
> >
>


NullPointerException

2015-10-02 Thread Mark Fenbers

Greetings!

Attached is a snippet from solrconfig.xml pertaining to my spellcheck 
efforts.  When I use the Admin UI (v5.3.0), and check the 
spellcheck.build box, I get a NullPointerException stacktrace.  The 
actual stacktrace is at the bottom of the attachment.  The 
FileBasedSpellChecker.build is clearly the problem, but I cannot figure 
out why.  /usr/share/dict/words exists and has global read permissions.  
I displayed the file and see no issues (i.e., one word per line) 
although some "words" are a string of digits, but that shouldn't matter.


Does my snippet give any clues about why I would get this error? Is my 
stripped down configuration missing something, perhaps?


Mark
  
text_en



   
 solr.FileBasedSpellChecker
logtext 
FileDict
 /usr/share/dict/words
 UTF-8
 /localapps/dev/EventLog/solr/EventLog2/data/spFile
   
  
  

  

  FileDict

  on
  true
  10
  5
  5
  true
  true
  10
  5


  spellcheck

  


"trace": "java.lang.NullPointerException\n\tat 
org.apache.lucene.search.spell.SpellChecker.indexDictionary(SpellChecker.java:509)\n\tat
 
org.apache.solr.spelling.FileBasedSpellChecker.build(FileBasedSpellChecker.java:74)\n\tat
 
org.apache.solr.handler.component.SpellCheckComponent.prepare(SpellCheckComponent.java:124)\n\tat
 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:251)\n\tat
 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143)\n\tat
 org.apache.solr.core.SolrCore.execute(SolrCore.java:2068)\n\tat 
org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:669)\n\tat 
org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:462)\n\tat 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:210)\n\tat
 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:179)\n\tat
 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)\n\tat
 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)\n\tat
 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)\n\tat
 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)\n\tat
 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)\n\tat
 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)\n\tat
 
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)\n\tat 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)\n\tat
 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)\n\tat
 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)\n\tat
 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)\n\tat
 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110)\n\tat
 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)\n\tat
 org.eclipse.jetty.server.Server.handle(Server.java:499)\n\tat 
org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310)\n\tat 
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)\n\tat
 
org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)\n\tat
 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)\n\tat
 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)\n\tat
 java.lang.Thread.run(Thread.java:745)\n",


Re: Cannot connect to a zookeeper 3.4.6 instance via zkCli.cmd

2015-10-02 Thread Erick Erickson
Hmmm, there are usually a couple of ports that each ZK instance needs,
is it possible that
you've got more than one process using one of those ports?

By default (I think), zookeeper uses "peer port + 1000" for its leader
election process, see:
https://zookeeper.apache.org/doc/r3.3.3/zookeeperStarted.html
the "Running Replicated Zookeeper" section.

I'm not quite clear whether the above ZK2 port and ZK3 port are just
meant to indicate a single
Zookeeper instance on a node or not so I thought I'd check.

Firewalls should always fail, not intermittently so I'm puzzled about that

Best,
Erick

On Fri, Oct 2, 2015 at 1:33 AM, Adrian Liew  wrote:
> Hi Edwin,
>
> I have followed the standards recommended by the Zookeeper article. It seems 
> to be working.
>
> Incidentally, I am facing intermittent issues whereby I am unable to connect 
> to Zookeeper service via Solr's zkCli.bat command, even after having setting 
> automatic startup of my ZooKeeper service. I have basically configured 
> (non-sucking-service-manager) nssm to auto start Solr with a dependency of 
> Zookeeper to ensure both services are running on startup for each Solr VM.
>
> Here is an example what I tried to run to connect to the ZK service:
>
> E:\solr-5.3.0\server\scripts\cloud-scripts>zkcli.bat -z 10.0.0.6:2183 -cmd 
> list
> Exception in thread "main" org.apache.solr.common.SolrException: 
> java.util.concu
> rrent.TimeoutException: Could not connect to ZooKeeper 10.0.0.6:2183 within 
> 3000
> 0 ms
> at 
> org.apache.solr.common.cloud.SolrZkClient.(SolrZkClient.java:18
> 1)
> at 
> org.apache.solr.common.cloud.SolrZkClient.(SolrZkClient.java:11
> 5)
> at 
> org.apache.solr.common.cloud.SolrZkClient.(SolrZkClient.java:10
> 5)
> at org.apache.solr.cloud.ZkCLI.main(ZkCLI.java:181)
> Caused by: java.util.concurrent.TimeoutException: Could not connect to 
> ZooKeeper
>  10.0.0.6:2183 within 3 ms
> at 
> org.apache.solr.common.cloud.ConnectionManager.waitForConnected(Conne
> ctionManager.java:208)
> at 
> org.apache.solr.common.cloud.SolrZkClient.(SolrZkClient.java:17
> 3)
> ... 3 more
>
>
> Further to this I inspected the output shown in console window by 
> zkServer.cmd:
>
> 2015-10-02 08:24:09,305 [myid:3] - WARN  
> [WorkerSender[myid=3]:QuorumCnxManager@
> 382] - Cannot open channel to 2 at election address /10.0.0.5:3888
> java.net.SocketTimeoutException: connect timed out
> at java.net.DualStackPlainSocketImpl.waitForConnect(Native Method)
> at java.net.DualStackPlainSocketImpl.socketConnect(Unknown Source)
> at java.net.AbstractPlainSocketImpl.doConnect(Unknown Source)
> at java.net.AbstractPlainSocketImpl.connectToAddress(Unknown Source)
> at java.net.AbstractPlainSocketImpl.connect(Unknown Source)
> at java.net.PlainSocketImpl.connect(Unknown Source)
> at java.net.SocksSocketImpl.connect(Unknown Source)
> at java.net.Socket.connect(Unknown Source)
> at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(Quorum
> CnxManager.java:368)
> at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxM
> anager.java:341)
> at 
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$Worke
> rSender.process(FastLeaderElection.java:449)
> at 
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$Worke
> rSender.run(FastLeaderElection.java:430)
> at java.lang.Thread.run(Unknown Source)
> 2015-10-02 08:24:09,305 [myid:3] - INFO  
> [WorkerReceiver[myid=3]:FastLeaderElect
> ion@597] - Notification: 1 (message format version), 3 (n.leader), 
> 0x70011 (
> n.zxid), 0x1 (n.round), LOOKING (n.state), 3 (n.sid), 0x7 (n.peerEpoch) 
> LOOKING
> (my state)
>
> I noticed the error message by zkServer.cmd as Cannot open channel to 2 at 
> election address /10.0.0.5:3888
>
> Can firewall settings be the issue here? I feel this may be a network issue 
> between the individual Solr VMs. I am using a Windows Server 2012 R2 64 bit 
> environment to run Zookeeper 3.4.6 and Solr 5.3.0.
>
> Currently, I have setup my firewalls in the Advanced Configuration Firewall 
> Settings as below:
>
> As for the Firewall settings I have configured the below for each Azure VM 
> (Phoenix-Solr-0, Phoenix-Solr-1, Phoenix-Solr-2) in the Firewall Advanced 
> Security Settings:
>
> For allowed inbound connections:
>
> Solr port 8983
> ZK1 port 2181
> ZK2 port 2888
> ZK3 port 3888
>
> Regards,
> Adrian
>
> -Original Message-
> From: Zheng Lin Edwin Yeo [mailto:edwinye...@gmail.com]
> Sent: Friday, October 2, 2015 11:03 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Cannot connect to a zookeeper 3.4.6 instance via zkCli.cmd
>
> Hi Adrian,
>
> How is your setup of your system like? By right it shouldn't be an issue if 
> we use different ports.
>
> in fact, if the various zookeeper instance are running on a single machine, 
> 

Re: Zk and Solr Cloud

2015-10-02 Thread Erick Erickson
Rallavagu:

Absent nodes going up and down or otherwise changing state, Zookeeper
isn't involved in the normal operations of Solr (adding docs,
querying, all that). That said, things that change the state of the
Solr nodes _do_ involve Zookeeper and the Overseer. The Overseer is
used to serialize and control changing information in the
clusterstate.json (or state.json) and others. If the nodes all tried
to write to Zk directly, it's hard to coordinate. That's a little
simplistic and counterintuitive, but maybe this will help.

When a Solr instance starts up it
1> registers itself as live with ZK
2> creates a listener that ZK pings when there's a state change (some
node goes up or down, goes into recovery, gets added, whatever).
3> gets the current cluster state from ZK.

Thereafter, this particular node doesn't need to ask ZK for anything.
It knows the current topology of cluster and can route requests (index
or query) to the correct Solr replica etc.

Now, let's claim that "something changes". Solr stops on one of the
nodes. Or someone adds a collection. Or. The overseer usually gets
involved in changing the state on ZK for this new action. Part of that
is that ZK sends an event to all the Solr nodes that have registered
themselves as listeners that causes them to ask ZK for the current
state of the cluster, and each Solr node adjusts its actions based on
this information. Note the kind of thing here that changes and
triggers this is that a whole replica becomes able or unable to carry
out its functions, NOT that the some collection gets another doc added
or answers a query.

Zk also periodically pings each Solr instance that's registered itself
and, if the node fails to respond may force it into recovery & etc.
Again, though, that has nothing to do with standard Solr operations.

So a massive overseer queue tends to indicate that there's a LOT of
state changes, lots of nodes going up and down etc. One implication of
the above is that if you turn on all your nodes in a large cluster at
the same time, there'll be a LOT of activity; they'll all register
themselves, try to elect leaders for shards, to into/out of recovery,
become active, all these are things that trigger overseer activity.

Or there are simply bugs in how the overseer works in the version
you're using, I know there's been a lot of effort to harden that area
over the various versions.

Two things that are "interesting".
1> Only one of your Solr instances hosts the overseer. If you're doing
a restart of _all_ your boxes, it's advisable to bounce the node
that's the overseer _last_. Otherwise you risk an odd situation: the
overseer is elected and starts to work, that node restarts which
causes the overseer role to switch to another node which immediately
is bounced and a new overseer is elected and

2> As of 5.x, there are two ZK formats
a> the "old" format where the entire clusterstate for all collections
is kept in a single ZK node (/clusterstate.json)
b> the "new" format where each collection has its own state.json that
only contains the state for that collection.

This is very helpful when you have many clusters. In the  case, any
time _any_ node changes, _all_ nodes have to get a new state. In ,
only the nodes involved in a single collection need to get new
information when any node in _that_ collection change.

FWIW,
Erick



On Fri, Oct 2, 2015 at 8:03 AM, Ravi Solr  wrote:
> Awesome nugget Shawn, I also faced similar issue a while ago while i was
> doing a full re-index. It would be great if such tips are added into FAQ
> type documentation on cwiki. I love the SOLR forum everyday I learn
> something new :-)
>
> Thanks
>
> Ravi Kiran Bhaskar
>
> On Fri, Oct 2, 2015 at 1:58 AM, Shawn Heisey  wrote:
>
>> On 10/1/2015 1:26 PM, Rallavagu wrote:
>> > Solr 4.6.1 single shard with 4 nodes. Zookeeper 3.4.5 ensemble of 3.
>> >
>> > See following errors in ZK and Solr and they are connected.
>> >
>> > When I see the following error in Zookeeper,
>> >
>> > unexpected error, closing socket connection and attempting reconnect
>> > java.io.IOException: Packet len11823809 is out of range!
>>
>> This is usually caused by the overseer queue (stored in zookeeper)
>> becoming extraordinarily huge, because it's being flooded with work
>> entries far faster than the overseer can process them.  This causes the
>> znode where the queue is stored to become larger than the maximum size
>> for a znode, which defaults to about 1MB.  In this case (reading your
>> log message that says len11823809), something in zookeeper has gotten to
>> be 11MB in size, so the zookeeper client cannot read it.
>>
>> I think the zookeeper server code must be handling the addition of
>> children to the queue znode through a code path that doesn't pay
>> attention to the maximum buffer size, just goes ahead and adds it,
>> probably by simply appending data.  I'm unfamiliar with how the ZK
>> database works, so I'm guessing here.
>>
>> 

Re: Solr 4.7.2 Vs 5.3.0 Docs different for same query

2015-10-02 Thread Erick Erickson
do we have to "reload" the collections on all the nodes to see the
updated config ??
YES

Is there a single call which can update all nodes connected to the ensemble ??

NO. I'll be a little pedantic here. When you say "ensemble", I'm not quite sure
what that means and am interpreting it as "all collections registered with ZK".
But see below.

I just went to the admin UI and hit "Reload" button manually on each
of the node...Is that
the correct way to do it ?

NO. The admin UI, "core admin" is a remnant from the old days (like
3.x) where there was
no concept of distributed collection as a distinct entity, you had to
do all the things you now
do automatically in SolrCloud "by hand". PLEASE DO NOT USE THIS
EXCEPT TO VIEW A REPLICA WHEN USING SOLRCLOUD! In particular, don't try to
take any action that manipulates the core (reload, add, unload and the like).
It'll work, but you have to know _exactly_ what you are doing. Go
ahead and use it for
viewing the current state of a replica/core, but unless you need to do
something that
you cannot do with the Collections API it's very easy to go astray.


Instead, use the "collections API". In this case, there's a call like

http://localhost:8983/solr/admin/collections?action=RELOAD=CollectionName

that will cause all the replicas associated with the collection to be
reloaded. Given you
mentioned linkconfig, I'm guessing that you have more than one
collection looking at a
particular configset, so the pedantic bit is you'd have to issue the
above for each
collection that references that configset.

Best,
Erick

P.S. Two bits:
1> actually the collections API uses the core admin calls to
accomplish its tasks, but
lots of effort went in to doing exactly the right thing
2> Upayavira has been creating an updated admin UI that will treat
collections as
first-class citizens (a work in progress). You can access it in 5.x by hitting

solr_host:solr_port/solr/index.html

Give it a whirl if you can and please provide any feedback you can, it'd be much
appreciated.

On Fri, Oct 2, 2015 at 7:47 AM, Ravi Solr  wrote:
> Mr. Uchida,
> Thank you for responding. It was my fault, I had a update processor
> which takes specific text and string fields and concatenates them into a
> single field, and I search on that single field. Recently I used Atomic
> update to fix a specific field's value and forgot to disable the
> UpdateProcessor chain...Since I was only updating one field the aggregate
> field got messed up with just that field value and hence I had issues
> searching. I reindexed the data again yesterday night and now it is all
> good.
>
> I do have a small question, when we update the zookeeper ensemble with new
> configs via 'upconfig' and 'linkconfig' commands do we have to "reload" the
> collections on all the nodes to see the updated config ?? Is there a single
> call which can update all nodes connected to the ensemble ?? I just went to
> the admin UI and hit "Reload" button manually on each of the node...Is that
> the correct way to do it ?
>
> Thanks
>
> Ravi Kiran Bhaskar
>
> On Fri, Oct 2, 2015 at 12:04 AM, Tomoko Uchida > wrote:
>
>> Are you sure that you've indexed same data to Solr 4.7.2 and 5.3.0 ?
>> If so, I suspect that you have multiple shards and request to one shard.
>> (In that case, you might get partial results)
>>
>> Can you share HTTP request url and the schema and default search field ?
>>
>>
>> 2015-10-02 6:09 GMT+09:00 Ravi Solr :
>>
>> > I we migrated from 4.7.2 to 5.3.0. I sourced the docs from 4.7.2 core and
>> > indexed into 5.3.0 collection (data directories are different) via
>> > SolrEntityProcessor. Currently my production is all whack because of this
>> > issue. Do I have to go back and reindex all again ?? Is there a quick fix
>> > for this ?
>> >
>> > Here are the results for the query 'obama'...please note the numfound.
>> > 4.7.2 has almost 148519 docs while 5.3.0 says it only has 5.3.0 docs. Any
>> > pointers on how to correct this ?
>> >
>> >
>> > Solr 4.7.2
>> >
>> > 
>> > 
>> >   0
>> >   2
>> >   
>> >  obama
>> >   0
>> >
>> >   
>> >   
>> > 
>> >
>> > SolrCloud 5.3.0
>> >
>> > 
>> >   
>> >0
>> >2
>> >
>> > obama
>> > 0
>> > 
>> >
>> >
>> > 
>> >
>> >
>> > Thanks
>> >
>> > Ravi Kiran Bhaskar
>> >
>>


Re: Reverse query?

2015-10-02 Thread Erick Erickson
The admin/analysis page is your friend here, find it and use it ;)
Note you have to select a core on the admin UI screen before you can
see the choice.

Because apart from the other comments, KeywordTokenizer is a red flag.
It does NOT break anything up into tokens, so if your doc contains:
Mad Max is a 1979 Australian
as the whole field, the _only_ match you'll ever get is if you search exactly
"Mad Max is a 1979 Australian"
Not Mad, not mad, not Max, exactly all 6 words separated by exactly one space.

Andrea's suggestion is the one you want, but be sure you use one of
the tokenizing analysis chains, perhaps start with text_en (in the
stock distro). Be sure to completely remove your node/data directory
(as in rm -rf data) after you make the change.

And really, explore the admin/analysis page; it's where a LOT of these
kinds of problems find solutions ;)

Best,
Erick

On Fri, Oct 2, 2015 at 7:57 AM, Ravi Solr  wrote:
> Hello Remi,
> Iam assuming the field where you store the data is analyzed.
> The field definition might help us answer your question better. If you are
> using edismax handler for your search requests, I believe you can achieve
> you goal by setting set your "mm" to 100%, phrase slop "ps" and query slop
> "qs" parameters to zero. I think that will force exact matches.
>
> Thanks
>
> Ravi Kiran Bhaskar
>
> On Fri, Oct 2, 2015 at 9:48 AM, Andrea Roggerone <
> andrearoggerone.o...@gmail.com> wrote:
>
>> Hi Remy,
>> The question is not really clear, could you explain a little bit better
>> what you need? Reading your email I understand that you want to get
>> documents containing all the search terms typed. For instance if you search
>> for "Mad Max", you wanna get documents containing both Mad and Max. If
>> that's your need, you can use a phrase query like:
>>
>> *"*Mad Max*"~2*
>>
>> where enclosing your keywords between double quotes means that you want to
>> get both Mad and Max and the optional parameter ~2 is an example of *slop*.
>> If you need more info you can look for *Phrase Query* in
>> https://wiki.apache.org/solr/SolrRelevancyFAQ
>>
>> On Fri, Oct 2, 2015 at 2:33 PM, remi tassing 
>> wrote:
>>
>> > Hi,
>> > I have medium-low experience on Solr and I have a question I couldn't
>> quite
>> > solve yet.
>> >
>> > Typically we have quite short query strings (a couple of words) and the
>> > search is done through a set of bigger documents. What if the logic is
>> > turned a little bit around. I have a document and I need to find out what
>> > strings appear in the document. A string here could be a person name
>> > (including space for example) or a location...which are indexed in Solr.
>> >
>> > A concrete example, we take this text from wikipedia (Mad Max):
>> > "*Mad Max is a 1979 Australian dystopian action film directed by George
>> > Miller .
>> > Written by Miller and James McCausland from a story by Miller and
>> producer
>> > Byron Kennedy , it tells a
>> > story of societal breakdown
>> > , murder, and vengeance
>> > . The film, starring the
>> > then-little-known Mel Gibson ,
>> > was released internationally in 1980. It became a top-grossing Australian
>> > film, while holding the record in the Guinness Book of Records
>> >  for decades as
>> > the
>> > most profitable film ever created,[1]
>> >  and
>> > has
>> > been credited for further opening the global market to Australian New
>> Wave
>> >  films.*
>> > 
>> > "
>> >
>> > I would like it to match "Mad Max" but not "Mad" or "Max" seperately, and
>> > "George Miller", "global market" ...
>> >
>> > I've tried the keywordTokenizer but it didn't work. I suppose it's ok for
>> > the index time but not query time (in this specific case)
>> >
>> > I had a look at Luwak but it's not what I'm looking for (
>> >
>> >
>> http://www.flax.co.uk/blog/2013/12/06/introducing-luwak-a-library-for-high-performance-stored-queries/
>> > )
>> >
>> > The typical name search doesn't seem to work either,
>> > https://dzone.com/articles/tips-name-search-solr
>> >
>> > I was thinking this problem must have already be solved...or?
>> >
>> > Remi
>> >
>>


Re: Reverse query?

2015-10-02 Thread Roman Chyla
I'd like to offer another option:

you say you want to match long query into a document - but maybe you
won't know whether to pick "Mad Max" or "Max is" (not mentioning the
performance hit of "*mad max*" search - or is it not the case
anymore?). Take a look at the NGram tokenizer (say size of 2; or
bigger). What it does, it splits the input into overlapping segments
of 'X' words (words, not characters - however, characters work too -
just pick bigger N)

mad max
max 1979
1979 australian

i'd recommend placing stopfilter before the ngram

 - then for the long query string of "Hey Mad Max is 1979" you
wold search "hey mad" OR "mad max" OR "max 1979"... (perhaps the query
tokenizer could be convinced to the search for you automatically). And
voila, the more overlapping segments there, the higher the search
result.

hth,

roman



On Fri, Oct 2, 2015 at 12:03 PM, Erick Erickson  wrote:
> The admin/analysis page is your friend here, find it and use it ;)
> Note you have to select a core on the admin UI screen before you can
> see the choice.
>
> Because apart from the other comments, KeywordTokenizer is a red flag.
> It does NOT break anything up into tokens, so if your doc contains:
> Mad Max is a 1979 Australian
> as the whole field, the _only_ match you'll ever get is if you search exactly
> "Mad Max is a 1979 Australian"
> Not Mad, not mad, not Max, exactly all 6 words separated by exactly one space.
>
> Andrea's suggestion is the one you want, but be sure you use one of
> the tokenizing analysis chains, perhaps start with text_en (in the
> stock distro). Be sure to completely remove your node/data directory
> (as in rm -rf data) after you make the change.
>
> And really, explore the admin/analysis page; it's where a LOT of these
> kinds of problems find solutions ;)
>
> Best,
> Erick
>
> On Fri, Oct 2, 2015 at 7:57 AM, Ravi Solr  wrote:
>> Hello Remi,
>> Iam assuming the field where you store the data is analyzed.
>> The field definition might help us answer your question better. If you are
>> using edismax handler for your search requests, I believe you can achieve
>> you goal by setting set your "mm" to 100%, phrase slop "ps" and query slop
>> "qs" parameters to zero. I think that will force exact matches.
>>
>> Thanks
>>
>> Ravi Kiran Bhaskar
>>
>> On Fri, Oct 2, 2015 at 9:48 AM, Andrea Roggerone <
>> andrearoggerone.o...@gmail.com> wrote:
>>
>>> Hi Remy,
>>> The question is not really clear, could you explain a little bit better
>>> what you need? Reading your email I understand that you want to get
>>> documents containing all the search terms typed. For instance if you search
>>> for "Mad Max", you wanna get documents containing both Mad and Max. If
>>> that's your need, you can use a phrase query like:
>>>
>>> *"*Mad Max*"~2*
>>>
>>> where enclosing your keywords between double quotes means that you want to
>>> get both Mad and Max and the optional parameter ~2 is an example of *slop*.
>>> If you need more info you can look for *Phrase Query* in
>>> https://wiki.apache.org/solr/SolrRelevancyFAQ
>>>
>>> On Fri, Oct 2, 2015 at 2:33 PM, remi tassing 
>>> wrote:
>>>
>>> > Hi,
>>> > I have medium-low experience on Solr and I have a question I couldn't
>>> quite
>>> > solve yet.
>>> >
>>> > Typically we have quite short query strings (a couple of words) and the
>>> > search is done through a set of bigger documents. What if the logic is
>>> > turned a little bit around. I have a document and I need to find out what
>>> > strings appear in the document. A string here could be a person name
>>> > (including space for example) or a location...which are indexed in Solr.
>>> >
>>> > A concrete example, we take this text from wikipedia (Mad Max):
>>> > "*Mad Max is a 1979 Australian dystopian action film directed by George
>>> > Miller .
>>> > Written by Miller and James McCausland from a story by Miller and
>>> producer
>>> > Byron Kennedy , it tells a
>>> > story of societal breakdown
>>> > , murder, and vengeance
>>> > . The film, starring the
>>> > then-little-known Mel Gibson ,
>>> > was released internationally in 1980. It became a top-grossing Australian
>>> > film, while holding the record in the Guinness Book of Records
>>> >  for decades as
>>> > the
>>> > most profitable film ever created,[1]
>>> >  and
>>> > has
>>> > been credited for further opening the global market to Australian New
>>> Wave
>>> >  films.*
>>> > 
>>> > 

Empty string in field used for grouping causes NPE in 4.x

2015-10-02 Thread Shawn Heisey
Let's say I'm using group.field=ip in a query.  If the index contains
documents where the ip field is the empty string, grouping fails with
NullPointerException in the response writer.  From our perspective, this
is a bad document that we have to fix, but I don't think it should have
failed the query (HTTP response code 500).  The empty string might be a
perfectly valid value for some use cases.

This is in Solr 4.7.2 and 4.9.1.  Is this a bug, or expected behavior?

FYI, I cannot check if the same problem exists in 5.x, because of
something else I stumbled across while working on a monitoring script
which uses these queries:

https://issues.apache.org/jira/browse/SOLR-8088

Here's the full exception from Solr 4.9.1:

java.lang.NullPointerException
at
org.apache.solr.response.JSONWriter.writeSolrDocument(JSONResponseWriter.java:330)
at
org.apache.solr.response.TextResponseWriter.writeSolrDocumentList(TextResponseWriter.java:220)
at
org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:182)
at
org.apache.solr.response.JSONWriter.writeNamedListAsMapWithDups(JSONResponseWriter.java:184)
at
org.apache.solr.response.JSONWriter.writeNamedList(JSONResponseWriter.java:300)
at
org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:186)
at
org.apache.solr.response.JSONWriter.writeArray(JSONResponseWriter.java:542)
at
org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:188)
at
org.apache.solr.response.JSONWriter.writeNamedListAsMapWithDups(JSONResponseWriter.java:184)
at
org.apache.solr.response.JSONWriter.writeNamedList(JSONResponseWriter.java:300)
at
org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:186)
at
org.apache.solr.response.JSONWriter.writeNamedListAsMapWithDups(JSONResponseWriter.java:184)
at
org.apache.solr.response.JSONWriter.writeNamedList(JSONResponseWriter.java:300)
at
org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:186)
at
org.apache.solr.response.JSONWriter.writeNamedListAsMapWithDups(JSONResponseWriter.java:184)
at
org.apache.solr.response.JSONWriter.writeNamedList(JSONResponseWriter.java:300)
at
org.apache.solr.response.JSONWriter.writeResponse(JSONResponseWriter.java:96)
at
org.apache.solr.response.JSONResponseWriter.write(JSONResponseWriter.java:61)
at
org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:765)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:426)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1476)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:499)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086)
at
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:428)
at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
at org.eclipse.jetty.server.Server.handle(Server.java:370)
at
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
at
org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971)
at
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1033)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:644)
at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
at
org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
at
org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:667)
at
org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52)
at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Thread.java:745)

Thanks,
Shawn



Drill down facet for multi valued groups of fields

2015-10-02 Thread Douglas McGilvray
Hi everyone, my first post to the list! I tried and failed to explain this on 
IRC, I hope I can do a better job here.   

My document has a group of text fields: company, location, year. The group can 
have multiple values and I would like to facet (drill down) beginning with any 
of the three fields. The order of the groups is not important.

Example Doc1: 
{company1: Bolts, location1: NY, year1: 2002}
{company2: Nuts,  location2: SF, year2: 2010}

If I select two filters: fq=company:Bolts && fq=location:SF, Doc1 should not be 
in the results, because although the two individual values occur in the 
document, they are not within the same group.  

Following the instructions for facet.prefix based drill down (the link will 
explain this far better than I can)
https://wiki.apache.org/solr/HierarchicalFaceting#A.27facet.prefix.27__Based_Drill_Down
I can create a custom field lets call it cly  which represents a drill-down 
hierarchy company > location > year
So For the document above it would contain the following:

0:Bolts
1:Bolts>NY
2:Bolts>NY>2002
0:Nuts
1:Nuts>SF
2:Nuts>SF>2010

I can retrieve the facets for the Company using: facet.field={!key=company 
facet.prefix=“0:”}cly

If the user selects the company Bolts, I can filter the values using: 
fq=cly:”0:Bolts”
And I can retrieve the facets for the location using facet.field={!key=location 
facet.prefix=“1:Bolts”}cly

This is fine if I want to drill down company location year, but what if, after 
selecting company I now want to select year? I make a field for each 
combination of values: cly, cyl, lyc …..

If the user selects Bolts, I can now retrieve the facets for year using 
facet.field={!key=year facet.prefix=“1:Bolts”}cyl (NB the order of the letters 
here)

I hope the above makes sense, even if the idea itself is completely crazy. 
Obviously the number of extra fields is factorial. I cant believe I am the 
first person to want to do this type of search, which makes me think there is 
probably another (better) way to do this. Is there?

King Regards and many thanks in advance,
Douglas

Re: Solr 4.7.2 Vs 5.3.0 Docs different for same query

2015-10-02 Thread Tomoko Uchida
Hi Ravi,

And for minor additional information,
you may want to look through Collections API reference guide to handle
collections properly in SolrCloud environment. (I bookmark this page.)
https://cwiki.apache.org/confluence/display/solr/Collections+API


Regards,
Tomoko

2015-10-03 1:15 GMT+09:00 Erick Erickson :

> do we have to "reload" the collections on all the nodes to see the
> updated config ??
> YES
>
> Is there a single call which can update all nodes connected to the
> ensemble ??
>
> NO. I'll be a little pedantic here. When you say "ensemble", I'm not quite
> sure
> what that means and am interpreting it as "all collections registered with
> ZK".
> But see below.
>
> I just went to the admin UI and hit "Reload" button manually on each
> of the node...Is that
> the correct way to do it ?
>
> NO. The admin UI, "core admin" is a remnant from the old days (like
> 3.x) where there was
> no concept of distributed collection as a distinct entity, you had to
> do all the things you now
> do automatically in SolrCloud "by hand". PLEASE DO NOT USE THIS
> EXCEPT TO VIEW A REPLICA WHEN USING SOLRCLOUD! In particular, don't try to
> take any action that manipulates the core (reload, add, unload and the
> like).
> It'll work, but you have to know _exactly_ what you are doing. Go
> ahead and use it for
> viewing the current state of a replica/core, but unless you need to do
> something that
> you cannot do with the Collections API it's very easy to go astray.
>
>
> Instead, use the "collections API". In this case, there's a call like
>
>
> http://localhost:8983/solr/admin/collections?action=RELOAD=CollectionName
>
> that will cause all the replicas associated with the collection to be
> reloaded. Given you
> mentioned linkconfig, I'm guessing that you have more than one
> collection looking at a
> particular configset, so the pedantic bit is you'd have to issue the
> above for each
> collection that references that configset.
>
> Best,
> Erick
>
> P.S. Two bits:
> 1> actually the collections API uses the core admin calls to
> accomplish its tasks, but
> lots of effort went in to doing exactly the right thing
> 2> Upayavira has been creating an updated admin UI that will treat
> collections as
> first-class citizens (a work in progress). You can access it in 5.x by
> hitting
>
> solr_host:solr_port/solr/index.html
>
> Give it a whirl if you can and please provide any feedback you can, it'd
> be much
> appreciated.
>
> On Fri, Oct 2, 2015 at 7:47 AM, Ravi Solr  wrote:
> > Mr. Uchida,
> > Thank you for responding. It was my fault, I had a update
> processor
> > which takes specific text and string fields and concatenates them into a
> > single field, and I search on that single field. Recently I used Atomic
> > update to fix a specific field's value and forgot to disable the
> > UpdateProcessor chain...Since I was only updating one field the aggregate
> > field got messed up with just that field value and hence I had issues
> > searching. I reindexed the data again yesterday night and now it is all
> > good.
> >
> > I do have a small question, when we update the zookeeper ensemble with
> new
> > configs via 'upconfig' and 'linkconfig' commands do we have to "reload"
> the
> > collections on all the nodes to see the updated config ?? Is there a
> single
> > call which can update all nodes connected to the ensemble ?? I just went
> to
> > the admin UI and hit "Reload" button manually on each of the node...Is
> that
> > the correct way to do it ?
> >
> > Thanks
> >
> > Ravi Kiran Bhaskar
> >
> > On Fri, Oct 2, 2015 at 12:04 AM, Tomoko Uchida <
> tomoko.uchida.1...@gmail.com
> >> wrote:
> >
> >> Are you sure that you've indexed same data to Solr 4.7.2 and 5.3.0 ?
> >> If so, I suspect that you have multiple shards and request to one shard.
> >> (In that case, you might get partial results)
> >>
> >> Can you share HTTP request url and the schema and default search field ?
> >>
> >>
> >> 2015-10-02 6:09 GMT+09:00 Ravi Solr :
> >>
> >> > I we migrated from 4.7.2 to 5.3.0. I sourced the docs from 4.7.2 core
> and
> >> > indexed into 5.3.0 collection (data directories are different) via
> >> > SolrEntityProcessor. Currently my production is all whack because of
> this
> >> > issue. Do I have to go back and reindex all again ?? Is there a quick
> fix
> >> > for this ?
> >> >
> >> > Here are the results for the query 'obama'...please note the numfound.
> >> > 4.7.2 has almost 148519 docs while 5.3.0 says it only has 5.3.0 docs.
> Any
> >> > pointers on how to correct this ?
> >> >
> >> >
> >> > Solr 4.7.2
> >> >
> >> > 
> >> > 
> >> >   0
> >> >   2
> >> >   
> >> >  obama
> >> >   0
> >> >
> >> >   
> >> >   
> >> > 
> >> >
> >> > SolrCloud 5.3.0

Re: Zk and Solr Cloud

2015-10-02 Thread Rallavagu

Thanks for the insight into this Erick. Thanks.

On 10/2/15 8:58 AM, Erick Erickson wrote:

Rallavagu:

Absent nodes going up and down or otherwise changing state, Zookeeper
isn't involved in the normal operations of Solr (adding docs,
querying, all that). That said, things that change the state of the
Solr nodes _do_ involve Zookeeper and the Overseer. The Overseer is
used to serialize and control changing information in the
clusterstate.json (or state.json) and others. If the nodes all tried
to write to Zk directly, it's hard to coordinate. That's a little
simplistic and counterintuitive, but maybe this will help.

When a Solr instance starts up it
1> registers itself as live with ZK
2> creates a listener that ZK pings when there's a state change (some
node goes up or down, goes into recovery, gets added, whatever).
3> gets the current cluster state from ZK.

Thereafter, this particular node doesn't need to ask ZK for anything.
It knows the current topology of cluster and can route requests (index
or query) to the correct Solr replica etc.

Now, let's claim that "something changes". Solr stops on one of the
nodes. Or someone adds a collection. Or. The overseer usually gets
involved in changing the state on ZK for this new action. Part of that
is that ZK sends an event to all the Solr nodes that have registered
themselves as listeners that causes them to ask ZK for the current
state of the cluster, and each Solr node adjusts its actions based on
this information. Note the kind of thing here that changes and
triggers this is that a whole replica becomes able or unable to carry
out its functions, NOT that the some collection gets another doc added
or answers a query.

Zk also periodically pings each Solr instance that's registered itself
and, if the node fails to respond may force it into recovery & etc.
Again, though, that has nothing to do with standard Solr operations.

So a massive overseer queue tends to indicate that there's a LOT of
state changes, lots of nodes going up and down etc. One implication of
the above is that if you turn on all your nodes in a large cluster at
the same time, there'll be a LOT of activity; they'll all register
themselves, try to elect leaders for shards, to into/out of recovery,
become active, all these are things that trigger overseer activity.

Or there are simply bugs in how the overseer works in the version
you're using, I know there's been a lot of effort to harden that area
over the various versions.

Two things that are "interesting".
1> Only one of your Solr instances hosts the overseer. If you're doing
a restart of _all_ your boxes, it's advisable to bounce the node
that's the overseer _last_. Otherwise you risk an odd situation: the
overseer is elected and starts to work, that node restarts which
causes the overseer role to switch to another node which immediately
is bounced and a new overseer is elected and

2> As of 5.x, there are two ZK formats
a> the "old" format where the entire clusterstate for all collections
is kept in a single ZK node (/clusterstate.json)
b> the "new" format where each collection has its own state.json that
only contains the state for that collection.

This is very helpful when you have many clusters. In the  case, any
time _any_ node changes, _all_ nodes have to get a new state. In ,
only the nodes involved in a single collection need to get new
information when any node in _that_ collection change.

FWIW,
Erick



On Fri, Oct 2, 2015 at 8:03 AM, Ravi Solr  wrote:

Awesome nugget Shawn, I also faced similar issue a while ago while i was
doing a full re-index. It would be great if such tips are added into FAQ
type documentation on cwiki. I love the SOLR forum everyday I learn
something new :-)

Thanks

Ravi Kiran Bhaskar

On Fri, Oct 2, 2015 at 1:58 AM, Shawn Heisey  wrote:


On 10/1/2015 1:26 PM, Rallavagu wrote:

Solr 4.6.1 single shard with 4 nodes. Zookeeper 3.4.5 ensemble of 3.

See following errors in ZK and Solr and they are connected.

When I see the following error in Zookeeper,

unexpected error, closing socket connection and attempting reconnect
java.io.IOException: Packet len11823809 is out of range!


This is usually caused by the overseer queue (stored in zookeeper)
becoming extraordinarily huge, because it's being flooded with work
entries far faster than the overseer can process them.  This causes the
znode where the queue is stored to become larger than the maximum size
for a znode, which defaults to about 1MB.  In this case (reading your
log message that says len11823809), something in zookeeper has gotten to
be 11MB in size, so the zookeeper client cannot read it.

I think the zookeeper server code must be handling the addition of
children to the queue znode through a code path that doesn't pay
attention to the maximum buffer size, just goes ahead and adds it,
probably by simply appending data.  I'm unfamiliar with how the ZK
database works, so I'm guessing here.

If 

Re: Solr vs Lucene

2015-10-02 Thread Jack Krupansky
Did you have a specific reason why you didn't want to send an HTTP request
to Solr to perform the spellcheck operation? I mean, that is probably
easier than diving into raw Lucene code. Also, Solr lets you do a
spellcheck from a remote client whereas the Lucene spellcheck needs to be
on the same machine as the Lucene/Solr index directory.

-- Jack Krupansky

On Fri, Oct 2, 2015 at 7:42 AM, Mark Fenbers  wrote:

> Thanks for the suggestion, but I've looked at aspell and hunspell and
> neither provide a native Java API.  Further, I already use Solr for a
> search engine, too, so why not stick with this infrastructure for spelling,
> too?  I think it will work well for me once I figure out the right
> configuration to get it to do what I want it to.
>
> Mark
>
>
> On 10/1/2015 4:16 PM, Walter Underwood wrote:
>
>> If you want a spell checker, don’t use a search engine. Use a spell
>> checker. Something like aspell (http://aspell.net/ )
>> will be faster and better than Solr.
>>
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>>
>>
>>
>


Re: Solr 4.7.2 Vs 5.3.0 Docs different for same query

2015-10-02 Thread Ravi Solr
Thank you very much Erick and Uchida. I will take a look at the URL u gave
Erick.

Thanks

Ravi Kiran Bhaskar

On Fri, Oct 2, 2015 at 12:41 PM, Tomoko Uchida  wrote:

> Hi Ravi,
>
> And for minor additional information,
> you may want to look through Collections API reference guide to handle
> collections properly in SolrCloud environment. (I bookmark this page.)
> https://cwiki.apache.org/confluence/display/solr/Collections+API
> 
>
> Regards,
> Tomoko
>
> 2015-10-03 1:15 GMT+09:00 Erick Erickson :
>
> > do we have to "reload" the collections on all the nodes to see the
> > updated config ??
> > YES
> >
> > Is there a single call which can update all nodes connected to the
> > ensemble ??
> >
> > NO. I'll be a little pedantic here. When you say "ensemble", I'm not
> quite
> > sure
> > what that means and am interpreting it as "all collections registered
> with
> > ZK".
> > But see below.
> >
> > I just went to the admin UI and hit "Reload" button manually on each
> > of the node...Is that
> > the correct way to do it ?
> >
> > NO. The admin UI, "core admin" is a remnant from the old days (like
> > 3.x) where there was
> > no concept of distributed collection as a distinct entity, you had to
> > do all the things you now
> > do automatically in SolrCloud "by hand". PLEASE DO NOT USE THIS
> > EXCEPT TO VIEW A REPLICA WHEN USING SOLRCLOUD! In particular, don't try
> to
> > take any action that manipulates the core (reload, add, unload and the
> > like).
> > It'll work, but you have to know _exactly_ what you are doing. Go
> > ahead and use it for
> > viewing the current state of a replica/core, but unless you need to do
> > something that
> > you cannot do with the Collections API it's very easy to go astray.
> >
> >
> > Instead, use the "collections API". In this case, there's a call like
> >
> >
> >
> http://localhost:8983/solr/admin/collections?action=RELOAD=CollectionName
> >
> > that will cause all the replicas associated with the collection to be
> > reloaded. Given you
> > mentioned linkconfig, I'm guessing that you have more than one
> > collection looking at a
> > particular configset, so the pedantic bit is you'd have to issue the
> > above for each
> > collection that references that configset.
> >
> > Best,
> > Erick
> >
> > P.S. Two bits:
> > 1> actually the collections API uses the core admin calls to
> > accomplish its tasks, but
> > lots of effort went in to doing exactly the right thing
> > 2> Upayavira has been creating an updated admin UI that will treat
> > collections as
> > first-class citizens (a work in progress). You can access it in 5.x by
> > hitting
> >
> > solr_host:solr_port/solr/index.html
> >
> > Give it a whirl if you can and please provide any feedback you can, it'd
> > be much
> > appreciated.
> >
> > On Fri, Oct 2, 2015 at 7:47 AM, Ravi Solr  wrote:
> > > Mr. Uchida,
> > > Thank you for responding. It was my fault, I had a update
> > processor
> > > which takes specific text and string fields and concatenates them into
> a
> > > single field, and I search on that single field. Recently I used Atomic
> > > update to fix a specific field's value and forgot to disable the
> > > UpdateProcessor chain...Since I was only updating one field the
> aggregate
> > > field got messed up with just that field value and hence I had issues
> > > searching. I reindexed the data again yesterday night and now it is all
> > > good.
> > >
> > > I do have a small question, when we update the zookeeper ensemble with
> > new
> > > configs via 'upconfig' and 'linkconfig' commands do we have to "reload"
> > the
> > > collections on all the nodes to see the updated config ?? Is there a
> > single
> > > call which can update all nodes connected to the ensemble ?? I just
> went
> > to
> > > the admin UI and hit "Reload" button manually on each of the node...Is
> > that
> > > the correct way to do it ?
> > >
> > > Thanks
> > >
> > > Ravi Kiran Bhaskar
> > >
> > > On Fri, Oct 2, 2015 at 12:04 AM, Tomoko Uchida <
> > tomoko.uchida.1...@gmail.com
> > >> wrote:
> > >
> > >> Are you sure that you've indexed same data to Solr 4.7.2 and 5.3.0 ?
> > >> If so, I suspect that you have multiple shards and request to one
> shard.
> > >> (In that case, you might get partial results)
> > >>
> > >> Can you share HTTP request url and the schema and default search
> field ?
> > >>
> > >>
> > >> 2015-10-02 6:09 GMT+09:00 Ravi Solr :
> > >>
> > >> > I we migrated from 4.7.2 to 5.3.0. I sourced the docs from 4.7.2
> core
> > and
> > >> > indexed into 5.3.0 collection (data directories are different) via
> > >> > SolrEntityProcessor. Currently my production is all whack because of
> > this
> > >> > issue. Do I have to go back and reindex all again ?? Is there a
> quick
> > fix
> > >> > for this ?
> > >> >
> > >> > Here are the results for the query 

are there any SolrCloud supervisors?

2015-10-02 Thread r b
I've been working on something that just monitors ZooKeeper to add and
remove nodes from collections. the use case being I put SolrCloud in
an autoscaling group on EC2 and as instances go up and down, I need
them added to the collection. It's something I've built for work and
could clean up to share on GitHub if there is much interest.

I asked in the IRC about a SolrCloud supervisor utility but wanted to
extend that question to this list. are there any more "full featured"
supervisors out there?


-renning


Re: Reverse query?

2015-10-02 Thread Andrea Roggerone
Hi, the phrase query format would be:
"Mad Max"~2
The * has been added by the mail aggregator around the chars in Bold for
some reason. That wasn't a wildcard.

On Friday, October 2, 2015, Roman Chyla  wrote:

> I'd like to offer another option:
>
> you say you want to match long query into a document - but maybe you
> won't know whether to pick "Mad Max" or "Max is" (not mentioning the
> performance hit of "*mad max*" search - or is it not the case
> anymore?). Take a look at the NGram tokenizer (say size of 2; or
> bigger). What it does, it splits the input into overlapping segments
> of 'X' words (words, not characters - however, characters work too -
> just pick bigger N)
>
> mad max
> max 1979
> 1979 australian
>
> i'd recommend placing stopfilter before the ngram
>
>  - then for the long query string of "Hey Mad Max is 1979" you
> wold search "hey mad" OR "mad max" OR "max 1979"... (perhaps the query
> tokenizer could be convinced to the search for you automatically). And
> voila, the more overlapping segments there, the higher the search
> result.
>
> hth,
>
> roman
>
>
>
> On Fri, Oct 2, 2015 at 12:03 PM, Erick Erickson  > wrote:
> > The admin/analysis page is your friend here, find it and use it ;)
> > Note you have to select a core on the admin UI screen before you can
> > see the choice.
> >
> > Because apart from the other comments, KeywordTokenizer is a red flag.
> > It does NOT break anything up into tokens, so if your doc contains:
> > Mad Max is a 1979 Australian
> > as the whole field, the _only_ match you'll ever get is if you search
> exactly
> > "Mad Max is a 1979 Australian"
> > Not Mad, not mad, not Max, exactly all 6 words separated by exactly one
> space.
> >
> > Andrea's suggestion is the one you want, but be sure you use one of
> > the tokenizing analysis chains, perhaps start with text_en (in the
> > stock distro). Be sure to completely remove your node/data directory
> > (as in rm -rf data) after you make the change.
> >
> > And really, explore the admin/analysis page; it's where a LOT of these
> > kinds of problems find solutions ;)
> >
> > Best,
> > Erick
> >
> > On Fri, Oct 2, 2015 at 7:57 AM, Ravi Solr  > wrote:
> >> Hello Remi,
> >> Iam assuming the field where you store the data is analyzed.
> >> The field definition might help us answer your question better. If you
> are
> >> using edismax handler for your search requests, I believe you can
> achieve
> >> you goal by setting set your "mm" to 100%, phrase slop "ps" and query
> slop
> >> "qs" parameters to zero. I think that will force exact matches.
> >>
> >> Thanks
> >>
> >> Ravi Kiran Bhaskar
> >>
> >> On Fri, Oct 2, 2015 at 9:48 AM, Andrea Roggerone <
> >> andrearoggerone.o...@gmail.com > wrote:
> >>
> >>> Hi Remy,
> >>> The question is not really clear, could you explain a little bit better
> >>> what you need? Reading your email I understand that you want to get
> >>> documents containing all the search terms typed. For instance if you
> search
> >>> for "Mad Max", you wanna get documents containing both Mad and Max. If
> >>> that's your need, you can use a phrase query like:
> >>>
> >>> *"*Mad Max*"~2*
> >>>
> >>> where enclosing your keywords between double quotes means that you
> want to
> >>> get both Mad and Max and the optional parameter ~2 is an example of
> *slop*.
> >>> If you need more info you can look for *Phrase Query* in
> >>> https://wiki.apache.org/solr/SolrRelevancyFAQ
> >>>
> >>> On Fri, Oct 2, 2015 at 2:33 PM, remi tassing  >
> >>> wrote:
> >>>
> >>> > Hi,
> >>> > I have medium-low experience on Solr and I have a question I couldn't
> >>> quite
> >>> > solve yet.
> >>> >
> >>> > Typically we have quite short query strings (a couple of words) and
> the
> >>> > search is done through a set of bigger documents. What if the logic
> is
> >>> > turned a little bit around. I have a document and I need to find out
> what
> >>> > strings appear in the document. A string here could be a person name
> >>> > (including space for example) or a location...which are indexed in
> Solr.
> >>> >
> >>> > A concrete example, we take this text from wikipedia (Mad Max):
> >>> > "*Mad Max is a 1979 Australian dystopian action film directed by
> George
> >>> > Miller .
> >>> > Written by Miller and James McCausland from a story by Miller and
> >>> producer
> >>> > Byron Kennedy , it
> tells a
> >>> > story of societal breakdown
> >>> > , murder, and
> vengeance
> >>> > . The film, starring the
> >>> > then-little-known Mel Gibson <
> https://en.wikipedia.org/wiki/Mel_Gibson>,
> >>> > was released internationally in 1980. It became a top-grossing
> Australian
> >>> > 

Re: Facet queries blow out the filterCache

2015-10-02 Thread Jeff Wartes

I backed up a bit. I took the stock solr download and did this:

solr-5.3.1>$ bin/solr -e techproducts

So, no SolrCloud, default example config, about as basic as you get. I
didn’t even bother indexing any docs. Then I issued this query:

http://localhost:8983/solr/techproducts/select?q=name:foo=1=true
=popularity=0=-1


This still causes an insert into the filterCache.

The only real difference I’m noticing vs my solrcloud collection is that
repeating the query increments cache lookups and hits. It’s still odd
though, because issuing new distinct queries causes a reported insert, but
not a lookup, so the cache hit ratio is always exactly 1.



On 10/2/15, 4:18 AM, "Toke Eskildsen"  wrote:

>On Thu, 2015-10-01 at 22:31 +, Jeff Wartes wrote:
>> It still inserts if I address the core directly and use distrib=false.
>
>It is quite strange that is is triggered with the direct access. If that
>can be reproduced in test, it looks like a performance optimization to
>be done.
>
>Anyway, operating under the assumption that the single-core facet
>request for some reason acts as a distributed call, the key to avoid the
>fine-counting is to ensure that _all_ possibly relevant term counts has
>been returned in the first facet phase.
>
>Try setting both facet.mincount=0 and facet.limit=-1.
>
>- Toke Eskildsen, State and University Library, Denmark
>
>



Recovery Thread Blocked

2015-10-02 Thread Rallavagu

Solr 4.6.1 on Tomcat 7, single shard 4 node cloud with 3 node zookeeper

During updates, some nodes are going very high cpu and becomes 
unavailable. The thread dump shows the following thread is blocked 870 
threads which explains high CPU. Any clues on where to look?


"Thread-56848" id=79207 idx=0x38 tid=3169 prio=5 alive, blocked, 
native_blocked, daemon

-- Blocked trying to get lock: java/lang/Object@0x114d8dd00[fat lock]
at pthread_cond_wait@@GLIBC_2.3.2+202(:0)@0x3d4180b5ba
at eventTimedWaitNoTransitionImpl+71(event.c:90)@0x7ff3133b6ba8
at 
syncWaitForSignalNoTransition+65(synchronization.c:28)@0x7ff31354a0b2

at syncWaitForSignal+189(synchronization.c:85)@0x7ff31354a20e
at syncWaitForJavaSignal+38(synchronization.c:93)@0x7ff31354a327
at jrockit/vm/Threads.waitForUnblockSignal()V(Native Method)
at jrockit/vm/Locks.fatLockBlockOrSpin(Locks.java:1411)[optimized]
at jrockit/vm/Locks.lockFat(Locks.java:1512)[optimized]
at 
jrockit/vm/Locks.monitorEnterSecondStageHard(Locks.java:1054)[optimized]

at jrockit/vm/Locks.monitorEnterSecondStage(Locks.java:1005)[optimized]
at jrockit/vm/Locks.monitorEnter(Locks.java:2179)[optimized]
at 
org/apache/solr/update/DefaultSolrCoreState.doRecovery(DefaultSolrCoreState.java:290)
at 
org/apache/solr/handler/admin/CoreAdminHandler$2.run(CoreAdminHandler.java:770)

at jrockit/vm/RNI.c2java(J)V(Native Method)


Re: Recovery Thread Blocked

2015-10-02 Thread Rallavagu

Here is the stack trace of the thread that is holding the lock.


"Thread-55266" id=77142 idx=0xc18 tid=992 prio=5 alive, waiting, 
native_blocked, daemon
-- Waiting for notification on: 
org/apache/solr/cloud/RecoveryStrategy@0x3f34e8480[fat lock]

at pthread_cond_wait@@GLIBC_2.3.2+202(:0)@0x3d4180b5ba
at eventTimedWaitNoTransitionImpl+71(event.c:90)@0x7ff3133b6ba8
at 
syncWaitForSignalNoTransition+65(synchronization.c:28)@0x7ff31354a0b2

at syncWaitForSignal+189(synchronization.c:85)@0x7ff31354a20e
at syncWaitForJavaSignal+38(synchronization.c:93)@0x7ff31354a327
at 
RJNI_jrockit_vm_Threads_waitForNotifySignal+73(rnithreads.c:72)@0x7ff31351939a
at 
jrockit/vm/Threads.waitForNotifySignal(JLjava/lang/Object;)Z(Native Method)

at java/lang/Object.wait(J)V(Native Method)
at java/lang/Thread.join(Thread.java:1206)
^-- Lock released while waiting: 
org/apache/solr/cloud/RecoveryStrategy@0x3f34e8480[fat lock]

at java/lang/Thread.join(Thread.java:1259)
at 
org/apache/solr/update/DefaultSolrCoreState.cancelRecovery(DefaultSolrCoreState.java:331)

^-- Holding lock: java/lang/Object@0x114d8dd00[recursive]
at 
org/apache/solr/update/DefaultSolrCoreState.doRecovery(DefaultSolrCoreState.java:297)

^-- Holding lock: java/lang/Object@0x114d8dd00[fat lock]
at 
org/apache/solr/handler/admin/CoreAdminHandler$2.run(CoreAdminHandler.java:770)

at jrockit/vm/RNI.c2java(J)V(Native Method)


Stack trace of one of the 870 threads that is waiting for the lock to be 
released.


"Thread-55489" id=77520 idx=0xebc tid=1494 prio=5 alive, blocked, 
native_blocked, daemon

-- Blocked trying to get lock: java/lang/Object@0x114d8dd00[fat lock]
at pthread_cond_wait@@GLIBC_2.3.2+202(:0)@0x3d4180b5ba
at eventTimedWaitNoTransitionImpl+71(event.c:90)@0x7ff3133b6ba8
at 
syncWaitForSignalNoTransition+65(synchronization.c:28)@0x7ff31354a0b2

at syncWaitForSignal+189(synchronization.c:85)@0x7ff31354a20e
at syncWaitForJavaSignal+38(synchronization.c:93)@0x7ff31354a327
at jrockit/vm/Threads.waitForUnblockSignal()V(Native Method)
at jrockit/vm/Locks.fatLockBlockOrSpin(Locks.java:1411)[optimized]
at jrockit/vm/Locks.lockFat(Locks.java:1512)[optimized]
at 
jrockit/vm/Locks.monitorEnterSecondStageHard(Locks.java:1054)[optimized]

at jrockit/vm/Locks.monitorEnterSecondStage(Locks.java:1005)[optimized]
at jrockit/vm/Locks.monitorEnter(Locks.java:2179)[optimized]
at 
org/apache/solr/update/DefaultSolrCoreState.doRecovery(DefaultSolrCoreState.java:290)
at 
org/apache/solr/handler/admin/CoreAdminHandler$2.run(CoreAdminHandler.java:770)

at jrockit/vm/RNI.c2java(J)V(Native Method)

On 10/2/15 4:12 PM, Rallavagu wrote:

Solr 4.6.1 on Tomcat 7, single shard 4 node cloud with 3 node zookeeper

During updates, some nodes are going very high cpu and becomes
unavailable. The thread dump shows the following thread is blocked 870
threads which explains high CPU. Any clues on where to look?

"Thread-56848" id=79207 idx=0x38 tid=3169 prio=5 alive, blocked,
native_blocked, daemon
 -- Blocked trying to get lock: java/lang/Object@0x114d8dd00[fat lock]
 at pthread_cond_wait@@GLIBC_2.3.2+202(:0)@0x3d4180b5ba
 at eventTimedWaitNoTransitionImpl+71(event.c:90)@0x7ff3133b6ba8
 at
syncWaitForSignalNoTransition+65(synchronization.c:28)@0x7ff31354a0b2
 at syncWaitForSignal+189(synchronization.c:85)@0x7ff31354a20e
 at syncWaitForJavaSignal+38(synchronization.c:93)@0x7ff31354a327
 at jrockit/vm/Threads.waitForUnblockSignal()V(Native Method)
 at jrockit/vm/Locks.fatLockBlockOrSpin(Locks.java:1411)[optimized]
 at jrockit/vm/Locks.lockFat(Locks.java:1512)[optimized]
 at
jrockit/vm/Locks.monitorEnterSecondStageHard(Locks.java:1054)[optimized]
 at
jrockit/vm/Locks.monitorEnterSecondStage(Locks.java:1005)[optimized]
 at jrockit/vm/Locks.monitorEnter(Locks.java:2179)[optimized]
 at
org/apache/solr/update/DefaultSolrCoreState.doRecovery(DefaultSolrCoreState.java:290)

 at
org/apache/solr/handler/admin/CoreAdminHandler$2.run(CoreAdminHandler.java:770)

 at jrockit/vm/RNI.c2java(J)V(Native Method)


Re: Zk and Solr Cloud

2015-10-02 Thread Upayavira
Very interesting, Shawn.

What I'd say is paste more of the stacktraces, so we can see the context
in which the exception happened. It could be that you are flooding the
overseer, or it could be that you have a synonyms file (or such) that is
too large. I'd like to think the rest of the stacktrace could give us
clues.

Upayavira

On Fri, Oct 2, 2015, at 06:58 AM, Shawn Heisey wrote:
> On 10/1/2015 1:26 PM, Rallavagu wrote:
> > Solr 4.6.1 single shard with 4 nodes. Zookeeper 3.4.5 ensemble of 3.
> >
> > See following errors in ZK and Solr and they are connected.
> >
> > When I see the following error in Zookeeper,
> >
> > unexpected error, closing socket connection and attempting reconnect
> > java.io.IOException: Packet len11823809 is out of range!
> 
> This is usually caused by the overseer queue (stored in zookeeper)
> becoming extraordinarily huge, because it's being flooded with work
> entries far faster than the overseer can process them.  This causes the
> znode where the queue is stored to become larger than the maximum size
> for a znode, which defaults to about 1MB.  In this case (reading your
> log message that says len11823809), something in zookeeper has gotten to
> be 11MB in size, so the zookeeper client cannot read it.
> 
> I think the zookeeper server code must be handling the addition of
> children to the queue znode through a code path that doesn't pay
> attention to the maximum buffer size, just goes ahead and adds it,
> probably by simply appending data.  I'm unfamiliar with how the ZK
> database works, so I'm guessing here.
> 
> If I'm right about where the problem is, there are two workarounds to
> your immediate issue.
> 
> 1) Delete all the entries in your overseer queue using a zookeeper
> client that lets you edit the DB directly.  If you haven't changed the
> cloud structure and all your servers are working, this should be safe.
> 
> 2) Set the jute.maxbuffer system property on the startup commandline for
> all ZK servers and all ZK clients (Solr instances) to a size that's
> large enough to accommodate the huge znode.  In order to do the deletion
> mentioned in option 1 above,you might need to increase jute.maxbuffer on
> the servers and the client you use for the deletion.
> 
> These are just workarounds.  Whatever caused the huge queue in the first
> place must be addressed.  It is frequently a performance issue.  If you
> go to the following link, you will see that jute.maxbuffer is considered
> an unsafe option:
> 
> http://zookeeper.apache.org/doc/r3.3.3/zookeeperAdmin.html#Unsafe+Options
> 
> In Jira issue SOLR-7191, I wrote the following in one of my comments:
> 
> "The giant queue I encountered was about 85 entries, and resulted in
> a packet length of a little over 14 megabytes. If I divide 85 by 14,
> I know that I can have about 6 overseer queue entries in one znode
> before jute.maxbuffer needs to be increased."
> 
> https://issues.apache.org/jira/browse/SOLR-7191?focusedCommentId=14347834
> 
> Thanks,
> Shawn
> 


Re: highlighting

2015-10-02 Thread Upayavira
In the end, in most open source projects, people implement that which
they need themselves, and offer it back to the community in the hope
that it will help others too.

If you need this, then I'd encourage you to look at the source
highlighting component and see if you can see how it might be done.

It would then be great to put your thoughts and ideas into a JIRA
ticket.

Upayavira

On Thu, Oct 1, 2015, at 11:31 PM, Teague James wrote:
> Hi everyone!
> 
> Pardon if it's not proper etiquette to chime in, but that feature would
> solve some issues I have with my app for the same reason. We are using
> markers now and it is very clunky - particularly with phrases and certain
> special characters. I would love to see this feature too Mark! For what
> it's worth - up vote. Thanks!
> 
> Cheers!
> 
> -Teague James
> 
> > On Oct 1, 2015, at 6:12 PM, Koji Sekiguchi  
> > wrote:
> > 
> > Hi Mark,
> > 
> > I think I saw similar requirement recently in mailing list. The feature 
> > sounds reasonable to me.
> > 
> > > If not, how do I go about posting this as a feature request?
> > 
> > JIRA can be used for the purpose, but there is no guarantee that the 
> > feature is implemented. :(
> > 
> > Koji
> > 
> >> On 2015/10/01 20:07, Mark Fenbers wrote:
> >> Yeah, I thought about using markers, but then I'd have to search the the 
> >> text for the markers to
> >> determine the locations.  This is a clunky way of getting the results I 
> >> want, and it would save two
> >> steps if Solr merely had an option to return a start/length array (of what 
> >> should be highlighted) in
> >> the original string rather than returning an altered string with tags 
> >> inserted.
> >> 
> >> Mark
> >> 
> >>> On 9/29/2015 7:04 AM, Upayavira wrote:
> >>> You can change the strings that are inserted into the text, and could
> >>> place markers that you use to identify the start/end of highlighting
> >>> elements. Does that work?
> >>> 
> >>> Upayavira
> >>> 
>  On Mon, Sep 28, 2015, at 09:55 PM, Mark Fenbers wrote:
>  Greetings!
>  
>  I have highlighting turned on in my Solr searches, but what I get back
>  is  tags surrounding the found term.  Since I use a SWT StyledText
>  widget to display my search results, what I really want is the offset
>  and length of each found term, so that I can highlight it in my own way
>  without HTML.  Is there a way to configure Solr to do that?  I couldn't
>  find it.  If not, how do I go about posting this as a feature request?
>  
>  Thanks,
>  Mark
> >