Re: [poll] virtualization platform for SOLR

2015-10-01 Thread Upayavira
What are you trying to achieve by using virtualisation?

If it is just code separation, consider using containers and Docker
rather than fully fledged VMs.

CPU is shared, but each container sees its own view of its file system.

Upayavira

On Thu, Oct 1, 2015, at 07:47 AM, Bernd Fehling wrote:
> Hi Shawn,
> 
> unfortunately we have to run VMs, otherwise we would waste hardware.
> I thought other solr users are in the same situation but seams that
> other users have tons of hardware available and we are the only one
> having to use VMs.
> Right, bare metal is always better than any VM.
> As you mentioned we have the indexer (master) on one physical machine
> and two searchers (slaves) on other physical machines, all together with
> other little VMs which are not I/O and CPU heavy.
> 
> Regards
> Bernd
> 
> Am 30.09.2015 um 18:48 schrieb Shawn Heisey:
> > On 9/30/2015 3:12 AM, Bernd Fehling wrote:
> >> while setting up some new servers (virtual machines) using XEN I was
> >> thinking about an alternative like KVM. My last tests with KVM is
> >> a while ago and XEN performed much better in the area of I/O and
> >> CPU usage.
> >> This lead me to the idea to start a poll about virtualization platform and 
> >> your experiences.
> > 
> > I once had a virtualized Solr install with Xen where each VM housed one
> > Solr instance with one core.  The index was distributed, so it required
> > several VMs for one copy of the index.
> > 
> > I eliminated the virtualization, used the same hardware as bare metal
> > with Linux, still one Solr instance installed on the machine, but with
> > multiple Solr cores.  Performance is much better now.
> > 
> > General advice:  Don't run virtual machines.
> > 
> > If a virtual environment is the only significant hardware you have
> > access to and it's used for more than Solr, then you might need to.  If
> > you do run virtual, then minimize the number of VMs, don't put multiple
> > replicas of the same index data on the same physical VM host, give each
> > Solr VM lots of memory, and don't oversubscribe the memory/cpu on the
> > physical VM host.
> > 
> > Thanks,
> > Shawn
> > 


Re: [poll] virtualization platform for SOLR

2015-10-01 Thread Toke Eskildsen
Bernd Fehling  wrote:
> unfortunately we have to run VMs, otherwise we would waste hardware.
> I thought other solr users are in the same situation but seams that
> other users have tons of hardware available and we are the only one
> having to use VMs.

We have ~5 smaller (< 1M documents) solr setups that runs under VMWare (chosen 
because that is what Operations use for all their virtualization). We have a 
single and quite large setup (terabytes of data, billions of documents) that 
runs alone on dedicated hardware. Then we have the third solution: Multiple 
independent Solr oriented projects that share the same bare metal. CentOS 
everywhere BTW.

We would probably get better hardware utilization by running the hardware 
sharing setups in a virtualization system, together with some random other 
projects. But I doubt we would gain much for the cost of rocking the 
high-performance boat.

We do have some other bare-metal setups than Solr at our organization (State 
and University Library, Denmark), but the default for most other projects is to 
use virtualizations. Going mostly bare metal with Solr was an explicit and 
performance-driven decision.

Except for the virtualized instances, we only use local SSDs to hold our index 
data. That might affect the trade-off as even slight delays in IO becomes 
visible, when storage access times are < 0.1ms instead of > 1ms. I suspect the 
relative impact of virtualization is less with spinning drives or networked 
storage.

- Toke Eskildsen


Re: [poll] virtualization platform for SOLR

2015-10-01 Thread Bernd Fehling
Hi Upayavira,

best would be to have 4 dedicated servers, 2 for indexing (masters) and
2 for searching (slaves). Always one is online and one is standby in
case of hardware failure or update of OS, JAVA or even SOLR.

But I only get 256GB RAM machines with many CPUs which I have to share
with other project partners. Such a machine as dedicated SOLR server
would be oversized for a single index SOLR system.
Currently 64GB RAM machines are sufficient.

You think docker could do this?

Regards
Bernd

Am 01.10.2015 um 09:29 schrieb Upayavira:
> What are you trying to achieve by using virtualisation?
> 
> If it is just code separation, consider using containers and Docker
> rather than fully fledged VMs.
> 
> CPU is shared, but each container sees its own view of its file system.
> 
> Upayavira
> 
> On Thu, Oct 1, 2015, at 07:47 AM, Bernd Fehling wrote:
>> Hi Shawn,
>>
>> unfortunately we have to run VMs, otherwise we would waste hardware.
>> I thought other solr users are in the same situation but seams that
>> other users have tons of hardware available and we are the only one
>> having to use VMs.
>> Right, bare metal is always better than any VM.
>> As you mentioned we have the indexer (master) on one physical machine
>> and two searchers (slaves) on other physical machines, all together with
>> other little VMs which are not I/O and CPU heavy.
>>
>> Regards
>> Bernd
>>
>> Am 30.09.2015 um 18:48 schrieb Shawn Heisey:
>>> On 9/30/2015 3:12 AM, Bernd Fehling wrote:
 while setting up some new servers (virtual machines) using XEN I was
 thinking about an alternative like KVM. My last tests with KVM is
 a while ago and XEN performed much better in the area of I/O and
 CPU usage.
 This lead me to the idea to start a poll about virtualization platform and 
 your experiences.
>>>
>>> I once had a virtualized Solr install with Xen where each VM housed one
>>> Solr instance with one core.  The index was distributed, so it required
>>> several VMs for one copy of the index.
>>>
>>> I eliminated the virtualization, used the same hardware as bare metal
>>> with Linux, still one Solr instance installed on the machine, but with
>>> multiple Solr cores.  Performance is much better now.
>>>
>>> General advice:  Don't run virtual machines.
>>>
>>> If a virtual environment is the only significant hardware you have
>>> access to and it's used for more than Solr, then you might need to.  If
>>> you do run virtual, then minimize the number of VMs, don't put multiple
>>> replicas of the same index data on the same physical VM host, give each
>>> Solr VM lots of memory, and don't oversubscribe the memory/cpu on the
>>> physical VM host.
>>>
>>> Thanks,
>>> Shawn
>>>

-- 
*
Bernd FehlingBielefeld University Library
Dipl.-Inform. (FH)LibTec - Library Technology
Universitätsstr. 25  and Knowledge Management
33615 Bielefeld
Tel. +49 521 106-4060   bernd.fehling(at)uni-bielefeld.de

BASE - Bielefeld Academic Search Engine - www.base-search.net
*


RE: Cannot connect to a zookeeper 3.4.6 instance via zkCli.cmd

2015-10-01 Thread Adrian Liew
Hi all,

The problem below was resolved by appropriately setting my server ip addresses 
to have the following for each zoo.cfg:

server.1=10.0.0.4:2888:3888
server.2=10.0.0.5:2888:3888
server.3=10.0.0.6:2888:3888

as opposed to the following:

server.1=10.0.0.4:2888:3888
server.2=10.0.0.5:2889:3889
server.3=10.0.0.6:2890:3890

I am not sure why the above can be an issue (by right it should not), however I 
followed the recommendations provided by Zookeeper administration guide under 
RunningReplicatedZookeeper 
(https://zookeeper.apache.org/doc/r3.1.2/zookeeperStarted.html#sc_RunningReplicatedZooKeeper)

Given that I am testing multiple servers in a mutiserver environment, it will 
be safe to use 2888:3888 on each server rather than have different ports.

Regards,
Adrian

From: Adrian Liew [mailto:adrian.l...@avanade.com]
Sent: Thursday, October 1, 2015 5:32 PM
To: solr-user@lucene.apache.org
Subject: Cannot connect to a zookeeper 3.4.6 instance via zkCli.cmd

Hi there,

Currently, I have setup an azure virtual network to connect my Zookeeper 
clusters together with three Azure VMs. Each VM has an internal IP of 10.0.0.4, 
10.0.0.5 and 10.0.0.6. I have also setup Solr 5.3.0 which runs in Solr Cloud 
mode connected to all three Zookeepers in an external ensemble manner.

I am able to connect to 10.0.0.4 and 10.0.0.6 via the zkCli.cmd after starting 
the Zookeeper services. However for 10.0.0.5, I keep getting the below error 
even if I started the zookeeper service.

[cid:image001.png@01D0FC6E.BDC2D990]

I have restarted 10.0.0.5 VM several times and still am unable to connect to 
Zookeeper via zkCli.cmd. I have checked zoo.cfg (making sure ports, data and 
logs are all set correctly) and myid to ensure they have the correct 
configurations.

The simple command line I used to connect to Zookeeper is zkCli.cmd -server 
10.0.0.5:2182 for example.

Any ideas?

Best regards,

Adrian Liew |  Consultant Application Developer
Avanade Malaysia Sdn. Bhd..| Consulting Services
(: Direct: +(603) 2382 5668
È: +6010-2288030




Re: Join with faceting and filtering

2015-10-01 Thread Mikhail Khludnev
1. i'd say it's challenge.
2. can't you do the opposite filter active contracts, join them back to
items, and facet then?
q=(Description:colgate OR Categories:colgate OR
Sellers:colgate)={!join from=ItemId to=ItemId
fromIndex=Contracts)Active:true=SellersString
3. note: there is {!terms} QParser (which makes leg-shooting easier).
4. what are number of documents you operate? what is update frequency? Is
there a chance to keep both types in the single index?

On Thu, Oct 1, 2015 at 5:58 AM, Troy Edwards 
wrote:

> I am working with the following indices
>
> *Item*
>
> ItemId - string
> Description - text (query on this)
> Categories - Multivalued text (query on this)
> Sellers - Multivalued text (query on this)
> SellersString - Multivalued string (Need to facet and filter on this)
>
> *ContractItem*
>
> ContractItemId - string
> ItemId - string
> ContractCode - string (facet and filter on this)
> Priority -  integer (order by priority descending)
> Active - boolean (filter on this)
>
> Say someone is searching for colgate
>
> I am doing two queries:
>
> First query: q = {!join from=ItemId to=ItemId
> fromIndex=Item)(Description:colgate OR Categories:colgate OR
> Sellers:colgate)=ContractCode
>
> From the first query I get all the ItemIds and do a second query on Item
> index using q=ItemId:(Id1 Id2 Id3) and generate facet on SellersString
>
> I have to do some custom coding to retain Priority (so that I can sort on
> it)
>
> Following are the issues I am running into:
>
> 1) Since there are a lot of Items and ContractItems, the number of Ids
> becomes large and I had to increase maxBooleanClause (possible performance
> degradation?)
>
> 2) Since I have to return a lot of items from first query, the data size
> becomes very large (again a performance concern)
>
> 3) When a filter is applied on the second query, I have to adjust the facet
> results of the first query
>
> 4) Overall this seems complex
>
> Is it possible to do just one query and apply filters (if any) and get
> results along with facets?
>
> Any suggestions on simplifying this and improving performance?
>
> Thanks in advance
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics





Re: Join with faceting and filtering

2015-10-01 Thread Troy Edwards
I had missed a field in ContractItem index (ClientId)

*ContractItem*

ContractItemId - string
ItemId - string
ClientId - string
ContractCode - string (facet and filter on this)
Priority -  integer (order by priority descending)
Active - boolean (filter on this)

2) It appears that I cannot have fromIndex=Contracts because it is very
large and has to be sharded. Per my understanding SolrCloud join does not
support multiple shards

4) The Item index contains approximately 2 million items. For ContractItem
there are about 1 clients with about 1.5 million records for each
client. So the total ContractItem records are close to 15 billion.

Several updates are made to Item during the day. Sometimes clients will
made large changes to ContractItem.

Any thoughts/suggestions?

On Thu, Oct 1, 2015 at 6:09 AM, Mikhail Khludnev  wrote:

> 1. i'd say it's challenge.
> 2. can't you do the opposite filter active contracts, join them back to
> items, and facet then?
> q=(Description:colgate OR Categories:colgate OR
> Sellers:colgate)={!join from=ItemId to=ItemId
> fromIndex=Contracts)Active:true=SellersString
> 3. note: there is {!terms} QParser (which makes leg-shooting easier).
> 4. what are number of documents you operate? what is update frequency? Is
> there a chance to keep both types in the single index?
>
> On Thu, Oct 1, 2015 at 5:58 AM, Troy Edwards 
> wrote:
>
> > I am working with the following indices
> >
> > *Item*
> >
> > ItemId - string
> > Description - text (query on this)
> > Categories - Multivalued text (query on this)
> > Sellers - Multivalued text (query on this)
> > SellersString - Multivalued string (Need to facet and filter on this)
> >
> > *ContractItem*
> >
> > ContractItemId - string
> > ItemId - string
> > ContractCode - string (facet and filter on this)
> > Priority -  integer (order by priority descending)
> > Active - boolean (filter on this)
> >
> > Say someone is searching for colgate
> >
> > I am doing two queries:
> >
> > First query: q = {!join from=ItemId to=ItemId
> > fromIndex=Item)(Description:colgate OR Categories:colgate OR
> > Sellers:colgate)=ContractCode
> >
> > From the first query I get all the ItemIds and do a second query on Item
> > index using q=ItemId:(Id1 Id2 Id3) and generate facet on SellersString
> >
> > I have to do some custom coding to retain Priority (so that I can sort on
> > it)
> >
> > Following are the issues I am running into:
> >
> > 1) Since there are a lot of Items and ContractItems, the number of Ids
> > becomes large and I had to increase maxBooleanClause (possible
> performance
> > degradation?)
> >
> > 2) Since I have to return a lot of items from first query, the data size
> > becomes very large (again a performance concern)
> >
> > 3) When a filter is applied on the second query, I have to adjust the
> facet
> > results of the first query
> >
> > 4) Overall this seems complex
> >
> > Is it possible to do just one query and apply filters (if any) and get
> > results along with facets?
> >
> > Any suggestions on simplifying this and improving performance?
> >
> > Thanks in advance
> >
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> 
> 
>


Solr vs Lucene

2015-10-01 Thread Mark Fenbers

Greetings!

Being a newbie, I'm still mostly in the dark regarding where the line is 
between Solr and Lucene.  The following code snippet is -- I think -- 
all Lucene and no Solr.  It is a significantly modified version of some 
example code I found on the net.


dir = 
FSDirectory.open(FileSystems.getDefault().getPath("/localapps/dev/EventLog/solr/data", 
"SpellIndex"));

speller = new SpellChecker(dir);
fis = new FileInputStream("/usr/share/dict/words");
analyzer = new StandardAnalyzer();
speller.indexDictionary(new PlainTextDictionary(EventLog.fis), new 
IndexWriterConfig(analyzer), false);


// now let's see speller in action...
System.out.println(speller.exist("beez"));  // returns false
System.out.println(speller.exist("bees"));  // returns true

String[] suggestions = speller.suggestSimilar("beez", 10);
for (String suggestion : suggestions)
System.err.println(suggestion);

(Later in my code, I close what objects need to be...)  This code 
(above) does the following:


1. identifies whether a given word is misspelled or spelled correctly.
2. Gives alternate suggestions to a given word (whether spelled
   correctly or not).
3. I presume, but haven't tested this yet, that I can add a second or
   third word list to the index, say, a site dictionary containing
   names of people or places commonly found in the text.

But this code does not:

1. parse any given text into words, and testing each word.
2. provide markers showing where the misspelled/suspect words are
   within the text.

and so my code will have to provide the latter functionality.  Or does 
Solr provide this capability, such that it would be silly to write my own?


Thanks,

Mark



Re: highlighting

2015-10-01 Thread Mark Fenbers
Yeah, I thought about using markers, but then I'd have to search the the 
text for the markers to determine the locations.  This is a clunky way 
of getting the results I want, and it would save two steps if Solr 
merely had an option to return a start/length array (of what should be 
highlighted) in the original string rather than returning an altered 
string with tags inserted.


Mark

On 9/29/2015 7:04 AM, Upayavira wrote:

You can change the strings that are inserted into the text, and could
place markers that you use to identify the start/end of highlighting
elements. Does that work?

Upayavira

On Mon, Sep 28, 2015, at 09:55 PM, Mark Fenbers wrote:

Greetings!

I have highlighting turned on in my Solr searches, but what I get back
is  tags surrounding the found term.  Since I use a SWT StyledText
widget to display my search results, what I really want is the offset
and length of each found term, so that I can highlight it in my own way
without HTML.  Is there a way to configure Solr to do that?  I couldn't
find it.  If not, how do I go about posting this as a feature request?

Thanks,
Mark




RE: [poll] virtualization platform for SOLR

2015-10-01 Thread Davis, Daniel (NIH/NLM) [C]
Shawn,

Same answer as Bernd.   We have a big VmWare vCenter setup and Netapp.
That's what we have to use.Even in a VM world, some advice persists - 
"local" disk is faster than network disk even if the "local" disk is virtual.   
 Netapp disk is exported to VmWare vCenter over Fibre-Channel, and vCenter has 
its own battery-backed caching.   It is still far better to use "local" disk 
even on a VM rather than use NFS.   

I did some scientifically not reliable tests using fio, and then replaying 
search logs to prove this... 

Hope this helps,

Dan

-Original Message-
From: Bernd Fehling [mailto:bernd.fehl...@uni-bielefeld.de] 
Sent: Thursday, October 01, 2015 2:48 AM
To: solr-user@lucene.apache.org
Subject: Re: [poll] virtualization platform for SOLR

Hi Shawn,

unfortunately we have to run VMs, otherwise we would waste hardware.
I thought other solr users are in the same situation but seams that other users 
have tons of hardware available and we are the only one having to use VMs.
Right, bare metal is always better than any VM.
As you mentioned we have the indexer (master) on one physical machine and two 
searchers (slaves) on other physical machines, all together with other little 
VMs which are not I/O and CPU heavy.

Regards
Bernd

Am 30.09.2015 um 18:48 schrieb Shawn Heisey:
> On 9/30/2015 3:12 AM, Bernd Fehling wrote:
>> while setting up some new servers (virtual machines) using XEN I was 
>> thinking about an alternative like KVM. My last tests with KVM is a 
>> while ago and XEN performed much better in the area of I/O and CPU 
>> usage.
>> This lead me to the idea to start a poll about virtualization platform and 
>> your experiences.
> 
> I once had a virtualized Solr install with Xen where each VM housed 
> one Solr instance with one core.  The index was distributed, so it 
> required several VMs for one copy of the index.
> 
> I eliminated the virtualization, used the same hardware as bare metal 
> with Linux, still one Solr instance installed on the machine, but with 
> multiple Solr cores.  Performance is much better now.
> 
> General advice:  Don't run virtual machines.
> 
> If a virtual environment is the only significant hardware you have 
> access to and it's used for more than Solr, then you might need to.  
> If you do run virtual, then minimize the number of VMs, don't put 
> multiple replicas of the same index data on the same physical VM host, 
> give each Solr VM lots of memory, and don't oversubscribe the 
> memory/cpu on the physical VM host.
> 
> Thanks,
> Shawn
> 


Re: Solr vs Lucene

2015-10-01 Thread Alexandre Rafalovitch
Hi Mark,

Have you gone through a Solr tutorial yet? If/when you do, you will
see you don't need to code any of this. It is configured as part of
the web-facing total offering which are tweaked by XML configuration
files (or REST API calls). And most of the standard pipelines are
already pre-configured, so you don't need to invent them from scratch.

On your specific question, it would be better to ask what _business_
level functionality you are trying to achieve and see if Solr can help
with that. Starting from Lucene code is less useful :-)

Regards,
   Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 1 October 2015 at 07:48, Mark Fenbers  wrote:
> Greetings!
>
> Being a newbie, I'm still mostly in the dark regarding where the line is
> between Solr and Lucene.  The following code snippet is -- I think -- all
> Lucene and no Solr.  It is a significantly modified version of some example
> code I found on the net.
>
> dir =
> FSDirectory.open(FileSystems.getDefault().getPath("/localapps/dev/EventLog/solr/data",
> "SpellIndex"));
> speller = new SpellChecker(dir);
> fis = new FileInputStream("/usr/share/dict/words");
> analyzer = new StandardAnalyzer();
> speller.indexDictionary(new PlainTextDictionary(EventLog.fis), new
> IndexWriterConfig(analyzer), false);
>
> // now let's see speller in action...
> System.out.println(speller.exist("beez"));  // returns false
> System.out.println(speller.exist("bees"));  // returns true
>
> String[] suggestions = speller.suggestSimilar("beez", 10);
> for (String suggestion : suggestions)
> System.err.println(suggestion);
>
> (Later in my code, I close what objects need to be...)  This code (above)
> does the following:
>
> 1. identifies whether a given word is misspelled or spelled correctly.
> 2. Gives alternate suggestions to a given word (whether spelled
>correctly or not).
> 3. I presume, but haven't tested this yet, that I can add a second or
>third word list to the index, say, a site dictionary containing
>names of people or places commonly found in the text.
>
> But this code does not:
>
> 1. parse any given text into words, and testing each word.
> 2. provide markers showing where the misspelled/suspect words are
>within the text.
>
> and so my code will have to provide the latter functionality.  Or does Solr
> provide this capability, such that it would be silly to write my own?
>
> Thanks,
>
> Mark
>


Re: Create Collection in Solr Cloud using Solr 5.3.0 giving timeout issues

2015-10-01 Thread Shawn Heisey
On 10/1/2015 4:43 AM, Adrian Liew wrote:
> E:\solr-5.3.0\bin>solr.cmd create_collection -c sitecore_core_index -n 
> sitecore_
> common_configs -shards 1 -replicationFactor 3
> 
> Connecting to ZooKeeper at 10.0.0.4:2181,10.0.0.5:2182,10.0.0.6:2183 ...
> Re-using existing configuration directory sitecore_common_configs
> 
> Creating new collection 'sitecore_core_index' using command:
> http://localhost:8983/solr/admin/collections?action=CREATE=sitecore_core_in
> dex=1=3=2=sit
> ecore_common_configs
> 
> ERROR: Failed to create collection 'sitecore_core_index' due to: create the 
> coll
> ection time out:180s


The timeout, as it mentions, is 180 seconds, or three minutes.  This is
a the default timeout for the Collections API, and it is a particularly
long timeout.  When it is exceeded, it is usually an indication of a
serious problem.  The collection create will likely succeed eventually,
after an unknown amount of time ... the collections API just gave up on
waiting for the response.

There are two things that I know of that can cause this:  A very large
number of collections, and general performance issues.

I did some testing a while back with thousands of empty collections on
the Solr 5.x cloud example.  It did not turn out well.  Many things
timed out, and a server restart would throw the whole cloud into chaos
for a very long time. If those collections were not empty, then I
suspect the problems would be even worse.

General performance issues (usually RAM-related) can cause big problems
with SolrCloud too.  The following wiki page is most of my accumulated
knowledge about what causes performance problems with Solr:

https://wiki.apache.org/solr/SolrPerformanceProblems

Thanks,
Shawn



RE: Create Collection in Solr Cloud using Solr 5.3.0 giving timeout issues

2015-10-01 Thread Adrian Liew
Hi Shawn,

Thanks for that. You did mention about starting out with empty collections and 
already I am experiencing timeout issues. Could this have to do with the 
hardware or server spec sizing itself. For example, lack of memory allocated, 
network issues etc. that can possibly cause this? Given Azure has a 99.95 
percent SLA, I don't think this could be a network issue.

I am currently using a 4 core 7 GB RAM memory machine for an individual Solr 
Server.

I don't quite understand why this is happening as I am just trying to setup a 
bare bones Solr Cloud setup using Solr 5.3.0 and Zookeeper 3.4.6. 

Any tips will be much appreciated. 

Best regards,
Adrian


-Original Message-
From: Shawn Heisey [mailto:apa...@elyograg.org] 
Sent: Thursday, October 1, 2015 11:12 PM
To: solr-user@lucene.apache.org
Subject: Re: Create Collection in Solr Cloud using Solr 5.3.0 giving timeout 
issues

On 10/1/2015 4:43 AM, Adrian Liew wrote:
> E:\solr-5.3.0\bin>solr.cmd create_collection -c sitecore_core_index -n 
> sitecore_ common_configs -shards 1 -replicationFactor 3
> 
> Connecting to ZooKeeper at 10.0.0.4:2181,10.0.0.5:2182,10.0.0.6:2183 ...
> Re-using existing configuration directory sitecore_common_configs
> 
> Creating new collection 'sitecore_core_index' using command:
> http://localhost:8983/solr/admin/collections?action=CREATE=siteco
> re_core_in 
> dex=1=3=2
> igName=sit
> ecore_common_configs
> 
> ERROR: Failed to create collection 'sitecore_core_index' due to: 
> create the coll ection time out:180s


The timeout, as it mentions, is 180 seconds, or three minutes.  This is a the 
default timeout for the Collections API, and it is a particularly long timeout. 
 When it is exceeded, it is usually an indication of a serious problem.  The 
collection create will likely succeed eventually, after an unknown amount of 
time ... the collections API just gave up on waiting for the response.

There are two things that I know of that can cause this:  A very large number 
of collections, and general performance issues.

I did some testing a while back with thousands of empty collections on the Solr 
5.x cloud example.  It did not turn out well.  Many things timed out, and a 
server restart would throw the whole cloud into chaos for a very long time. If 
those collections were not empty, then I suspect the problems would be even 
worse.

General performance issues (usually RAM-related) can cause big problems with 
SolrCloud too.  The following wiki page is most of my accumulated knowledge 
about what causes performance problems with Solr:

https://wiki.apache.org/solr/SolrPerformanceProblems

Thanks,
Shawn



Re: Create Collection in Solr Cloud using Solr 5.3.0 giving timeout issues

2015-10-01 Thread Shawn Heisey
On 10/1/2015 9:26 AM, Adrian Liew wrote:
> Thanks for that. You did mention about starting out with empty collections 
> and already I am experiencing timeout issues. Could this have to do with the 
> hardware or server spec sizing itself. For example, lack of memory allocated, 
> network issues etc. that can possibly cause this? Given Azure has a 99.95 
> percent SLA, I don't think this could be a network issue.
>
> I am currently using a 4 core 7 GB RAM memory machine for an individual Solr 
> Server.
>
> I don't quite understand why this is happening as I am just trying to setup a 
> bare bones Solr Cloud setup using Solr 5.3.0 and Zookeeper 3.4.6. 
>
> Any tips will be much appreciated. 

Since I don't know anything about your install except for the number of
CPU cores and RAM, I can only give you general information.

One problem that can plague new installs is that the default Java heap
size for a Solr 5.x install is 512MB.  This works great when you first
fire it up, but as you add data, quickly becomes very limiting, and
needs to be increased.  The GC pauses that occur when the heap is too
small can be extreme.

Since you are on Windows, there is no install script.  Adding something
like "-m 3g" to the startup commandline will allocate more memory (3GB
for my example) to the Java heap.  Note that if your index data gets
very big, your VM might need more memory in order for OS disk caching to
be effective.

Thanks,
Shawn



Re: Solr vs Lucene

2015-10-01 Thread Mark Fenbers
Yes, and I've spend numerous hours configuring and reconfiguring, and 
eventually even starting over, but still have not getting it to work 
right.  Even now, I'm getting bizarre results.  For example, I query   
"NOTE: This is purely as an example."  and I get back really bizarre 
suggestions, like "n ot e" and "n o te" and "n o t e" for the first word 
which isn't even misspelled!  The same goes for "purely" and "example" 
also!  Moreover, I get extended results showing the frequencies of these 
suggestions being over 2600 occurrences, when I'm not even using an 
indexed spell checker.  I'm only using a file-based spell checker 
(/usr/shar/dict/words), and the wordbreak checker.


At this point, I can't even figure out how to narrow down my confusion 
so that I can post concise questions to the group.  But I'll get there 
eventually, starting with removing the wordbreak checker for the 
time-being.  Your response was encouraging, at least.


Mark


On 10/1/2015 9:45 AM, Alexandre Rafalovitch wrote:

Hi Mark,

Have you gone through a Solr tutorial yet? If/when you do, you will
see you don't need to code any of this. It is configured as part of
the web-facing total offering which are tweaked by XML configuration
files (or REST API calls). And most of the standard pipelines are
already pre-configured, so you don't need to invent them from scratch.

On your specific question, it would be better to ask what _business_
level functionality you are trying to achieve and see if Solr can help
with that. Starting from Lucene code is less useful :-)

Regards,
Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 1 October 2015 at 07:48, Mark Fenbers  wrote:


Re: Solr vs Lucene

2015-10-01 Thread Alexandre Rafalovitch
Is that with Lucene or with Solr? Because Solr has several different
spell-checker modules you can configure.  I would recommend trying
them first.

And, frankly, I still don't know what your business case is.

Regards,
   Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 1 October 2015 at 12:38, Mark Fenbers  wrote:
> Yes, and I've spend numerous hours configuring and reconfiguring, and
> eventually even starting over, but still have not getting it to work right.
> Even now, I'm getting bizarre results.  For example, I query   "NOTE: This
> is purely as an example."  and I get back really bizarre suggestions, like
> "n ot e" and "n o te" and "n o t e" for the first word which isn't even
> misspelled!  The same goes for "purely" and "example" also!  Moreover, I get
> extended results showing the frequencies of these suggestions being over
> 2600 occurrences, when I'm not even using an indexed spell checker.  I'm
> only using a file-based spell checker (/usr/shar/dict/words), and the
> wordbreak checker.
>
> At this point, I can't even figure out how to narrow down my confusion so
> that I can post concise questions to the group.  But I'll get there
> eventually, starting with removing the wordbreak checker for the time-being.
> Your response was encouraging, at least.
>
> Mark
>
>
>
> On 10/1/2015 9:45 AM, Alexandre Rafalovitch wrote:
>>
>> Hi Mark,
>>
>> Have you gone through a Solr tutorial yet? If/when you do, you will
>> see you don't need to code any of this. It is configured as part of
>> the web-facing total offering which are tweaked by XML configuration
>> files (or REST API calls). And most of the standard pipelines are
>> already pre-configured, so you don't need to invent them from scratch.
>>
>> On your specific question, it would be better to ask what _business_
>> level functionality you are trying to achieve and see if Solr can help
>> with that. Starting from Lucene code is less useful :-)
>>
>> Regards,
>> Alex.
>> 
>> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
>> http://www.solr-start.com/
>>
>>
>> On 1 October 2015 at 07:48, Mark Fenbers  wrote:


Class Loader issues

2015-10-01 Thread Firas Khasawneh
Hi all,

I am trying to load Jackson json library from the 
solr-5.3.1/contrib/clustering/lib directory.
In solconfig.xml I have the following entry: 

When I start solr, I get the following warning:
SolrResourceLoader

No files added to classloader from lib: /dev/solr-5.3.1/contrib/clustering/lib




The path is correct and regex should load jackson-core-asl-1.9.13.jar  and 
jackson-mapper-asl-1.9.13.jar  but it does not. Any help is appreciated.

Thanks,
Firas


Facet queries blow out the filterCache

2015-10-01 Thread Jeff Wartes

I’m doing some fairly simple facet queries in a two-shard 5.3 SolrCloud
index on fields like this:



PoolingClientConnectionManager

2015-10-01 Thread Rallavagu

Solr 4.6.1, single Shard, cloud with 4 nodes

Solr is running on Tomcat configured with 200 threads for thread pool. 
As Solr uses "org.apache.http.impl.conn.PoolingClientConnectionManager" 
for replication, my question is does Solr threads use connections from 
tomcat thread pool or they create their own thread pool? I am trying to 
find out if it would be 200 + Solr threads or not. Thanks.


Re: Using dynamically calculated value for sorting

2015-10-01 Thread bbarani
Thanks for your reply.

Overall design has changed little bit.

Now I will be sending the SKU id (SKU id is in SOLR document) to an external
API and it will return a new price to me for that SKU based on some logic (I
wont be calculating the new price). 

Once I get that value I need to use that new price value for sorting. 






--
View this message in context: 
http://lucene.472066.n3.nabble.com/Using-dynamically-calculated-value-for-sorting-tp4231950p4232320.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: PoolingClientConnectionManager

2015-10-01 Thread Rallavagu

Thanks for the response Andrea.

Assuming that Solr has it's own thread pool, it appears that 
"PoolingClientConnectionManager" has a maximum 20 threads per host as 
default. Is there a way to changes this increase to handle heavy update 
traffic? Thanks.




On 10/1/15 11:05 AM, Andrea Gazzarini wrote:

Hi,
Maybe I could be wrong as your question is related with Solr internals (I
believe the dev list is a better candidate for such questions).

Anyway, my thoughts: unless you're within a JCA inbound component (and Solr
isn't), the JEE specs say you shouldn' start new threads. For this  reason,
there's no a (standard) way to directly connect to and use the servlet
container threads.

As far as I know Solr 4.x is a standard and JEE compliant web application
so the answer to your question *should* be: "yes, it is using its own
threads"

Best,
Andrea
Solr 4.6.1, single Shard, cloud with 4 nodes

Solr is running on Tomcat configured with 200 threads for thread pool. As
Solr uses "org.apache.http.impl.conn.PoolingClientConnectionManager" for
replication, my question is does Solr threads use connections from tomcat
thread pool or they create their own thread pool? I am trying to find out
if it would be 200 + Solr threads or not. Thanks.



Re: PoolingClientConnectionManager

2015-10-01 Thread Shawn Heisey
On 10/1/2015 11:50 AM, Rallavagu wrote:
> Solr 4.6.1, single Shard, cloud with 4 nodes
>
> Solr is running on Tomcat configured with 200 threads for thread pool.
> As Solr uses
> "org.apache.http.impl.conn.PoolingClientConnectionManager" for
> replication, my question is does Solr threads use connections from
> tomcat thread pool or they create their own thread pool? I am trying
> to find out if it would be 200 + Solr threads or not. Thanks.

I don't know the answer to the actual question you have asked ... but I
do know that keeping the container maxThreads at 200 can cause serious
problems for Solr.  It does not take a very big installation to exceed
200 threads, and users have had problems fixed by increasing
maxThreads.  This implies that the container is able to control the
threads in Solr to some degree.

The Jetty included with all versions of Solr that I have actually
checked (back to 3.2.0) has maxThreads set to 1, which effectively
removes the thread limit for any typical install.  Very large installs
might need it bumped higher than 1.

Thanks,
Shawn



Re: Cloud Deployment Strategy... In the Cloud

2015-10-01 Thread Mark Miller
On Wed, Sep 30, 2015 at 10:36 AM Steve Davids  wrote:

> Our project built a custom "admin" webapp that we use for various O
> activities so I went ahead and added the ability to upload a Zip
> distribution which then uses SolrJ to forward the extracted contents to ZK,
> this package is built and uploaded via a Gradle build task which makes life
> easy on us by allowing us to jam stuff into ZK which is sitting in a
> private network (local VPC) without necessarily needing to be on a ZK
> machine. We then moved on to creating collection (trivial), and
> adding/removing replicas. As for adding replicas I am rather confused as to
> why I would need specify a specific shard for replica placement, before
> when I threw down a core.properties file the machine would automatically
> come up and figure out which shard it should join based on reasonable
> assumptions - why wouldn't the same logic apply here?


I'd file a JIRA issue for the functionality.


> I then saw that
> a Rule-based
> Replica Placement
> <
> https://cwiki.apache.org/confluence/display/solr/Rule-based+Replica+Placement
> >
> feature was added which I thought would be reasonable but after looking at
> the tests  it appears to
> still require a shard parameter for adding a replica which seems to defeat
> the entire purpose.


I was not involved in the addReplica command, but the predefined stuff
worked that way just to make bootstrapping up a cluster really simple. I
don't see why addReplica couldn't follow the same logic if no shard was
specified.


> So after getting bummed out about that, I took a look
> at the delete replica request since we are having machines come/go we need
> to start dropping them and found that the delete replica requires a
> collection, shard, and replica name and if I have the name of the machine
> it appears the only way to figure out what to remove is by walking the
> clusterstate tree for all collections and determine which replicas are a
> candidate for removal which seems unnecessarily complicated.
>

You should not need the shard for this call. The collection and replica
core node name will be unique. Another JIRA issue?


>
> Hopefully I don't come off as complaining, but rather looking at it from a
> client perspective, the Collections API doesn't seem simple to use and
> really the only reason I am messing around with it now is because there is
> repeated threats to make "zk as truth" the default in the 5.x branch at
> some point in the future. I would personally advocate that something like
> the autoManageReplicas 
> be
> introduced to make life much simpler on clients as this appears to be the
> thing I am trying to implement externally.
>
> If anyone has happened to to build a system to orchestrate Solr for cloud
> infrastructure and have some pointers it would be greatly appreciated.
>
> Thanks,
>
> -Steve
>
>
> --
- Mark
about.me/markrmiller


Re: PoolingClientConnectionManager

2015-10-01 Thread Rallavagu

Thanks Shawn. This is good data.

On 10/1/15 11:43 AM, Shawn Heisey wrote:

On 10/1/2015 11:50 AM, Rallavagu wrote:

Solr 4.6.1, single Shard, cloud with 4 nodes

Solr is running on Tomcat configured with 200 threads for thread pool.
As Solr uses
"org.apache.http.impl.conn.PoolingClientConnectionManager" for
replication, my question is does Solr threads use connections from
tomcat thread pool or they create their own thread pool? I am trying
to find out if it would be 200 + Solr threads or not. Thanks.


I don't know the answer to the actual question you have asked ... but I
do know that keeping the container maxThreads at 200 can cause serious
problems for Solr.  It does not take a very big installation to exceed
200 threads, and users have had problems fixed by increasing
maxThreads.  This implies that the container is able to control the
threads in Solr to some degree.

The Jetty included with all versions of Solr that I have actually
checked (back to 3.2.0) has maxThreads set to 1, which effectively
removes the thread limit for any typical install.  Very large installs
might need it bumped higher than 1.

Thanks,
Shawn



Re: PoolingClientConnectionManager

2015-10-01 Thread Andrea Gazzarini
Hi,
Maybe I could be wrong as your question is related with Solr internals (I
believe the dev list is a better candidate for such questions).

Anyway, my thoughts: unless you're within a JCA inbound component (and Solr
isn't), the JEE specs say you shouldn' start new threads. For this  reason,
there's no a (standard) way to directly connect to and use the servlet
container threads.

As far as I know Solr 4.x is a standard and JEE compliant web application
so the answer to your question *should* be: "yes, it is using its own
threads"

Best,
Andrea
Solr 4.6.1, single Shard, cloud with 4 nodes

Solr is running on Tomcat configured with 200 threads for thread pool. As
Solr uses "org.apache.http.impl.conn.PoolingClientConnectionManager" for
replication, my question is does Solr threads use connections from tomcat
thread pool or they create their own thread pool? I am trying to find out
if it would be 200 + Solr threads or not. Thanks.


Re: Solr vs Lucene

2015-10-01 Thread Walter Underwood
If you want a spell checker, don’t use a search engine. Use a spell checker. 
Something like aspell (http://aspell.net/ ) will be faster 
and better than Solr.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Oct 1, 2015, at 1:06 PM, Mark Fenbers  wrote:
> 
> This is with Solr.  The Lucene approach (assuming that is what is in my Java 
> code, shared previously) works flawlessly, albeit with fewer options, AFAIK.
> 
> I'm not sure what you mean by "business case"...  I'm wanting to spell-check 
> user-supplied text in my Java app.  The end-user then activates the 
> spell-checker on the entire text (presumably, a few paragraphs or less).  I 
> can use StyledText's capabilities to highlight the misspelled words, and when 
> the user clicks the highlighted word, a menu will appear where he can select 
> a suggested spelling.
> 
> But so far, I've had trouble:
> 
> * determining which words are misspelled (because Solr often returns
>   suggestions for correctly spelled words).
> * getting coherent suggestions (regardless if the query word is
>   misspelled or not).
> 
> It's been a bit puzzling (and frustrating)!!  it only took me 10 minutes to 
> get the Lucene spell checker working, but I agree that Solr would be the 
> better way to go, if I can ever get it configured properly...
> 
> Mark
> 
> 
> On 10/1/2015 12:50 PM, Alexandre Rafalovitch wrote:
>> Is that with Lucene or with Solr? Because Solr has several different
>> spell-checker modules you can configure.  I would recommend trying
>> them first.
>> 
>> And, frankly, I still don't know what your business case is.
>> 
>> Regards,
>>Alex.
>> 
>> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
>> http://www.solr-start.com/
>> 
>> 
>> On 1 October 2015 at 12:38, Mark Fenbers  wrote:
>>> Yes, and I've spend numerous hours configuring and reconfiguring, and
>>> eventually even starting over, but still have not getting it to work right.
>>> Even now, I'm getting bizarre results.  For example, I query   "NOTE: This
>>> is purely as an example."  and I get back really bizarre suggestions, like
>>> "n ot e" and "n o te" and "n o t e" for the first word which isn't even
>>> misspelled!  The same goes for "purely" and "example" also!  Moreover, I get
>>> extended results showing the frequencies of these suggestions being over
>>> 2600 occurrences, when I'm not even using an indexed spell checker.  I'm
>>> only using a file-based spell checker (/usr/shar/dict/words), and the
>>> wordbreak checker.
>>> 
>>> At this point, I can't even figure out how to narrow down my confusion so
>>> that I can post concise questions to the group.  But I'll get there
>>> eventually, starting with removing the wordbreak checker for the time-being.
>>> Your response was encouraging, at least.
>>> 
>>> Mark
>>> 
>>> 
>>> 
>>> On 10/1/2015 9:45 AM, Alexandre Rafalovitch wrote:
 Hi Mark,
 
 Have you gone through a Solr tutorial yet? If/when you do, you will
 see you don't need to code any of this. It is configured as part of
 the web-facing total offering which are tweaked by XML configuration
 files (or REST API calls). And most of the standard pipelines are
 already pre-configured, so you don't need to invent them from scratch.
 
 On your specific question, it would be better to ask what _business_
 level functionality you are trying to achieve and see if Solr can help
 with that. Starting from Lucene code is less useful :-)
 
 Regards,
 Alex.
 
 Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
 http://www.solr-start.com/
 
 
 On 1 October 2015 at 07:48, Mark Fenbers  wrote:
> 



Re: PoolingClientConnectionManager

2015-10-01 Thread Rallavagu

Awesome. This is what I was looking for. Will try these. Thanks.

On 10/1/15 1:31 PM, Shawn Heisey wrote:

On 10/1/2015 12:39 PM, Rallavagu wrote:

Thanks for the response Andrea.

Assuming that Solr has it's own thread pool, it appears that
"PoolingClientConnectionManager" has a maximum 20 threads per host as
default. Is there a way to changes this increase to handle heavy
update traffic? Thanks.


You can configure all ShardHandler instances with the solr.xml file.
The shard handler controls SolrJ (and HttpClient) within Solr.

https://cwiki.apache.org/confluence/display/solr/Moving+to+the+New+solr.xml+Format

That page does not go into all the shard handler options, though.  For
that, you need to look at the page for distributed requests ... but
don't configure it in solrconfig.xml as the following link shows,
configure it in solr.xml as shown by the earlier link.

https://cwiki.apache.org/confluence/display/solr/Distributed+Requests#DistributedRequests-ConfiguringtheShardHandlerFactory

Thanks,
Shawn



Re: Solr vs Lucene

2015-10-01 Thread Mark Fenbers
This is with Solr.  The Lucene approach (assuming that is what is in my 
Java code, shared previously) works flawlessly, albeit with fewer 
options, AFAIK.


I'm not sure what you mean by "business case"...  I'm wanting to 
spell-check user-supplied text in my Java app.  The end-user then 
activates the spell-checker on the entire text (presumably, a few 
paragraphs or less).  I can use StyledText's capabilities to highlight 
the misspelled words, and when the user clicks the highlighted word, a 
menu will appear where he can select a suggested spelling.


But so far, I've had trouble:

 * determining which words are misspelled (because Solr often returns
   suggestions for correctly spelled words).
 * getting coherent suggestions (regardless if the query word is
   misspelled or not).

It's been a bit puzzling (and frustrating)!!  it only took me 10 minutes 
to get the Lucene spell checker working, but I agree that Solr would be 
the better way to go, if I can ever get it configured properly...


Mark


On 10/1/2015 12:50 PM, Alexandre Rafalovitch wrote:

Is that with Lucene or with Solr? Because Solr has several different
spell-checker modules you can configure.  I would recommend trying
them first.

And, frankly, I still don't know what your business case is.

Regards,
Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 1 October 2015 at 12:38, Mark Fenbers  wrote:

Yes, and I've spend numerous hours configuring and reconfiguring, and
eventually even starting over, but still have not getting it to work right.
Even now, I'm getting bizarre results.  For example, I query   "NOTE: This
is purely as an example."  and I get back really bizarre suggestions, like
"n ot e" and "n o te" and "n o t e" for the first word which isn't even
misspelled!  The same goes for "purely" and "example" also!  Moreover, I get
extended results showing the frequencies of these suggestions being over
2600 occurrences, when I'm not even using an indexed spell checker.  I'm
only using a file-based spell checker (/usr/shar/dict/words), and the
wordbreak checker.

At this point, I can't even figure out how to narrow down my confusion so
that I can post concise questions to the group.  But I'll get there
eventually, starting with removing the wordbreak checker for the time-being.
Your response was encouraging, at least.

Mark



On 10/1/2015 9:45 AM, Alexandre Rafalovitch wrote:

Hi Mark,

Have you gone through a Solr tutorial yet? If/when you do, you will
see you don't need to code any of this. It is configured as part of
the web-facing total offering which are tweaked by XML configuration
files (or REST API calls). And most of the standard pipelines are
already pre-configured, so you don't need to invent them from scratch.

On your specific question, it would be better to ask what _business_
level functionality you are trying to achieve and see if Solr can help
with that. Starting from Lucene code is less useful :-)

Regards,
 Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 1 October 2015 at 07:48, Mark Fenbers  wrote:




Solr 4.7.2 Vs 5.3.0 Docs different for same query

2015-10-01 Thread Ravi Solr
I we migrated from 4.7.2 to 5.3.0. I sourced the docs from 4.7.2 core and
indexed into 5.3.0 collection (data directories are different) via
SolrEntityProcessor. Currently my production is all whack because of this
issue. Do I have to go back and reindex all again ?? Is there a quick fix
for this ?

Here are the results for the query 'obama'...please note the numfound.
4.7.2 has almost 148519 docs while 5.3.0 says it only has 5.3.0 docs. Any
pointers on how to correct this ?


Solr 4.7.2



  0
  2
  
 obama
  0
   
  
  


SolrCloud 5.3.0


  
   0
   2
   
obama
0

   
   



Thanks

Ravi Kiran Bhaskar


Re: Spam handling with ASF mailing lists

2015-10-01 Thread Gora Mohanty
> On 23 September 2015 at 21:10, Upayavira  wrote:
>
> > If you have specific questions about spam handling, then I'd suggest you
> > ask on the ASF infrastructure list, but generally, we can expect that
> > there will be occasions when something that seems obviously spam gets
> > through our systems.
>
> OK, will take this up on the ASF infrastructure list as you suggest.
> Thanks: I was not quite sure where to address this plaint to.
>

Sorry to bug all of you again, but I was annoyed enough once more by the
continuing spam that slips through the cracks to try and complain to ASF
infrastructure lists. Unfortunately, at the very beginning of
http://www.apache.org/dev/infra-mail it says "Participation in these lists
is only available to ASF committers.", and I am not a committer. Would it
be possible for someone to forward this to the infrastructure list?

Not to point fingers, but the spam that I have been responding to would
*not* slip through the open-source-based filters that we maintain for our
small clients. I would think that ASF, with all its resources, would be
able to handle this.

Regards,
Gora


Re: Find records with no values in solr.LatLongType fied type

2015-10-01 Thread Erick Erickson
BTW, there's a JIRA for this, it's a bit clumsy to have to
know about how the coordinate fields are split up

And I like Ishan's idea too!


On Wed, Sep 30, 2015 at 11:51 AM, Ishan Chattopadhyaya
 wrote:
> There's also a function, exists(), which might work here, and result in a
> neater query.
> e.g. something like: q=*:* -exists(usrlatlong_0_coordinate)
> Haven't tried it, though.
> https://cwiki.apache.org/confluence/display/solr/Function+Queries#FunctionQueries-AvailableFunctions
>
> On Wed, Sep 30, 2015 at 8:17 PM, Kamal Kishore Aggarwal <
> kkroyal@gmail.com> wrote:
>
>> Thanks Erick..it worked..
>>
>> On Wed, Sep 16, 2015 at 9:21 PM, Erick Erickson 
>> wrote:
>>
>> > Top level queries need a *:* in front, something like
>> > q=*:* -usrlatlong_0_coordinate:[* TO *]
>> >
>> > I just took a quick check and just using usrlatlong:[* TO *]
>> > encounters a parse error.
>> >
>> > P.S. It would help if you told us what you _did_ receive
>> > when you tried your options. Parse errors? All docs?
>> >
>> > Best,
>> > Erick
>> >
>> > On Mon, Sep 14, 2015 at 10:58 PM, Kamal Kishore Aggarwal
>> >  wrote:
>> > > Hi,
>> > >
>> > > I am working on solr 4.8,1. I am trying to find the docs where
>> > latlongtype
>> > > have null values.
>> > >
>> > > I have tried using these, but not getting the results :
>> > >
>> > > 1) http://localhost:8984/solr/IM-Search/select?q.alt=-usrlatlong:[' '
>> > TO *]
>> > >
>> > > 2) http://localhost:8984/solr/IM-Search/select?q.alt=-usrlatlong:[* TO
>> > *]
>> > >
>> > > Here's the configurations :
>> > >> > > >> subFieldSuffix="_coordinate"/>
>> > >> > stored="true"
>> > >> required="false" multiValued="false" />
>> > >
>> > >
>> > > Please help.
>> >
>>


Re: Re-label terms from a shard?

2015-10-01 Thread Erick Erickson
Actually, I think there is an enum field type, see:
https://issues.apache.org/jira/browse/SOLR-5084.

Although the ability to retrofit the current setup is...er...fraught.

You could always write a custom update processor (maybe a
scriptupdateprocessor?) to
transform synonyms into the "correct" from, but then to find _current_
values you'd have to
do a lot of other work. For faceting you'd have to always return all
values to get correct counts
you have 100 well behaved clients and 1 ill-behaved one. The X facet
counts will probably
be very few relative to x, so combining them would require that both X
and x be returned. With, say,
less than a few hundred distinct values that's certainly possible.

How to make the query work is probably as Upayavira suggests.

Best,
Erick

On Tue, Sep 29, 2015 at 8:47 AM, Upayavira  wrote:
>
>
> On Tue, Sep 29, 2015, at 03:38 PM, Dan Bolser wrote:
>> Hi,
>>
>> I'm using sharding 'off label' to integrate data from various remote
>> sites
>> running a common schema.
>>
>> One issue is that the remote sites sometimes use synonyms of the allowed
>> terms in a given field. i.e. we specify that a certain field may only
>> carry
>> the values x, y, and z, but the remote indexes decide to use X, Y, and Z
>> instead.
>>
>> In my 'hub' (the server configured to query over all shards), can I
>> configure a mapping such that the facet only shows x, y and z, instead of
>> x, X, y, Y, z, and Z?
>>
>> I'm not sure how a facet selection would 'magically' filter on the list
>> of
>> all synonyms defined in the mapping.
>>
>> I should have defined this field as an enumeration, but I think the cat's
>> out of the bag now!
>
> I'm not sure there's anything you can do here (without a substantial
> programming effort) other than add a layer in front of Solr that adds
> x+X, y+Y and z+Z.
>
> As such, Solr doesn't have an enumeration data type - you'd have to just
> use a string field and enforce it outside of Solr.
>
> Upayavira


Re: PoolingClientConnectionManager

2015-10-01 Thread Shawn Heisey
On 10/1/2015 12:39 PM, Rallavagu wrote:
> Thanks for the response Andrea.
>
> Assuming that Solr has it's own thread pool, it appears that
> "PoolingClientConnectionManager" has a maximum 20 threads per host as
> default. Is there a way to changes this increase to handle heavy
> update traffic? Thanks.

You can configure all ShardHandler instances with the solr.xml file. 
The shard handler controls SolrJ (and HttpClient) within Solr.

https://cwiki.apache.org/confluence/display/solr/Moving+to+the+New+solr.xml+Format

That page does not go into all the shard handler options, though.  For
that, you need to look at the page for distributed requests ... but
don't configure it in solrconfig.xml as the following link shows,
configure it in solr.xml as shown by the earlier link.

https://cwiki.apache.org/confluence/display/solr/Distributed+Requests#DistributedRequests-ConfiguringtheShardHandlerFactory

Thanks,
Shawn



Zk and Solr Cloud

2015-10-01 Thread Rallavagu

Solr 4.6.1 single shard with 4 nodes. Zookeeper 3.4.5 ensemble of 3.

See following errors in ZK and Solr and they are connected.

When I see the following error in Zookeeper,

unexpected error, closing socket connection and attempting reconnect
java.io.IOException: Packet len11823809 is out of range!
at 
org.apache.zookeeper.ClientCnxnSocket.readLength(ClientCnxnSocket.java:112)
at 
org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:79)
at 
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355)
at 
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1068)



There is the following corresponding error in Solr

caught end of stream exception
EndOfStreamException: Unable to read additional data from client 
sessionid 0x25024c8ea0e, likely client has closed socket
at 
org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220)
at 
org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)

at java.lang.Thread.run(Thread.java:744)

Any clues as to what is causing these errors. Thanks.


Re: error reporting during indexing

2015-10-01 Thread Erick Erickson
bq: If there is a problem writing the segment, a permission error,

Highly doubtful that this'll occur. When an IndexWriter is opened,
the first thing that's (usually) done is write to the lock file to keep
other Solr's from writing. That should fail right off the bat, far before
any docs are actually indexed, perhaps with a lock obtain timeout
error message.

And, for that matter, when Solr first starts up it creates the ./data,
./data/index and (perhaps) the ./data/tlog directories and any
permissions errors should be hit then.

I suppose there's some "interesting" stuff possible if someone
out there changing directory permissions while Solr is running, in
which case you should find then and then slap them silly ;)

IOW I've certainly seen Solr _fail_ to start when it can't access the
right directories, but not fail part way through.

Best,
Erick

On Tue, Sep 29, 2015 at 1:55 AM, Alessandro Benedetti
 wrote:
> Hi Matteo, at this point I would suggest you this reading by Erick:
>
> https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
>
> If i am not wrong when the document is indexed ( simplifying) :
> 1) The document is added to the current segment in memory
> 2) When a soft commit happens, we get the visibility ( no flush happens to
> the disk, but the document is searchable)
> 3) When the hard commit happens, we get the durability, we truncate the
> segment in memory and we flush it to the disk, so if a problem happens
> here, you  should see an error Solr side, but this does not imply that the
> document indexing is failed, actually only the last flush has failed.
>
> Related point 3, I am not sure what are the Solr reaction to this fail.
> I should investigate.
>
> Cheers
>
>
>
> 2015-09-29 8:53 GMT+01:00 Matteo Grolla :
>
>> Hi Erik,
>> it's a curiosity question. When I add a document it's buffered by Solr
>> and can (apparently is) be parsed to verify it matches the schema. But it's
>> not written to a segment file until a commit is issued. If there is a
>> problem writing the segment, a permission error, isn't this a case where I
>> would report everything OK when in fact documents are not there?
>>
>> thanks
>>
>> 2015-09-29 2:12 GMT+02:00 Erick Erickson :
>>
>> > You shouldn't be losing errors with HttpSolrServer. Are you
>> > seeing evidence that you are or is this mostly a curiosity question?
>> >
>> > Do not it's better to batch up docs, your throughput will increase
>> > a LOT. That said, when you do batch (e.g. send 500 docs per update
>> > or whatever) and you get an error back, you're not quite sure what
>> > doc failed. So what people do is retry a failed batch one document
>> > at a time when the batch has errors and rely on Solr overwriting
>> > any docs in the batch that were indexed the first time.
>> >
>> > Best,
>> > Erick
>> >
>> > On Mon, Sep 28, 2015 at 2:27 PM, Matteo Grolla 
>> > wrote:
>> > > Hi,
>> > > if I need fine grained error reporting I use Http Solr server and
>> > send
>> > > 1 doc per request using the add method.
>> > > I report errors on exceptions of the add method,
>> > > I'm using autocommit so I'm not seing errors related to commit.
>> > > Am I loosing some errors? Is there a better way?
>> > >
>> > > Thanks
>> >
>>
>
>
>
> --
> --
>
> Benedetti Alessandro
> Visiting card - http://about.me/alessandro_benedetti
> Blog - http://alexbenedetti.blogspot.co.uk
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England


Re: Facet queries blow out the filterCache

2015-10-01 Thread Mikhail Khludnev
what if you set f.city.facet.limit=-1 ?

On Thu, Oct 1, 2015 at 7:43 PM, Jeff Wartes  wrote:

>
> I’m doing some fairly simple facet queries in a two-shard 5.3 SolrCloud
> index on fields like this:
>
>  docValues="true”/>
>
> that look something like this:
> q=...=id,score=city=true=1
> y.facet.limit=50=0=0=fc
>
> (no, NOT facet.method=enum - the usage of the filterCache there is pretty
> well documented)
>
> Watching the filterCache stats, it appears that every one of these queries
> causes the "inserts" counter to be incremented by one. Distinct "q="
> queries also increase the "size", and eviction happens as normal. If I
> repeat the same query a few times, "lookups" is not incremented, so these
> entries generally appear to be completely wasted. (Although when running a
> lot of these queries, it appears as though a very small set also increment
> the "lookups" counter, but only a small set, and I haven’t figured out why
> some are special.)
>
> So the question is, why does this facet query have anything to do with the
> filterCache? This causes a huge amount of filterCache churn with no
> apparent benefit.
>
>


-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics





Re: Facet queries blow out the filterCache

2015-10-01 Thread Jeff Wartes

No change, still shows an insert per-request. As does a simplified request
with only the facet params
"=city=true"

It’s definitely facet related though, facet=false eliminates the insert.



On 10/1/15, 1:50 PM, "Mikhail Khludnev"  wrote:

>what if you set f.city.facet.limit=-1 ?
>
>On Thu, Oct 1, 2015 at 7:43 PM, Jeff Wartes 
>wrote:
>
>>
>> I’m doing some fairly simple facet queries in a two-shard 5.3 SolrCloud
>> index on fields like this:
>>
>> > docValues="true”/>
>>
>> that look something like this:
>> 
>>q=...=id,score=city=true=1
>>it
>> y.facet.limit=50=0=0=fc
>>
>> (no, NOT facet.method=enum - the usage of the filterCache there is
>>pretty
>> well documented)
>>
>> Watching the filterCache stats, it appears that every one of these
>>queries
>> causes the "inserts" counter to be incremented by one. Distinct "q="
>> queries also increase the "size", and eviction happens as normal. If I
>> repeat the same query a few times, "lookups" is not incremented, so
>>these
>> entries generally appear to be completely wasted. (Although when
>>running a
>> lot of these queries, it appears as though a very small set also
>>increment
>> the "lookups" counter, but only a small set, and I haven’t figured out
>>why
>> some are special.)
>>
>> So the question is, why does this facet query have anything to do with
>>the
>> filterCache? This causes a huge amount of filterCache churn with no
>> apparent benefit.
>>
>>
>
>
>-- 
>Sincerely yours
>Mikhail Khludnev
>Principal Engineer,
>Grid Dynamics
>
>
>



Re: highlighting

2015-10-01 Thread Koji Sekiguchi

Hi Mark,

I think I saw similar requirement recently in mailing list. The feature sounds 
reasonable to me.

> If not, how do I go about posting this as a feature request?

JIRA can be used for the purpose, but there is no guarantee that the feature is 
implemented. :(

Koji

On 2015/10/01 20:07, Mark Fenbers wrote:

Yeah, I thought about using markers, but then I'd have to search the the text 
for the markers to
determine the locations.  This is a clunky way of getting the results I want, 
and it would save two
steps if Solr merely had an option to return a start/length array (of what 
should be highlighted) in
the original string rather than returning an altered string with tags inserted.

Mark

On 9/29/2015 7:04 AM, Upayavira wrote:

You can change the strings that are inserted into the text, and could
place markers that you use to identify the start/end of highlighting
elements. Does that work?

Upayavira

On Mon, Sep 28, 2015, at 09:55 PM, Mark Fenbers wrote:

Greetings!

I have highlighting turned on in my Solr searches, but what I get back
is  tags surrounding the found term.  Since I use a SWT StyledText
widget to display my search results, what I really want is the offset
and length of each found term, so that I can highlight it in my own way
without HTML.  Is there a way to configure Solr to do that?  I couldn't
find it.  If not, how do I go about posting this as a feature request?

Thanks,
Mark






Re: Facet queries blow out the filterCache

2015-10-01 Thread Jeff Wartes
It still inserts if I address the core directly and use distrib=false.

I’ve got a few collections sharing the same config, so it’s surprisingly
annoying to
change solrconfig.xml right now, but it seemed pretty clear the query is
the thing being cached, since
the cache size only changes when the query does.



On 10/1/15, 3:01 PM, "Mikhail Khludnev"  wrote:

>hm..
>This option was useful for introspecting cache content
>https://wiki.apache.org/solr/SolrCaching#showItems It might help you to
>find-out a cause.
>I'm still blaming distributed requests, it expained here
>https://cwiki.apache.org/confluence/display/solr/Faceting#Faceting-Over-Re
>questParameters
>eg does it happen if you run with distrib=false?
>
>On Fri, Oct 2, 2015 at 12:27 AM, Jeff Wartes 
>wrote:
>
>>
>> No change, still shows an insert per-request. As does a simplified
>>request
>> with only the facet params
>> "=city=true"
>>
>by default it's 100
>https://cwiki.apache.org/confluence/display/solr/Faceting#Faceting-Theface
>t.limitParameter
>and can cause filtering by values, it can be seen in logs, btw.
>
>>
>> It’s definitely facet related though, facet=false eliminates the insert.
>>
>>
>>
>> On 10/1/15, 1:50 PM, "Mikhail Khludnev" 
>> wrote:
>>
>> >what if you set f.city.facet.limit=-1 ?
>> >
>> >On Thu, Oct 1, 2015 at 7:43 PM, Jeff Wartes 
>> >wrote:
>> >
>> >>
>> >> I’m doing some fairly simple facet queries in a two-shard 5.3
>>SolrCloud
>> >> index on fields like this:
>> >>
>> >> > >> docValues="true”/>
>> >>
>> >> that look something like this:
>> >>
>> 
q=...=id,score=city=true=1
.c
>> >>it
>> >> y.facet.limit=50=0=0=fc
>> >>
>> >> (no, NOT facet.method=enum - the usage of the filterCache there is
>> >>pretty
>> >> well documented)
>> >>
>> >> Watching the filterCache stats, it appears that every one of these
>> >>queries
>> >> causes the "inserts" counter to be incremented by one. Distinct "q="
>> >> queries also increase the "size", and eviction happens as normal. If
>>I
>> >> repeat the same query a few times, "lookups" is not incremented, so
>> >>these
>> >> entries generally appear to be completely wasted. (Although when
>> >>running a
>> >> lot of these queries, it appears as though a very small set also
>> >>increment
>> >> the "lookups" counter, but only a small set, and I haven’t figured
>>out
>> >>why
>> >> some are special.)
>> >>
>> >> So the question is, why does this facet query have anything to do
>>with
>> >>the
>> >> filterCache? This causes a huge amount of filterCache churn with no
>> >> apparent benefit.
>> >>
>> >>
>> >
>> >
>> >--
>> >Sincerely yours
>> >Mikhail Khludnev
>> >Principal Engineer,
>> >Grid Dynamics
>> >
>> >
>> >
>>
>>
>
>
>-- 
>Sincerely yours
>Mikhail Khludnev
>Principal Engineer,
>Grid Dynamics
>
>
>



Re: highlighting

2015-10-01 Thread Teague James
Hi everyone!

Pardon if it's not proper etiquette to chime in, but that feature would solve 
some issues I have with my app for the same reason. We are using markers now 
and it is very clunky - particularly with phrases and certain special 
characters. I would love to see this feature too Mark! For what it's worth - up 
vote. Thanks!

Cheers!

-Teague James

> On Oct 1, 2015, at 6:12 PM, Koji Sekiguchi  
> wrote:
> 
> Hi Mark,
> 
> I think I saw similar requirement recently in mailing list. The feature 
> sounds reasonable to me.
> 
> > If not, how do I go about posting this as a feature request?
> 
> JIRA can be used for the purpose, but there is no guarantee that the feature 
> is implemented. :(
> 
> Koji
> 
>> On 2015/10/01 20:07, Mark Fenbers wrote:
>> Yeah, I thought about using markers, but then I'd have to search the the 
>> text for the markers to
>> determine the locations.  This is a clunky way of getting the results I 
>> want, and it would save two
>> steps if Solr merely had an option to return a start/length array (of what 
>> should be highlighted) in
>> the original string rather than returning an altered string with tags 
>> inserted.
>> 
>> Mark
>> 
>>> On 9/29/2015 7:04 AM, Upayavira wrote:
>>> You can change the strings that are inserted into the text, and could
>>> place markers that you use to identify the start/end of highlighting
>>> elements. Does that work?
>>> 
>>> Upayavira
>>> 
 On Mon, Sep 28, 2015, at 09:55 PM, Mark Fenbers wrote:
 Greetings!
 
 I have highlighting turned on in my Solr searches, but what I get back
 is  tags surrounding the found term.  Since I use a SWT StyledText
 widget to display my search results, what I really want is the offset
 and length of each found term, so that I can highlight it in my own way
 without HTML.  Is there a way to configure Solr to do that?  I couldn't
 find it.  If not, how do I go about posting this as a feature request?
 
 Thanks,
 Mark
> 


Re: Facet queries blow out the filterCache

2015-10-01 Thread Mikhail Khludnev
hm..
This option was useful for introspecting cache content
https://wiki.apache.org/solr/SolrCaching#showItems It might help you to
find-out a cause.
I'm still blaming distributed requests, it expained here
https://cwiki.apache.org/confluence/display/solr/Faceting#Faceting-Over-RequestParameters
eg does it happen if you run with distrib=false?

On Fri, Oct 2, 2015 at 12:27 AM, Jeff Wartes  wrote:

>
> No change, still shows an insert per-request. As does a simplified request
> with only the facet params
> "=city=true"
>
by default it's 100
https://cwiki.apache.org/confluence/display/solr/Faceting#Faceting-Thefacet.limitParameter
and can cause filtering by values, it can be seen in logs, btw.

>
> It’s definitely facet related though, facet=false eliminates the insert.
>
>
>
> On 10/1/15, 1:50 PM, "Mikhail Khludnev" 
> wrote:
>
> >what if you set f.city.facet.limit=-1 ?
> >
> >On Thu, Oct 1, 2015 at 7:43 PM, Jeff Wartes 
> >wrote:
> >
> >>
> >> I’m doing some fairly simple facet queries in a two-shard 5.3 SolrCloud
> >> index on fields like this:
> >>
> >>  >> docValues="true”/>
> >>
> >> that look something like this:
> >>
> >>q=...=id,score=city=true=1
> >>it
> >> y.facet.limit=50=0=0=fc
> >>
> >> (no, NOT facet.method=enum - the usage of the filterCache there is
> >>pretty
> >> well documented)
> >>
> >> Watching the filterCache stats, it appears that every one of these
> >>queries
> >> causes the "inserts" counter to be incremented by one. Distinct "q="
> >> queries also increase the "size", and eviction happens as normal. If I
> >> repeat the same query a few times, "lookups" is not incremented, so
> >>these
> >> entries generally appear to be completely wasted. (Although when
> >>running a
> >> lot of these queries, it appears as though a very small set also
> >>increment
> >> the "lookups" counter, but only a small set, and I haven’t figured out
> >>why
> >> some are special.)
> >>
> >> So the question is, why does this facet query have anything to do with
> >>the
> >> filterCache? This causes a huge amount of filterCache churn with no
> >> apparent benefit.
> >>
> >>
> >
> >
> >--
> >Sincerely yours
> >Mikhail Khludnev
> >Principal Engineer,
> >Grid Dynamics
> >
> >
> >
>
>


-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics





Re: [poll] virtualization platform for SOLR

2015-10-01 Thread Bernd Fehling
Hi Shawn,

unfortunately we have to run VMs, otherwise we would waste hardware.
I thought other solr users are in the same situation but seams that
other users have tons of hardware available and we are the only one
having to use VMs.
Right, bare metal is always better than any VM.
As you mentioned we have the indexer (master) on one physical machine
and two searchers (slaves) on other physical machines, all together with
other little VMs which are not I/O and CPU heavy.

Regards
Bernd

Am 30.09.2015 um 18:48 schrieb Shawn Heisey:
> On 9/30/2015 3:12 AM, Bernd Fehling wrote:
>> while setting up some new servers (virtual machines) using XEN I was
>> thinking about an alternative like KVM. My last tests with KVM is
>> a while ago and XEN performed much better in the area of I/O and
>> CPU usage.
>> This lead me to the idea to start a poll about virtualization platform and 
>> your experiences.
> 
> I once had a virtualized Solr install with Xen where each VM housed one
> Solr instance with one core.  The index was distributed, so it required
> several VMs for one copy of the index.
> 
> I eliminated the virtualization, used the same hardware as bare metal
> with Linux, still one Solr instance installed on the machine, but with
> multiple Solr cores.  Performance is much better now.
> 
> General advice:  Don't run virtual machines.
> 
> If a virtual environment is the only significant hardware you have
> access to and it's used for more than Solr, then you might need to.  If
> you do run virtual, then minimize the number of VMs, don't put multiple
> replicas of the same index data on the same physical VM host, give each
> Solr VM lots of memory, and don't oversubscribe the memory/cpu on the
> physical VM host.
> 
> Thanks,
> Shawn
> 


Re: Class Loader issues

2015-10-01 Thread Tomoko Uchida
Hi,

Do you have (execute) permission for /dev/solr-5.3.1/contrib/clustering/lib
?
I've seen same warning when I have not access permission to the library dir.

Regards,
Tomoko

2015-10-02 1:23 GMT+09:00 Firas Khasawneh :

> Hi all,
>
> I am trying to load Jackson json library from the
> solr-5.3.1/contrib/clustering/lib directory.
> In solconfig.xml I have the following entry:  dir="/dev/solr-5.3.1/contrib/clustering/lib" regex="jackson-.*\.jar"/>
>
> When I start solr, I get the following warning:
> SolrResourceLoader
>
> No files added to classloader from lib:
> /dev/solr-5.3.1/contrib/clustering/lib
>
>
>
>
> The path is correct and regex should load jackson-core-asl-1.9.13.jar  and
> jackson-mapper-asl-1.9.13.jar  but it does not. Any help is appreciated.
>
> Thanks,
> Firas
>


Re: Cannot connect to a zookeeper 3.4.6 instance via zkCli.cmd

2015-10-01 Thread Zheng Lin Edwin Yeo
Hi Adrian,

How is your setup of your system like? By right it shouldn't be an issue if
we use different ports.

in fact, if the various zookeeper instance are running on a single machine,
they have to be on different ports in order for it to work.


Regards,
Edwin



On 1 October 2015 at 18:19, Adrian Liew  wrote:

> Hi all,
>
> The problem below was resolved by appropriately setting my server ip
> addresses to have the following for each zoo.cfg:
>
> server.1=10.0.0.4:2888:3888
> server.2=10.0.0.5:2888:3888
> server.3=10.0.0.6:2888:3888
>
> as opposed to the following:
>
> server.1=10.0.0.4:2888:3888
> server.2=10.0.0.5:2889:3889
> server.3=10.0.0.6:2890:3890
>
> I am not sure why the above can be an issue (by right it should not),
> however I followed the recommendations provided by Zookeeper administration
> guide under RunningReplicatedZookeeper (
> https://zookeeper.apache.org/doc/r3.1.2/zookeeperStarted.html#sc_RunningReplicatedZooKeeper
> )
>
> Given that I am testing multiple servers in a mutiserver environment, it
> will be safe to use 2888:3888 on each server rather than have different
> ports.
>
> Regards,
> Adrian
>
> From: Adrian Liew [mailto:adrian.l...@avanade.com]
> Sent: Thursday, October 1, 2015 5:32 PM
> To: solr-user@lucene.apache.org
> Subject: Cannot connect to a zookeeper 3.4.6 instance via zkCli.cmd
>
> Hi there,
>
> Currently, I have setup an azure virtual network to connect my Zookeeper
> clusters together with three Azure VMs. Each VM has an internal IP of
> 10.0.0.4, 10.0.0.5 and 10.0.0.6. I have also setup Solr 5.3.0 which runs in
> Solr Cloud mode connected to all three Zookeepers in an external ensemble
> manner.
>
> I am able to connect to 10.0.0.4 and 10.0.0.6 via the zkCli.cmd after
> starting the Zookeeper services. However for 10.0.0.5, I keep getting the
> below error even if I started the zookeeper service.
>
> [cid:image001.png@01D0FC6E.BDC2D990]
>
> I have restarted 10.0.0.5 VM several times and still am unable to connect
> to Zookeeper via zkCli.cmd. I have checked zoo.cfg (making sure ports, data
> and logs are all set correctly) and myid to ensure they have the correct
> configurations.
>
> The simple command line I used to connect to Zookeeper is zkCli.cmd
> -server 10.0.0.5:2182 for example.
>
> Any ideas?
>
> Best regards,
>
> Adrian Liew |  Consultant Application Developer
> Avanade Malaysia Sdn. Bhd..| Consulting Services
> (: Direct: +(603) 2382 5668
> È: +6010-2288030
>
>
>


Re: Keyword match distance rule issue

2015-10-01 Thread anil.vadhavane
Hello,

We have tried the Analysis tool. Below is the screenshot of analysis tool.

 





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Keyword-match-distance-rule-issue-tp4231624p4232246.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Zk and Solr Cloud

2015-10-01 Thread Shawn Heisey
On 10/1/2015 1:26 PM, Rallavagu wrote:
> Solr 4.6.1 single shard with 4 nodes. Zookeeper 3.4.5 ensemble of 3.
>
> See following errors in ZK and Solr and they are connected.
>
> When I see the following error in Zookeeper,
>
> unexpected error, closing socket connection and attempting reconnect
> java.io.IOException: Packet len11823809 is out of range!

This is usually caused by the overseer queue (stored in zookeeper)
becoming extraordinarily huge, because it's being flooded with work
entries far faster than the overseer can process them.  This causes the
znode where the queue is stored to become larger than the maximum size
for a znode, which defaults to about 1MB.  In this case (reading your
log message that says len11823809), something in zookeeper has gotten to
be 11MB in size, so the zookeeper client cannot read it.

I think the zookeeper server code must be handling the addition of
children to the queue znode through a code path that doesn't pay
attention to the maximum buffer size, just goes ahead and adds it,
probably by simply appending data.  I'm unfamiliar with how the ZK
database works, so I'm guessing here.

If I'm right about where the problem is, there are two workarounds to
your immediate issue.

1) Delete all the entries in your overseer queue using a zookeeper
client that lets you edit the DB directly.  If you haven't changed the
cloud structure and all your servers are working, this should be safe.

2) Set the jute.maxbuffer system property on the startup commandline for
all ZK servers and all ZK clients (Solr instances) to a size that's
large enough to accommodate the huge znode.  In order to do the deletion
mentioned in option 1 above,you might need to increase jute.maxbuffer on
the servers and the client you use for the deletion.

These are just workarounds.  Whatever caused the huge queue in the first
place must be addressed.  It is frequently a performance issue.  If you
go to the following link, you will see that jute.maxbuffer is considered
an unsafe option:

http://zookeeper.apache.org/doc/r3.3.3/zookeeperAdmin.html#Unsafe+Options

In Jira issue SOLR-7191, I wrote the following in one of my comments:

"The giant queue I encountered was about 85 entries, and resulted in
a packet length of a little over 14 megabytes. If I divide 85 by 14,
I know that I can have about 6 overseer queue entries in one znode
before jute.maxbuffer needs to be increased."

https://issues.apache.org/jira/browse/SOLR-7191?focusedCommentId=14347834

Thanks,
Shawn



Re: Solr 4.7.2 Vs 5.3.0 Docs different for same query

2015-10-01 Thread Tomoko Uchida
Are you sure that you've indexed same data to Solr 4.7.2 and 5.3.0 ?
If so, I suspect that you have multiple shards and request to one shard.
(In that case, you might get partial results)

Can you share HTTP request url and the schema and default search field ?


2015-10-02 6:09 GMT+09:00 Ravi Solr :

> I we migrated from 4.7.2 to 5.3.0. I sourced the docs from 4.7.2 core and
> indexed into 5.3.0 collection (data directories are different) via
> SolrEntityProcessor. Currently my production is all whack because of this
> issue. Do I have to go back and reindex all again ?? Is there a quick fix
> for this ?
>
> Here are the results for the query 'obama'...please note the numfound.
> 4.7.2 has almost 148519 docs while 5.3.0 says it only has 5.3.0 docs. Any
> pointers on how to correct this ?
>
>
> Solr 4.7.2
>
> 
> 
>   0
>   2
>   
>  obama
>   0
>
>   
>   
> 
>
> SolrCloud 5.3.0
>
> 
>   
>0
>2
>
> obama
> 0
> 
>
>
> 
>
>
> Thanks
>
> Ravi Kiran Bhaskar
>


Re: [poll] virtualization platform for SOLR

2015-10-01 Thread Bernd Fehling
Hi Toke,

I don't get SSDs, only spinning drives.
And as you mentioned, the impact of VMs is not that much if you use spinning 
drives.
It is more the VM software that matters and thats why we use XEN and not KVM.
With some tuning of sysctrl for the VMs it performs good, but bare-metal is 
still better
and should be preferred.

Regards
Bernd


Am 01.10.2015 um 09:44 schrieb Toke Eskildsen:
> Bernd Fehling  wrote:
>> unfortunately we have to run VMs, otherwise we would waste hardware.
>> I thought other solr users are in the same situation but seams that
>> other users have tons of hardware available and we are the only one
>> having to use VMs.
> 
> We have ~5 smaller (< 1M documents) solr setups that runs under VMWare 
> (chosen because that is what Operations use for all their virtualization). We 
> have a single and quite large setup (terabytes of data, billions of 
> documents) that runs alone on dedicated hardware. Then we have the third 
> solution: Multiple independent Solr oriented projects that share the same 
> bare metal. CentOS everywhere BTW.
> 
> We would probably get better hardware utilization by running the hardware 
> sharing setups in a virtualization system, together with some random other 
> projects. But I doubt we would gain much for the cost of rocking the 
> high-performance boat.
> 
> We do have some other bare-metal setups than Solr at our organization (State 
> and University Library, Denmark), but the default for most other projects is 
> to use virtualizations. Going mostly bare metal with Solr was an explicit and 
> performance-driven decision.
> 
> Except for the virtualized instances, we only use local SSDs to hold our 
> index data. That might affect the trade-off as even slight delays in IO 
> becomes visible, when storage access times are < 0.1ms instead of > 1ms. I 
> suspect the relative impact of virtualization is less with spinning drives or 
> networked storage.
> 
> - Toke Eskildsen
> 


Cannot connect to a zookeeper 3.4.6 instance via zkCli.cmd

2015-10-01 Thread Adrian Liew
Hi there,

Currently, I have setup an azure virtual network to connect my Zookeeper 
clusters together with three Azure VMs. Each VM has an internal IP of 10.0.0.4, 
10.0.0.5 and 10.0.0.6. I have also setup Solr 5.3.0 which runs in Solr Cloud 
mode connected to all three Zookeepers in an external ensemble manner.

I am able to connect to 10.0.0.4 and 10.0.0.6 via the zkCli.cmd after starting 
the Zookeeper services. However for 10.0.0.5, I keep getting the below error 
even if I started the zookeeper service.

[cid:image001.png@01D0FC6E.BDC2D990]

I have restarted 10.0.0.5 VM several times and still am unable to connect to 
Zookeeper via zkCli.cmd. I have checked zoo.cfg (making sure ports, data and 
logs are all set correctly) and myid to ensure they have the correct 
configurations.

The simple command line I used to connect to Zookeeper is zkCli.cmd -server 
10.0.0.5:2182 for example.

Any ideas?

Best regards,

Adrian Liew |  Consultant Application Developer
Avanade Malaysia Sdn. Bhd..| Consulting Services
(: Direct: +(603) 2382 5668
È: +6010-2288030