Re: Highlighting tag problem

2015-12-03 Thread Zheng Lin Edwin Yeo
Hi Scott,

No, what's describe in SOLR-8334 is the tag appearing at the result, but at
the wrong position.

For this problem, the situation is that when I do a highlight query, some
of the results in the resultset does not contain the search word in  title,
content_type, last_modified and  url, as specified in my solrconfig.xml
which I'm posted earlier on, and there is no  tag in those results. So
I'm not sure why those results are returned.

Regards,
Edwin


On 4 December 2015 at 01:03, Scott Stults  wrote:

> Edwin,
>
> Is this related to what's described in SOLR-8334?
>
>
> k/r,
> Scott
>
> On Thu, Dec 3, 2015 at 5:07 AM, Zheng Lin Edwin Yeo 
> wrote:
>
> > Hi,
> >
> > I'm using Solr 5.3.0.
> > Would like to find out, during a search, sometimes there is a match in
> > content, but it is not highlighted (the word is not in the stopword
> list)?
> > Did I make any mistakes in my configuration?
> >
> > This is my highlighting request handler from solrconfig.xml.
> >
> > 
> > 
> > explicit
> > 10
> > json
> > true
> > text
> > id, title, content_type, last_modified, url, score 
> >
> > on
> > id, title, content, author, tag
> >true
> > true
> > html
> > 200
> >
> > true
> > signature
> > true
> > 100
> > 
> > 
> >
> >
> > This is my pipeline for the field.
> >
> >   > positionIncrementGap="100">
> >
> >
> >
> > > segMode="SEARCH"/>
> >
> >
> >
> >
> >
> > > words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
> >
> > > words="stopwords.txt" />
> >
> > > generateWordParts="1" generateNumberParts="1" catenateWords="0"
> > catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
> >
> > > synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
> >
> >
> >
> > > maxGramSize="15"/>
> >
> >
> >
> >
> >
> > > segMode="SEARCH"/>
> >
> >
> >
> >
> >
> > > words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
> >
> > > words="stopwords.txt" />
> >
> > > generateWordParts="0" generateNumberParts="0" catenateWords="0"
> > catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
> >
> > > synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
> >
> >
> >
> > 
> >
> >  
> >
> >
> > Regards,
> > Edwin
> >
>
>
>
> --
> Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
> | 434.409.2780
> http://www.opensourceconnections.com
>


Use multiple istance simultaneously

2015-12-03 Thread Gian Maria Ricci - aka Alkampfer
Suppose that for some reason you are not able to use SolrCloud and you are
forced to use the old Master-Slave approach to guarantee High Availability. 

 

In such a scenario, if the master failed, application are still able to
search with slaves, but clearly, no more data can be indexed until the
master is fully restored or configure some more complex topology (es
http://stackoverflow.com/questions/6362484/apache-solr-failover-support-in-m
aster-slave-setup ).

 

In such a scenario could it be feasible to simply configure 2 or 3 identical
instance of Solr and configure the application that transfer data to solr to
all the instances simultaneously (the approach will be a DIH incremental for
some core and an external application that push data continuously for other
cores)? Which could be the drawback of using this approach?

 

I've googled around, but did not find anything really useful.

 

Thanks for any answer you could give me.

 

--
Gian Maria Ricci
Cell: +39 320 0136949

 

   


 



Re: Protect against duplicates with the Migrate statement

2015-12-03 Thread Shalin Shekhar Mangar
Hi Philippa,

The migrate command actually splits the lucene index from the source
and merges it into the target collection. Whereas, the de-duplication
is applied only to incoming updates. So you see migrate is lower level
than de-duplication and therefore they cannot work together. If you
want de-duplication, you have no option but to index documents instead
of using migrate command.

On Wed, Dec 2, 2015 at 4:29 PM, philippa griggs
 wrote:
> Hello,
>
>
> I'm using Solr 5.2.1 and Zookeeper 3.4.6.
>
>
> I'm implementing two collections - HotDocuments and ColdDocuments . New 
> documents will only be written to HotDocuments and every night I will migrate 
> a chunk of documents into ColdDocuments.
>
>
> In the test environment, I have the Collection API migrate statement working 
> fine. I know this won't handle duplicates ending up in the ColdDocuments 
> collection and I don't expect to have duplicate documents but I would like to 
> protect against it- just in case.
>
>
> We have a unique key and I've tried to implement de-duplication 
> (https://cwiki.apache.org/confluence/display/solr/De-Duplication) but I still 
> end up with duplicates in the ColdDocuments collection.
>
>
>
> Does anyone have any suggestions on how I can protect against duplicates with 
> the migrate statement?  Any ideas would be greatly appreciated.
>
>
> Many thanks
>
> Philippa



-- 
Regards,
Shalin Shekhar Mangar.


Re: Use multiple istance simultaneously

2015-12-03 Thread Shawn Heisey
On 12/3/2015 1:25 AM, Gian Maria Ricci - aka Alkampfer wrote:
> In such a scenario could it be feasible to simply configure 2 or 3
> identical instance of Solr and configure the application that transfer
> data to solr to all the instances simultaneously (the approach will be a
> DIH incremental for some core and an external application that push data
> continuously for other cores)? Which could be the drawback of using this
> approach?

When I first set up Solr, I used replication.  Then version 3.1.0 was
released, including a non-backward-compatible upgrade to javabin, and it
was not possible to replicate between 1.x and 3.x.

This incompatibility meant that it would not be possible to do a gradual
upgrade to 3.x, where the slaves are upgraded first and then the master.

To get around the problem, I basically did exactly what you've
described.  I turned off replication and configured a second copy of my
build program to update what used to be slave servers.

Later, when I moved to a SolrJ program for index maintenance, I made one
copy of the maintenance program capable of updating multiple copies of
the index in parallel.

I have stuck with this architecture through 4.x and moving into 5.x,
even though I could go back to replication or switch to SolrCloud.
Having completely independent indexes allows a great deal of flexibility
with upgrades and testing new configurations, flexibility that isn't
available with SolrCloud or master-slave replication.

Thanks,
Shawn



schema fileds and Typefield in solr-5.3.1

2015-12-03 Thread kostali hassan
I start working in solr 5x by extract solr in D://solr and run solr server
with :

D:\solr\solr-5.3.1\bin>solr start ;

Then I create a core in standalone mode :

D:\solr\solr-5.3.1\bin>solr create -c mycore

I need indexing from system files (word and pdf) and the schema API don’t
have a field “name” of document, then I Add this field using curl :

curl -X POST -H 'Content-type:application/json' --data-binary '{

  "add-field":{

 "name":"name",

 "type":"text_general",

 "stored":true,

 “indexed”:true }

}' http://localhost:8983/solr/mycore/schema



And re-index all document.with windows SimplepostTools:

D:\solr\solr-5.3.1>java -classpath example\exampledocs\post.jar -Dauto=yes
-Dc=mycore -Ddata=files -Drecursive=yes org.apache.solr.util.SimplePostTool
D:\Lucene\document ;



But even if the field “name” is succeffly added he is empty ; the field
title get the name for only pdf document not for msword(.doc and .docx).



Then I choose indexing with techproducts example because he don’t use
schema.xml API then I can modified my schema:



D:\solr\solr-5.3.1>solr –e techproducts



Techproducts return the name of all files.xml indexed;



Then I create a new core based in solr_home example/techproducts/solr and I
use schema.xml (contient field “name”) and solrConfig.xml from techproducts
in this new core called demo.

When I indexed all document the field name exist but still empty for all
document indexed.



My question is how I can get just the name of each document(msword and pdf)
not the path like the field “id” or field “ressource_name” ; I have to
create new Typefield or exist another way.



Sorry for my basic English.

Thank you.


Re: curl adapter in solarium 3x

2015-12-03 Thread kostali hassan
Thank you Gora , in fact Curl is default adapter for solarium-3x and I am
not using zend framwork.

2015-12-03 11:05 GMT+00:00 Gora Mohanty :

> On 3 December 2015 at 16:20, kostali hassan 
> wrote:
> > How to force the connection to explicitly close when it has finished
> > processing, and not be pooled for reuse.
> > they are a way to tell to  server may send a keep-alive timeout (with
> > default Apache install, it is 15 seconds or 100 requests, whichever comes
> > first) - but cURL will just open another connection when that happens.
>
> These questions seem no longer relevant to the Solr mailing list.
> Please ask on a Solarium mailing list.
>
> In response to your earlier message,  I had sent you a link to the
> Solarium ZendHttpAdapter which seems to allow keepalive, unlike the
> curl adapter. Here it is again:
> http://wiki.solarium-project.org/index.php/V1:Client_adapters . You
> might also find this useful:
> http://framework.zend.com/manual/1.12/en/zend.http.client.advanced.html
>
> Regards,
> Gora
>


Re: curl adapter in solarium 3x

2015-12-03 Thread Gora Mohanty
On 3 December 2015 at 16:20, kostali hassan  wrote:
> How to force the connection to explicitly close when it has finished
> processing, and not be pooled for reuse.
> they are a way to tell to  server may send a keep-alive timeout (with
> default Apache install, it is 15 seconds or 100 requests, whichever comes
> first) - but cURL will just open another connection when that happens.

These questions seem no longer relevant to the Solr mailing list.
Please ask on a Solarium mailing list.

In response to your earlier message,  I had sent you a link to the
Solarium ZendHttpAdapter which seems to allow keepalive, unlike the
curl adapter. Here it is again:
http://wiki.solarium-project.org/index.php/V1:Client_adapters . You
might also find this useful:
http://framework.zend.com/manual/1.12/en/zend.http.client.advanced.html

Regards,
Gora


(sem assunto)

2015-12-03 Thread sabrina_rodrigues
 

Hello! 

I would like to stop receiving emails from lucene apache.
Please remove my email from every list. 

Thanks! 

 

Highlighting tag problem

2015-12-03 Thread Zheng Lin Edwin Yeo
Hi,

I'm using Solr 5.3.0.
Would like to find out, during a search, sometimes there is a match in
content, but it is not highlighted (the word is not in the stopword list)?
Did I make any mistakes in my configuration?

This is my highlighting request handler from solrconfig.xml.



explicit
10
json
true
text
id, title, content_type, last_modified, url, score 

on
id, title, content, author, tag
   true
true
html
200

true
signature
true
100




This is my pipeline for the field.

 

   

   

   

   

   

   

   

   

   

   

   

   

   

   

   

   

   

   

   

   



 


Regards,
Edwin


Solr 4.8 Overseer/Queue Processing

2015-12-03 Thread Durham, Russell
Hello,

So I’m running SolrCloud 4.8 on a two node cluster and 3 zookeeper instances on 
a separate set of machines. The cluster has roughly 150 collections, each 
running as 1 shard with a replication factor of 2. There is also a management 
tool we have running that will periodically request cluster status from solr to 
update some other monitoring services we have in place. We’re also running the 
instances with Tomcat.

For the most part everything seems to be working great, however every now and 
then the cluster will start timing out when the management tool requests the 
cluster status. When I go to check the /overseer/queue node I can see that the 
work queue is starting to build up and the leader is no longer processing the 
queue. I’ve been bouncing the Tomcat service which fixes the issue but it will 
eventually come back and continuing to bounce the Tomcat service is not really 
an option. I tried searching and found some old threads with the same issue but 
it looked as if whatever it was got resolved in an earlier 4.X version.

We do have plans to upgrade to 5.X early next year at which point will also 
switch to using Jetty instead of Tomcat. There is still some time before that 
happens though so I’d like to figure out what is going on now if possible. I 
tried looking at the solr logs but I can’t seem to find anything that says why 
it’s not processing the queue.

Thanks in advance for any thoughts/suggestions.

Russell

Russell Durham | Senior Software Engineer | MedAssets
5543 Legacy Drive | Plano, TX, 75024 | Work: 972.202.5850 | Mobile: 361.564.7223
rdur...@medassets.com
Visit us at www.medassets.com
Follow us on LinkedIn, 
YouTube, 
Twitter, and 
Facebook

*Attention*
This electronic transmission may contain confidential, sensitive, proprietary 
and/or privileged information belonging to the sender. This information, 
including any attached files, is intended only for the persons or entities to 
which it is addressed. Authorized recipients of this information are prohibited 
from disclosing the information to any unauthorized party and are required to 
properly dispose of the information upon fulfillment of its need/use, unless 
otherwise required by law. Any review, retransmission, dissemination or other 
use of, or taking of any action in reliance upon this information by any person 
or entity other than the intended recipient is prohibited. If you have received 
this electronic transmission in error, please notify the sender and properly 
dispose of the information immediately.


curl adapter in solarium 3x

2015-12-03 Thread kostali hassan
How to force the connection to explicitly close when it has finished
processing, and not be pooled for reuse.
they are a way to tell to  server may send a keep-alive timeout (with
default Apache install, it is 15 seconds or 100 requests, whichever comes
first) - but cURL will just open another connection when that happens.

this is my function's cakephp to index rich data from files system


*App::import('Vendor','autoload',array('file'=>'solarium/vendor/autoload.php'));*

*public function indexDocument(){*
*$config = array(*
* "endpoint" => array("localhost" => array("host"=>"127.0.0.1",*
* "port"=>"8983", "path"=>"/solr", "core"=>"demo",)*
*) );*
*   $start = microtime(true);*

*if($_POST){*
*// create a client instance*
*$client = new Solarium\Client($config);*
*$dossier=$this->request->data['User']['dossier'];*
*$dir = new Folder($dossier);*
*$files = $dir->find('.*\.*');*

* $headers = array('Content-Type:multipart/form-data');*

*foreach ($files as $file) {*
*$file = new File($dir->pwd() . DS . $file);*

*$query = $client->createExtract();*
*$query->setFile($file->pwd());*
*$query->setCommit(true);*
*$query->setOmitHeader(false);*

*$doc = $query->createDocument();*
*$doc->id =$file->pwd();*
*$doc->name = $file->name;*
*$doc->title = $file->name();*

*$query->setDocument($doc);*

*$request = $client->createRequest($query);*
*$request->addHeaders($headers);*

*$result = $client->executeRequest($request);*
*}*

*}*

*$this->set(compact('start'));*
*}*


Re: Solr Auto-Complete

2015-12-03 Thread Alessandro Benedetti
"Sounds good but I heard "/suggest" component is the recommended way of
doing auto-complete"

This sounds fantastic :)
We "heard" that as well, we know what the suggest component does.
The point is that you would like to retrieve the suggestions + some
consistent payload in different fields.
Current suggest component offers some effort in providing a payload, but
almost all the suggester implementation are based on an FST approach which
aim to be as fast and memory efficient as possible.
Honestly you could experiment and even contribute a customisation if you
want to add a new feature to the suggest component able to return complex
payloads together with the suggestions.
Apart that, it strictly depends of how you want to provide the
autocompletion, there are plenty of different lookups implementation and
plenty of tokenizer/token filters to combine .
So I would confirm what we already said and that Andrea confirmed.

If anyone has played with the suggester suggestions payload, his feedback
is welcome!

Cheers


On 3 December 2015 at 06:21, Andrea Gazzarini  wrote:

> Hi Salman,
> few months ago I have been involved in a project similar to
> map.geoadmin.ch
> and there, I had your same need (I also sent an email to this list).
>
> From my side I can furtherly confirm what Alan and Alessandro already
> explained, I followed that approach.
>
> IMHO, that is the "recommended way" if the component's features meet your
> needs (i.e. do not reinvent the wheel) but it seems you're out of those
> bounds.
>
> Best,
> Andrea
> On 2 Dec 2015 21:51, "Salman Ansari"  wrote:
>
> > Sounds good but I heard "/suggest" component is the recommended way of
> > doing auto-complete in the new versions of Solr. Something along the
> lines
> > of this article
> > https://cwiki.apache.org/confluence/display/solr/Suggester
> >
> > 
> >   
> > mySuggester
> > FuzzyLookupFactory
> > DocumentDictionaryFactory
> > cat
> > price
> > string
> > false
> >   
> > 
> >
> > Can someone confirm this?
> >
> > Regards,
> > Salman
> >
> >
> > On Wed, Dec 2, 2015 at 1:14 PM, Alessandro Benedetti <
> > abenede...@apache.org>
> > wrote:
> >
> > > Hi Salman,
> > > I agree with Alan.
> > > Just configure your schema with the proper analysers .
> > > For the field you want to use for suggestions you are likely to need
> > simply
> > > this fieldType :
> > >
> > >  > > positionIncrementGap="100">
> > > 
> > > 
> > > 
> > >  > > maxGramSize="20"/>
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > >
> > > This is a very sample example, please adapt it to your use case.
> > >
> > > Cheers
> > >
> > > On 2 December 2015 at 09:41, Alan Woodward  wrote:
> > >
> > > > Hi Salman,
> > > >
> > > > It sounds as though you want to do a normal search against a special
> > > > 'suggest' field, that's been indexed with edge ngrams.
> > > >
> > > > Alan Woodward
> > > > www.flax.co.uk
> > > >
> > > >
> > > > On 2 Dec 2015, at 09:31, Salman Ansari wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I am looking for auto-complete in Solr but on top of just auto
> > > complete I
> > > > > want as well to return the data completely (not just suggestions),
> > so I
> > > > > want to get back the ids, and other fields in the whole document. I
> > > tried
> > > > > the following 2 approaches but each had issues
> > > > >
> > > > > 1) Used the /suggest component but that returns a very specific
> > format
> > > > > which looks like I cannot customize. I want to return the whole
> > > document
> > > > > that has a matching field and not only the suggestion list. So for
> > > > example,
> > > > > if I write "hard" it returns the results in a specific format as
> > > follows
> > > > >
> > > > >   hard drive
> > > > > hard disk
> > > > >
> > > > > Is there a way to get back additional fields with suggestions?
> > > > >
> > > > > 2) Tried the normal /select component but that does not do
> > > auto-complete
> > > > on
> > > > > portion of the word. So, for example, if I write the query as
> "bara"
> > it
> > > > > DOES NOT return "barack obama". Any suggestions how to solve this?
> > > > >
> > > > >
> > > > > Regards,
> > > > > Salman
> > > >
> > > >
> > >
> > >
> > > --
> > > --
> > >
> > > Benedetti Alessandro
> > > Visiting card : http://about.me/alessandro_benedetti
> > >
> > > "Tyger, tyger burning bright
> > > In the forests of the night,
> > > What immortal hand or eye
> > > Could frame thy fearful symmetry?"
> > >
> > > William Blake - Songs of Experience -1794 England
> > >
> >
>



-- 
--

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

Re: Grouping by simhash signature

2015-12-03 Thread Nikola Smolenski
On Wed, Dec 2, 2015 at 9:00 PM, Nickolay41189  wrote:
> I try to implement NearDup detection by  SimHash
>    algorithm in Solr.
> Let's say:
> 1) each document has a field /simhash_signature/ that stores a sequence of
> bits.
> 2) that in order to be considered NearDup, documents must have, at most, 2
> bits that differ in /simhash_signature/
>
>
> *My question:*
> How can I get groups of nearDup by /simhash_signature/?
>
> *Examples:*
>   Input:
> Doc A = 0001000
> Doc B = 100
> Doc C = 111
> Doc D = 0101000
>   Output:
> A -> {B, D}
> B -> {A}
> C -> {}
> D -> {A}

I'm not sure if this is the best solution (or, indeed, if it is at all
possible), but maybe you could store the bit fields as strings, then
use strdist function to find Levenshtein distance between the strings
and group by that.

-- 
Nikola Smolenski

University of Belgrade
University library ''Svetozar Markovic''


Re: Solr 5: Schema.xml vs. Managed Schema - which is advisable?

2015-12-03 Thread Shawn Heisey
On 12/3/2015 8:09 AM, Kelly, Frank wrote:
> Just wondering if folks have any suggestions on using Schema.xml vs. Managed 
> Schema going forward.
> 
> Our deployment will be
>> 3 Zk, 3 Shards, 3 replicas
>> Copies of each collection in 5 AWS regions (EBS-backed EC2 instances)
>> Planning at least 1 Billion objects indexed (currently < 100 million)
> 
> I'm sure our schema.xml will have changes and fixes and just wondering which 
> approach (schema.xml vs. managed)
> will be easier to deploy / maintain?

In production, you probably want a schema that cannot change.  The
managed schema that you find in the data-driven configuration will
automatically add new fields to the schema if unknown fields are
encountered in your data ... which means that if somehow a typo makes it
through your indexing process, you may not know about the problem until
later.

With a static schema, an indexing request that has an error in a field
name will be rejected and you will receive an error, which is how I
would want Solr to behave.

The data-driven schema is good for prototyping, but because the field
definitons that get added are just a guess by Solr, I would manually
edit the schema before going into production.  Once in production I
would want to be in complete manual control of the schema.

Thanks,
Shawn



Re: Solr 5: Schema.xml vs. Managed Schema - which is advisable?

2015-12-03 Thread Erick Erickson
Shawn:

Managed schema is _used_ by "schemaless", but not the same thing at
all. For "schemaless" (i.e. "data driven"), you need to include the
update processor chains that do the guessing for you and makes use of
the managed veatures to add fields to your schema.

You can also use a managed schema _without_ the processor chains that
enable the "schemaless" update chains. In this you do have a static
schema, with the caveat that "static" means that anyone who can post
directly to Solr can change your schema, but if you allow that someone
issuing managed schema API calls is the least of your worries ;).

That said, I certainly understand wanting to lock down my schema, but
then I'm a control freak.

Best,
Erick



On Thu, Dec 3, 2015 at 7:25 PM, Shawn Heisey  wrote:
> On 12/3/2015 8:09 AM, Kelly, Frank wrote:
>> Just wondering if folks have any suggestions on using Schema.xml vs. Managed 
>> Schema going forward.
>>
>> Our deployment will be
>>> 3 Zk, 3 Shards, 3 replicas
>>> Copies of each collection in 5 AWS regions (EBS-backed EC2 instances)
>>> Planning at least 1 Billion objects indexed (currently < 100 million)
>>
>> I'm sure our schema.xml will have changes and fixes and just wondering which 
>> approach (schema.xml vs. managed)
>> will be easier to deploy / maintain?
>
> In production, you probably want a schema that cannot change.  The
> managed schema that you find in the data-driven configuration will
> automatically add new fields to the schema if unknown fields are
> encountered in your data ... which means that if somehow a typo makes it
> through your indexing process, you may not know about the problem until
> later.
>
> With a static schema, an indexing request that has an error in a field
> name will be rejected and you will receive an error, which is how I
> would want Solr to behave.
>
> The data-driven schema is good for prototyping, but because the field
> definitons that get added are just a guess by Solr, I would manually
> edit the schema before going into production.  Once in production I
> would want to be in complete manual control of the schema.
>
> Thanks,
> Shawn
>


Re: Nested Docs issue

2015-12-03 Thread Bogdan Marinescu

Hi Mikhail,

I would expect the same behaviour as for a database. Meaning if I have a 
field declared as an uniqueKey, then there should only be one document 
with that key, regardless if it has a child or not.


If you add the childless document first and afterwards the child, then 
sol'r should append the child to the already existing document (or 
rather delete the existing one as the new one has newer data).


It's weird because when I query sol'r for that ID, I get two documents 
when I am only expecting one.
I could do a sort of 'workaround' logic where I would always pick the 
one with children but I think better would be to fix this in sol'r.


Regarding your issue SOLR-5211 
. If a document has 
children and you update the parent as childless or in fact delete the 
parent altogether, then the child documents should also be deleted.


I've faced this problem where I was just deleting the parent by id and I 
had lots of orphan documents just laying around.


Regards,
Bogdan Marinescu

On 12/03/2015 06:26 PM, Mikhail Khludnev wrote:

Hello Bogdan,
You described how it works now. That's how it was implemented. And I can
explain why it was done so.

Could you please describe the expected behavior for you?

Notice, I want to enforce nested (block) behavior always in scope of
https://issues.apache.org/jira/browse/SOLR-5211. So, the fields assigned to
parent with child and childless single doc will be the same. So, far it's
not clear how to amend  semantic.

On Thu, Dec 3, 2015 at 6:35 PM, Bogdan Marinescu <
bogdan.marine...@awinta.com> wrote:


Hi,

I have a problem with nested docs. If I create a document with id: 1 and
fieldA:sometext and then add it to sol'r, I get one doc in sol'r.

Afterwards if I add a child/nested doc to this document I additionally get
a _root_:1 to the document but the problem is I now have two documents with
the same ID (id: 1) in sol'r, one with _root_ and the child/nested doc and
one without it.

Any ideas why this happens?
Any ideas how to avoid this?

Thanks,








Re: Stop adding content in Solr through /update URL

2015-12-03 Thread Alexandre Rafalovitch
You could add 'enable' flag in the solrconfig.xml and then
enable/disable it differently on different servers:
https://wiki.apache.org/solr/SolrConfigXml#Enable.2Fdisable_components
Example: 
https://github.com/apache/lucene-solr/blob/lucene_solr_5_3_0/solr/server/solr/configsets/sample_techproducts_configs/conf/solrconfig.xml#L1354

Regards,
   Alex.

Newsletter and resources for Solr beginners and intermediates:
http://www.solr-start.com/


On 3 December 2015 at 08:46, pradeepandey24  wrote:
> We have master/slave architecture of Solr and we are updating index on slave
> server through ReplicationHandler.
> We want that no body can directly update data into slave server using
> /update from url.
> Can we do it?  If yes please tell how.
>
> Thanks in advance
> Pradeep
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Stop-adding-content-in-Solr-through-update-URL-tp4243365.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Failed to create collection in Solrcloud

2015-12-03 Thread Mugeesh Husain
Thanks you Zheng,

I ahve found the issue, there was server IP, when i check, one of my 
live_collection  was pointed to localhost.

So i did mention hostname into solr.xml.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Failed-to-create-collection-in-Solrcloud-tp4243232p4243360.html
Sent from the Solr - User mailing list archive at Nabble.com.


Stop adding content in Solr through /update URL

2015-12-03 Thread pradeepandey24
We have master/slave architecture of Solr and we are updating index on slave
server through ReplicationHandler.
We want that no body can directly update data into slave server using
/update from url.
Can we do it?  If yes please tell how.

Thanks in advance
Pradeep



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Stop-adding-content-in-Solr-through-update-URL-tp4243365.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: (sem assunto)

2015-12-03 Thread Ahmet Arslan
Hi Sabrina,

Please send a blank e-mail to solr-user-unsubscr...@lucene.apache.org if you 
haven't already.
If you still have problems, see : 
https://wiki.apache.org/solr/Unsubscribing%20from%20mailing%20lists

Ahmet


On Thursday, December 3, 2015 11:37 AM, "sabrina_rodrig...@iol.pt" 
 wrote:


Hello! 

I would like to stop receiving emails from lucene apache.
Please remove my email from every list. 

Thanks! 


Re: Highlighting tag problem

2015-12-03 Thread Scott Stults
Edwin,

Is this related to what's described in SOLR-8334?


k/r,
Scott

On Thu, Dec 3, 2015 at 5:07 AM, Zheng Lin Edwin Yeo 
wrote:

> Hi,
>
> I'm using Solr 5.3.0.
> Would like to find out, during a search, sometimes there is a match in
> content, but it is not highlighted (the word is not in the stopword list)?
> Did I make any mistakes in my configuration?
>
> This is my highlighting request handler from solrconfig.xml.
>
> 
> 
> explicit
> 10
> json
> true
> text
> id, title, content_type, last_modified, url, score 
>
> on
> id, title, content, author, tag
>true
> true
> html
> 200
>
> true
> signature
> true
> 100
> 
> 
>
>
> This is my pipeline for the field.
>
>   positionIncrementGap="100">
>
>
>
> segMode="SEARCH"/>
>
>
>
>
>
> words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
>
> words="stopwords.txt" />
>
> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>
> synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
>
>
>
> maxGramSize="15"/>
>
>
>
>
>
> segMode="SEARCH"/>
>
>
>
>
>
> words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
>
> words="stopwords.txt" />
>
> generateWordParts="0" generateNumberParts="0" catenateWords="0"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
>
> synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
>
>
>
> 
>
>  
>
>
> Regards,
> Edwin
>



-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com


How to list all collections in solr-4.7.2

2015-12-03 Thread rashi gandhi
Hi all,

I have setup two solr-4.7.2 server instances on two diff machines with 3
zookeeper severs in solrcloud mode.

Now, I want to retrieve list of all the collections that I have created in
solrcloud mode.

I tried LIST command of collections api, but its not working with
solr-4.7.2.
Error: unknown command LIST

Please suggest me the command, that I can use.

Thanks.


Re: AW: Is it possible to sort on a BooleanField?

2015-12-03 Thread Chris Hostetter

: Guess then I must set indexed="true" ;) Is it true the BooleanField may not 
have docValues?

Yeah ... open jira to add this still: SOLR-7264

FWIW: you could also use an EnumField (which does support docvalues) with 
2 values ("true" and "false") ... that should be just as efficient as 
BooleanField even once we add docValues support.

: 
: -Ursprüngliche Nachricht-
: Von: Muhammad Zahid Iqbal [mailto:zahid.iq...@northbaysolutions.net] 
: Gesendet: Donnerstag, 3. Dezember 2015 08:01
: An: solr-user
: Betreff: Re: Is it possible to sort on a BooleanField?
: 
: Please share your schema.
: 
: On Thu, Dec 3, 2015 at 11:28 AM, Clemens Wyss DEV 
: wrote:
: 
: > Looks like not. I get to see
: > 'can not sort on a field which is neither indexed nor has doc values:
: > '
: >
: > - Clemens
: >
: 

-Hoss
http://www.lucidworks.com/

Spellcheck error

2015-12-03 Thread Matt Pearce

Hi,

We're using Solr 5.3.1, and we're getting a 
StringIndexOutOfBoundsException from the SpellCheckCollator. I've done 
some investigation, and it looks like the problem is that the corrected 
string is shorter than the original query.


For example, the search term is "theatre", the suggested correction is 
"there". The error is being thrown when replacing the original query 
with the shorter replacement.


This is the stack trace:
java.lang.StringIndexOutOfBoundsException: String index out of range: -2
at 
java.lang.AbstractStringBuilder.replace(AbstractStringBuilder.java:824)

at java.lang.StringBuilder.replace(StringBuilder.java:262)
at 
org.apache.solr.spelling.SpellCheckCollator.getCollation(SpellCheckCollator.java:235)
at 
org.apache.solr.spelling.SpellCheckCollator.collate(SpellCheckCollator.java:92)
at 
org.apache.solr.handler.component.SpellCheckComponent.addCollationsToResponse(SpellCheckComponent.java:237)
at 
org.apache.solr.handler.component.SpellCheckComponent.process(SpellCheckComponent.java:202)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:277)


The error looks very similar to those described in 
https://issues.apache.org/jira/browse/SOLR-4489, 
https://issues.apache.org/jira/browse/SOLR-3608 and 
https://issues.apache.org/jira/browse/SOLR-2509, most of which are closed.


Any suggestions would be appreciated, or should I open a JIRA ticket?

Thanks,

Matt

--
Matt Pearce
Flax - Open Source Enterprise Search
www.flax.co.uk



Re: Nested Docs issue

2015-12-03 Thread Mikhail Khludnev
Hello Bogdan,
You described how it works now. That's how it was implemented. And I can
explain why it was done so.

Could you please describe the expected behavior for you?

Notice, I want to enforce nested (block) behavior always in scope of
https://issues.apache.org/jira/browse/SOLR-5211. So, the fields assigned to
parent with child and childless single doc will be the same. So, far it's
not clear how to amend  semantic.

On Thu, Dec 3, 2015 at 6:35 PM, Bogdan Marinescu <
bogdan.marine...@awinta.com> wrote:

> Hi,
>
> I have a problem with nested docs. If I create a document with id: 1 and
> fieldA:sometext and then add it to sol'r, I get one doc in sol'r.
>
> Afterwards if I add a child/nested doc to this document I additionally get
> a _root_:1 to the document but the problem is I now have two documents with
> the same ID (id: 1) in sol'r, one with _root_ and the child/nested doc and
> one without it.
>
> Any ideas why this happens?
> Any ideas how to avoid this?
>
> Thanks,
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics





Solr 5: Schema.xml vs. Managed Schema - which is advisable?

2015-12-03 Thread Kelly, Frank
Just wondering if folks have any suggestions on using Schema.xml vs. Managed 
Schema going forward.

Our deployment will be
> 3 Zk, 3 Shards, 3 replicas
> Copies of each collection in 5 AWS regions (EBS-backed EC2 instances)
> Planning at least 1 Billion objects indexed (currently < 100 million)

I'm sure our schema.xml will have changes and fixes and just wondering which 
approach (schema.xml vs. managed)
will be easier to deploy / maintain?

Cheers!

-Frank


Frank Kelly
Principal Software Engineer
Predictive Analytics Team (SCBE/HAC/CDA)










Nested Docs issue

2015-12-03 Thread Bogdan Marinescu

Hi,

I have a problem with nested docs. If I create a document with id: 1 and 
fieldA:sometext and then add it to sol'r, I get one doc in sol'r.


Afterwards if I add a child/nested doc to this document I additionally 
get a _root_:1 to the document but the problem is I now have two 
documents with the same ID (id: 1) in sol'r, one with _root_ and the 
child/nested doc and one without it.


Any ideas why this happens?
Any ideas how to avoid this?

Thanks,


Re: How to list all collections in solr-4.7.2

2015-12-03 Thread Pushkar Raste
Will 'wget http://host;port//solr/admin/collections?action=LIST' help?

On 3 December 2015 at 12:12, rashi gandhi  wrote:

> Hi all,
>
> I have setup two solr-4.7.2 server instances on two diff machines with 3
> zookeeper severs in solrcloud mode.
>
> Now, I want to retrieve list of all the collections that I have created in
> solrcloud mode.
>
> I tried LIST command of collections api, but its not working with
> solr-4.7.2.
> Error: unknown command LIST
>
> Please suggest me the command, that I can use.
>
> Thanks.
>


Using properties placeholder ${someProperty} for xml node attribute in solrconfig

2015-12-03 Thread Pushkar Raste
Hi,
I want to make turning filter cache on/off configurable (I really have a
use case to turn off filter cache), can I use properties placeholders like
${someProperty} in the filter cache config. i.e.



In short, can I use properties placeholders for attributes for xml node in
solrconfig. Follow up question is, provided I can do that, to turn off
filterCache can I simply set values 0 (zero) for 'solr.filterCacheSize' and
'solr.filterCacheInitialSize'


Can someone put up a guide to integrate uima with solr

2015-12-03 Thread vaibhavlella




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Can-someone-put-up-a-guide-to-integrate-uima-with-solr-tp4243464.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Spellcheck error

2015-12-03 Thread Dyer, James
Matt,

Can you give some information about how your spellcheck field is analyzed and 
also if you're using a custom query converter.  Also, try and place the bare 
terms you want checked in spellcheck.q (ex, if your query is q=+movie +theatre, 
then spellcheck.q=movie theatre).  Does it work in this case?  Also, could you 
give the exact query you're using?

This is the very same bug as in the 3 tickets you mention.  We clearly haven't 
solved all of the possible ways this bug can be triggered.  But we cannot fix 
this unless we can come up with a unit test that reliably reproduces it.  At 
the very least, we should handle these problems better than throwing SIOOB like 
this.

Long term, there is probably a better design we could come up with for how 
terms are identified within queries and how collations are generated.

James Dyer
Ingram Content Group


-Original Message-
From: Matt Pearce [mailto:m...@flax.co.uk] 
Sent: Thursday, December 03, 2015 10:40 AM
To: solr-user
Subject: Spellcheck error

Hi,

We're using Solr 5.3.1, and we're getting a 
StringIndexOutOfBoundsException from the SpellCheckCollator. I've done 
some investigation, and it looks like the problem is that the corrected 
string is shorter than the original query.

For example, the search term is "theatre", the suggested correction is 
"there". The error is being thrown when replacing the original query 
with the shorter replacement.

This is the stack trace:
java.lang.StringIndexOutOfBoundsException: String index out of range: -2
 at 
java.lang.AbstractStringBuilder.replace(AbstractStringBuilder.java:824)
 at java.lang.StringBuilder.replace(StringBuilder.java:262)
 at 
org.apache.solr.spelling.SpellCheckCollator.getCollation(SpellCheckCollator.java:235)
 at 
org.apache.solr.spelling.SpellCheckCollator.collate(SpellCheckCollator.java:92)
 at 
org.apache.solr.handler.component.SpellCheckComponent.addCollationsToResponse(SpellCheckComponent.java:237)
 at 
org.apache.solr.handler.component.SpellCheckComponent.process(SpellCheckComponent.java:202)
 at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:277)

The error looks very similar to those described in 
https://issues.apache.org/jira/browse/SOLR-4489, 
https://issues.apache.org/jira/browse/SOLR-3608 and 
https://issues.apache.org/jira/browse/SOLR-2509, most of which are closed.

Any suggestions would be appreciated, or should I open a JIRA ticket?

Thanks,

Matt

-- 
Matt Pearce
Flax - Open Source Enterprise Search
www.flax.co.uk



Re: Solr 5: Schema.xml vs. Managed Schema - which is advisable?

2015-12-03 Thread Jeff Wartes
I’ve never used the managed schema, so I’m probably biased, but I’ve never
seen much of a point to the Schema API.

I need to make changes sometimes to solrconfig.xml, in addition to
schema.xml and other config files, and there’s no API for those, so my
process has been like:

1. Put the entire config directory used by a collection in source control
somewhere. solrconfig.xml, schema.xml, synonyms.txt, everything.
2. Make changes, test, commit
3. “Release” by uploading the whole config dir at a specific commit to ZK
(overwriting any existing files) and issuing a collections API “reload”.


This has the downside that I can upload a broken config and take down my
collection, but with the whole config dir in source control,
I can also easily roll back to any point by uploading an old commit.
You still have to be aware of how the changes you’re making will effect
your current index, but that’s unavoidable.


On 12/3/15, 7:09 AM, "Kelly, Frank"  wrote:

>Just wondering if folks have any suggestions on using Schema.xml vs.
>Managed Schema going forward.
>
>Our deployment will be
>> 3 Zk, 3 Shards, 3 replicas
>> Copies of each collection in 5 AWS regions (EBS-backed EC2 instances)
>> Planning at least 1 Billion objects indexed (currently < 100 million)
>
>I'm sure our schema.xml will have changes and fixes and just wondering
>which approach (schema.xml vs. managed)
>will be easier to deploy / maintain?
>
>Cheers!
>
>-Frank
>
>
>Frank Kelly
>Principal Software Engineer
>Predictive Analytics Team (SCBE/HAC/CDA)
>
>
>
>
>
>
>
>



solr with Isilon HDFS

2015-12-03 Thread Gaurav Patel
Hi

We are facing below challenge:

Product Use Case: Analytics

Hardware:
3 Physical Machines with 60 cpu cores and 512 GB RAM each.
EMC Isilon Appliance with PB storage. It can be accessed via HDFS or NFS.

Questions:
Can we use solr cloud for this setup?
How many instances of SOLR are recommended per physical machines and how
much ram should be allocated to it.
Should zookeeper be installed along with solr on each box or should be
installed in separate 2 Virtual machines by itself?
Can we run kakfa and cassandra along with solr on each physical machine?
Anybody running Solr with HDFS in production?


Thanks
Gaurav


Re: Solr 5: Schema.xml vs. Managed Schema - which is advisable?

2015-12-03 Thread Erick Erickson
It Depends (tm).

Managed Schema is way cool if you have a front end that lets you
manipulate the schema via a browser or other program. There's really
no other way to deal with changing the schema from a browser without
allowing uploading xml files, which is a security problem. Trust me on
this one ;).

For people who know the ins and outs of schema.xml, it's often easier
just to edit the raw file and upload it to ZK (or use it locally). And
much faster for mass edits.

So really they're different beasts. The end result is functionally the
same, there's a schema that's read by Solr and used. The managed
schema makes it harder to have typos sneak in and prevent collections
from loading at the expense of fast mass editing.

And there is some ability to change the solrconfig.xml file, see:
https://cwiki.apache.org/confluence/display/solr/Config+API. But again
whether you "should" use that or just manually edit solrconfig.xml is
largely a matter of the tools available and personal taste.


bq: will be easier to deploy / maintain


Not a lot of difference here. At the end of the day, you have to
1> have the configs stored somewhere safely in version control (or at
least I think you must)
2> change the files in the config set on Zookeeper
3> reload the collection.

So with manually editing the process to change something you'd
1> get the files from VCS
2> edit them
3> push them to ZK
4> reload the collection (collections API) and verify it was correct
5> save the configs back to VCS.

With managed schema you'd
1> use the managed schema API to make changes
2> reload the collection and verify
3> pull the changes from Zookeeper
4> put them in VCS


Best,
Erick



On Thu, Dec 3, 2015 at 12:09 PM, Don Bosco Durai  wrote:
> My experience is, once managed-schema is created, then schema.xml even if 
> present is ignored. When both are present, you will get a warning in the Solr 
> log.
>
> I have stopped using schema.xml. Actually, I use it once, start Solr and 
> after it generates managed-schema, I export it and pretty much just update it 
> going forward.
>
> I think, the recommended way to manage fields is using API calls, but it 
> might not be always possible. E.g. You have to save the config in source code 
> system. If you are doing that, make sure you to update it more regularly, 
> because if Solr finds a new field name, it will auto create it in the 
> managed-schema and you saved copy will be out of date.
>
> Bosco
>
>
>
>
> On 12/3/15, 11:47 AM, "Jeff Wartes"  wrote:
>
>>I’ve never used the managed schema, so I’m probably biased, but I’ve never
>>seen much of a point to the Schema API.
>>
>>I need to make changes sometimes to solrconfig.xml, in addition to
>>schema.xml and other config files, and there’s no API for those, so my
>>process has been like:
>>
>>1. Put the entire config directory used by a collection in source control
>>somewhere. solrconfig.xml, schema.xml, synonyms.txt, everything.
>>2. Make changes, test, commit
>>3. “Release” by uploading the whole config dir at a specific commit to ZK
>>(overwriting any existing files) and issuing a collections API “reload”.
>>
>>
>>This has the downside that I can upload a broken config and take down my
>>collection, but with the whole config dir in source control,
>>I can also easily roll back to any point by uploading an old commit.
>>You still have to be aware of how the changes you’re making will effect
>>your current index, but that’s unavoidable.
>>
>>
>>On 12/3/15, 7:09 AM, "Kelly, Frank"  wrote:
>>
>>>Just wondering if folks have any suggestions on using Schema.xml vs.
>>>Managed Schema going forward.
>>>
>>>Our deployment will be
 3 Zk, 3 Shards, 3 replicas
 Copies of each collection in 5 AWS regions (EBS-backed EC2 instances)
 Planning at least 1 Billion objects indexed (currently < 100 million)
>>>
>>>I'm sure our schema.xml will have changes and fixes and just wondering
>>>which approach (schema.xml vs. managed)
>>>will be easier to deploy / maintain?
>>>
>>>Cheers!
>>>
>>>-Frank
>>>
>>>
>>>Frank Kelly
>>>Principal Software Engineer
>>>Predictive Analytics Team (SCBE/HAC/CDA)
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>


Re: Collection Name is different than what i specify in API

2015-12-03 Thread Chris Hostetter

: I am using solr 4.10 in cloud mode. I am able to create collection using 
: 
: ./zkcli.sh -cmd upconfig -confdir $CONF_DIR -confname techproducts
: -collection techproducts -z $ZOOKEEPER_URL && curl
: 
"http://$SOLR_URL/solr/admin/collections?action=CREATE=techproducts=compositeId=1=1;
: 
: Instead of getting collection as "techproducts" i get collection name as
: "techproducts_shard1_replica1"
: 
: How do i correct this to be "techproducts" ?

the "collection" name you should get is definitely "techproducts" ... the 
"core" name that you get implementing that collection on disk will be 
"techproducts_shard1_replica1" ... if you specified that you wanted 
multiple shards or multiple replicas of shards, then you would get 
multiple Solr cores with names like "techproducts_shard2_replica1", 
"techproducts_shard1_replica2", etc...

Once you create a collection you can send requests it with the appropriate 
URLs...

http://$SOLR_URL/solr/techproducts/select
http://$SOLR_URL/solr/techproducts/update
etc...

...and solr will route requests under the covers to the appropriate 
core(s)

The Admin UI (especially in 4.10) is very "core" centric so that you can 
see the details of every replica, but if you look at the "Cloud" screen in 
the UI it will in fact show you the collections and what cores make up 
that collection...

https://cwiki.apache.org/confluence/display/solr/Cloud+Screens



-Hoss
http://www.lucidworks.com/


Re: Spellcheck error

2015-12-03 Thread Matt Pearce

Hi James,

Thanks for responding.

The query we were testing looks like this:
http://localhost:8983/solr/testdata/select?q=theatre=theatre

I did some further investigation, after discovering that omitting the 
spellcheck.q parameter stops the error appearing, and it looks like 
synonym expansion is playing a part in the problem. The spellcheck field 
is essentially the same as text_general in the example schema, with the 
substitution of HTMLStripCharFilterFactory instead of the 
StandardTokenizerFactory at index time:


positionIncrementGap="100">

  


words="stopwords.txt" />


  
  

words="stopwords.txt" />
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>


  


With synonyms enabled, spellcheck.q=theatre is being expanded to seven 
tokens - theatre (3 times), theater, playhouse, studio and workshop. If 
I disable synonyms in the query analyser, "theatre" is used on its own, 
and the error doesn't happen (this is the same behaviour as when I omit 
spellcheck.q).


So, it looks like the quick solution is to disable synonyms in the query 
analyser for that field. I'll do some further investigation tomorrow to 
see if I can figure out why the synonym expansion triggers the problem 
while neither "theatre" nor "theater" on their own do (I can't imagine 
the other three variants are going to make "there" appear as a spelling 
correction).


Cheers,

Matt

On 03/12/15 18:53, Dyer, James wrote:

Matt,

Can you give some information about how your spellcheck field is analyzed and 
also if you're using a custom query converter.  Also, try and place the bare 
terms you want checked in spellcheck.q (ex, if your query is q=+movie +theatre, 
then spellcheck.q=movie theatre).  Does it work in this case?  Also, could you 
give the exact query you're using?

This is the very same bug as in the 3 tickets you mention.  We clearly haven't 
solved all of the possible ways this bug can be triggered.  But we cannot fix 
this unless we can come up with a unit test that reliably reproduces it.  At 
the very least, we should handle these problems better than throwing SIOOB like 
this.

Long term, there is probably a better design we could come up with for how 
terms are identified within queries and how collations are generated.

James Dyer
Ingram Content Group


-Original Message-
From: Matt Pearce [mailto:m...@flax.co.uk]
Sent: Thursday, December 03, 2015 10:40 AM
To: solr-user
Subject: Spellcheck error

Hi,

We're using Solr 5.3.1, and we're getting a
StringIndexOutOfBoundsException from the SpellCheckCollator. I've done
some investigation, and it looks like the problem is that the corrected
string is shorter than the original query.

For example, the search term is "theatre", the suggested correction is
"there". The error is being thrown when replacing the original query
with the shorter replacement.

This is the stack trace:
java.lang.StringIndexOutOfBoundsException: String index out of range: -2
  at
java.lang.AbstractStringBuilder.replace(AbstractStringBuilder.java:824)
  at java.lang.StringBuilder.replace(StringBuilder.java:262)
  at
org.apache.solr.spelling.SpellCheckCollator.getCollation(SpellCheckCollator.java:235)
  at
org.apache.solr.spelling.SpellCheckCollator.collate(SpellCheckCollator.java:92)
  at
org.apache.solr.handler.component.SpellCheckComponent.addCollationsToResponse(SpellCheckComponent.java:237)
  at
org.apache.solr.handler.component.SpellCheckComponent.process(SpellCheckComponent.java:202)
  at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:277)

The error looks very similar to those described in
https://issues.apache.org/jira/browse/SOLR-4489,
https://issues.apache.org/jira/browse/SOLR-3608 and
https://issues.apache.org/jira/browse/SOLR-2509, most of which are closed.

Any suggestions would be appreciated, or should I open a JIRA ticket?

Thanks,

Matt



--
Matt Pearce
Flax - Open Source Enterprise Search
www.flax.co.uk



Wildcard searches - field:aaaa* works but field:a*a does not

2015-12-03 Thread Kelly, Frank
Hello Lucene Folks,

  Newbie here -  I've found how Solr does Wildcard searches of the form   
field:a*   using the EdgeNGramFilterFactory
https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.EdgeNGramFilterFactory

but I can't seem to dig up how to support wildcards in the middle e.g. 
field:a*a.

I guess I am missing a Tokenizer / Filter somewhere but not sure where,

Here is my text configuration

 
  




  
  




  


-Frank












Collection Name is different than what i specify in API

2015-12-03 Thread abhayd
hi 
I am using solr 4.10 in cloud mode. I am able to create collection using 

./zkcli.sh -cmd upconfig -confdir $CONF_DIR -confname techproducts
-collection techproducts -z $ZOOKEEPER_URL && curl
"http://$SOLR_URL/solr/admin/collections?action=CREATE=techproducts=compositeId=1=1;

Instead of getting collection as "techproducts" i get collection name as
"techproducts_shard1_replica1"

How do i correct this to be "techproducts" ?





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Collection-Name-is-different-than-what-i-specify-in-API-tp4243502.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Grouping by simhash signature

2015-12-03 Thread Chris Hostetter

: I try to implement NearDup detection by  SimHash

I'm not really familiar with simhash, but based on your description of it, 
i'm not sure that any of Solr's deduplication, grouping, or collapsing 
features will really help you here...

: 1) each document has a field /simhash_signature/ that stores a sequence of
: bits.
: 2) that in order to be considered NearDup, documents must have, at most, 2
: bits that differ in /simhash_signature/
: 
: *My question:*
: How can I get groups of nearDup by /simhash_signature/?

the problem here is that there is no transative property in your 
definition of a "NearDup" -- as you point out in your example, B & D are 
both "NearDups" or A, but B & D are not NearDups of eachother.

Some sort of transative relationship (either in terms of an identical 
field value, or a function that can produce identical results for all 
documents i na group) is neccessary to use Solr's de-duplication, 
collapsing, or grouping functionality.

Assuming you wanted results like those below, and you had some existing 
"query + sort" that would identiy the "main" document result set (the "Doc 
A', "Doc B", "Doc C", "Doc D" list in that order) you could -- in theory 
-- write a custom DocTransformer that could annotate those documents with 
a list of doc IDs that had "NearDup" values for some field (possily doing 
strdist, or some other more efficient binary bit set diff as a 
ValueSource) 

If you wanted to pursue implementing a DocTransofrmer like this as a 
plugin, the existing ChildDocTransformerFactory might be a good starting 
point for some code to study.

: *Examples:*
:   Input:
: Doc A = 0001000
: Doc B = 100
: Doc C = 111
: Doc D = 0101000
:   Output:
: A -> {B, D}
: B -> {A}
: C -> {}
: D -> {A}


-Hoss
http://www.lucidworks.com/


Re: Using properties placeholder ${someProperty} for xml node attribute in solrconfig

2015-12-03 Thread Erick Erickson
Hmmm, never tried it. You can check by looking at the admin
UI>>plugins/stats>>cahces>>filterCache with a property defined like
you want.

And assuming that works, yes. the filterCache is turned off if its size is zero.

Another option might be to add {!cache=false} to your fq clauses on
the client in this case if that is possible/convenient.

Best,
Erick

On Thu, Dec 3, 2015 at 11:19 AM, Pushkar Raste  wrote:
> Hi,
> I want to make turning filter cache on/off configurable (I really have a
> use case to turn off filter cache), can I use properties placeholders like
> ${someProperty} in the filter cache config. i.e.
>
>   size="${solr.filterCacheSize:4096}"
>  initialSize=""${solr.filterCacheInitialSize:2048}"
>  autowarmCount="0"/>
>
> In short, can I use properties placeholders for attributes for xml node in
> solrconfig. Follow up question is, provided I can do that, to turn off
> filterCache can I simply set values 0 (zero) for 'solr.filterCacheSize' and
> 'solr.filterCacheInitialSize'


Re: Solr 5: Schema.xml vs. Managed Schema - which is advisable?

2015-12-03 Thread Upayavira
They are different beasts, but I bet on the managed schema winning in
the long run.

With the bulk API, you can post a heap of fields/etc in one go, so
basically, rather than pushing the schema to Zookeeper, you push it to
Solr. 

Look at Solr 5.4 when it comes out shortly. It'll change the way you
think about the schema. The managed schema has been there for ages, but
now the UI has support for it in the schema tab. Being able to really
easily create and remove fields certainly does things to my brain
because it is just so easy.

Upayavira

On Thu, Dec 3, 2015, at 08:35 PM, Erick Erickson wrote:
> It Depends (tm).
> 
> Managed Schema is way cool if you have a front end that lets you
> manipulate the schema via a browser or other program. There's really
> no other way to deal with changing the schema from a browser without
> allowing uploading xml files, which is a security problem. Trust me on
> this one ;).
> 
> For people who know the ins and outs of schema.xml, it's often easier
> just to edit the raw file and upload it to ZK (or use it locally). And
> much faster for mass edits.
> 
> So really they're different beasts. The end result is functionally the
> same, there's a schema that's read by Solr and used. The managed
> schema makes it harder to have typos sneak in and prevent collections
> from loading at the expense of fast mass editing.
> 
> And there is some ability to change the solrconfig.xml file, see:
> https://cwiki.apache.org/confluence/display/solr/Config+API. But again
> whether you "should" use that or just manually edit solrconfig.xml is
> largely a matter of the tools available and personal taste.
> 
> 
> bq: will be easier to deploy / maintain
> 
> 
> Not a lot of difference here. At the end of the day, you have to
> 1> have the configs stored somewhere safely in version control (or at
> least I think you must)
> 2> change the files in the config set on Zookeeper
> 3> reload the collection.
> 
> So with manually editing the process to change something you'd
> 1> get the files from VCS
> 2> edit them
> 3> push them to ZK
> 4> reload the collection (collections API) and verify it was correct
> 5> save the configs back to VCS.
> 
> With managed schema you'd
> 1> use the managed schema API to make changes
> 2> reload the collection and verify
> 3> pull the changes from Zookeeper
> 4> put them in VCS
> 
> 
> Best,
> Erick
> 
> 
> 
> On Thu, Dec 3, 2015 at 12:09 PM, Don Bosco Durai 
> wrote:
> > My experience is, once managed-schema is created, then schema.xml even if 
> > present is ignored. When both are present, you will get a warning in the 
> > Solr log.
> >
> > I have stopped using schema.xml. Actually, I use it once, start Solr and 
> > after it generates managed-schema, I export it and pretty much just update 
> > it going forward.
> >
> > I think, the recommended way to manage fields is using API calls, but it 
> > might not be always possible. E.g. You have to save the config in source 
> > code system. If you are doing that, make sure you to update it more 
> > regularly, because if Solr finds a new field name, it will auto create it 
> > in the managed-schema and you saved copy will be out of date.
> >
> > Bosco
> >
> >
> >
> >
> > On 12/3/15, 11:47 AM, "Jeff Wartes"  wrote:
> >
> >>I’ve never used the managed schema, so I’m probably biased, but I’ve never
> >>seen much of a point to the Schema API.
> >>
> >>I need to make changes sometimes to solrconfig.xml, in addition to
> >>schema.xml and other config files, and there’s no API for those, so my
> >>process has been like:
> >>
> >>1. Put the entire config directory used by a collection in source control
> >>somewhere. solrconfig.xml, schema.xml, synonyms.txt, everything.
> >>2. Make changes, test, commit
> >>3. “Release” by uploading the whole config dir at a specific commit to ZK
> >>(overwriting any existing files) and issuing a collections API “reload”.
> >>
> >>
> >>This has the downside that I can upload a broken config and take down my
> >>collection, but with the whole config dir in source control,
> >>I can also easily roll back to any point by uploading an old commit.
> >>You still have to be aware of how the changes you’re making will effect
> >>your current index, but that’s unavoidable.
> >>
> >>
> >>On 12/3/15, 7:09 AM, "Kelly, Frank"  wrote:
> >>
> >>>Just wondering if folks have any suggestions on using Schema.xml vs.
> >>>Managed Schema going forward.
> >>>
> >>>Our deployment will be
>  3 Zk, 3 Shards, 3 replicas
>  Copies of each collection in 5 AWS regions (EBS-backed EC2 instances)
>  Planning at least 1 Billion objects indexed (currently < 100 million)
> >>>
> >>>I'm sure our schema.xml will have changes and fixes and just wondering
> >>>which approach (schema.xml vs. managed)
> >>>will be easier to deploy / maintain?
> >>>
> >>>Cheers!
> >>>
> >>>-Frank
> >>>
> >>>
> >>>Frank Kelly
> >>>Principal Software Engineer
> 

Re: Solr 5: Schema.xml vs. Managed Schema - which is advisable?

2015-12-03 Thread Don Bosco Durai
My experience is, once managed-schema is created, then schema.xml even if 
present is ignored. When both are present, you will get a warning in the Solr 
log.

I have stopped using schema.xml. Actually, I use it once, start Solr and after 
it generates managed-schema, I export it and pretty much just update it going 
forward. 

I think, the recommended way to manage fields is using API calls, but it might 
not be always possible. E.g. You have to save the config in source code system. 
If you are doing that, make sure you to update it more regularly, because if 
Solr finds a new field name, it will auto create it in the managed-schema and 
you saved copy will be out of date.

Bosco




On 12/3/15, 11:47 AM, "Jeff Wartes"  wrote:

>I’ve never used the managed schema, so I’m probably biased, but I’ve never
>seen much of a point to the Schema API.
>
>I need to make changes sometimes to solrconfig.xml, in addition to
>schema.xml and other config files, and there’s no API for those, so my
>process has been like:
>
>1. Put the entire config directory used by a collection in source control
>somewhere. solrconfig.xml, schema.xml, synonyms.txt, everything.
>2. Make changes, test, commit
>3. “Release” by uploading the whole config dir at a specific commit to ZK
>(overwriting any existing files) and issuing a collections API “reload”.
>
>
>This has the downside that I can upload a broken config and take down my
>collection, but with the whole config dir in source control,
>I can also easily roll back to any point by uploading an old commit.
>You still have to be aware of how the changes you’re making will effect
>your current index, but that’s unavoidable.
>
>
>On 12/3/15, 7:09 AM, "Kelly, Frank"  wrote:
>
>>Just wondering if folks have any suggestions on using Schema.xml vs.
>>Managed Schema going forward.
>>
>>Our deployment will be
>>> 3 Zk, 3 Shards, 3 replicas
>>> Copies of each collection in 5 AWS regions (EBS-backed EC2 instances)
>>> Planning at least 1 Billion objects indexed (currently < 100 million)
>>
>>I'm sure our schema.xml will have changes and fixes and just wondering
>>which approach (schema.xml vs. managed)
>>will be easier to deploy / maintain?
>>
>>Cheers!
>>
>>-Frank
>>
>>
>>Frank Kelly
>>Principal Software Engineer
>>Predictive Analytics Team (SCBE/HAC/CDA)
>>
>>
>>
>>
>>
>>
>>
>>
>



Re: How to list all collections in solr-4.7.2

2015-12-03 Thread Jeff Wartes
Looks like LIST was added in 4.8, so I guess you’re stuck looking at ZK,
or finding some tool that looks in ZK for you.

The zkCli.sh that ships with zookeeper would probably suffice for a
one-off manual inspection:
https://zookeeper.apache.org/doc/trunk/zookeeperStarted.html#sc_ConnectingT
oZooKeeper



On 12/3/15, 12:05 PM, "Pushkar Raste"  wrote:

>Will 'wget http://host;port//solr/admin/collections?action=LIST' help?
>
>On 3 December 2015 at 12:12, rashi gandhi  wrote:
>
>> Hi all,
>>
>> I have setup two solr-4.7.2 server instances on two diff machines with 3
>> zookeeper severs in solrcloud mode.
>>
>> Now, I want to retrieve list of all the collections that I have created
>>in
>> solrcloud mode.
>>
>> I tried LIST command of collections api, but its not working with
>> solr-4.7.2.
>> Error: unknown command LIST
>>
>> Please suggest me the command, that I can use.
>>
>> Thanks.
>>



Re: schema fileds and Typefield in solr-5.3.1

2015-12-03 Thread Erick Erickson
Have you looked at Solr Cell? See:
https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika

When working with things like MS word, there are a couple of things to
be aware of:
1> there has to be a mapping between the meta-data (last_edited,
author, whatever) and the field in Solr you want that meta-data to go
to.
2> each type of document may have different meta-data meaning the same thing.

The other alternative is to use Tika directly in a Java program and
take full control of what goes where, here's an example (you can
remove the database stuff easily):
https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/

Best,
Erick

On Thu, Dec 3, 2015 at 4:00 AM, kostali hassan
 wrote:
> I start working in solr 5x by extract solr in D://solr and run solr server
> with :
>
> D:\solr\solr-5.3.1\bin>solr start ;
>
> Then I create a core in standalone mode :
>
> D:\solr\solr-5.3.1\bin>solr create -c mycore
>
> I need indexing from system files (word and pdf) and the schema API don’t
> have a field “name” of document, then I Add this field using curl :
>
> curl -X POST -H 'Content-type:application/json' --data-binary '{
>
>   "add-field":{
>
>  "name":"name",
>
>  "type":"text_general",
>
>  "stored":true,
>
>  “indexed”:true }
>
> }' http://localhost:8983/solr/mycore/schema
>
>
>
> And re-index all document.with windows SimplepostTools:
>
> D:\solr\solr-5.3.1>java -classpath example\exampledocs\post.jar -Dauto=yes
> -Dc=mycore -Ddata=files -Drecursive=yes org.apache.solr.util.SimplePostTool
> D:\Lucene\document ;
>
>
>
> But even if the field “name” is succeffly added he is empty ; the field
> title get the name for only pdf document not for msword(.doc and .docx).
>
>
>
> Then I choose indexing with techproducts example because he don’t use
> schema.xml API then I can modified my schema:
>
>
>
> D:\solr\solr-5.3.1>solr –e techproducts
>
>
>
> Techproducts return the name of all files.xml indexed;
>
>
>
> Then I create a new core based in solr_home example/techproducts/solr and I
> use schema.xml (contient field “name”) and solrConfig.xml from techproducts
> in this new core called demo.
>
> When I indexed all document the field name exist but still empty for all
> document indexed.
>
>
>
> My question is how I can get just the name of each document(msword and pdf)
> not the path like the field “id” or field “ressource_name” ; I have to
> create new Typefield or exist another way.
>
>
>
> Sorry for my basic English.
>
> Thank you.


Re: Wildcard searches - field:aaaa* works but field:a*a does not

2015-12-03 Thread Erik Hatcher
You don't need to ngram at all if your queries themselves are going to be 
wildcarded. 

   Erik

> On Dec 3, 2015, at 17:21, Kelly, Frank  wrote:
> 
> Hello Lucene Folks,
> 
>  Newbie here -  I've found how Solr does Wildcard searches of the form   
> field:a*   using the EdgeNGramFilterFactory
> https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.EdgeNGramFilterFactory
> 
> but I can't seem to dig up how to support wildcards in the middle e.g. 
> field:a*a.
> 
> I guess I am missing a Tokenizer / Filter somewhere but not sure where,
> 
> Here is my text configuration
> 
>  positionIncrementGap="100">
>  
>
> words="stopwords.txt" />
>
> maxGramSize="64"/>
>  
>  
>
> words="stopwords.txt" />
> ignoreCase="true" expand="true"/>
>
>  
>
> 
> -Frank
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 


Re: Collection Name is different than what i specify in API

2015-12-03 Thread Upayavira


On Thu, Dec 3, 2015, at 10:49 PM, Chris Hostetter wrote:
> 
> : I am using solr 4.10 in cloud mode. I am able to create collection
> using 
> : 
> : ./zkcli.sh -cmd upconfig -confdir $CONF_DIR -confname techproducts
> : -collection techproducts -z $ZOOKEEPER_URL && curl
> :
> "http://$SOLR_URL/solr/admin/collections?action=CREATE=techproducts=compositeId=1=1;
> : 
> : Instead of getting collection as "techproducts" i get collection name
> as
> : "techproducts_shard1_replica1"
> : 
> : How do i correct this to be "techproducts" ?
> 
> the "collection" name you should get is definitely "techproducts" ... the 
> "core" name that you get implementing that collection on disk will be 
> "techproducts_shard1_replica1" ... if you specified that you wanted 
> multiple shards or multiple replicas of shards, then you would get 
> multiple Solr cores with names like "techproducts_shard2_replica1", 
> "techproducts_shard1_replica2", etc...
> 
> Once you create a collection you can send requests it with the
> appropriate 
> URLs...
> 
>   http://$SOLR_URL/solr/techproducts/select
>   http://$SOLR_URL/solr/techproducts/update
>   etc...
> 
> ...and solr will route requests under the covers to the appropriate 
> core(s)
> 
> The Admin UI (especially in 4.10) is very "core" centric so that you can 
> see the details of every replica, but if you look at the "Cloud" screen
> in 
> the UI it will in fact show you the collections and what cores make up 
> that collection...
> 
> https://cwiki.apache.org/confluence/display/solr/Cloud+Screens

The UI from 5.4 will fix this - it will show separate drop downs for
collections and cores, which I hope will make this much clearer.

Upayavira


RE: Help With Phrase Highlighting

2015-12-03 Thread Teague James
Thanks everyone who replied! The FastVectorHighlighter did the trick. Here
is how I configured it:

In solrconfig.xml:
In the requestHandler I added:
on
text
true
100

In schema.xml:
I modified the text field:


I restarted Solr, re-indexed the documents and tested. All phrases are
correctly highlighted as phrases! Thanks everyone!

-Teague