Re: Issue Using Solr 5.3 Authentication and Authorization Plugins

2015-09-02 Thread Kevin Lee
I’ve found that completely exiting Chrome or Firefox and opening it back up 
re-prompts for credentials when they are required.  It was re-prompting with 
the /browse path where authentication was working each time I completely exited 
and started the browser again, however it won’t re-prompt unless you exit 
completely and close all running instances so I closed all instances each time 
to test.

However, to make sure I ran it via the command line via curl as suggested and 
it still does not give any authentication error when trying to issue the 
command via curl.  I get a success response from all the Solr instances that 
the reload was successful.

Not sure why the pre-canned permissions aren’t working, but the one to the 
request handler at the /browse path is.


> On Sep 1, 2015, at 11:03 PM, Noble Paul  wrote:
> 
> " However, after uploading the new security.json and restarting the
> web browser,"
> 
> The browser remembers your login , So it is unlikely to prompt for the
> credentials again.
> 
> Why don't you try the RELOAD operation using command line (curl) ?
> 
> On Tue, Sep 1, 2015 at 10:31 PM, Kevin Lee  wrote:
>> The restart issues aside, I’m trying to lockdown usage of the Collections 
>> API, but that also does not seem to be working either.
>> 
>> Here is my security.json.  I’m using the “collection-admin-edit” permission 
>> and assigning it to the “adminRole”.  However, after uploading the new 
>> security.json and restarting the web browser, it doesn’t seem to be 
>> requiring credentials when calling the RELOAD action on the Collections API. 
>>  The only thing that seems to work is the custom permission “browse” which 
>> is requiring authentication before allowing me to pull up the page.  Am I 
>> using the permissions correctly for the RuleBasedAuthorizationPlugin?
>> 
>> {
>>"authentication":{
>>   "class":"solr.BasicAuthPlugin",
>>   "credentials": {
>>"admin”:” ",
>>"user": ” "
>>}
>>},
>>"authorization":{
>>   "class":"solr.RuleBasedAuthorizationPlugin",
>>   "permissions": [
>>{
>>"name":"security-edit",
>>"role":"adminRole"
>>},
>>{
>>"name":"collection-admin-edit”,
>>"role":"adminRole"
>>},
>>{
>>"name":"browse",
>>"collection": "inventory",
>>"path": "/browse",
>>"role":"browseRole"
>>}
>>],
>>   "user-role": {
>>"admin": [
>>"adminRole",
>>"browseRole"
>>],
>>"user": [
>>"browseRole"
>>]
>>}
>>}
>> }
>> 
>> Also tried adding the permission using the Authorization API, but no effect, 
>> still isn’t protecting the Collections API from being invoked without a 
>> username password.  I do see in the Solr logs that it sees the updates 
>> because it outputs the messages “Updating /security.json …”, “Security node 
>> changed”, “Initializing authorization plugin: 
>> solr.RuleBasedAuthorizationPlugin” and “Authentication plugin class obtained 
>> from ZK: solr.BasicAuthPlugin”.
>> 
>> Thanks,
>> Kevin
>> 
>>> On Sep 1, 2015, at 12:31 AM, Noble Paul  wrote:
>>> 
>>> I'm investigating why restarts or first time start does not read the
>>> security.json
>>> 
>>> On Tue, Sep 1, 2015 at 1:00 PM, Noble Paul  wrote:
 I removed that statement
 
 "If activating the authorization plugin doesn't protect the admin ui,
 how does one protect access to it?"
 
 One does not need to protect the admin UI. You only need to protect
 the relevant API calls . I mean it's OK to not protect the CSS and
 HTML stuff.  But if you perform an action to create a core or do a
 query through admin UI , it automatically will prompt you for
 credentials (if those APIs are protected)
 
 On Tue, Sep 1, 2015 at 12:41 PM, Kevin Lee  
 wrote:
> Thanks for the clarification!
> 
> So is the wiki page incorrect at
> https://cwiki.apache.org/confluence/display/solr/Basic+Authentication+Plugin
>  which says that the admin ui will require authentication once the 
> authorization plugin is activated?
> 
> "An authorization plugin is also available to configure Solr with 
> permissions to perform various activities in the system. Once activated, 
> access to the Solr 

Re: 'missing content stream' issuing expungeDeletes=true

2015-09-02 Thread Derek Poh

There are around 6+ millions documents in the collection.

Each document (or product record) is unqiue in the collection.
When we found out the document has a docfreq of 2, we did a query on the 
document's product id and indeed 2 documents were returned.
We suspect 1 of them is deleted but not remove from the index. We try 
optimizing. Only 1 document is return when we query again and the 
document docreq is 1.


We checked the source data and the document is not duplicated.
It could be the way we index (full index every time) that result in this 
scenario of having 2 of the same document in the index.


On 9/2/2015 12:11 PM, Erick Erickson wrote:

How many document total in your corpus? And how many do you
intend to have?

My point is that if you are testing this with a small corpus, the results
are very likely different than when you test on a reasonable corpus.
So if you expect your "real" index will contain many more docs than
what you're testing, this is likely a red herring.

But something isn't making a lot of sense here. You say you've traced it
to having a docfreq of 2 that changes to 1. But that means that the
value is unique in your entire corpus, which kind of indicates you're
trying to boost on unique values which is unusual.

If you're confident in your model though, the only way to guarantee
what you want is to optimize/expungeDeletes.

Best,
Erick

On Tue, Sep 1, 2015 at 7:51 PM, Derek Poh  wrote:

Erick

Yes, we see documents changing their position in the list due to having
deleted docs.
In our searchresult,weapply higher boost (bq) to a group of matched
documents to have them display at the top tier of the result.
At times 1 or 2 of these documentsare not return in the top tier, they are
relegateddown to the lower tierof the result. Wediscovered that these
documents have a lower score due to docFreq=2.
After we do an optimize, these 1-2 documents are back in the top tier result
order and their docFreqis 1.



On 9/1/2015 11:40 PM, Erick Erickson wrote:

Derek:

Why do you care? What evidence do you have that this matters
_practically_?

If you've look at scoring with a small number of documents, you'll see
significant
differences due to deleted documents. In most cases, as you get a larger
number
of documents the ranking of documents in an index with no deletions .vs.
indexes
that have deletions is usually not noticeable.

I'm suggesting that this is a red herring. Your specific situation may
be different
of course, but since scoring is really only about ranking docs
relative to each other,
unless the relative positions change enough to be noticeable it's not a
problem.

Note that I'm saying "relative rankings", NOT "absolute score". Document
scores
have no meaning outside comparisons to other docs _in the same query_. So
unless you see documents changing their position in the list due to
having deleted
docs, it's not worth spending time on IMO.

Best,
Erick

On Tue, Sep 1, 2015 at 12:45 AM, Upayavira  wrote:

I wonder if this resolves it [1]. It has been applied to trunk, but not
to the 5.x release branch.

If you needed it in 5.x, I wonder if there's a way that particular
choice could be made configurable.

Upayavira

[1] https://issues.apache.org/jira/browse/LUCENE-6711
On Tue, Sep 1, 2015, at 02:43 AM, Derek Poh wrote:

Hi Upayavira

In fact we are using optimize currently but was advised to use expunge
deletes as it is less resource intensive.
So expunge deletes will only remove deleted documents, it will not merge
all index segments into one?

If we don't use optimize, the deleted documents in the index will affect
the scores (with docFreq=2) of the matched documents which will affect
the relevancy of the search result.

Derek

On 9/1/2015 12:05 AM, Upayavira wrote:

If you really must expunge deletes, use optimize. That will merge all
index segments into one, and in the process will remove any deleted
documents.

Why do you need to expunge deleted documents anyway? It is generally
done in the background for you, so you shouldn't need to worry about
it.

Upayavira

On Mon, Aug 31, 2015, at 06:46 AM, davidphilip cherian wrote:

Hi,

The below curl command worked without error, you can try.

curl http://localhost:8983/solr/techproducts/update?commit=true -H
"Content-Type: text/xml" --data-binary ''

However, after executing this, I could still see same deleted counts
on
dashboard.  Deleted Docs:6
I am not sure whether that means,  the command did not take effect or
it
took effect but did not reflect on dashboard view.





On Mon, Aug 31, 2015 at 8:51 AM, Derek Poh 
wrote:


Hi

I tried doing a expungeDeletes=true with the following but get the
message
'missing content stream'. What am I missing? I need to provide
additional
parameters?

curl
'http://127.0.0.1:8983/solr/supplier/update/json?expungeDeletes=true
';

Thanks,
Derek

--
CONFIDENTIALITY NOTICE
This e-mail (including any attachments) 

Using bq param for negative boost

2015-09-02 Thread Kevin Lee
Hi,

I’m trying to boost all results using the bq param with edismax where termA and 
termB do not appear in the field, but if phraseC appears it doesn’t matter if 
termA and termB appear.

The following works and boosts everything that doesn’t have termA and termB in 
myField so the effect is that all documents with termA and termB are pushed to 
the bottom of the result list.

myField:(*:* -termA -termB)^1

How would you add the second part where if phraseC is present, then termA and 
termB can be present?

Tried doing something like the following, but it is not working.

myField:(*:* ((-termA -termB) OR +”phraseC”))^1

Thanks!

Re: Issue Using Solr 5.3 Authentication and Authorization Plugins

2015-09-02 Thread Noble Paul
" However, after uploading the new security.json and restarting the
web browser,"

The browser remembers your login , So it is unlikely to prompt for the
credentials again.

Why don't you try the RELOAD operation using command line (curl) ?

On Tue, Sep 1, 2015 at 10:31 PM, Kevin Lee  wrote:
> The restart issues aside, I’m trying to lockdown usage of the Collections 
> API, but that also does not seem to be working either.
>
> Here is my security.json.  I’m using the “collection-admin-edit” permission 
> and assigning it to the “adminRole”.  However, after uploading the new 
> security.json and restarting the web browser, it doesn’t seem to be requiring 
> credentials when calling the RELOAD action on the Collections API.  The only 
> thing that seems to work is the custom permission “browse” which is requiring 
> authentication before allowing me to pull up the page.  Am I using the 
> permissions correctly for the RuleBasedAuthorizationPlugin?
>
> {
> "authentication":{
>"class":"solr.BasicAuthPlugin",
>"credentials": {
> "admin”:” ",
> "user": ” "
> }
> },
> "authorization":{
>"class":"solr.RuleBasedAuthorizationPlugin",
>"permissions": [
> {
> "name":"security-edit",
> "role":"adminRole"
> },
> {
> "name":"collection-admin-edit”,
> "role":"adminRole"
> },
> {
> "name":"browse",
> "collection": "inventory",
> "path": "/browse",
> "role":"browseRole"
> }
> ],
>"user-role": {
> "admin": [
> "adminRole",
> "browseRole"
> ],
> "user": [
> "browseRole"
> ]
> }
> }
> }
>
> Also tried adding the permission using the Authorization API, but no effect, 
> still isn’t protecting the Collections API from being invoked without a 
> username password.  I do see in the Solr logs that it sees the updates 
> because it outputs the messages “Updating /security.json …”, “Security node 
> changed”, “Initializing authorization plugin: 
> solr.RuleBasedAuthorizationPlugin” and “Authentication plugin class obtained 
> from ZK: solr.BasicAuthPlugin”.
>
> Thanks,
> Kevin
>
>> On Sep 1, 2015, at 12:31 AM, Noble Paul  wrote:
>>
>> I'm investigating why restarts or first time start does not read the
>> security.json
>>
>> On Tue, Sep 1, 2015 at 1:00 PM, Noble Paul  wrote:
>>> I removed that statement
>>>
>>> "If activating the authorization plugin doesn't protect the admin ui,
>>> how does one protect access to it?"
>>>
>>> One does not need to protect the admin UI. You only need to protect
>>> the relevant API calls . I mean it's OK to not protect the CSS and
>>> HTML stuff.  But if you perform an action to create a core or do a
>>> query through admin UI , it automatically will prompt you for
>>> credentials (if those APIs are protected)
>>>
>>> On Tue, Sep 1, 2015 at 12:41 PM, Kevin Lee  
>>> wrote:
 Thanks for the clarification!

 So is the wiki page incorrect at
 https://cwiki.apache.org/confluence/display/solr/Basic+Authentication+Plugin
  which says that the admin ui will require authentication once the 
 authorization plugin is activated?

 "An authorization plugin is also available to configure Solr with 
 permissions to perform various activities in the system. Once activated, 
 access to the Solr Admin UI and all requests will need to be authenticated 
 and users will be required to have the proper authorization for all 
 requests, including using the Admin UI and making any API calls."

 If activating the authorization plugin doesn't protect the admin ui, how 
 does one protect access to it?

 Also, the issue I'm having is not just at restart.  According to the docs 
 security.json should be uploaded to Zookeeper before starting any of the 
 Solr instances.  However, I tried to upload security.json before starting 
 any of the Solr instances, but it would not pick up the security config 
 until after the Solr instances are already running and then uploading the 
 security.json again.  I can see in the logs at startup that the Solr 
 instances don't see any plugin enabled even though security.json is 
 already in zookeeper and then after they are started and the 

Highlighting snippets truncated when matching large number of indexed documents

2015-09-02 Thread hsharma mailinglists
Hi there,

I'm observing that the snippets being returned in the highlighting
section of the response are getting truncated. However, this behavior
is being seen only when the query matches a large number of documents
and the results requested are near the end of the Solr-returned
overall results list.

I'm using Solr 5.2.1 (Java 1.8.0_51) and my document is defined in
terms of the following two fields, as specified in the schema file:

  

  
  

  
  

  
  
  


  
  

  

  
  
  

Hence, the fields of interest are called "name" and "name_edgengram".

I search for the word 'data' and Solr indicates that there are 565
results. I retrieve 10 results at a time, and the highlighting works
fine till I make a request to Solr for getting 10 results starting at
number 490. The http request made is >>

http://localhost:8983/solr/mycore/select?q=name%3A%22data%22+OR+name_edgengram%3A%22data%22=490=id%2Cname=json=true=true=name%2Cname_edgengram=%3Cem%3E=%3C%2Fem%3E=true=0

My highlighting parameters are specified at query-time. I get the
following json response from Solr >>


{
  "responseHeader": {
"status": 0,
"QTime": 76,
"params": {
  "q": "name:\"data\" OR name_edgengram:\"data\"",
  "hl": "true",
  "hl.simple.post": "",
  "indent": "true",
  "fl": "id,name",
  "start": "490",
  "hl.fragSize": "0",
  "hl.fl": "name,name_edgengram",
  "wt": "json",
  "hl.simple.pre": "",
  "hl.highlightMultiTerm": "true"
}
  },
  "response": {
"numFound": 565,
"start": 490,
"docs": [
  {
"name":
"software/information-management/cq-image-jsp-/content/sascom/en_us/software/data-management/jcr:content/par/tabctrl_d036/tab-2-tabImage",
"id": "p-798-pn9058800-uu303582258"
  },
  {
"name":
"en_us/whitepapers/how-to-advance-your-data-mining-predictive-analytics-with-modern-techniques-106219.html",
"id": "p-798-pn9677905-uu304125128"
  },
  {
"name":
"en_us/insights/cq-image-jsp-/content/sascom/en_us/insights/data-management/jcr:content/par/tabctrl_4a63/tab-0/styledcontainer_231d/par/styledcontainer_3919/par/image_a747",
"id": "p-798-pn9058609-uu303582055"
  },
  {
"name":
"software/smb/cq-textimage-jsp-/content/sascom/en_us/software/small-midsize-business/desktop-data-mining/jcr:content/par/styledcontainer_6b5c/par/contentcarousel_ea6/cntntcarousel/textimage_e28",
"id": "p-798-pn9058629-uu303582076"
  },
  {
"name":
"en_us/whitepapers/harvard-business-review-the-evolution-of-decision-making-how-leading-organizations-are-adopting-a-data-driven-culture105998.html",
"id": "p-798-pn9677481-uu297657017"
  },
  {
"id": "kw-798-3075204",
"name": "mpp database"
  },
  {
"id": "kw-798-951983",
"name": "In-Database Analytics"
  },
  {
"id": "kw-798-3075206",
"name": "in-memory database"
  },
  {
"name": "software/data_mining/",
"id": "p-798-pn30459505-uu376483712"
  },
  {
"name": "rnd/datavisualization/",
"id": "p-798-pn68559-uu524630"
  }
]
  },
  "highlighting": {
"p-798-pn9058800-uu303582258": {
  "name": [

"software/information-management/cq-image-jsp-/content/sascom/en_us/software/data"
  ],
  "name_edgengram": [

"software/information-management/cq-image-jsp-/content/sascom/en_us/software/data"
  ]
},
"p-798-pn9677905-uu304125128": {
  "name": [

"en_us/whitepapers/how-to-advance-your-data-mining-predictive-analytics-with"
  ],
  "name_edgengram": [

"en_us/whitepapers/how-to-advance-your-data-mining-predictive-analytics-with"
  ]
},
"p-798-pn9058609-uu303582055": {
  "name": [

"en_us/insights/cq-image-jsp-/content/sascom/en_us/insights/data-management"
  ],
  "name_edgengram": [

"en_us/insights/cq-image-jsp-/content/sascom/en_us/insights/data-management"
  ]
},
"p-798-pn9058629-uu303582076": {
  "name": [

"-business/desktop-data-mining/jcr:content/par/styledcontainer_6b5c/par/contentcarousel_ea6/cntntcarousel"
  ],
  "name_edgengram": [

"-business/desktop-data-mining/jcr:content/par/styledcontainer_6b5c/par/contentcarousel_ea6/cntntcarousel"
  ]
},
"p-798-pn9677481-uu297657017": {
  "name": [

"-leading-organizations-are-adopting-a-data-driven-culture105998.html"
  ],
  "name_edgengram": [

"-leading-organizations-are-adopting-a-data-driven-culture105998.html"
  ]
},
"kw-798-3075204": {
  "name_edgengram": [
"mpp database"
  ]
},
"kw-798-951983": {
  "name_edgengram": [
"In-Database Analytics"
  ]
},
"kw-798-3075206": {
  "name_edgengram": [
"in-memory database"

Re: Solr cloud hangs, log4j contention issue observed

2015-09-02 Thread Arnon Yogev
Thank you Shawn,

We are indeed using Tomcat, maxThreads was set to 2000 (Normally seen <600 
active threads under load).

I attached the complete stack trace of http-bio-8443-exec-37460 below. 
The thread is marked as "Waiting on Condition", and does not mention any 
lock it's waiting for.
Looking at the code of Category.callAppenders, the thread hangs in line 
204.

198 public
199 void callAppenders(LoggingEvent event) {
200   int writes = 0;
201   
202   for(Category c = this; c != null; c=c.parent) {
203 // Protected against simultaneous call to addAppender, 
removeAppender,...
204 synchronized(c) {
205  if(c.aai != null) {
206writes += c.aai.appendLoopOnAppenders(event);
207  }
208  if(!c.additive) {
209break;
210  }
211 }
212   }
213   
214   if(writes == 0) {
215 repository.emitNoAppenderWarning(this);
216   }
217 }


Of course, I understand this is Solr's mailing list and not log4j's. So 
wanted to know if this problem happens to be familiar in Solr, and whether 
a fix or some solution exists.

Thanks,
Arnon


3XMTHREADINFO  "http-bio-8443-exec-37460" 
J9VMThread:0x7FED88044600, j9thread_t:0x7FE73E4D04A0, 
java/lang/Thread:0x7FF267995468, state:CW, prio=5
3XMJAVALTHREAD(java/lang/Thread getId:0xA1AC9, isDaemon:true)
3XMTHREADINFO1(native thread ID:0x17F8, native priority:0x5, 
native policy:UNKNOWN)
3XMTHREADINFO2(native stack address range 
from:0x7FEA9487B000, to:0x7FEA948BC000, size:0x41000)
3XMCPUTIME   CPU usage total: 55.216798962 secs
3XMHEAPALLOC Heap bytes allocated since last GC cycle=3176200 
(0x307708)
3XMTHREADINFO3   Java callstack:
4XESTACKTRACEat 
org/apache/log4j/Category.callAppenders(Category.java:204)
4XESTACKTRACEat 
org/apache/log4j/Category.forcedLog(Category.java:391(Compiled Code))
4XESTACKTRACEat 
org/apache/log4j/Category.log(Category.java:856(Compiled Code))
4XESTACKTRACEat 
org/slf4j/impl/Log4jLoggerAdapter.error(Log4jLoggerAdapter.java:498)
4XESTACKTRACEat 
org/apache/solr/common/SolrException.log(SolrException.java:109)
4XESTACKTRACEat 
org/apache/solr/handler/RequestHandlerBase.handleRequest(RequestHandlerBase.java:153(Compiled
 
Code))
4XESTACKTRACEat 
org/apache/solr/core/SolrCore.execute(SolrCore.java:1916(Compiled Code))
4XESTACKTRACEat 
org/apache/solr/servlet/SolrDispatchFilter.execute(SolrDispatchFilter.java:780(Compiled
 
Code))
4XESTACKTRACEat 
org/apache/solr/servlet/SolrDispatchFilter.doFilter(SolrDispatchFilter.java:427(Compiled
 
Code))
4XESTACKTRACEat 
org/apache/solr/servlet/SolrDispatchFilter.doFilter(SolrDispatchFilter.java:217(Compiled
 
Code))
4XESTACKTRACEat 
org/apache/catalina/core/ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241(Compiled
 
Code))
4XESTACKTRACEat 
org/apache/catalina/core/ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208(Compiled
 
Code))
4XESTACKTRACEat 
org/apache/catalina/core/StandardWrapperValve.invoke(StandardWrapperValve.java:220(Compiled
 
Code))
4XESTACKTRACEat 
org/apache/catalina/core/StandardContextValve.invoke(StandardContextValve.java:122(Compiled
 
Code))
4XESTACKTRACEat 
org/apache/catalina/core/StandardHostValve.invoke(StandardHostValve.java:171(Compiled
 
Code))
4XESTACKTRACEat 
org/apache/catalina/valves/ErrorReportValve.invoke(ErrorReportValve.java:102(Compiled
 
Code))
4XESTACKTRACEat 
org/apache/catalina/valves/AccessLogValve.invoke(AccessLogValve.java:950(Compiled
 
Code))
4XESTACKTRACEat 
org/apache/catalina/core/StandardEngineValve.invoke(StandardEngineValve.java:116(Compiled
 
Code))
4XESTACKTRACEat 
org/apache/catalina/connector/CoyoteAdapter.service(CoyoteAdapter.java:408(Compiled
 
Code))
4XESTACKTRACEat 
org/apache/coyote/http11/AbstractHttp11Processor.process(AbstractHttp11Processor.java:1040(Compiled
 
Code))
4XESTACKTRACEat 
org/apache/coyote/AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:607(Compiled
 
Code))
4XESTACKTRACEat 
org/apache/tomcat/util/net/JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:314(Compiled
 
Code))
5XESTACKTRACE   (entered lock: 
org/apache/tomcat/util/net/SocketWrapper@0x7FF253C18FA8, entry count: 
1)
4XESTACKTRACEat 
java/util/concurrent/ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1156(Compiled
 
Code))
4XESTACKTRACEat 
java/util/concurrent/ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:626(Compiled
 
Code))
4XESTACKTRACEat 

Please add me to SolrWiki contributors

2015-09-02 Thread Gaurav Kumar
Hi
I am working on writing some open source tool for Solr Camel component, it 
would be great if you can add me to list of contributors.
Also I realized that you guys have upgraded the wiki to Solr 5.3, but we are 
using Solr 4, and suddenly now there is no information available for the older 
version.
Is there a way you guys can keep information about previous versions as well?
My username is "GauravKumar"
ThanksGaurav Kumar

Re: Re: Re: Re: concept and choice: custom sharding or auto sharding?

2015-09-02 Thread scott chu
 
solr-user,妳好

Sorry ,wrong again. Auto sharding is not implicit router.
- Original Message - 
From: scott chu 
To: solr-user 
Date: 2015-09-02, 23:50:20
Subject: Re: Re: Re: concept and choice: custom sharding or auto sharding?


 
solr-user,妳好

Thanks! I'll go back to check my old environment and that article is really 
helpful.

BTW, I think I got wrong about compositeID. In the reference guide, it said 
compositeID needs numShards. That means what I describe in question 5 seems 
wrong cause I intend to plan one shard one whole year news article and I 
thought SolrCloud will create new shard for me itself when I add new year's 
articles. But since compositeID needs to specify numShards first, there's no 
way I can know how many years I will put in SolrCloud in advance . IT looks 
like if I want to use SolrCloud afte all, I may have to use auto sharding (i.e. 
implicit router).
- Original Message - 
From: Erick Erickson 
To: solr-user 
Date: 2015-09-02, 23:30:53
Subject: Re: Re: concept and choice: custom sharding or auto sharding?


bq: Why do you say: "at 10M documents there's rarely a need to shard at all?"

Because I routinely see 50M docs on a single node and I've seen over 300M docs
on a single node with sub-second responses. So if you're saying that
you see poor
performance at 1M docs then I suspect there's something radically
wrong with your
setup. Too little memory, very bad query patterns, whatever. If my
suspicion is true,
then sharding will just mask the underlying problem.

You need to quantify your performance concerns. It's one thing to say
"my node satisfies 50 queries-per-second with 500ms response time" and
another to say "My queries take 5,000 ms".

In the first case, you do indeed need to add more servers to increase QPS if
you need 500 QPS. And adding more slaves is the best way to do that.
In the second, you need to understand the slowdown because sharding
will be a band-aid.

This might help:
https://wiki.apache.org/solr/SolrPerformanceProblems

Best,
Erick



On Wed, Sep 2, 2015 at 8:19 AM, scott chu  wrote:
>
> solr-user,妳好
>
> Do you mean I only have to put 10M documents in one index and copy it to

> many slaves in a classic Solr master-slave architecture to provide querying
> serivce on internet, and it won't have obvious downgrade of query
> performance? But I did have add 1M document into one index on master and

> provide 2 slaves to serve querying service on internet, the query
> performance is kinda sad. Why do you say: "at 10M documents there's rarely a
> need to shard at all?" Do I provide too few slaves? What amount of documents
> is suitable for a need for shard in SolrCloud?
>
> - Original Message -
>
> From: Erick Erickson
> To: solr-user
> Date: 2015-09-02, 23:00:29
> Subject: Re: concept and choice: custom sharding or auto sharding?
>
> Frankly, at 10M documents there's rarely a need to shard at all.
> Why do you think you need to? This seems like adding
> complexity for no good reason. Sharding should only really
> be used when you have too many documents to fit on a single
> shard as it adds some overhead, restricts some
> possibilities (cross-core join for instance, a couple of
> grouping options don't work in distributed mode etc.).
>
> You can still run SolrCloud and have it manage multiple
> _replicas_ of a single shard for HA/DR.
>
> So this seems like an XY problem, you're asking specific
> questions about shard routing because you think it'll
> solve some problem without telling us what the problem
> is.
>
> Best,
> Erick
>
> On Wed, Sep 2, 2015 at 7:47 AM, scott chu  wrote:
>> I post a question on Stackoverflow
>> http://stackoverflow.com/questions/32343813/custom-sharding-or-auto-sharding-on-solrcloud:
>> However, since this is a mail-list, I repost the question below to request
>> for suggestion and more subtle concept of SolrCloud's behavior on document
>> routing.
>> I want to establish a SolrCloud clsuter for over 10 millions of news
>> articles. After reading this article in Apache Solr Refernce guide: Shards
>> and Indexing Data in SolrCloud, I have a plan as follows:
>> Add prefix ED2001! to document ID where ED means some newspaper source and
>> 2001 is the year part in published date of news article, i.e. I want to put
>> all news articles of specific news paper source published in specific year
>> to a shard.
>> Create collection with router.name set to compositeID.
>> Add documents?
>> Query Collection?
>> Practically, I got some questions:
>> How to add doucments based on this plan? Do I have to specify special
>> parameters when updating the collection/core?
>> Is this called "custom sharding"? If not, what is "custom sharding"?
>> Is auto sharding a better choice for my case since there's a
>> shard-splitting feature for auto sharding when the shard is too big?
>> Can I query without _router_ parameter?
>> EDIT @ 2015/9/2:
>> This is how I think SolrCloud will do: 

Re: String bytes can be at most 32766 characters in length?

2015-09-02 Thread Erick Erickson
Yes, that is an intentional limit for the size of a single token,
which strings are.

Why not use deduplication? See:
https://cwiki.apache.org/confluence/display/solr/De-Duplication

You don't have to replace the existing documents, and Solr will
compute a hash that can be used to identify identical documents
and you can use_that_.

Best
Erick

On Wed, Sep 2, 2015 at 2:53 AM, Zheng Lin Edwin Yeo
 wrote:
> Hi,
>
> I would like to check, is the string bytes must be at most 32766 characters
> in length?
>
> I'm trying to do a copyField of my rich-text documents content to a field
> with fieldType=string to try out my getting distinct result for content, as
> there are several documents with the exact same content, and we only want
> to list one of them during searching.
>
> However, I get the following errors in some of the documents when I tried
> to index them with the copyField. Some of my documents are quite large in
> size, and there is a possibility that it exceed 32766 characters. Is there
> any other ways to overcome this problem?
>
>
> org.apache.solr.common.SolrException: Exception writing document id
> collection1_polymer100 to the index; possible analysis error.
> at
> org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:167)
> at
> org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69)
> at
> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51)
> at
> org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:955)
> at
> org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1110)
> at
> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:706)
> at
> org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:104)
> at
> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51)
> at
> org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor.processAdd(LanguageIdentifierUpdateProcessor.java:207)
> at
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.doAdd(ExtractingDocumentLoader.java:122)
> at
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.addDoc(ExtractingDocumentLoader.java:127)
> at
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:235)
> at
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
> at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:2064)
> at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:654)
> at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:450)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:227)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:196)
> at
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
> at
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
> at
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
> at
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
> at
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
> at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
> at
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
> at
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
> at
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
> at
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110)
> at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
> at org.eclipse.jetty.server.Server.handle(Server.java:497)
> at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310)
> at
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
> at
> org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.IllegalArgumentException: Document contains at least
> one immense term in field="signature" (whose UTF8 encoding is longer than
> the max length 32766), all of which were skipped.  Please correct the
> 

Re: Re: concept and choice: custom sharding or auto sharding?

2015-09-02 Thread scott chu
 
solr-user,妳好

Do you mean I only have to put 10M documents in one index and copy it to many 
slaves in a classic Solr master-slave architecture to provide querying serivce 
on internet, and it won't have obvious downgrade of query performance? But I 
did have add 1M document into one index on master and provide 2 slaves to serve 
querying service on internet, the query performance is kinda sad. Why do you 
say: "at 10M documents there's rarely a need to shard at all?" Do I provide too 
few slaves? What amount of documents is suitable for a need for shard in 
SolrCloud?

- Original Message - 
From: Erick Erickson 
To: solr-user 
Date: 2015-09-02, 23:00:29
Subject: Re: concept and choice: custom sharding or auto sharding?


Frankly, at 10M documents there's rarely a need to shard at all.
Why do you think you need to? This seems like adding
complexity for no good reason. Sharding should only really
be used when you have too many documents to fit on a single
shard as it adds some overhead, restricts some
possibilities (cross-core join for instance, a couple of
grouping options don't work in distributed mode etc.).

You can still run SolrCloud and have it manage multiple
_replicas_ of a single shard for HA/DR.

So this seems like an XY problem, you're asking specific
questions about shard routing because you think it'll
solve some problem without telling us what the problem
is.

Best,
Erick

On Wed, Sep 2, 2015 at 7:47 AM, scott chu  wrote:
> I post a question on Stackoverflow 
> http://stackoverflow.com/questions/32343813/custom-sharding-or-auto-sharding-on-solrcloud:
> However, since this is a mail-list, I repost the question below to request 
> for suggestion and more subtle concept of SolrCloud's behavior on document 
> routing.
> I want to establish a SolrCloud clsuter for over 10 millions of news 
> articles. After reading this article in Apache Solr Refernce guide: Shards 
> and Indexing Data in SolrCloud, I have a plan as follows:
> Add prefix ED2001! to document ID where ED means some newspaper source and 
> 2001 is the year part in published date of news article, i.e. I want to put 
> all news articles of specific news paper source published in specific year to 
> a shard.
> Create collection with router.name set to compositeID.
> Add documents?
> Query Collection?
> Practically, I got some questions:
> How to add doucments based on this plan? Do I have to specify special 
> parameters when updating the collection/core?
> Is this called "custom sharding"? If not, what is "custom sharding"?
> Is auto sharding a better choice for my case since there's a shard-splitting 
> feature for auto sharding when the shard is too big?
> Can I query without _router_ parameter?
> EDIT @ 2015/9/2:
> This is how I think SolrCloud will do: "The amount of news articles of 
> specific newspaper source of specific year tends to be around a fix number, 
> e.g. Every year ED has around 80,000 articles, so each shard's size won't 
> increase dramatically. For the next year's news articles of ED, I only have 
> to add prefix 'ED2016!' to document ID, SolrCloud will create a new shard for 
> me (which contains all ED2016 articles), and later the Leader will spread the 
> replica of this new shard to other nodes (per replica per node other than 
> leader?)". Am I right? If yes, it seems no need for shard-splitting.


-
未在此訊息中找到病毒。
已透過 AVG 檢查 - www.avg.com
版本: 2015.0.6086 / 病毒庫: 4409/10562 - 發佈日期: 09/02/15




 


Re: String bytes can be at most 32766 characters in length?

2015-09-02 Thread Zheng Lin Edwin Yeo
Hi Erick,

Yes, i'm trying out the De-Duplication too. But I'm facing a problem with
that, which is the indexing stops working once I put in the following
De-Duplication code in solrconfig.xml. The problem seems to be with this dedupe line.

  
  
dedupe
  
  



  
true
signature
false
content
solr.processor.Lookup3Signature
  



Regards,
Edwin

On 2 September 2015 at 23:10, Erick Erickson 
wrote:

> Yes, that is an intentional limit for the size of a single token,
> which strings are.
>
> Why not use deduplication? See:
> https://cwiki.apache.org/confluence/display/solr/De-Duplication
>
> You don't have to replace the existing documents, and Solr will
> compute a hash that can be used to identify identical documents
> and you can use_that_.
>
> Best
> Erick
>
> On Wed, Sep 2, 2015 at 2:53 AM, Zheng Lin Edwin Yeo
>  wrote:
> > Hi,
> >
> > I would like to check, is the string bytes must be at most 32766
> characters
> > in length?
> >
> > I'm trying to do a copyField of my rich-text documents content to a field
> > with fieldType=string to try out my getting distinct result for content,
> as
> > there are several documents with the exact same content, and we only want
> > to list one of them during searching.
> >
> > However, I get the following errors in some of the documents when I tried
> > to index them with the copyField. Some of my documents are quite large in
> > size, and there is a possibility that it exceed 32766 characters. Is
> there
> > any other ways to overcome this problem?
> >
> >
> > org.apache.solr.common.SolrException: Exception writing document id
> > collection1_polymer100 to the index; possible analysis error.
> > at
> >
> org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:167)
> > at
> >
> org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69)
> > at
> >
> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51)
> > at
> >
> org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:955)
> > at
> >
> org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1110)
> > at
> >
> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:706)
> > at
> >
> org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:104)
> > at
> >
> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51)
> > at
> >
> org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor.processAdd(LanguageIdentifierUpdateProcessor.java:207)
> > at
> >
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.doAdd(ExtractingDocumentLoader.java:122)
> > at
> >
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.addDoc(ExtractingDocumentLoader.java:127)
> > at
> >
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:235)
> > at
> >
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
> > at
> >
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143)
> > at org.apache.solr.core.SolrCore.execute(SolrCore.java:2064)
> > at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:654)
> > at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:450)
> > at
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:227)
> > at
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:196)
> > at
> >
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
> > at
> >
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
> > at
> >
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
> > at
> >
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
> > at
> >
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
> > at
> >
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
> > at
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
> > at
> >
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
> > at
> >
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
> > at
> >
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
> > at
> >
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
> > at
> >
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110)
> > at
> >
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
> > at org.eclipse.jetty.server.Server.handle(Server.java:497)

Re: Re: concept and choice: custom sharding or auto sharding?

2015-09-02 Thread Erick Erickson
bq: Why do you say: "at 10M documents there's rarely a need to shard at all?"

Because I routinely see 50M docs on a single node and I've seen over 300M docs
on a single node with sub-second responses. So if you're saying that
you see poor
performance at 1M docs then I suspect there's something radically
wrong with your
setup. Too little memory, very bad query patterns, whatever. If my
suspicion is true,
then sharding will just mask the underlying problem.

You need to quantify your performance concerns. It's one thing to say
"my node satisfies 50 queries-per-second with 500ms response time" and
another to say "My queries take 5,000 ms".

In the first case, you do indeed need to add more servers to increase QPS if
you need 500 QPS. And adding more slaves is the best way to do that.
In the second, you need to understand the slowdown because sharding
will be a band-aid.

This might help:
https://wiki.apache.org/solr/SolrPerformanceProblems

Best,
Erick



On Wed, Sep 2, 2015 at 8:19 AM, scott chu  wrote:
>
> solr-user,妳好
>
> Do you mean I only have to put 10M documents in one index and copy it to
> many slaves in a classic Solr master-slave architecture to provide querying
> serivce on internet, and it won't have obvious downgrade of query
> performance? But I did have add 1M document into one index on master and
> provide 2 slaves to serve querying service on internet, the query
> performance is kinda sad. Why do you say: "at 10M documents there's rarely a
> need to shard at all?" Do I provide too few slaves? What amount of documents
> is suitable for a need for shard in SolrCloud?
>
> - Original Message -
>
> From: Erick Erickson
> To: solr-user
> Date: 2015-09-02, 23:00:29
> Subject: Re: concept and choice: custom sharding or auto sharding?
>
> Frankly, at 10M documents there's rarely a need to shard at all.
> Why do you think you need to? This seems like adding
> complexity for no good reason. Sharding should only really
> be used when you have too many documents to fit on a single
> shard as it adds some overhead, restricts some
> possibilities (cross-core join for instance, a couple of
> grouping options don't work in distributed mode etc.).
>
> You can still run SolrCloud and have it manage multiple
> _replicas_ of a single shard for HA/DR.
>
> So this seems like an XY problem, you're asking specific
> questions about shard routing because you think it'll
> solve some problem without telling us what the problem
> is.
>
> Best,
> Erick
>
> On Wed, Sep 2, 2015 at 7:47 AM, scott chu  wrote:
>> I post a question on Stackoverflow
>> http://stackoverflow.com/questions/32343813/custom-sharding-or-auto-sharding-on-solrcloud:
>> However, since this is a mail-list, I repost the question below to request
>> for suggestion and more subtle concept of SolrCloud's behavior on document
>> routing.
>> I want to establish a SolrCloud clsuter for over 10 millions of news
>> articles. After reading this article in Apache Solr Refernce guide: Shards
>> and Indexing Data in SolrCloud, I have a plan as follows:
>> Add prefix ED2001! to document ID where ED means some newspaper source and
>> 2001 is the year part in published date of news article, i.e. I want to put
>> all news articles of specific news paper source published in specific year
>> to a shard.
>> Create collection with router.name set to compositeID.
>> Add documents?
>> Query Collection?
>> Practically, I got some questions:
>> How to add doucments based on this plan? Do I have to specify special
>> parameters when updating the collection/core?
>> Is this called "custom sharding"? If not, what is "custom sharding"?
>> Is auto sharding a better choice for my case since there's a
>> shard-splitting feature for auto sharding when the shard is too big?
>> Can I query without _router_ parameter?
>> EDIT @ 2015/9/2:
>> This is how I think SolrCloud will do: "The amount of news articles of
>> specific newspaper source of specific year tends to be around a fix number,
>> e.g. Every year ED has around 80,000 articles, so each shard's size won't
>> increase dramatically. For the next year's news articles of ED, I only have
>> to add prefix 'ED2016!' to document ID, SolrCloud will create a new shard
>> for me (which contains all ED2016 articles), and later the Leader will
>> spread the replica of this new shard to other nodes (per replica per node
>> other than leader?)". Am I right? If yes, it seems no need for
>> shard-splitting.
>
>
> -
> 未在此訊息中找到病毒。
> 已透過 AVG 檢查 - www.avg.com
> 版本: 2015.0.6086 / 病毒庫: 4409/10562 - 發佈日期: 09/02/15
>
>
>
>
>


Solr Join support in Multiple Shard

2015-09-02 Thread Maulin Rathod
As per this link (http://wiki.apache.org/solr/Join) Solr Join is supported
only for cores in single shard. Is there any plan to support Join Across
cores in Multiple Shard?


Re: 'missing content stream' issuing expungeDeletes=true

2015-09-02 Thread Erick Erickson
bq: When we found out the document has a docfreq
of 2, we did a query on the document's product id and
indeed 2 documents were returned.
We suspect 1 of them is deleted but not remove from the index.

This is totally inconsistent with how Solr works _if_ these
documents had the same value for whatever field is defined
in your schema.xml as the , usually "id".
So how did you do your query? Through Solr or looking
at things at a low level with Lucene? It should not matter
whether you re-index from scratch or not. So I suspect there's
something else going on here.

bq: Each document (or product record) is unqiue in the collection.

then boosting on the unique value is probably not what you
really want to do. You have to already _know_ the values here
and you just want them at the top. And they're unique per doc.
Could you use the QueryElevationCompoment? The original
intent of that component was statically defined, but you can
also provide a set of IDs as HTTP parameters, see:
https://cwiki.apache.org/confluence/display/solr/The+Query+Elevation+Component
down near the bottom of that page.

Now, all that said if you're indexing only occasionally (say
once a day), then optimizing is not a bad way to go. And having
only 6M docs means it won't take all that long.

Best,
Erick

On Wed, Sep 2, 2015 at 12:59 AM, Derek Poh  wrote:
> There are around 6+ millions documents in the collection.
>
> Each document (or product record) is unqiue in the collection.
> When we found out the document has a docfreq of 2, we did a query on the
> document's product id and indeed 2 documents were returned.
> We suspect 1 of them is deleted but not remove from the index. We try
> optimizing. Only 1 document is return when we query again and the document
> docreq is 1.
>
> We checked the source data and the document is not duplicated.
> It could be the way we index (full index every time) that result in this
> scenario of having 2 of the same document in the index.
>
> On 9/2/2015 12:11 PM, Erick Erickson wrote:
>>
>> How many document total in your corpus? And how many do you
>> intend to have?
>>
>> My point is that if you are testing this with a small corpus, the results
>> are very likely different than when you test on a reasonable corpus.
>> So if you expect your "real" index will contain many more docs than
>> what you're testing, this is likely a red herring.
>>
>> But something isn't making a lot of sense here. You say you've traced it
>> to having a docfreq of 2 that changes to 1. But that means that the
>> value is unique in your entire corpus, which kind of indicates you're
>> trying to boost on unique values which is unusual.
>>
>> If you're confident in your model though, the only way to guarantee
>> what you want is to optimize/expungeDeletes.
>>
>> Best,
>> Erick
>>
>> On Tue, Sep 1, 2015 at 7:51 PM, Derek Poh  wrote:
>>>
>>> Erick
>>>
>>> Yes, we see documents changing their position in the list due to having
>>> deleted docs.
>>> In our searchresult,weapply higher boost (bq) to a group of matched
>>> documents to have them display at the top tier of the result.
>>> At times 1 or 2 of these documentsare not return in the top tier, they
>>> are
>>> relegateddown to the lower tierof the result. Wediscovered that these
>>> documents have a lower score due to docFreq=2.
>>> After we do an optimize, these 1-2 documents are back in the top tier
>>> result
>>> order and their docFreqis 1.
>>>
>>>
>>>
>>> On 9/1/2015 11:40 PM, Erick Erickson wrote:

 Derek:

 Why do you care? What evidence do you have that this matters
 _practically_?

 If you've look at scoring with a small number of documents, you'll see
 significant
 differences due to deleted documents. In most cases, as you get a larger
 number
 of documents the ranking of documents in an index with no deletions .vs.
 indexes
 that have deletions is usually not noticeable.

 I'm suggesting that this is a red herring. Your specific situation may
 be different
 of course, but since scoring is really only about ranking docs
 relative to each other,
 unless the relative positions change enough to be noticeable it's not a
 problem.

 Note that I'm saying "relative rankings", NOT "absolute score". Document
 scores
 have no meaning outside comparisons to other docs _in the same query_.
 So
 unless you see documents changing their position in the list due to
 having deleted
 docs, it's not worth spending time on IMO.

 Best,
 Erick

 On Tue, Sep 1, 2015 at 12:45 AM, Upayavira  wrote:
>
> I wonder if this resolves it [1]. It has been applied to trunk, but not
> to the 5.x release branch.
>
> If you needed it in 5.x, I wonder if there's a way that particular
> choice could be made configurable.
>
> Upayavira
>
> [1] 

Re: Re: Re: concept and choice: custom sharding or auto sharding?

2015-09-02 Thread scott chu
 
solr-user,妳好

Thanks! I'll go back to check my old environment and that article is really 
helpful.

BTW, I think I got wrong about compositeID. In the reference guide, it said 
compositeID needs numShards. That means what I describe in question 5 seems 
wrong cause I intend to plan one shard one whole year news article and I 
thought SolrCloud will create new shard for me itself when I add new year's 
articles. But since compositeID needs to specify numShards first, there's no 
way I can know how many years I will put in SolrCloud in advance . IT looks 
like if I want to use SolrCloud afte all, I may have to use auto sharding (i.e. 
implicit router).
- Original Message - 
From: Erick Erickson 
To: solr-user 
Date: 2015-09-02, 23:30:53
Subject: Re: Re: concept and choice: custom sharding or auto sharding?


bq: Why do you say: "at 10M documents there's rarely a need to shard at all?"

Because I routinely see 50M docs on a single node and I've seen over 300M docs
on a single node with sub-second responses. So if you're saying that
you see poor
performance at 1M docs then I suspect there's something radically
wrong with your
setup. Too little memory, very bad query patterns, whatever. If my
suspicion is true,
then sharding will just mask the underlying problem.

You need to quantify your performance concerns. It's one thing to say
"my node satisfies 50 queries-per-second with 500ms response time" and
another to say "My queries take 5,000 ms".

In the first case, you do indeed need to add more servers to increase QPS if
you need 500 QPS. And adding more slaves is the best way to do that.
In the second, you need to understand the slowdown because sharding
will be a band-aid.

This might help:
https://wiki.apache.org/solr/SolrPerformanceProblems

Best,
Erick



On Wed, Sep 2, 2015 at 8:19 AM, scott chu  wrote:
>
> solr-user,妳好
>
> Do you mean I only have to put 10M documents in one index and copy it to

> many slaves in a classic Solr master-slave architecture to provide querying
> serivce on internet, and it won't have obvious downgrade of query
> performance? But I did have add 1M document into one index on master and

> provide 2 slaves to serve querying service on internet, the query
> performance is kinda sad. Why do you say: "at 10M documents there's rarely a
> need to shard at all?" Do I provide too few slaves? What amount of documents
> is suitable for a need for shard in SolrCloud?
>
> - Original Message -
>
> From: Erick Erickson
> To: solr-user
> Date: 2015-09-02, 23:00:29
> Subject: Re: concept and choice: custom sharding or auto sharding?
>
> Frankly, at 10M documents there's rarely a need to shard at all.
> Why do you think you need to? This seems like adding
> complexity for no good reason. Sharding should only really
> be used when you have too many documents to fit on a single
> shard as it adds some overhead, restricts some
> possibilities (cross-core join for instance, a couple of
> grouping options don't work in distributed mode etc.).
>
> You can still run SolrCloud and have it manage multiple
> _replicas_ of a single shard for HA/DR.
>
> So this seems like an XY problem, you're asking specific
> questions about shard routing because you think it'll
> solve some problem without telling us what the problem
> is.
>
> Best,
> Erick
>
> On Wed, Sep 2, 2015 at 7:47 AM, scott chu  wrote:
>> I post a question on Stackoverflow
>> http://stackoverflow.com/questions/32343813/custom-sharding-or-auto-sharding-on-solrcloud:
>> However, since this is a mail-list, I repost the question below to request
>> for suggestion and more subtle concept of SolrCloud's behavior on document
>> routing.
>> I want to establish a SolrCloud clsuter for over 10 millions of news
>> articles. After reading this article in Apache Solr Refernce guide: Shards
>> and Indexing Data in SolrCloud, I have a plan as follows:
>> Add prefix ED2001! to document ID where ED means some newspaper source and
>> 2001 is the year part in published date of news article, i.e. I want to put
>> all news articles of specific news paper source published in specific year
>> to a shard.
>> Create collection with router.name set to compositeID.
>> Add documents?
>> Query Collection?
>> Practically, I got some questions:
>> How to add doucments based on this plan? Do I have to specify special
>> parameters when updating the collection/core?
>> Is this called "custom sharding"? If not, what is "custom sharding"?
>> Is auto sharding a better choice for my case since there's a
>> shard-splitting feature for auto sharding when the shard is too big?
>> Can I query without _router_ parameter?
>> EDIT @ 2015/9/2:
>> This is how I think SolrCloud will do: "The amount of news articles of
>> specific newspaper source of specific year tends to be around a fix number,
>> e.g. Every year ED has around 80,000 articles, so each shard's size won't
>> increase dramatically. For the next year's news 

Re: String bytes can be at most 32766 characters in length?

2015-09-02 Thread Erick Erickson
_How_ does it fail? You must be seeing something in the logs



On Wed, Sep 2, 2015 at 8:29 AM, Zheng Lin Edwin Yeo
 wrote:
> Hi Erick,
>
> Yes, i'm trying out the De-Duplication too. But I'm facing a problem with
> that, which is the indexing stops working once I put in the following
> De-Duplication code in solrconfig.xml. The problem seems to be with this  name="update.chain">dedupe line.
>
>   
>   
> dedupe
>   
>   
>
>
> 
>   
> true
> signature
> false
> content
> solr.processor.Lookup3Signature
>   
> 
>
>
> Regards,
> Edwin
>
> On 2 September 2015 at 23:10, Erick Erickson 
> wrote:
>
>> Yes, that is an intentional limit for the size of a single token,
>> which strings are.
>>
>> Why not use deduplication? See:
>> https://cwiki.apache.org/confluence/display/solr/De-Duplication
>>
>> You don't have to replace the existing documents, and Solr will
>> compute a hash that can be used to identify identical documents
>> and you can use_that_.
>>
>> Best
>> Erick
>>
>> On Wed, Sep 2, 2015 at 2:53 AM, Zheng Lin Edwin Yeo
>>  wrote:
>> > Hi,
>> >
>> > I would like to check, is the string bytes must be at most 32766
>> characters
>> > in length?
>> >
>> > I'm trying to do a copyField of my rich-text documents content to a field
>> > with fieldType=string to try out my getting distinct result for content,
>> as
>> > there are several documents with the exact same content, and we only want
>> > to list one of them during searching.
>> >
>> > However, I get the following errors in some of the documents when I tried
>> > to index them with the copyField. Some of my documents are quite large in
>> > size, and there is a possibility that it exceed 32766 characters. Is
>> there
>> > any other ways to overcome this problem?
>> >
>> >
>> > org.apache.solr.common.SolrException: Exception writing document id
>> > collection1_polymer100 to the index; possible analysis error.
>> > at
>> >
>> org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:167)
>> > at
>> >
>> org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69)
>> > at
>> >
>> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51)
>> > at
>> >
>> org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:955)
>> > at
>> >
>> org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1110)
>> > at
>> >
>> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:706)
>> > at
>> >
>> org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:104)
>> > at
>> >
>> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51)
>> > at
>> >
>> org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor.processAdd(LanguageIdentifierUpdateProcessor.java:207)
>> > at
>> >
>> org.apache.solr.handler.extraction.ExtractingDocumentLoader.doAdd(ExtractingDocumentLoader.java:122)
>> > at
>> >
>> org.apache.solr.handler.extraction.ExtractingDocumentLoader.addDoc(ExtractingDocumentLoader.java:127)
>> > at
>> >
>> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:235)
>> > at
>> >
>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
>> > at
>> >
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143)
>> > at org.apache.solr.core.SolrCore.execute(SolrCore.java:2064)
>> > at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:654)
>> > at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:450)
>> > at
>> >
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:227)
>> > at
>> >
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:196)
>> > at
>> >
>> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
>> > at
>> >
>> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
>> > at
>> >
>> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
>> > at
>> >
>> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
>> > at
>> >
>> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
>> > at
>> >
>> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
>> > at
>> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
>> > at
>> >
>> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
>> > at
>> >
>> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
>> > at
>> >
>> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>> > at
>> >
>> 

Re: Solr Join support in Multiple Shard

2015-09-02 Thread Erick Erickson
It's been discussed, but it's likely to have performance
problems I'd guess.

That said, there's some very interesting stuff being
done with Streaming Aggregation and, built on top
of that Parallel SQL. But how that applies to your
use-case I don't know.

Best,
Erick

On Wed, Sep 2, 2015 at 9:05 AM, Maulin Rathod  wrote:
> As per this link (http://wiki.apache.org/solr/Join) Solr Join is supported
> only for cores in single shard. Is there any plan to support Join Across
> cores in Multiple Shard?


Re: concept and choice: custom sharding or auto sharding?

2015-09-02 Thread Shawn Heisey
On 9/2/2015 9:19 AM, scott chu wrote:
> Mail
> Do you mean I only have to put 10M documents in one index and copy
> it to many slaves in a classic Solr master-slave architecture to
> provide querying serivce on internet, and it won't have obvious
> downgrade of query performance? But I did have add 1M document into
> one index on master and provide 2 slaves to serve querying service on
> internet, the query performance is kinda sad. Why do you say: "at 10M
> documents there's rarely a need to shard at all?" Do I provide too few
> slaves? What amount of documents is suitable for a need for shard in
> SolrCloud?

Lucene has exactly one hard and unbreakable limit, and it is the number
of documents you can have in a single index (core/shard for Solr).  That
limit is just over 2.1 billion documents.  The actual limiting factor is
the maximum value of an integer in Java.  Because deleted documents are
counted when this limit is considered, you shouldn't go over 1 billion
active documents per shard, but the *practical* recommendation for shard
size is much lower than that.

For various reasons, some of which are very technical and boring, the
general advice is to not exceed about 100 million documents per shard. 
Some setups can handle more docs per shard, some require a lot less. 
There are no quick answers or hard rules.  You may have been given this
URL before:

https://lucidworks.com/blog/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

There are sometimes reasons to shard a very small index.  This is the
correct path when the index is not very busy and you want to take
advantage of a surplus of CPU power to make each query faster.  In this
situation, you will probably end up with multiple shards per server, so
when a query comes in, multiple CPUs on the machine can handle the shard
queries in parallel.  If the index is handling a lot of requests per
second, then you want that CPU power used for handling the load, not
speeding up a single query.  For high-load situations, one shard per
physical server is desirable.

Thanks,
Shawn



Rules for pre-processing queries

2015-09-02 Thread Siamak Rowshan
Hi all, I need to refine my search results by adding parameters to search query 
parameters. For example, if user enters "ipad", I want to add a filter query 
such as ("category=tablets") to refine the search results. I thought a more 
general solution would be to define rules, that examine the query parameter 
values, and can alter or add to the query parameters. Short of writing custom 
code, are there any features within Solr or add-on tools that can do something 
like this?

Regards,
Mak


Siamak Rowshan | Software Engineer
Softmart | 450 Acorn Lane Downingtown, PA 19335
P   | 888-763-8627
siamak.rows...@softmart.com


EEO Employer/Protected Veteran/Disabled
The information in this e-mail is confidential and may be legally privileged. 
It is intended solely for the addressee. Access to this e-mail by anyone else 
is unauthorized. Softmart Sales Terms & Conditions available at 
www.softmart.com/terms.






Cannot search on special characters such as $ or

2015-09-02 Thread Steven White
Hi Everyone,

I have the following in my schema:

  

  
  
  
  
  
  
  
  
  

  

In the text file "wdfftypes.txt", I have this:

  & => DIGIT
  $ => DIGIT

I also tried:

  & => ALPHA
  $ => ALPHA

I then index data that contains the string: "~ ! @ # $ % ^ & * ( ) _ + - =
[ { ] } \ | ; : ' " , < . > / ?"

But yet when I search on $ or &, I don't get any hit.  Any idea what I'm
doing wrong?

Thanks in advance.

Steve


Re: String bytes can be at most 32766 characters in length?

2015-09-02 Thread Alexandre Rafalovitch
And that's because you have an incomplete chain. If you look at the
full example in solrconfig.xml, it shows:
 
   
 true
 id
 false
 name,features,cat
 solr.processor.Lookup3Signature
   
   
   
 


Notice, the last two processors. If you don't have those, nothing gets
indexed. You chain is missing them, for whatever reasons. Try adding
them back in, reloading the core and reindexing.

Regards,
   Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 2 September 2015 at 11:29, Zheng Lin Edwin Yeo  wrote:
> Hi Erick,
>
> Yes, i'm trying out the De-Duplication too. But I'm facing a problem with
> that, which is the indexing stops working once I put in the following
> De-Duplication code in solrconfig.xml. The problem seems to be with this  name="update.chain">dedupe line.
>
>   
>   
> dedupe
>   
>   
>
>
> 
>   
> true
> signature
> false
> content
> solr.processor.Lookup3Signature
>   
> 
>
>
> Regards,
> Edwin
>
> On 2 September 2015 at 23:10, Erick Erickson 
> wrote:
>
>> Yes, that is an intentional limit for the size of a single token,
>> which strings are.
>>
>> Why not use deduplication? See:
>> https://cwiki.apache.org/confluence/display/solr/De-Duplication
>>
>> You don't have to replace the existing documents, and Solr will
>> compute a hash that can be used to identify identical documents
>> and you can use_that_.
>>
>> Best
>> Erick
>>
>> On Wed, Sep 2, 2015 at 2:53 AM, Zheng Lin Edwin Yeo
>>  wrote:
>> > Hi,
>> >
>> > I would like to check, is the string bytes must be at most 32766
>> characters
>> > in length?
>> >
>> > I'm trying to do a copyField of my rich-text documents content to a field
>> > with fieldType=string to try out my getting distinct result for content,
>> as
>> > there are several documents with the exact same content, and we only want
>> > to list one of them during searching.
>> >
>> > However, I get the following errors in some of the documents when I tried
>> > to index them with the copyField. Some of my documents are quite large in
>> > size, and there is a possibility that it exceed 32766 characters. Is
>> there
>> > any other ways to overcome this problem?
>> >
>> >
>> > org.apache.solr.common.SolrException: Exception writing document id
>> > collection1_polymer100 to the index; possible analysis error.
>> > at
>> >
>> org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:167)
>> > at
>> >
>> org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69)
>> > at
>> >
>> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51)
>> > at
>> >
>> org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:955)
>> > at
>> >
>> org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1110)
>> > at
>> >
>> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:706)
>> > at
>> >
>> org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:104)
>> > at
>> >
>> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51)
>> > at
>> >
>> org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor.processAdd(LanguageIdentifierUpdateProcessor.java:207)
>> > at
>> >
>> org.apache.solr.handler.extraction.ExtractingDocumentLoader.doAdd(ExtractingDocumentLoader.java:122)
>> > at
>> >
>> org.apache.solr.handler.extraction.ExtractingDocumentLoader.addDoc(ExtractingDocumentLoader.java:127)
>> > at
>> >
>> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:235)
>> > at
>> >
>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
>> > at
>> >
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143)
>> > at org.apache.solr.core.SolrCore.execute(SolrCore.java:2064)
>> > at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:654)
>> > at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:450)
>> > at
>> >
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:227)
>> > at
>> >
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:196)
>> > at
>> >
>> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
>> > at
>> >
>> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
>> > at
>> >
>> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
>> > at
>> >
>> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
>> > at
>> >
>> 

Re: which solrconfig.xml

2015-09-02 Thread Chris Hostetter
: various $HOME/solr-5.3.0 subdirectories.  The documents/tutorials say to edit
: the solrconfig.xml file for various configuration details, but they never say
: which one of these dozen to edit.  Moreover, I cannot determine which version

can you please give us a specific examples (ie: urls, page numbers & 
version of the ref guide, etc...) of documentation that tell you to edit 
the solrconfig.xml w/o being explicit about where to to find it so that we 
can fix the docs?

FWIW: The official "Quick Start" tutorial does not mention editing 
solrconfig.xml at all...

http://lucene.apache.org/solr/quickstart.html



-Hoss
http://www.lucidworks.com/


Re: String bytes can be at most 32766 characters in length?

2015-09-02 Thread Zheng Lin Edwin Yeo
Hi Erick,

I couldn't really find anything special in the logs. The indexing process
just went on normally, but after that when I check the index, there is
nothing indexed.

This is what I see from the logs. Looks the same as when the indexing works
fine.

INFO  - 2015-09-03 01:24:35.316; [collection1 shard1 core_node2
collection1] org.apache.solr.handler.extraction.SolrContentHandler; Content
1 = content
INFO  - 2015-09-03 01:24:35.319; [collection1 shard1 core_node2
collection1] org.apache.solr.handler.extraction.SolrContentHandler; Content
2 = content
INFO  - 2015-09-03 01:24:35.482; [collection1 shard1 core_node1
collection1_shard1_replica2] org.apache.solr.core.SolrCore;
[collection1_shard1_replica2] webapp=/solr path=/update
params={update.distrib=FROMLEADER=
http://localhost:8983/solr/collection1/=javabin=2
} status=0
QTime=4
INFO  - 2015-09-03 01:24:35.483; [collection1 shard1 core_node2
collection1] org.apache.solr.update.processor.LogUpdateProcessor;
[collection1] webapp=/solr path=/update/extract
params={literal.geolocation=1,103=0=cat=Growhill&
resource.name
=C:\Users\edwin_000\Desktop\edwin\solr-5.2.1\IndexingDocuments\collection1\cat.pdf&
literal.id=collection1_cat=Singapore=VIP_cat=test3=science=8=5=edwin=science=C:\Users\edwin_000\Desktop\edwin\solr-5.2.1\IndexingDocuments\collection1\cat.pdf_subcat=test3=Public}
{add=[collection1_cat (1511253318382387200)]} 0 437
INFO  - 2015-09-03 01:24:36.218; [collection1 shard1 core_node2
collection1] org.apache.solr.handler.extraction.SolrContentHandler; Content
1 = content
INFO  - 2015-09-03 01:24:36.225; [collection1 shard1 core_node2
collection1] org.apache.solr.handler.extraction.SolrContentHandler; Content
2 = content
INFO  - 2015-09-03 01:24:36.487; [collection1 shard1 core_node1
collection1_shard1_replica2] org.apache.solr.core.SolrCore;
[collection1_shard1_replica2] webapp=/solr path=/update
params={update.distrib=FROMLEADER=
http://localhost:8983/solr/collection1/=javabin=2
} status=0
QTime=6


Regards,
Edwin


On 2 September 2015 at 23:34, Erick Erickson 
wrote:

> _How_ does it fail? You must be seeing something in the logs
>
>
>
> On Wed, Sep 2, 2015 at 8:29 AM, Zheng Lin Edwin Yeo
>  wrote:
> > Hi Erick,
> >
> > Yes, i'm trying out the De-Duplication too. But I'm facing a problem with
> > that, which is the indexing stops working once I put in the following
> > De-Duplication code in solrconfig.xml. The problem seems to be with this
>  > name="update.chain">dedupe line.
> >
> >   
> >   
> > dedupe
> >   
> >   
> >
> >
> > 
> >   
> > true
> > signature
> > false
> > content
> > solr.processor.Lookup3Signature
> >   
> > 
> >
> >
> > Regards,
> > Edwin
> >
> > On 2 September 2015 at 23:10, Erick Erickson 
> > wrote:
> >
> >> Yes, that is an intentional limit for the size of a single token,
> >> which strings are.
> >>
> >> Why not use deduplication? See:
> >> https://cwiki.apache.org/confluence/display/solr/De-Duplication
> >>
> >> You don't have to replace the existing documents, and Solr will
> >> compute a hash that can be used to identify identical documents
> >> and you can use_that_.
> >>
> >> Best
> >> Erick
> >>
> >> On Wed, Sep 2, 2015 at 2:53 AM, Zheng Lin Edwin Yeo
> >>  wrote:
> >> > Hi,
> >> >
> >> > I would like to check, is the string bytes must be at most 32766
> >> characters
> >> > in length?
> >> >
> >> > I'm trying to do a copyField of my rich-text documents content to a
> field
> >> > with fieldType=string to try out my getting distinct result for
> content,
> >> as
> >> > there are several documents with the exact same content, and we only
> want
> >> > to list one of them during searching.
> >> >
> >> > However, I get the following errors in some of the documents when I
> tried
> >> > to index them with the copyField. Some of my documents are quite
> large in
> >> > size, and there is a possibility that it exceed 32766 characters. Is
> >> there
> >> > any other ways to overcome this problem?
> >> >
> >> >
> >> > org.apache.solr.common.SolrException: Exception writing document id
> >> > collection1_polymer100 to the index; possible analysis error.
> >> > at
> >> >
> >>
> org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:167)
> >> > at
> >> >
> >>
> org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69)
> >> > at
> >> >
> >>
> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51)
> >> > at
> >> >
> >>
> org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:955)
> >> > at
> >> >
> >>
> org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1110)
> >> > at
> >> >
> >>
> 

Re: Cannot search on special characters such as $ or

2015-09-02 Thread Erick Erickson
The Admin/Analysis page is your friend.
On a quick test $ and & never make it past
StandardTokenizerFactory

Best,
Erick

On Wed, Sep 2, 2015 at 5:17 PM, Steven White  wrote:
> Hi Everyone,
>
> I have the following in my schema:
>
>positionIncrementGap="100" autoGeneratePhraseQueries="true">
> 
>   
>synonyms="synonyms.txt" ignoreCase="true"/>
>ignoreCase="true"/>
>generateNumberParts="1" splitOnCaseChange="0" catenateWords="1"
> splitOnNumerics="1" stemEnglishPossessive="1" generateWordParts="1"
> catenateAll="1" catenateNumbers="1" types="wdfftypes.txt"/>
>   
>   
>protected="protwords.txt"/>
>   
>   
> 
>   
>
> In the text file "wdfftypes.txt", I have this:
>
>   & => DIGIT
>   $ => DIGIT
>
> I also tried:
>
>   & => ALPHA
>   $ => ALPHA
>
> I then index data that contains the string: "~ ! @ # $ % ^ & * ( ) _ + - =
> [ { ] } \ | ; : ' " , < . > / ?"
>
> But yet when I search on $ or &, I don't get any hit.  Any idea what I'm
> doing wrong?
>
> Thanks in advance.
>
> Steve


Re: String bytes can be at most 32766 characters in length?

2015-09-02 Thread Zheng Lin Edwin Yeo
Hi Alexandre,

Thanks for pointing out the error. I'm able to get the documents to be
indexed after adding in the two processors.

However, I'm still seeing all the similar documents being search in the
content without being de-duplicated. My content is currently indexed as
fieldType=text_general.


 
true
content
false
content
solr.processor.Lookup3Signature
 




Regards,
Edwin


On 3 September 2015 at 09:46, Alexandre Rafalovitch 
wrote:

> And that's because you have an incomplete chain. If you look at the
> full example in solrconfig.xml, it shows:
>  
>
>  true
>  id
>  false
>  name,features,cat
>  solr.processor.Lookup3Signature
>
>
>
>  
>
>
> Notice, the last two processors. If you don't have those, nothing gets
> indexed. You chain is missing them, for whatever reasons. Try adding
> them back in, reloading the core and reindexing.
>
> Regards,
>Alex.
> 
> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> http://www.solr-start.com/
>
>
> On 2 September 2015 at 11:29, Zheng Lin Edwin Yeo 
> wrote:
> > Hi Erick,
> >
> > Yes, i'm trying out the De-Duplication too. But I'm facing a problem with
> > that, which is the indexing stops working once I put in the following
> > De-Duplication code in solrconfig.xml. The problem seems to be with this
>  > name="update.chain">dedupe line.
> >
> >   
> >   
> > dedupe
> >   
> >   
> >
> >
> > 
> >   
> > true
> > signature
> > false
> > content
> > solr.processor.Lookup3Signature
> >   
> > 
> >
> >
> > Regards,
> > Edwin
> >
> > On 2 September 2015 at 23:10, Erick Erickson 
> > wrote:
> >
> >> Yes, that is an intentional limit for the size of a single token,
> >> which strings are.
> >>
> >> Why not use deduplication? See:
> >> https://cwiki.apache.org/confluence/display/solr/De-Duplication
> >>
> >> You don't have to replace the existing documents, and Solr will
> >> compute a hash that can be used to identify identical documents
> >> and you can use_that_.
> >>
> >> Best
> >> Erick
> >>
> >> On Wed, Sep 2, 2015 at 2:53 AM, Zheng Lin Edwin Yeo
> >>  wrote:
> >> > Hi,
> >> >
> >> > I would like to check, is the string bytes must be at most 32766
> >> characters
> >> > in length?
> >> >
> >> > I'm trying to do a copyField of my rich-text documents content to a
> field
> >> > with fieldType=string to try out my getting distinct result for
> content,
> >> as
> >> > there are several documents with the exact same content, and we only
> want
> >> > to list one of them during searching.
> >> >
> >> > However, I get the following errors in some of the documents when I
> tried
> >> > to index them with the copyField. Some of my documents are quite
> large in
> >> > size, and there is a possibility that it exceed 32766 characters. Is
> >> there
> >> > any other ways to overcome this problem?
> >> >
> >> >
> >> > org.apache.solr.common.SolrException: Exception writing document id
> >> > collection1_polymer100 to the index; possible analysis error.
> >> > at
> >> >
> >>
> org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:167)
> >> > at
> >> >
> >>
> org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69)
> >> > at
> >> >
> >>
> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51)
> >> > at
> >> >
> >>
> org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:955)
> >> > at
> >> >
> >>
> org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1110)
> >> > at
> >> >
> >>
> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:706)
> >> > at
> >> >
> >>
> org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:104)
> >> > at
> >> >
> >>
> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51)
> >> > at
> >> >
> >>
> org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor.processAdd(LanguageIdentifierUpdateProcessor.java:207)
> >> > at
> >> >
> >>
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.doAdd(ExtractingDocumentLoader.java:122)
> >> > at
> >> >
> >>
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.addDoc(ExtractingDocumentLoader.java:127)
> >> > at
> >> >
> >>
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:235)
> >> > at
> >> >
> >>
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
> >> > at
> >> >
> >>
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143)
> >> > at org.apache.solr.core.SolrCore.execute(SolrCore.java:2064)
> >> > at 

Re: Position of Document in Listing (Search Result)

2015-09-02 Thread Erick Erickson
Well, you have access to the start parameter, isn't it just
start+(ordinal position in the page)?

Best,
Erick

On Wed, Sep 2, 2015 at 7:01 PM, Shayan Haque  wrote:
> Hi,
>
> I need to get a document position within a search result for a specific
> member, to show them where there result lie for a particular set of
> filters... I tried using a Solr-Ranking plugin but its outdated, version
> 3.5 compatible. Is there some other way?
> Ordinal ranking or any other thing.. the version I am using is solr 4.7.
> The last resort would be counting in the app.. but the issue is that would
> be extensive... it would mean running a Solr request for each listed item
> to get its position by looping through the results till X page or X limit
> ... can't traverse all data.
>
> Help plz.
>
>
> -Shayan


Re: Difference between Legacy Facets and JSON Facets

2015-09-02 Thread Zheng Lin Edwin Yeo
> As far as I can see, JSON Facets does not have this delayed mapping
mechanism: Every increment requires a call to the segment->global-ordinal
map. With a large field this map cannot be in the fast caches. Combine this
with a gazillion references and it makes sense that JSON Facets is slower
in this scenario. A factor 20 sounds like way too much though. I would have
expected maybe 2.


I'm not sure if it is the really large content that causes this.
I have found some other fields, if I indexed them as String and the length
is more than 5 different words, the JSON facet is slightly slower than
Legacy facet, but that is within your expected factor of 2. (Legacy Facet
QTime:10, JSON Facet QTime:25)

The content is the only one with a factor of more than 20, as some of the
documents indexed is more than 200 pages long.

So should I say that in this case of doing faceting on large content field,
using Legacy Facet is better than using the newer JSON Facet? But for other
shorter field, using JSON Facet would be better?


Regards,
Edwin


On 3 September 2015 at 02:44, Toke Eskildsen  wrote:

> Yonik Seeley  wrote:
> > Hmmm, well something is really wrong for this orders of magnitude
> > difference.  I've never seen anything like that and we should
> > definitely try to get to the bottom of it.
>
> This might be a wild goose chase, but...
>
> Zheng states it is a text field with the content of fairly large
> documents. This means a high amount of unique values and a gazillion
> references from documents to those values.
>
> When incrementing counters for String faceting, segment ordinal -> index
> ordinal mapping takes place. Legacy facets has a mechanism where temporary
> segment-specific counters are used. These are updated directly with the
> segment ordinals and the mapping to global ordinals is performed after the
> counting.
>
> As far as I can see, JSON Facets does not have this delayed mapping
> mechanism: Every increment requires a call to the segment->global-ordinal
> map. With a large field this map cannot be in the fast caches. Combine this
> with a gazillion references and it makes sense that JSON Facets is slower
> in this scenario. A factor 20 sounds like way too much though. I would have
> expected maybe 2.
>
> - Toke Eskildsen
>


Re: Position of Document in Listing (Search Result)

2015-09-02 Thread Erick Erickson
It's entirely unclear what you mean by "position".

bq:  where for "make and model" his
first result comes

Comes in what? The search result list? Some
a-priori ordering of all the cars that has
nothing to do with this search? The results
list of everyone's cars that have the same
make and model? If this latter, what good
is this doing the user? On what criteria
are the make/model combination ordered so
that the position of the user's car has
meaning?

 Some concrete examples would help I think.

Best,
Erick


On Wed, Sep 2, 2015 at 9:42 PM, Shayan Haque  wrote:
> Thanks for the reply Erick.
>
> How do I get the position? I am searching on e.g. car model and make, and I
> want to show on which position the members's first car falls for that
> specific car model and make. So I tell solr, get listing for the cars with
> the model and make. I want from that result, if the member's first result
> is 208th document, could be 2008th.
>
> Well the web interface that I need to implement is different though. In
> actual the member will see all his cars with where for "make and model" his
> first result comes. If I get the above thing to work, I'd basically search
> solr for each of his car's make and model. It will be 12 or 24 cars per
> page, so it's not the best solution but still better if solr can give the
> position. Alternate would be to I have to loop all the results pr solr
> search, that would be even worse, 12 * 1+ results. I may even have
> limit the search to 1000 positions per make and model or some other cap and
> show user not the exact position if not found in the 1000 result.
>
>
> Maybe I am not thinking in the right direction using Solr for this. Can't
> think how I'd get the position any other way as the site's public pages use
> solr for listing.
>
>
>
> -
> Regards,
> Shayan
>
>
>
>
>
>
>
>
>
>
> On Thu, Sep 3, 2015 at 12:08 PM Erick Erickson 
> wrote:
>
>> Well, you have access to the start parameter, isn't it just
>> start+(ordinal position in the page)?
>>
>> Best,
>> Erick
>>
>> On Wed, Sep 2, 2015 at 7:01 PM, Shayan Haque  wrote:
>> > Hi,
>> >
>> > I need to get a document position within a search result for a specific
>> > member, to show them where there result lie for a particular set of
>> > filters... I tried using a Solr-Ranking plugin but its outdated, version
>> > 3.5 compatible. Is there some other way?
>> > Ordinal ranking or any other thing.. the version I am using is solr 4.7.
>> > The last resort would be counting in the app.. but the issue is that
>> would
>> > be extensive... it would mean running a Solr request for each listed item
>> > to get its position by looping through the results till X page or X limit
>> > ... can't traverse all data.
>> >
>> > Help plz.
>> >
>> >
>> > -Shayan
>>


Position of Document in Listing (Search Result)

2015-09-02 Thread Shayan Haque
Hi,

I need to get a document position within a search result for a specific
member, to show them where there result lie for a particular set of
filters... I tried using a Solr-Ranking plugin but its outdated, version
3.5 compatible. Is there some other way?
Ordinal ranking or any other thing.. the version I am using is solr 4.7.
The last resort would be counting in the app.. but the issue is that would
be extensive... it would mean running a Solr request for each listed item
to get its position by looping through the results till X page or X limit
... can't traverse all data.

Help plz.


-Shayan


Re: Position of Document in Listing (Search Result)

2015-09-02 Thread Shayan Haque
Thanks for the reply Erick.

How do I get the position? I am searching on e.g. car model and make, and I
want to show on which position the members's first car falls for that
specific car model and make. So I tell solr, get listing for the cars with
the model and make. I want from that result, if the member's first result
is 208th document, could be 2008th.

Well the web interface that I need to implement is different though. In
actual the member will see all his cars with where for "make and model" his
first result comes. If I get the above thing to work, I'd basically search
solr for each of his car's make and model. It will be 12 or 24 cars per
page, so it's not the best solution but still better if solr can give the
position. Alternate would be to I have to loop all the results pr solr
search, that would be even worse, 12 * 1+ results. I may even have
limit the search to 1000 positions per make and model or some other cap and
show user not the exact position if not found in the 1000 result.


Maybe I am not thinking in the right direction using Solr for this. Can't
think how I'd get the position any other way as the site's public pages use
solr for listing.



-
Regards,
Shayan










On Thu, Sep 3, 2015 at 12:08 PM Erick Erickson 
wrote:

> Well, you have access to the start parameter, isn't it just
> start+(ordinal position in the page)?
>
> Best,
> Erick
>
> On Wed, Sep 2, 2015 at 7:01 PM, Shayan Haque  wrote:
> > Hi,
> >
> > I need to get a document position within a search result for a specific
> > member, to show them where there result lie for a particular set of
> > filters... I tried using a Solr-Ranking plugin but its outdated, version
> > 3.5 compatible. Is there some other way?
> > Ordinal ranking or any other thing.. the version I am using is solr 4.7.
> > The last resort would be counting in the app.. but the issue is that
> would
> > be extensive... it would mean running a Solr request for each listed item
> > to get its position by looping through the results till X page or X limit
> > ... can't traverse all data.
> >
> > Help plz.
> >
> >
> > -Shayan
>


Re: Difference between Legacy Facets and JSON Facets

2015-09-02 Thread Toke Eskildsen
Yonik Seeley  wrote:
> Hmmm, well something is really wrong for this orders of magnitude
> difference.  I've never seen anything like that and we should
> definitely try to get to the bottom of it.

This might be a wild goose chase, but...

Zheng states it is a text field with the content of fairly large documents. 
This means a high amount of unique values and a gazillion references from 
documents to those values.

When incrementing counters for String faceting, segment ordinal -> index 
ordinal mapping takes place. Legacy facets has a mechanism where temporary 
segment-specific counters are used. These are updated directly with the segment 
ordinals and the mapping to global ordinals is performed after the counting.

As far as I can see, JSON Facets does not have this delayed mapping mechanism: 
Every increment requires a call to the segment->global-ordinal map. With a 
large field this map cannot be in the fast caches. Combine this with a 
gazillion references and it makes sense that JSON Facets is slower in this 
scenario. A factor 20 sounds like way too much though. I would have expected 
maybe 2.

- Toke Eskildsen


Re: is there any way to tell delete by query actually deleted anything?

2015-09-02 Thread Renee Sun
Shawn,
thanks for the reply.

I have a sharded index. When I re-index a document (vs new index, which is
different process), I need to delete the old one first to avoid dup. We all
know that if there is only one core, the newly added document will replace
the old one, but with multiple core indexes, we will have to issue delete
command first to ALL shards since we do NOT know/remember which core the old
document was indexed to ... 

I also wanted to know if there is a better way handling this efficiently.

Anyways, we are sending delete to all cores of this customer, one of them
hit , others did not.

But consequently, when I need to decide about commit, I do NOT want blindly
commit to all cores, I want to know which one actually had the old doc so I
only send commit to that core.

I could alternatively use query first and skip if it did not hit, but delete
if it does, and I can't short circuit since we have dups :-( based on a
historical reason. 

any suggestion how to make this more efficiently?
 
thanks!






--
View this message in context: 
http://lucene.472066.n3.nabble.com/is-there-any-way-to-tell-delete-by-query-actually-deleted-anything-tp4226776p4226788.html
Sent from the Solr - User mailing list archive at Nabble.com.


which solrconfig.xml

2015-09-02 Thread Mark Fenbers
Hi,  I've been fiddling with Solr for two whole days since 
downloading/unzipping it.  I've learned a lot by reading 4 documents and 
the web site.  However, there are a dozen or so instances of 
solrconfig.xml in various $HOME/solr-5.3.0 subdirectories.  The 
documents/tutorials say to edit the solrconfig.xml file for various 
configuration details, but they never say which one of these dozen to 
edit.  Moreover, I cannot determine which version is being used once I 
start solr, so that I would know which instance of this file to 
edit/customize.


Can you help??

Thanks!
Mark


Re: is there any way to tell delete by query actually deleted anything?

2015-09-02 Thread Shawn Heisey
On 9/2/2015 1:30 PM, Renee Sun wrote:
> Is there an easy way for me to get the actually deleted document number? I
> mean if the query did not hit any documents, I want to know that nothing got
> deleted. But if it did hit documents, i would like to know how many were
> delete...

I do this by issuing the same query that I plan to use for the delete,
before doing the delete.  If numFound is zero, I don't do the delete. 
Either way I know how many docs are getting deleted.  Since the program
that does this is the only thing updating the index, I know that the
info is completely accurate.

Thanks,
Shawn



is there any way to tell delete by query actually deleted anything?

2015-09-02 Thread Renee Sun
I run this curl trying to delete some messages :

curl
'http://localhost:8080/solr/mycore/update?commit=true=abacd'
| xmllint --format -

or

curl
'http://localhost:8080/solr/mycore/update?commit=true=myfield:mycriteria'
| xmllint --format -

the results I got is like:

  % Total% Received % Xferd  Average Speed   TimeTime Time 
Current
 Dload  Upload   Total   SpentLeft 
Speed
148   1480   1480 0  11402  0 --:--:-- --:--:-- --:--:--
14800


  
0
10
  


Is there an easy way for me to get the actually deleted document number? I
mean if the query did not hit any documents, I want to know that nothing got
deleted. But if it did hit documents, i would like to know how many were
delete...

thanks
Renee



--
View this message in context: 
http://lucene.472066.n3.nabble.com/is-there-any-way-to-tell-delete-by-query-actually-deleted-anything-tp4226776.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: is there any way to tell delete by query actually deleted anything?

2015-09-02 Thread Mark Ehle
Do a search with the same criteria before and after?

On Wed, Sep 2, 2015 at 3:30 PM, Renee Sun  wrote:

> I run this curl trying to delete some messages :
>
> curl
> 'http://localhost:8080/solr/mycore/update?commit=true=
> abacd'
> | xmllint --format -
>
> or
>
> curl
> 'http://localhost:8080/solr/mycore/update?commit=true=
> myfield:mycriteria'
> | xmllint --format -
>
> the results I got is like:
>
>   % Total% Received % Xferd  Average Speed   TimeTime Time
> Current
>  Dload  Upload   Total   SpentLeft
> Speed
> 148   1480   1480 0  11402  0 --:--:-- --:--:-- --:--:--
> 14800
> 
> 
>   
> 0
> 10
>   
> 
>
> Is there an easy way for me to get the actually deleted document number? I
> mean if the query did not hit any documents, I want to know that nothing
> got
> deleted. But if it did hit documents, i would like to know how many were
> delete...
>
> thanks
> Renee
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/is-there-any-way-to-tell-delete-by-query-actually-deleted-anything-tp4226776.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Rules for pre-processing queries

2015-09-02 Thread Arcadius Ahouansou
Hello Siamak.

You may also want to have a look at 3 related articles, the 3rd part being:

http://lucidworks.com/blog/query-autofiltering-extended-language-logic-search/

I would start from the 1st part.

Hope this helps a bit.

Arcadius.

On 2 September 2015 at 21:09, Upayavira  wrote:

> Do you have a predefined list of such filters?
>
> You can do fun things with synonyms: define an ipad->tablet synonym, and
> use it at query time. Filter out all non-synonym terms in your query
> time analysis chain, and then use that field as a filter.
>
> Upayavira
>
> On Wed, Sep 2, 2015, at 09:07 PM, Siamak Rowshan wrote:
> > Hi all, I need to refine my search results by adding parameters to search
> > query parameters. For example, if user enters "ipad", I want to add a
> > filter query such as ("category=tablets") to refine the search results. I
> > thought a more general solution would be to define rules, that examine
> > the query parameter values, and can alter or add to the query parameters.
> > Short of writing custom code, are there any features within Solr or
> > add-on tools that can do something like this?
> >
> > Regards,
> > Mak
> >
> >
> > Siamak Rowshan | Software Engineer
> > Softmart | 450 Acorn Lane Downingtown, PA 19335
> > P   | 888-763-8627
> > siamak.rows...@softmart.com
> >
> > 
> > EEO Employer/Protected Veteran/Disabled
> > The information in this e-mail is confidential and may be legally
> > privileged. It is intended solely for the addressee. Access to this
> > e-mail by anyone else is unauthorized. Softmart Sales Terms & Conditions
> > available at www.softmart.com/terms.
> > 
> >
> >
> >
>



-- 
Arcadius Ahouansou
Menelic Ltd | Information is Power
M: 07908761999
W: www.menelic.com
---


Re: Rules for pre-processing queries

2015-09-02 Thread Upayavira
Do you have a predefined list of such filters?

You can do fun things with synonyms: define an ipad->tablet synonym, and
use it at query time. Filter out all non-synonym terms in your query
time analysis chain, and then use that field as a filter.

Upayavira

On Wed, Sep 2, 2015, at 09:07 PM, Siamak Rowshan wrote:
> Hi all, I need to refine my search results by adding parameters to search
> query parameters. For example, if user enters "ipad", I want to add a
> filter query such as ("category=tablets") to refine the search results. I
> thought a more general solution would be to define rules, that examine
> the query parameter values, and can alter or add to the query parameters.
> Short of writing custom code, are there any features within Solr or
> add-on tools that can do something like this?
> 
> Regards,
> Mak
> 
> 
> Siamak Rowshan | Software Engineer
> Softmart | 450 Acorn Lane Downingtown, PA 19335
> P   | 888-763-8627
> siamak.rows...@softmart.com
> 
> 
> EEO Employer/Protected Veteran/Disabled
> The information in this e-mail is confidential and may be legally
> privileged. It is intended solely for the addressee. Access to this
> e-mail by anyone else is unauthorized. Softmart Sales Terms & Conditions
> available at www.softmart.com/terms.
> 
> 
> 
> 


Re: is there any way to tell delete by query actually deleted anything?

2015-09-02 Thread Erick Erickson
bq: I have a sharded index. When I re-index a document (vs new index, which is
different process), I need to delete the old one first to avoid dup

No, you do not need to issue the delete in a sharded collection
_assuming_ that the doc has the same . Why
do you think you do? If it's in some doc somewhere we need
to fix it.

Docs are routed by a hash no the  in the default
case. So since it goes to the same shard, the fact that it's a
new version will be detected and it'll replace the old version.

Are you seeing anything different?

Best,
Erick

On Wed, Sep 2, 2015 at 1:24 PM, Renee Sun  wrote:
> Shawn,
> thanks for the reply.
>
> I have a sharded index. When I re-index a document (vs new index, which is
> different process), I need to delete the old one first to avoid dup. We all
> know that if there is only one core, the newly added document will replace
> the old one, but with multiple core indexes, we will have to issue delete
> command first to ALL shards since we do NOT know/remember which core the old
> document was indexed to ...
>
> I also wanted to know if there is a better way handling this efficiently.
>
> Anyways, we are sending delete to all cores of this customer, one of them
> hit , others did not.
>
> But consequently, when I need to decide about commit, I do NOT want blindly
> commit to all cores, I want to know which one actually had the old doc so I
> only send commit to that core.
>
> I could alternatively use query first and skip if it did not hit, but delete
> if it does, and I can't short circuit since we have dups :-( based on a
> historical reason.
>
> any suggestion how to make this more efficiently?
>
> thanks!
>
>
>
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/is-there-any-way-to-tell-delete-by-query-actually-deleted-anything-tp4226776p4226788.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: is there any way to tell delete by query actually deleted anything?

2015-09-02 Thread Renee Sun
thanks Shawn...

on the other side, I have just created a thin layer webapp I deploy it with
solr/tomcat. this webapp provides RESTful api allow all kind of clients in
our system to call and request a commit on the certain core on that solr
server.

I put in with the idea to have a centre/final place to control the commit on
the cores in local solr server.

so far it works by reducing the arbitrary requests, such as that I will not
allow 2 commit requests from different clients to commit on same core happen
too close to each other, I will disregard the second request if the first
just being done like less than 5 minutes ago.

I am think enhance this webapp to check on physical index dir timestamp, and
drop the request if the core has not been changed since last commit. This
will prevent the client trying to commit on all cold cores blindly where
only one of them actually was updated.

I mean to ask: is there any solr admin meta data I can fetch through restful
api, to get data such as index last updated time, or something like that?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/is-there-any-way-to-tell-delete-by-query-actually-deleted-anything-tp4226776p4226818.html
Sent from the Solr - User mailing list archive at Nabble.com.


Merging documents from a distributed search

2015-09-02 Thread tedsolr
I've read from  http://heliosearch.org/solrs-mergestrategy/
   that the AnalyticsQuery
component only works for a single instance of Solr. I'm planning to
"migrate" to the SolrCloud soon and I have a custom AnalyticsQuery module
that collapses what I consider to be duplicate documents, keeping stats like
a "count" of the dupes. For my purposes "dupes" are determined at run time
and vary by the search request. Once a collection has multiple shards I will
not be able to prevent "dupes" from appearing across those shards. A custom
merge strategy should allow me to merge my stats, but I don't see how I can
drop duplicate docs at that point.

If shard1 returns docs A & B and shard2 returns docs B & C (letters denoting
what I consider to be unique docs), can my implementation of a merge
strategy return only docs A, B, & C, rather than A, B, B, & C?

thanks! 
solr 5.2.1



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Merging-documents-from-a-distributed-search-tp4226802.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: is there any way to tell delete by query actually deleted anything?

2015-09-02 Thread Renee Sun
Hi Erick... as Shawn pointed out... I am not using solrcloud, I am using a
more complicated sharding scheme, home grown... 

thanks for your response :-)
Renee



--
View this message in context: 
http://lucene.472066.n3.nabble.com/is-there-any-way-to-tell-delete-by-query-actually-deleted-anything-tp4226776p4226806.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: is there any way to tell delete by query actually deleted anything?

2015-09-02 Thread Renee Sun
Hi Shawn,
I think we have similar structure where we use frontier/back instead of
hot/cold :-)

so yes we will probably have to do the same.

since we have large customers and some of them may have tera bytes data and
end up with hundreds of cold cores the blind delete broadcasting to all
of them is a performance kill.

I am thinking of adding a in-memory inventory of coreID : docID  so I can
identify which core the document is in efficiently... what do you think
about it?

thanks
Renee



--
View this message in context: 
http://lucene.472066.n3.nabble.com/is-there-any-way-to-tell-delete-by-query-actually-deleted-anything-tp4226776p4226805.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: is there any way to tell delete by query actually deleted anything?

2015-09-02 Thread Shawn Heisey
On 9/2/2015 3:32 PM, Renee Sun wrote:
> I think we have similar structure where we use frontier/back instead of
> hot/cold :-)
>
> so yes we will probably have to do the same.
>
> since we have large customers and some of them may have tera bytes data and
> end up with hundreds of cold cores the blind delete broadcasting to all
> of them is a performance kill.
>
> I am thinking of adding a in-memory inventory of coreID : docID  so I can
> identify which core the document is in efficiently... what do you think
> about it?

I could write code for the deleteByQuery method to figure out where to
send the requests.  Performance hasn't become a problem with the "send
to all shards" method.  If it does, then I know exactly what to do:

If the ID value that we use for sharding is larger than X, it goes to
the hot shard.  If not, then I would CRC32 hash the ID, mod the hash
value by the number of cold shards, and send it to the shard number (0
through 5 for our indexes) that comes out.

Our sharding ID field is actually not our uniqueKey field for Solr,
although it is the autoincrement primary key on the source MySQL
database.  Another way to think about this field is as the "delete id". 
Our Solr uniqueKey is a different field that has a unique-enforcing
index in MySQL.

If you want good performance with sharding operations, then you need a
sharding algorithm that is completely deterministic based on the key
value and the current shard layout.  If the shard layout changes then it
should not change frequently.  Our layout changes only once a day, at
which time the oldest documents are moved from the hot shard to the cold
shards.

Thanks,
Shawn



Re: which solrconfig.xml

2015-09-02 Thread Alexandre Rafalovitch
Have you looked at Admin Web UI in details yet? When you look at the
"Overview" page, on the right hand side, it lists a bunch of
directories. You want one that says "Instance". Then, your
solrconfig.xml is in "conf" directory under that.

Regards,
   Alex.
P.s. Welcome!


Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 2 September 2015 at 17:03, Mark Fenbers  wrote:
> Hi,  I've been fiddling with Solr for two whole days since
> downloading/unzipping it.  I've learned a lot by reading 4 documents and the
> web site.  However, there are a dozen or so instances of solrconfig.xml in
> various $HOME/solr-5.3.0 subdirectories.  The documents/tutorials say to
> edit the solrconfig.xml file for various configuration details, but they
> never say which one of these dozen to edit.  Moreover, I cannot determine
> which version is being used once I start solr, so that I would know which
> instance of this file to edit/customize.
>
> Can you help??
>
> Thanks!
> Mark


RE: Rules for pre-processing queries

2015-09-02 Thread Siamak Rowshan
Upayavira, wow! Didn’t think it'd work that well, and would be so easy to do! I 
do have a predefined list, so synonyms work great! Thanks!



Siamak Rowshan | Software Engineer
Softmart | 450 Acorn Lane Downingtown, PA 19335
P   | 888-763-8627
siamak.rows...@softmart.com


EEO Employer/Protected Veteran/Disabled
The information in this e-mail is confidential and may be legally privileged. 
It is intended solely for the addressee. Access to this e-mail by anyone else 
is unauthorized. Softmart Sales Terms & Conditions available at 
www.softmart.com/terms.




-Original Message-
From: Upayavira [mailto:u...@odoko.co.uk]
Sent: Wednesday, September 02, 2015 4:10 PM
To: solr-user@lucene.apache.org
Subject: Re: Rules for pre-processing queries

Do you have a predefined list of such filters?

You can do fun things with synonyms: define an ipad->tablet synonym, and use it 
at query time. Filter out all non-synonym terms in your query time analysis 
chain, and then use that field as a filter.

Upayavira

On Wed, Sep 2, 2015, at 09:07 PM, Siamak Rowshan wrote:
> Hi all, I need to refine my search results by adding parameters to
> search query parameters. For example, if user enters "ipad", I want to
> add a filter query such as ("category=tablets") to refine the search
> results. I thought a more general solution would be to define rules,
> that examine the query parameter values, and can alter or add to the query 
> parameters.
> Short of writing custom code, are there any features within Solr or
> add-on tools that can do something like this?
>
> Regards,
> Mak
>
>
> Siamak Rowshan | Software Engineer
> Softmart | 450 Acorn Lane Downingtown, PA 19335
> P   | 888-763-8627
> siamak.rows...@softmart.com
>
> 
> EEO Employer/Protected Veteran/Disabled The information in this e-mail
> is confidential and may be legally privileged. It is intended solely
> for the addressee. Access to this e-mail by anyone else is
> unauthorized. Softmart Sales Terms & Conditions available at
> www.softmart.com/terms.
> 
>
>
>


Re: is there any way to tell delete by query actually deleted anything?

2015-09-02 Thread Shawn Heisey
On 9/2/2015 2:24 PM, Renee Sun wrote:
> I have a sharded index. When I re-index a document (vs new index, which is
> different process), I need to delete the old one first to avoid dup. We all
> know that if there is only one core, the newly added document will replace
> the old one, but with multiple core indexes, we will have to issue delete
> command first to ALL shards since we do NOT know/remember which core the old
> document was indexed to ... 
>
> I also wanted to know if there is a better way handling this efficiently.
>
> Anyways, we are sending delete to all cores of this customer, one of them
> hit , others did not.
>
> But consequently, when I need to decide about commit, I do NOT want blindly
> commit to all cores, I want to know which one actually had the old doc so I
> only send commit to that core.
>
> I could alternatively use query first and skip if it did not hit, but delete
> if it does, and I can't short circuit since we have dups :-( based on a
> historical reason. 
>
> any suggestion how to make this more efficiently?

I have a sharded index too.  It is a more complicated sharding mechanism
than you would get in a default SolrCloud install (and my servers are
NOT running in cloud mode).  It's a hot/cold shard system, with one hot
shard and six cold shards.  Even though the shard that contains any
given document is *always* something that can be calculated according to
a configuration that changes at most once a day, I send all deletes to
every shard like you do.  Each batch of documents in the delete list
(currently set to a batch size of 500) is sent to each shard.

The deleteByQuery method on my Core class (this is a java program)
queries the Solr core to see if any documents are found.  If they are,
then the delete request is sent to Solr.  Any successful Solr update
operation (add, delete, etc) will set a "commit" flag in the class
instance, which is checked by the commit method.  When a commit is
requested on the Core class, if the flag is true, a commit is sent to
Solr.  If the commit succeeds, the flag is cleared.

Thanks,
Shawn



Re: Merging documents from a distributed search

2015-09-02 Thread Joel Bernstein
The merge strategy probably won't work for the type of distributed collapse
you're describing.

You may want to begin exploring the Streaming API which supports real-time
map/reduce operations,

http://joelsolr.blogspot.com/2015/03/parallel-computing-with-solrcloud.html

Joel Bernstein
http://joelsolr.blogspot.com/

On Wed, Sep 2, 2015 at 5:12 PM, tedsolr  wrote:

> I've read from  http://heliosearch.org/solrs-mergestrategy/
>    that the AnalyticsQuery
> component only works for a single instance of Solr. I'm planning to
> "migrate" to the SolrCloud soon and I have a custom AnalyticsQuery module
> that collapses what I consider to be duplicate documents, keeping stats
> like
> a "count" of the dupes. For my purposes "dupes" are determined at run time
> and vary by the search request. Once a collection has multiple shards I
> will
> not be able to prevent "dupes" from appearing across those shards. A custom
> merge strategy should allow me to merge my stats, but I don't see how I can
> drop duplicate docs at that point.
>
> If shard1 returns docs A & B and shard2 returns docs B & C (letters
> denoting
> what I consider to be unique docs), can my implementation of a merge
> strategy return only docs A, B, & C, rather than A, B, B, & C?
>
> thanks!
> solr 5.2.1
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Merging-documents-from-a-distributed-search-tp4226802.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Issue Using Solr 5.3 Authentication and Authorization Plugins

2015-09-02 Thread Noble Paul
I opened a ticket for the same
 https://issues.apache.org/jira/browse/SOLR-8004

On Wed, Sep 2, 2015 at 1:36 PM, Kevin Lee  wrote:
> I’ve found that completely exiting Chrome or Firefox and opening it back up 
> re-prompts for credentials when they are required.  It was re-prompting with 
> the /browse path where authentication was working each time I completely 
> exited and started the browser again, however it won’t re-prompt unless you 
> exit completely and close all running instances so I closed all instances 
> each time to test.
>
> However, to make sure I ran it via the command line via curl as suggested and 
> it still does not give any authentication error when trying to issue the 
> command via curl.  I get a success response from all the Solr instances that 
> the reload was successful.
>
> Not sure why the pre-canned permissions aren’t working, but the one to the 
> request handler at the /browse path is.
>
>
>> On Sep 1, 2015, at 11:03 PM, Noble Paul  wrote:
>>
>> " However, after uploading the new security.json and restarting the
>> web browser,"
>>
>> The browser remembers your login , So it is unlikely to prompt for the
>> credentials again.
>>
>> Why don't you try the RELOAD operation using command line (curl) ?
>>
>> On Tue, Sep 1, 2015 at 10:31 PM, Kevin Lee  wrote:
>>> The restart issues aside, I’m trying to lockdown usage of the Collections 
>>> API, but that also does not seem to be working either.
>>>
>>> Here is my security.json.  I’m using the “collection-admin-edit” permission 
>>> and assigning it to the “adminRole”.  However, after uploading the new 
>>> security.json and restarting the web browser, it doesn’t seem to be 
>>> requiring credentials when calling the RELOAD action on the Collections 
>>> API.  The only thing that seems to work is the custom permission “browse” 
>>> which is requiring authentication before allowing me to pull up the page.  
>>> Am I using the permissions correctly for the RuleBasedAuthorizationPlugin?
>>>
>>> {
>>>"authentication":{
>>>   "class":"solr.BasicAuthPlugin",
>>>   "credentials": {
>>>"admin”:” ",
>>>"user": ” "
>>>}
>>>},
>>>"authorization":{
>>>   "class":"solr.RuleBasedAuthorizationPlugin",
>>>   "permissions": [
>>>{
>>>"name":"security-edit",
>>>"role":"adminRole"
>>>},
>>>{
>>>"name":"collection-admin-edit”,
>>>"role":"adminRole"
>>>},
>>>{
>>>"name":"browse",
>>>"collection": "inventory",
>>>"path": "/browse",
>>>"role":"browseRole"
>>>}
>>>],
>>>   "user-role": {
>>>"admin": [
>>>"adminRole",
>>>"browseRole"
>>>],
>>>"user": [
>>>"browseRole"
>>>]
>>>}
>>>}
>>> }
>>>
>>> Also tried adding the permission using the Authorization API, but no 
>>> effect, still isn’t protecting the Collections API from being invoked 
>>> without a username password.  I do see in the Solr logs that it sees the 
>>> updates because it outputs the messages “Updating /security.json …”, 
>>> “Security node changed”, “Initializing authorization plugin: 
>>> solr.RuleBasedAuthorizationPlugin” and “Authentication plugin class 
>>> obtained from ZK: solr.BasicAuthPlugin”.
>>>
>>> Thanks,
>>> Kevin
>>>
 On Sep 1, 2015, at 12:31 AM, Noble Paul  wrote:

 I'm investigating why restarts or first time start does not read the
 security.json

 On Tue, Sep 1, 2015 at 1:00 PM, Noble Paul  wrote:
> I removed that statement
>
> "If activating the authorization plugin doesn't protect the admin ui,
> how does one protect access to it?"
>
> One does not need to protect the admin UI. You only need to protect
> the relevant API calls . I mean it's OK to not protect the CSS and
> HTML stuff.  But if you perform an action to create a core or do a
> query through admin UI , it automatically will prompt you for
> credentials (if those APIs are protected)
>
> On Tue, Sep 1, 2015 at 12:41 PM, Kevin Lee  
> wrote:
>> Thanks for the clarification!
>>
>> So is the wiki page incorrect at
>> https://cwiki.apache.org/confluence/display/solr/Basic+Authentication+Plugin
>>  which says that the admin ui 

Re: Difference between Legacy Facets and JSON Facets

2015-09-02 Thread Yonik Seeley
On Wed, Sep 2, 2015 at 1:19 AM, Zheng Lin Edwin Yeo
 wrote:
> The type of field is text_general.

What are some typical values for this "content" field (i.e. how many
different words does the content field contain for each document)?

-Yonik

> I found that the problem mainly happen in the content field of the
> collections with rich text document.
> It works fine for other files, and also collections indexed with CSV
> documents, even if the fieldType is text_general.
>
> Regards,
> Edwin
>
>
> On 2 September 2015 at 12:12, Yonik Seeley  wrote:
>
>> On Tue, Sep 1, 2015 at 11:51 PM, Zheng Lin Edwin Yeo
>>  wrote:
>> > No, I've tested it several times after committing it.
>>
>> Hmmm, well something is really wrong for this orders of magnitude
>> difference.  I've never seen anything like that and we should
>> definitely try to get to the bottom of it.
>> What is the type of the field?
>>
>> -Yonik
>>


Re: Strange behavior of solr

2015-09-02 Thread Zheng Lin Edwin Yeo
Is there any error message in the log when Solr stops indexing the file at
line 2046?

Regards,
Edwin

On 2 September 2015 at 17:17, Long Yan  wrote:

> Hey,
> I have created a core with
> bin\solr create -c mycore
>
> I want to index the csv sample files from solr-5.2.1
>
> If I index film.csv under solr-5.2.1\example\films\, solr can only index
> this file until the line
> "2046,Wong Kar-wai,Romance Film|Fantasy|Science
> Fiction|Drama,,/en/2046_2004,2004-05-20"
>
> But if I at first index books.csv under solr-5.2.1\example\exampledocs and
> then index film.csv, solr can index all lines in film.csv
>
> Why?
>
> Regards
> Long Yan
>
>
>


Re: Please add me to SolrWiki contributors

2015-09-02 Thread Shawn Heisey
On 9/1/2015 11:28 PM, Gaurav Kumar wrote:
> I am working on writing some open source tool for Solr Camel component, it 
> would be great if you can add me to list of contributors.
> Also I realized that you guys have upgraded the wiki to Solr 5.3, but we are 
> using Solr 4, and suddenly now there is no information available for the 
> older version.
> Is there a way you guys can keep information about previous versions as well?
> My username is "GauravKumar"

You are now added to the Contributors Group on the Solr community wiki,
so you have full edit rights to nearly all of the wiki.

https://wiki.apache.org/solr/ContributorsGroup

Your message hinted at wanting more information ... so here's a whole
lot more info than you probably wanted:

Note that the community wiki is *not* the same thing as the official
reference guide.

The reference guide is carefully updated to only refer to the latest
version.  Currently it targets the 5.4 release, which is not out yet:

https://cwiki.apache.org/confluence/display/solr/Apache+Solr+Reference+Guide

Because the reference guide is published and released as official PDF
documentation, only Apache Lucene/Solr committers have edit rights for
the wiki that contains the ref guide.  Anyone can comment on the pages.
 User input is appreciated and frequently incorporated.

The Reference Guide was originally created and maintained by LucidWorks.
 They generously donated the entire guide to Apache.  The first Apache
release of the guide covered Solr version 4.4.0.  There are historical
releases of the guide for each minor release since that version:

http://archive.apache.org/dist/lucene/solr/ref-guide/

The community wiki is definitely not abandoned, but now that we have the
official reference guide, its purpose has changed.  It now exists to
hold supplemental documentation and tips/tricks that are not appropriate
for the reference guide, but should still be available to the community.
 This means that large sections of the wiki need to be removed, because
they contain the same info as the guide, but it is frequently outdated.

The transition of the community wiki to its new purpose is VERY slow ...
no major overhaul has been planned.  Users like you are the best hope
for keeping that wiki useful and relevant.

Regarding Solr 4:  Nearly zero committer time is spent on the code for
4.x releases, and very little time is spent on documentation in the
community wiki for those versions.  In the Apache SVN server, branch_4x
was removed and branch_5x created on September 18, 2014.  If extremely
serious bugs are found in the older version, they will be fixed in the
4.10 branch.  Depending on the estimated user impact for any bugs found,
you MIGHT see a new 4.10.x release.  At this point, it would have to be
a *REALLY* bad bug, and the chances of a bug like that being found are low.

Thanks,
Shawn



Re: Strange behavior of solr

2015-09-02 Thread Erik Hatcher
See example/films/README.txt

The “name” field is guessed incorrectly (because the first film has name=“.45”, 
so indexing errors once it hits a name value that is no longer numeric.  The 
README provides a command to define the name field *before* indexing.  If 
you’ve indexed and had the name field guessed incorrectly and created, you’ll 
need to delete and recreate the collection, then define the name field, then 
reindex.

We used to have a fake film at the top to allow field guessing to “work”, but I 
felt that was too fake and that the example should be true to what happens with 
real world data and the pitfalls of allowing field type guessing to guess 
incorrectly.

—
Erik Hatcher, Senior Solutions Architect
http://www.lucidworks.com




> On Sep 2, 2015, at 5:17 AM, Long Yan  wrote:
> 
> Hey,
> I have created a core with
> bin\solr create -c mycore
> 
> I want to index the csv sample files from solr-5.2.1
> 
> If I index film.csv under solr-5.2.1\example\films\, solr can only index this 
> file until the line
> "2046,Wong Kar-wai,Romance Film|Fantasy|Science 
> Fiction|Drama,,/en/2046_2004,2004-05-20"
> 
> But if I at first index books.csv under solr-5.2.1\example\exampledocs and 
> then index film.csv, solr can index all lines in film.csv
> 
> Why?
> 
> Regards
> Long Yan
> 
> 



Frage zu einem komischen Verhalten

2015-09-02 Thread Long Yan
Guten Tag,
ich habe einen Core mit dem folgendem Befehl erstellt
bin\solr create -c mycore

Wenn ich die Datei film.csv unter solr-5.2.1\example\films\ indexiere, kann 
solr nur bis die Zeile
"2046,Wong Kar-wai,Romance Film|Fantasy|Science 
Fiction|Drama,,/en/2046_2004,2004-05-20" indexieren.

Aber wenn ich zuerst die Datei books.csv unter solr-5.2.1\example\exampledocs 
und danach film.csv indexiere,
kann solr alle Zeilen in film.csv indexieren.

Kann Jemand mir bitte Hinweis geben, woran es liegen könnte?

Grüße
Long Yan


Strange behavior of solr

2015-09-02 Thread Long Yan
Hey,
I have created a core with
bin\solr create -c mycore

I want to index the csv sample files from solr-5.2.1

If I index film.csv under solr-5.2.1\example\films\, solr can only index this 
file until the line
"2046,Wong Kar-wai,Romance Film|Fantasy|Science 
Fiction|Drama,,/en/2046_2004,2004-05-20"

But if I at first index books.csv under solr-5.2.1\example\exampledocs and then 
index film.csv, solr can index all lines in film.csv

Why?

Regards
Long Yan




String bytes can be at most 32766 characters in length?

2015-09-02 Thread Zheng Lin Edwin Yeo
Hi,

I would like to check, is the string bytes must be at most 32766 characters
in length?

I'm trying to do a copyField of my rich-text documents content to a field
with fieldType=string to try out my getting distinct result for content, as
there are several documents with the exact same content, and we only want
to list one of them during searching.

However, I get the following errors in some of the documents when I tried
to index them with the copyField. Some of my documents are quite large in
size, and there is a possibility that it exceed 32766 characters. Is there
any other ways to overcome this problem?


org.apache.solr.common.SolrException: Exception writing document id
collection1_polymer100 to the index; possible analysis error.
at
org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:167)
at
org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69)
at
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:955)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1110)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:706)
at
org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:104)
at
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51)
at
org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor.processAdd(LanguageIdentifierUpdateProcessor.java:207)
at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.doAdd(ExtractingDocumentLoader.java:122)
at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.addDoc(ExtractingDocumentLoader.java:127)
at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:235)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:2064)
at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:654)
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:450)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:227)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:196)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
at org.eclipse.jetty.server.Server.handle(Server.java:497)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310)
at
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
at
org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
at
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.IllegalArgumentException: Document contains at least
one immense term in field="signature" (whose UTF8 encoding is longer than
the max length 32766), all of which were skipped.  Please correct the
analyzer to not produce such terms.  The prefix of the first immense term
is: '[32, 60, 112, 62, 60, 98, 114, 62, 32, 32, 32, 60, 98, 114, 62, 56,
48, 56, 32, 72, 97, 110, 100, 98, 111, 111, 107, 32, 111, 102]...',
original message: bytes can be at most 32766 in length; got 49960
at
org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:670)
at
org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:344)
at
org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:300)
at

Re: Frage zu einem komischen Verhalten

2015-09-02 Thread Hasan Diwan
You might get a better response in English...

Vielleicht haben Sie eine bessere Antwort bekommen in... (from Google
Translate, as my own German is non-existent) -- H

2015-09-02 2:05 GMT-07:00 Long Yan :

> Guten Tag,
> ich habe einen Core mit dem folgendem Befehl erstellt
> bin\solr create -c mycore
>
> Wenn ich die Datei film.csv unter solr-5.2.1\example\films\ indexiere,
> kann solr nur bis die Zeile
> "2046,Wong Kar-wai,Romance Film|Fantasy|Science
> Fiction|Drama,,/en/2046_2004,2004-05-20" indexieren.
>
> Aber wenn ich zuerst die Datei books.csv unter
> solr-5.2.1\example\exampledocs und danach film.csv indexiere,
> kann solr alle Zeilen in film.csv indexieren.
>
> Kann Jemand mir bitte Hinweis geben, woran es liegen könnte?
>
> Grüße
> Long Yan
>



-- 
OpenPGP: https://hasan.d8u.us/gpg.key
Sent from my mobile device
Envoyé de mon portable


Re: Difference between Legacy Facets and JSON Facets

2015-09-02 Thread Zheng Lin Edwin Yeo
Q) What are some typical values for this "content" field (i.e. how
many different words does the content field contain for each document)?

A) They are indexed from word and pdf documents, the highest is 278 pages
long (about 372000 bytes when indexed into Solr). There's thousands of
different words in each of the document.

Regards,
Edwin


On 2 September 2015 at 19:45, Yonik Seeley  wrote:

> On Wed, Sep 2, 2015 at 1:19 AM, Zheng Lin Edwin Yeo
>  wrote:
> > The type of field is text_general.
>
> What are some typical values for this "content" field (i.e. how many
> different words does the content field contain for each document)?
>
> -Yonik
>
> > I found that the problem mainly happen in the content field of the
> > collections with rich text document.
> > It works fine for other files, and also collections indexed with CSV
> > documents, even if the fieldType is text_general.
> >
> > Regards,
> > Edwin
> >
> >
> > On 2 September 2015 at 12:12, Yonik Seeley  wrote:
> >
> >> On Tue, Sep 1, 2015 at 11:51 PM, Zheng Lin Edwin Yeo
> >>  wrote:
> >> > No, I've tested it several times after committing it.
> >>
> >> Hmmm, well something is really wrong for this orders of magnitude
> >> difference.  I've never seen anything like that and we should
> >> definitely try to get to the bottom of it.
> >> What is the type of the field?
> >>
> >> -Yonik
> >>
>


concept and choice: custom sharding or auto sharding?

2015-09-02 Thread scott chu
I post a question on Stackoverflow 
http://stackoverflow.com/questions/32343813/custom-sharding-or-auto-sharding-on-solrcloud:
However, since this is a mail-list, I repost the question below to request for 
suggestion and more subtle concept of SolrCloud's behavior on document routing.
I want to establish a SolrCloud clsuter for over 10 millions of news articles. 
After reading this article in Apache Solr Refernce guide: Shards and Indexing 
Data in SolrCloud, I have a plan as follows:
Add prefix ED2001! to document ID where ED means some newspaper source and 2001 
is the year part in published date of news article, i.e. I want to put all news 
articles of specific news paper source published in specific year to a shard.
Create collection with router.name set to compositeID.
Add documents?
Query Collection?
Practically, I got some questions:
How to add doucments based on this plan? Do I have to specify special 
parameters when updating the collection/core?
Is this called "custom sharding"? If not, what is "custom sharding"?
Is auto sharding a better choice for my case since there's a shard-splitting 
feature for auto sharding when the shard is too big?
Can I query without _router_ parameter?
EDIT @ 2015/9/2:
This is how I think SolrCloud will do: "The amount of news articles of specific 
newspaper source of specific year tends to be around a fix number, e.g. Every 
year ED has around 80,000 articles, so each shard's size won't increase 
dramatically. For the next year's news articles of ED, I only have to add 
prefix 'ED2016!' to document ID, SolrCloud will create a new shard for me 
(which contains all ED2016 articles), and later the Leader will spread the 
replica of this new shard to other nodes (per replica per node other than 
leader?)". Am I right? If yes, it seems no need for shard-splitting.


Re: concept and choice: custom sharding or auto sharding?

2015-09-02 Thread Erick Erickson
Frankly, at 10M documents there's rarely a need to shard at all.
Why do you think you need to? This seems like adding
complexity for no good reason. Sharding should only really
be used when you have too many documents to fit on a single
shard as it adds some overhead, restricts some
possibilities (cross-core join for instance, a couple of
grouping options don't work in distributed mode etc.).

You can still run SolrCloud and have it manage multiple
_replicas_ of a single shard for HA/DR.

So this seems like an XY problem, you're asking specific
questions about shard routing because you think it'll
solve some problem without telling us what the problem
is.

Best,
Erick

On Wed, Sep 2, 2015 at 7:47 AM, scott chu  wrote:
> I post a question on Stackoverflow 
> http://stackoverflow.com/questions/32343813/custom-sharding-or-auto-sharding-on-solrcloud:
> However, since this is a mail-list, I repost the question below to request 
> for suggestion and more subtle concept of SolrCloud's behavior on document 
> routing.
> I want to establish a SolrCloud clsuter for over 10 millions of news 
> articles. After reading this article in Apache Solr Refernce guide: Shards 
> and Indexing Data in SolrCloud, I have a plan as follows:
> Add prefix ED2001! to document ID where ED means some newspaper source and 
> 2001 is the year part in published date of news article, i.e. I want to put 
> all news articles of specific news paper source published in specific year to 
> a shard.
> Create collection with router.name set to compositeID.
> Add documents?
> Query Collection?
> Practically, I got some questions:
> How to add doucments based on this plan? Do I have to specify special 
> parameters when updating the collection/core?
> Is this called "custom sharding"? If not, what is "custom sharding"?
> Is auto sharding a better choice for my case since there's a shard-splitting 
> feature for auto sharding when the shard is too big?
> Can I query without _router_ parameter?
> EDIT @ 2015/9/2:
> This is how I think SolrCloud will do: "The amount of news articles of 
> specific newspaper source of specific year tends to be around a fix number, 
> e.g. Every year ED has around 80,000 articles, so each shard's size won't 
> increase dramatically. For the next year's news articles of ED, I only have 
> to add prefix 'ED2016!' to document ID, SolrCloud will create a new shard for 
> me (which contains all ED2016 articles), and later the Leader will spread the 
> replica of this new shard to other nodes (per replica per node other than 
> leader?)". Am I right? If yes, it seems no need for shard-splitting.