Query function error - can not use FieldCache on multivalued field

2020-09-14 Thread Shamik Bandopadhyay
Hi,

  I'm trying to use Solr query function as a boost for term matches in the
title field. Here's my boost function

bf=if(exists(query({!v='title:Import data'})),10,0)

This throws the following error --> can not use FieldCache on multivalued
field: data

The function seems to be only working for a single term. The title field
doesn't support multivalued but it's configured to analyze terms. Here's
the field definition.



I was under the impression that I would be able to use the query function
to evaluate a regular query field. Am I missing something? If there's a
constraint on this function, can this boost be done in a different way?

Any pointers will be appreciated.

Thanks,
Shamik


Lemmatizer for Solr

2020-02-14 Thread Shamik Bandopadhyay
Hi,
  I'm trying to replace pprter stemmer with an english lemmatizer in my
analysis chain. Just wondering what
is the recommended way of achieving this. I've come across few different
implementation which are listed below;

Open NLP -->
https://lucene.apache.org/solr/guide/7_5/language-analysis.html#opennlp-
lemmatizer-filter

https://opennlp.apache.org/docs/1.8.0/manual/opennlp.html#tools.lemmatizer

KStem Filter -->
https://lucene.apache.org/solr/guide/7_5/filter-descriptions.html#kstem-filter

There are couple of third party libraries , but not sure if they are being
maintained or compatible with the solr version i'm using (7.5).

https://github.com/nicholasding/solr-lemmatizer
https://github.com/bejean/solr-lemmatizer

Currently, I'm looking for English only lemmatization. Also, I need to have
the ability to update the lemma dictionary to add custom terms specific to
our organization (not sure of kstem filter can do that).

Any pointers will be appreciated.

Regards,
Shamik


Lemmatizer for indexing

2019-10-14 Thread Shamik Bandopadhyay
Hi,
  I'm trying to use a lemmatized in my analysis chain. Just wondering what
is the recommended way of achieving this. I've come across few different
implementation which are listed below;

Open NLP -->
https://lucene.apache.org/solr/guide/7_5/language-analysis.html#opennlp-lemmatizer-filter

https://opennlp.apache.org/docs/1.8.0/manual/opennlp.html#tools.lemmatizer

KStem Filter -->
https://lucene.apache.org/solr/guide/7_5/filter-descriptions.html#kstem-filter

There are couple of third party libraries , but not sure if they are being
maintained or compatible with the solr version i'm using (7.5).

https://github.com/nicholasding/solr-lemmatizer
https://github.com/bejean/solr-lemmatizer

Currently, I'm looking for English only lemmatization. Also, I need to have
the ability to update the lemma dictionary to add custom terms specific to
our organization (not sure of kstem filter can do that).

Any pointers will be appreciated.

Regards,
Shamik


Fwd: Numeric value ignored by EdgeNGramFilterFactory

2019-07-04 Thread Shamik Bandopadhyay
Hi,

   I'm using EdgeNGramFilterFactory to support partial search. Here's my
field definition.
























I run into an issue when I'm trying a numeric terms in search. For e.g. if
I search for "72 hours", EdgeNGramFilterFactory ignores 72 and only stores
hou and hour in index. Since I'm using AND operator, the query fails to
match 72 hours. I can enable EdgeNGramFilterFactory in the query chain, but
I thought that would be an un-necessary overhead. Is there a reason why 72
is ignored and what'll be the best way to address this scenario?

Any pointers will be appreciated.

Thanks,
Shamik


Numeric value ignored by EdgeNGramFilterFactory

2019-07-04 Thread Shamik Bandopadhyay
Hi,

   I'm using EdgeNGramFilterFactory to support partial search. Here's my
field definition.

 <
filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt" />
   

I run into an issue when I'm trying a numeric terms in search. For e.g. if
I search for *72 hours*, EdgeNGramFilterFactory ignores 72 and only stores
*hou* and *hour* in index. Since I'm using AND operator, the query fails to
match *72 *hours. I can enable EdgeNGramFilterFactory in the query chain,
but I thought that would be an un-necessary overhead. Is there a reason why
72 is ignored and what'll be the best way to address this scenario?

Any pointers will be appreciated.

Thanks,
Shamik


Numeric value ignored by EdgeNGramFilterFactory

2019-07-04 Thread Shamik Bandopadhyay
Hi,

   I'm using EdgeNGramFilterFactory to support partial search. Here's my
field definition.
























I run into an issue when I'm trying a numeric terms in search. For e.g. if
I search for "72 hours", EdgeNGramFilterFactory ignores 72 and only stores
hou and hour in index. Since I'm using AND operator, the query fails to
match 72 hours. I can enable EdgeNGramFilterFactory in the query chain, but
I thought that would be an un-necessary overhead. Is there a reason why 72
is ignored and what'll be the best way to address this scenario?

Any pointers will be appreciated.

Thanks,
Shamik


Re: Problem with white space or special characters in function queries

2019-03-28 Thread shamik
Thanks Jan, I was not aware of this, appreciate your help.



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Problem with white space or special characters in function queries

2019-03-28 Thread shamik
Ahemad, I don't think its related to the field definition, rather looks like
an inherent bug. For the time being, I created a copyfield which uses a
custom regex to remove whitespace and special characters and use it in the
function. I'll debug the source code and confirm if it's bug, will raise a
JIRA if needed.



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Problem with white space or special characters in function queries

2019-03-27 Thread shamik
I'm using Solr 7.5, here's the query:

q=line=language:"english"=Source2:("topicarticles"+OR+"sfdcarticles")=url,title=ADSKFeature:"CUI+(Command)"^7=recip(ms(NOW/DAY,PublishDate),3.16e-11,1,1)^2+if(termfreq(ADSKFeature,'CUI
(Command)'),log(CaseCount),sqrt(CaseCount))=10



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Problem with white space or special characters in function queries

2019-03-26 Thread shamik
Edwin,

   The field is a string type, here's the field definition.



-Shamik



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Problem with white space or special characters in function queries

2019-03-25 Thread Shamik Bandopadhyay
Hi,

   I'm having trouble handling white space or special characters in
function queries. Here's a sample function:

if(termfreq(ADSKFeature,'CUI (Command)'),log(CaseCount),sqrt(CaseCount))

I tried escaping the space with "\", but that didn't work either. Here's
the exception being thrown:

org.apache.solr.search.SyntaxError: Expected identifier at pos 0
str='(Command)'),log(CaseCount),sqrt(CaseCount))'

This happen with other special characters like &, ?, etc.

Does function queries have any limitations? Or am I doing something wrong?

Any pointers will be highly appreciated.

Thanks,
Shamik


Re: Solr recovery issue in 7.5

2018-12-17 Thread shamik
I'm still pretty clueless trying to find the root cause of this behavior. One
thing is pretty consistent that whenever a node restarts up and sends a
recovery command, the recipient shard/replica goes down due to sudden surge
in old gen heap space. Within minutes, it hits the ceiling and stall the
server. And this keeps one going in circles. After moving to 7.5, we decided
to switch to G1 from CMS. We are using the recommended settings from Shawn's
blog.

GC_TUNE="-XX:+UseG1GC \
-XX:+PerfDisableSharedMem \
-XX:+ParallelRefProcEnabled \
-XX:G1HeapRegionSize=8m \
-XX:MaxGCPauseMillis=250 \
-XX:InitiatingHeapOccupancyPercent=75 \
-XX:+UseLargePages \
-XX:+AggressiveOpts \
-XX:OnOutOfMemoryError=/mnt/ebs2/solrhome/bin/oom_solr.sh"

Can this be tuned better to avoid this?

Also, I'm curios to know if any 7.5 user has experienced similar scenario.
Can there be some major change related to recovery that I might be missing
after porting from 6.6?



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr recovery issue in 7.5

2018-12-14 Thread shamik
Thanks Eric. I guess I was not clear when I mentioned that I had stopped the
indexing process. It was just a temporary step to make sure that we are not
adding any new data when the nodes are in a recovery mode. The 10 minute
hard commit is carried over from our 6.5 configuration which actually
followed the guiding principle  of "Index light- Query light/heavy" from the
same document you mentioned. I would surely try out  with a 15 sec hard
commit, opensearcher=false and 10 min softcommit and see if it makes a
difference. I'm working on getting a heap dump and see if it provides any
red flag. 



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr recovery issue in 7.5

2018-12-12 Thread shamik
Erick,

   Thanks for your input. All our fields (for facet, group & sort) have
docvalues enabled since 6.5. That includes the id field. Here's the field
cache entry:

CACHE.core.fieldCache.entries_count:0
CACHE.core.fieldCache.total_size:  0 bytes

Based on whatever I've seen so far, I think zookeeper is not the culprit
here. All the nodes including zookeeper was setup recently. The all are all
inside the same VPC within the same AZ. The instances talk to each other
through a dedicated network. Both zookeeper and Solr instances have SSDs.

Here's what's happening based on my observation. Whenever an instance is
getting restarted, it initiates a preprecovery command to its leader or a
different node in the other shard. The node which receives the recovery
request is the one which is due to go down next. Within few minutes, the
heap size (old gen) reaches the max allocated heap, thus stalling the
process. I guess due to this, it fails to send the credentials for a
zookeeper session within the stipulated timeframe, which is why zookeeper
terminates the session. Here's from the startup log.

2018-12-13 04:02:34.910 INFO 
(recoveryExecutor-4-thread-1-processing-n:x.x.193.244:8983_solr
x:knowledge_shard2_replica_n4 c:knowledge s:shard2 r:core_node9)
[c:knowledge s:shard2 r:core_node9 x:knowledge_shard2_replica_n4]
o.a.s.c.RecoveryStrategy Sending prep recovery command to
[http://x.x.240.225:8983/solr]; [WaitForState:
action=PREPRECOVERY=knowledge_shard2_replica_n6=x.x.x.244:8983_solr=core_node9=recovering=true=true=true]

The node sends a recovery command to its replica which immediately triggers
G1 Old Gen jvm pool to reach the max heap size. Please see the screenshot
below which shows the sudden jump in heap size. We've made sure that the
indexing process is completely switched off at this point, so there's no
commit happening.

JVM Pool --> https://www.dropbox.com/s/5s0igznhrol6c05/jvm_pool_1.png?dl=0

I'm totally puzzled by this weird behavior, never seen something like this
before. Could G1GC settings be contributing to this issue? 

>From zookeeper log:

2018-12-13 03:47:27,905 [myid:1] - INFO 
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@215] -
Accepted socket connection from /10.0.0.160:58376
2018-12-13 03:47:27,905 [myid:1] - INFO 
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@215] -
Accepted socket connection from /10.0.0.160:58378
2018-12-13 03:47:27,905 [myid:1] - INFO 
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@938] - Client
attempting to establish new session at /10.0.0.160:58376
2018-12-13 03:47:27,905 [myid:1] - WARN 
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@376] - Unable to
read additional data from client sessionid 0x0, likely client has closed
socket
2018-12-13 03:47:27,905 [myid:1] - INFO 
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1040] - Closed
socket connection for client /10.0.0.160:58378 (no session established for
client)
2018-12-13 03:47:27,907 [myid:1] - INFO 
[CommitProcessor:1:ZooKeeperServer@683] - Established session
0x100c46d01440072 with negotiated timeout 1 for client /10.0.0.160:58376
2018-12-13 03:47:39,386 [myid:1] - INFO 
[CommitProcessor:1:NIOServerCnxn@1040] - Closed socket connection for client
/10.0.0.160:58376 which had sessionid 0x100c46d01440072



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Solr recovery issue in 7.5

2018-12-12 Thread Shamik Bandopadhyay
but we have noticed that filter cache utilization has
drastically reduced (0.17) while document cache has gone up (0.61). It used
to be 0.9 and 0.3 in Solr 6.5.

Not sure what we are missing here in terms of Solr upgrade to 7.5 I can
provide other relevant information.

Thanks,
Shamik


Re: Does ConcurrentUpdateSolrClient apply for SolrCloud ?

2018-10-24 Thread shamik
Thanks Erick, appreciate your help



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Does ConcurrentUpdateSolrClient apply for SolrCloud ?

2018-10-24 Thread shamik
Thanks Erick, that's extremely insightful. I'm not using batching and that's
the reason I was exploring ConcurrentUpdateSolrClient. Currently, N threads
are reusing the same CloudSolrClient to send data to Solr. Ofcourse, the
single point of failure was my biggest concern with
ConcurrentUpdateSolrClient, thanks for clarifying my doubt.

"You also want to be a little careful how hard you drive Solr if you're also
serving queries at the same time, the more cycles you use for indexing the
fewer are available to serve queries."

Our solr servers are also used to serve queries (50-100/minute). Our hard
commit set at 10 minutes while soft commit is disabled. Are there any best
practices (I know it's too generic, but specifically around indexing) that I
should follow?





--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Does ConcurrentUpdateSolrClient apply for SolrCloud ?

2018-10-24 Thread Shamik Bandopadhyay
Hi,

   I'm looking into the possibility of using ConcurrentUpdateSolrClient for
indexing a large volume of data instead of CloudSolrClient. Having an
async,batch API seems to be a better fit for us where we tend to index a
lot of data periodically. As I'm looking into the API, I'm wonderign if
this can be used for SolrCloud.

ConcurrentUpdateSolrClientclient = new
ConcurrentUpdateSolrClient.Builder(url).withThreadCount(100).withQueueSize(50).build();

The Builder object only takes a single url, not sure what that would be in
case of SolrCloud. For e.g. if I've two shards with a couple of replicas,
then what will be the server url?

I was not able to find any relevant document or example to clarify my
doubt. Any pointers will be appreciated.

Thanks


Re: Multiple Queries per request

2018-10-02 Thread Shamik Sinha
The Solr uses REST based calls which is done over http or https which
cannot handle multiple requests at one shot. However what you can do is
return all the necessary data at one shot and group them according to your
needs.
Thanks and regards,
Shamik


On 02-Oct-2018 8:11 PM, "Greenhorn Techie" 
wrote:

Hi,

We are building a mobile app which would display results from Solr. At the
moment, the idea is to have multiple widgets / areas on the mobile screen,
with each area being served by a distinct Solr query. For example first
widget would be display customer’s aggregated product usage, second widget
to display time-windows during which they are ore active on the app.

As these two widgets have different field list and query parameters, I was
wondering whether I can make a single call into Solr, which would then be
sending the results catering to each widget separately. I have gone through
the mail archive, but could not determine whether this is possible or not
an option in solr.

Any thoughts from the  awesome community?

Thanks


Re: Regarding pdf indexing issue

2018-07-11 Thread Shamik Sinha
You may try to use tesseract tool to check data extraction from pdf or
images and then go forward accordingly. As far as I understand the PDF is
an image and not data. The searchable PDF actually overlays the selectable
text as hidden text over the PDF image. These PDFs can be indexed and
extracted. These are mostly supported in english and other latin
derivatives. You may face problems to extract/index text based on any other
language. Handwritten text converted to PDFs are next to impossible to
index/extract. Apache Tika may be the solution you are looking for
On Wed 11 Jul, 2018, 9:12 PM Walter Underwood, 
wrote:

> PDF is not a structured document format. It is a printer control format.
>
> PDF does not have a paragraph marker. Instead, it says to move
> to this spot on the page, choose this font, and print this letter. For a
> paragraph, it moves farther. For the next letter in a word, it moves a
> little bit. Extracting paragraphs from that is a difficult pattern
> recognition
> problem.
>
> I worked with a PDF of a two-column magazine article that printed
> the first line of column 1, then the first line of column 2, then the
> second line of column 1, and so on. If a line ended with a hyphenated
> word, too bad.
>
> Extracting structure from a PDF document is somewhere between
> very hard and impossible. Someone I worked with said that getting
> structured text from PDF was like turning hamburger back into a cow.
>
> Since Acrobat 5, there is “tagged PDF”. I’m not sure how widely that
> is used. It appears to be an accessibility feature, so it still might not
> be useful for search.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Jul 11, 2018, at 8:07 AM, Erick Erickson 
> wrote:
> >
> > Solr will not do this automatically, the Extracting Request Handler
> > simply indexes the entire contents of the doc without regard to things
> > like paragraphs etc. Ditto with HTML. This is actually a task that
> > requires getting into Tika and using all the bells and whistles there.
> >
> > I'd recommend two things:
> >
> > 1> Take the PDF parsing offline, i.e. in a separate client. There are
> > many reasons for this, in particular you can attempt to do what you're
> > asking. See: https://lucidworks.com/2012/02/14/indexing-with-solrj/
> >
> > 2> Talk to the Tika folks about the best ways to make Tika return the
> > information such that you can index them and get what you'd like.
> >
> > Best,
> > Erick
> >
> > On Wed, Jul 11, 2018 at 6:35 AM, Rahul Prasad Dwivedi
> >  wrote:
> >> Hello Team,
> >>
> >> I am using the Solr for indexing and searching for pdf document
> >>
> >> I have go through with your website document and installed solr but
> unable
> >> to index and search the document.
> >>
> >> For example: Suppose we have a PDF file which have no of paragraph with
> >> separate heading.
> >>
> >> So If I search for the title on indexed pdf the result should be contain
> >> the paragraph from where the title belongs.
> >>
> >> I am unable to perform this task.
> >>
> >> I have run the below command for upload the pdf
> >>
> >> *bin/post -c gettingstarted pdf-sample.pdf*
> >>
> >> and for searching I am running the command
> >>
> >> *curl http://localhost:8983/solr/gettingstarted/select?q='*
> >>  >>
> >> Please suggest me anything and let me know if I am missing anything
> >>
> >> Thanks,
> >>
> >> Rahul
>
>


Error using multiple terms in function query

2018-05-15 Thread Shamik Bandopadhyay
Hi,

  I'm having issues using multiple terms in Solr function queries. For e.g.
I'm trying to use the following bf function using termfreq

bf=if(termfreq(ProductLine,'Test Product'),5,0)

This throws  org.apache.solr.search.SyntaxError: Missing end to unquoted
value starting at 28 str='if(termfreq(ProductLine,Test'

If I use only Test or Product i.e. a single term, the function works as
expected. I'm seeing a similar problem with other functions like
docfreq,ttf,tf,idf,etc. I'm using 6.6 but verified similar issue in 5 as
well.

Just wondering if this is an existing issue or something not supported by
Solr. Is there an alternate way to check multiple terms in a function?

Any pointers will be appreciated.


Re: Text in images are not extracted and indexed to content

2018-04-10 Thread Shamik Sinha
To index text in images the image needs to be searchable i. e. text needs
to be overlayed on the image like a searchable pdf. You can do this using
ocr but it is a bit unreliable if the images are scanned copies of written
text.

On 10-Apr-2018 4:12 PM, "Rahul Singh"  wrote:

May need to extract outside SolR and index pure text with an external
ingestion process. You have much more control over the Tika attributes and
behaviors.

--
Rahul Singh
rahul.si...@anant.us

Anant Corporation


On Apr 9, 2018, 10:23 PM -0400, Zheng Lin Edwin Yeo ,
wrote:
> Hi,
>
> Currently I am facing issue whereby the text in images file like jpg, bmp
> are not being extracted out and indexed. After the indexing, Tika did
> extract all the meta data out and index them under the fields attr_*.
> However, the content field is always empty for images file. For other
types
> of document files like .doc, the content is extracted correctly.
>
> I have already updated the tika-parsers-1.17.jar, under
> \prg\apache\tika\parser\pdf\ for extractInlineImages to true.
>
>
> What could be the reason?
>
> I have just upgraded to Solr 7.3.0.
>
> Regards,
> Edwin


Re: Error when indexing with SolrJ HTTP ERROR 405

2018-03-19 Thread Shamik Sinha
You need to send binary content instead of html. Atleast that is what the
error shows.

I also think the url is wrong. The correct url should have
http://localhost:8983/solr/core/update


Check first whether indexing is working on the same data that you are
trying to or not using the browser based tools. Check the url for the same.
Then based on your requirement decide whether to use dih or oob indexing
Thanks and regards,
Shamik

On Mon 19 Mar, 2018, 1:02 PM Khalid Moustapha Askia, <
m.askiakha...@gmail.com> wrote:

> Hi. I am trying to index some data with Solr by using SolrJ. But I have
> this error that I can't solve.
>
>
> -
> Exception in thread "main"
> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error
> from server at http://localhost:8983/solr/#/corename: Expected mime type
> application/octet-stream but got text/html. 
> 
> 
> Error 405  HTTP POST method is not supported by this URL
> 
> HTTP ERROR 405
> Problem accessing /solr/index.html. Reason:
> Error 405  HTTP POST method is not supported by this
> URL
> 
> 
>
> at
>
> org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:558)
> at
>
> org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:259)
> at
>
> org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:248)
> at
> org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:149)
> at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:106)
> at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:71)
> at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:85)
> at indexsolr.index(indexsolr.java:33)
> at LoadData.toIndex(LoadData.java:102)
> at LoadData.loadDocuments(LoadData.java:72)
> at IndexLaunch.main(IndexLaunch.java:12)
>
>
> --
>
> This is how I connect (I am in local):
>
> 
>
> SolrClient client = new HttpSolrClient.Builder("
> http://localhost:8983/solr/#/corename;).build();
>
> When I remove the "#" It throws a NullPointerException
>
> I have been struggling for a week with this indexing...
>


Sol rCloud collection design considerations / best practice

2017-11-13 Thread Shamik Bandopadhyay
Hi,

I'm looking for some input on design considerations for defining
collections in a SolrCloud cluster. Right now, our cluster consists of two
collections in a 2 shard / 2 replica mode. Each collection has a dedicated
set of source and don't overlap, which made it an easy decision.
Recently, we've a requirement to index a bunch of new sources that are
region based. The search result corresponding to those region needs to come
from their specific source as well sources from one of our existing
collection. Here's an example of our existing collection and their
corresponding source(s).

Existing Collection:
--
Collection A --> Source_A, Source_B
Collection B --> Source_C, Source_D, Source_E

Proposed Collection:

Collection_Asia --> Source_Asia, Source_C, Source_D, Source_E
Collection_Europe --> Source_Europe, Source_C, Source_D, Source_E
Collection_Australia --> Source_Asutralia, Source_C, Source_D, Source_E

The proposed collection part shows that each geo has its dedicated source
as well as source(s) from existing collection B.

Just wondering if creating a dedicated collection for each geo is the right
approach here. The main motivation is to support a geo-specific relevancy
model which can easily be customized without stepping into each other. On
the downside, I'm not sure if it's a good idea to replicate data from the
same source across various collections. Moreover, the data within the
source are not relational, so joining across collection might not be
an easy proposition.
The other consideration is the hardware design. Right now, both shards and
their replicas run on their dedicated instance. With two collections, we
sometimes run into OOM scenarios, so I'm a little bit worried about adding
more collections. Does the best practice (I know it's subjective) in
scenarios like this call for a dedicated Solr cluster per collection? From
index size perspective, Source_C,Source_D and Source_E combines close to10
million documents with 60gb volume size. Each geo based source is small,
won't exceed more than 500k documents.

Any pointers will be appreciated.

Thanks,
Shamik


Re: Solr nodes going into recovery mode and eventually failing

2017-10-23 Thread shamik
Thanks Emir and Zisis.

I added the maxRamMB for filterCache and reduced the size. I could the
benefit immediately, the hit ratio went to 0.97. Here's the configuration:





It seemed to be stable for few days, the cache hits and jvm pool utilization
seemed to be well within expected range. But the OOM issue occurred on one
of the nodes as the heap size reached 30gb. The hit ratio for query result
cache and document cache at that point was recorded as 0.18 and 0.65. I'm
not sure if the cache caused the memory spike at this point, with filter
cache restricted to 500mb, it should be negligible. One thing I noticed is
that the eviction rate now (with the addition of maxRamMB) is staying at 0.
Index hard commit happens at every 10 min, that's when the cache gets
flushed. Based on the monitoring log, the spike happened on the indexing
side where almost 8k docs went to pending state.

On the query performance standpoint, there have been occasional slow queries
(1sec+), but nothing alarming so far. Same goes for deep paging, I haven't
seen any evidence which points to that.

Based on the hit ratio, I can further scale down the query result and
document cache, also change to FastLRUCache and add maxRamMB. For filter
cache, I think this setting should be optimal enough to work on a 30gb heap
space unless I'm wrong on the maxRamMB concept. I'll have to get a heap dump
somehow, unfortunately, the whole process (of the node going down) happens
so quickly, I've hardly any time to run a profiler.



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr nodes going into recovery mode and eventually failing

2017-10-20 Thread shamik
Zisis, thanks for chiming in. This is really an interesting information and
probably in line what I'm trying to fix. In my case, the facet fields are
certainly not high cardinal ones. Most of them have a finite set of data,
the max being 200 (though it has a low usage percentage). Earlier I had
facet.limit=-1, but then scaled down to 200 to eliminate any performance
overhead.

I was not aware of maxRamMB parameter, looks like it's only available for
queryResultCache. Is that what you are referring to? Can you please share
your cache configuration? 



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr nodes going into recovery mode and eventually failing

2017-10-20 Thread shamik
Thanks Eric, in my case, each replica is running on it's own JVM, so even if
we consider 8gb of filter cache, it still has 27gb to play with. Isn't this
is a decent amount of memory to handle the rest of the JVM operation? 

Here's an example of implicit filters that get applied to almost all the
queries. Except for Source2 and  AccessMode, rest of the fields have doc
values enabled. Our sorting is down mostly on relevance, so there's little
impact there.

fq=language:("english")=ContentGroup:"Learn & Explore" OR "Getting
Started" OR "Troubleshooting" OR "Downloads")=Source2:("Help" OR
"documentation" OR "video" OR (+Source2:"discussion" +Solution:"yes") OR
"sfdcarticles" OR "downloads" OR "topicarticles" OR "screen" OR "blog" OR
"simplecontent" OR "auonline" OR "contributedlink" OR "collection") AND
-workflowparentid:[*+TO+*] AND -AccessMode:"internal" AND -AccessMode:"beta"
AND -DisplayName:Partner AND -GlobalDedup:true AND -Exclude:"knowledge" AND
-Exclude:"all" =recip(ms(NOW/DAY,PublishDate),3.16e-11,1,1)^1.0

As you can see, there's a bunch, so filter cache is sort of important for us
for performance standpoint. The hit ratio of 25% is abysmal and I don't
think there are too many unique queries which are contributing to this. As I
mentioned earlier, the increase in size parameter does improve the hit
count. Just wondering, what are the best practices around scenarios like
this? Looks like I've pretty much exhausted my options :-).



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr nodes going into recovery mode and eventually failing

2017-10-19 Thread shamik
Thanks Emir. The index is equally split between the two shards, each having
approx 35gb. The total number of documents is around 11 million which should
be distributed equally among the two shards. So, each core should take 3gb
of the heap for a full cache. Not sure I get the "multiply it by number of
replica". Shouldn't each replica have its own cache of 3gb? Moreover, based
on the SPM graph, the max filter cache size during the outages have been 1.5
million max.

Majority of our queries are heavily dependent on some implicit filter and
user selected ones. By reducing the filter cache size to the current one of
4096 has taken a hit in performance. Earlier (in 5.5), I had a max cache
size of 10,000 (running on 15gb allocated heap)  which produced a 95% hit
rate. With the memory issues in 6.6,  I started reducing it to the current
value. It reduced the % hit to 25. I tried earlier reducing the value to  
. 
It still didn't help which is when I decided to go for a higher RAM machine.
What I've noticed is that the heap is consistently around 22-23gb mark out
of which G1 old gen takes close to 13gb, G1 eden space around 6gb, rest
shared by G Survivor space, Metaspace and Code cache. 

This issue has been bothering me as I seemed to be running out of possible
tuning options. What I could see from the monitoring tool is the surge
period saw around 400 requests/hr with 40 docs/sec getting indexed. Is it a
really high volume of load to handle for a cluster size 6 nodes with 16 CPU
/ 64gb RAM? What are the other options I should be looking into? 

The other thing which I'm still confused is why the recovery fails when the
memory has been freed up.



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Solr nodes going into recovery mode and eventually failing

2017-10-18 Thread Shamik Bandopadhyay
role":[
"admin",
"dev",
"read"]}
INFO831841[qtp1389808948-125836] -
org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:500) -
USER_REQUIRED auth header null context : userPrincipal: [null] type:
[READ], collections: [knowledge,], Path: [/select] path : /select params
:q=*:*=false=_docid_+asc=0=javabin=2
INFO831840[qtp1389808948-125855] -
org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:500) -
USER_REQUIRED auth header null context : userPrincipal: [null] type:
[READ], collections: [knowledge,], Path: [/select] path : /select params
:q=*:*=false=_docid_+asc=0=javabin=2
ERROR831840[qtp1389808948-125740] -
org.apache.solr.common.SolrException.log(SolrException.java:148) -
org.apache.solr.common.SolrException: no servers hosting shard: shard1
at
org.apache.solr.handler.component.HttpShardHandler.prepDistributed(HttpShardHandler.java:413)
at
org.apache.solr.handler.component.SearchHandler.getAndPrepShardHandler(SearchHandler.java:227)
at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:265)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477)



Not sure what's the issue here and how to address this. I've played around
with different memory settings but haven't been successful so far. Also,
not sure why it affects the entire cluster. When I restart the instance,
it goes into recovery mode and updates it's index with the delta, which is
understandable.But at the same time, the other replica in the same shard
stalls and goes offline. This starts a cascading effect and I've to end up
restarting all the nodes.

Any pointers will be appreciated.

Thanks,
Shamik


Authentication error : request has come without principal. failed permission

2017-10-02 Thread Shamik Bandopadhyay
Hi,

  I'm seeing this random Authentication failure in our Solr Cloud cluster
which is eventually rendering the nodes in "down" state. This doesn't seem
to have a pattern, just starts to happen out of the blue. I've 2 shards,
each having two replicas. They are using Solr basic authentication plugin.

Here's the error log:

org.apache.solr.security.RuleBasedAuthorizationPlugin.checkPathPerm(RuleBasedAuthorizationPlugin.java:147)
- request has come without principal. failed permission {
  "name":"select",
  "collection":"knowledge",
  "path":"/select",
  "role":[
"admin",
"dev",
"read"]}

org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:500) -
USER_REQUIRED auth header null context : userPrincipal: [null] type:
[READ], collections: [knowledge,], Path: [/select] path : /select params
:q=*:*=false=_docid_+asc=0=javabin=2

It eventually throws zookeeper timeout session and disappears from the
cluster.

org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1156) - *Client
session timed out, have not heard from server in 229984ms for sessionid
0x35ec984bea00016, closing socket connection and attempting reconnect*

If I restart the node, it goes into recovery mode, but at the same time,
the other healthy replica starts throwing the authentication error and
eventually spirals into the downed state. This happens across all the nodes
till everyone has gone through one restart cycle.

Here are a couple of other exceptions I've seen in the log:

org.apache.solr.handler.ReplicationHandler$DirectoryFileStream.write(ReplicationHandler.java:1539)
- Exception while writing response for params:
generation=14327=/replication=_1cww.fdt=127926272=true=filestream=filecontent
java.io.IOException: java.util.concurrent.TimeoutException: *Idle timeout
expired: 50001/5 ms*
at
org.eclipse.jetty.util.SharedBlockingCallback$Blocker.block(SharedBlockingCallback.java:219)
at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:220)
at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:491)
at
org.apache.commons.io.output.ProxyOutputStream.write(ProxyOutputStream.java:90)
at
org.apache.solr.common.util.FastOutputStream.flush(FastOutputStream.java:213)
at
org.apache.solr.common.util.FastOutputStream.write(FastOutputStream.java:83)
at
org.apache.solr.handler.ReplicationHandler$DirectoryFileStream.write(ReplicationHandler.java:1520)
at org.apache.solr.core.SolrCore$3.write(SolrCore.java:2601)
at
org.apache.solr.response.QueryResponseWriterUtil.writeQueryResponse(QueryResponseWriterUtil.java:49)
at
org.apache.solr.servlet.HttpSolrCall.writeResponse(HttpSolrCall.java:809)
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:538)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:361)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:305)


org.apache.solr.security.PKIAuthenticationPlugin.parseCipher(PKIAuthenticationPlugin.java:175)
- *Decryption failed , key must be wrong*
java.security.InvalidKeyException: No installed provider supports this key:
(null)
at javax.crypto.Cipher.chooseProvider(Cipher.java:893)
at javax.crypto.Cipher.init(Cipher.java:1249)
at javax.crypto.Cipher.init(Cipher.java:1186)
at org.apache.solr.util.CryptoKeys.decryptRSA(CryptoKeys.java:277)
at
org.apache.solr.security.PKIAuthenticationPlugin.parseCipher(PKIAuthenticationPlugin.java:173)
at
org.apache.solr.security.PKIAuthenticationPlugin.decipherHeader(PKIAuthenticationPlugin.java:160)
at
org.apache.solr.security.PKIAuthenticationPlugin.doAuthenticate(PKIAuthenticationPlugin.java:118)
at
org.apache.solr.servlet.SolrDispatchFilter.authenticateRequest(SolrDispatchFilter.java:430)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:326)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:305)


Re: Solr nodes crashing (OOM) after 6.6 upgrade

2017-09-22 Thread shamik
Susheel, my inference was based on the Qtime value from Solr log and not
based on application log. Before the CPU spike, the query time didn’t give
any indication that they are slow in the process of slowing down. As the GC
suddenly triggers a high CPU usage, query execution slows down or chocks,
but that can easily be attributed to the lack of available processing power.

I’m curious to know what’s the recommended hardware for 6.6 having 50gb
index size and 15 million+ documents.



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr nodes crashing (OOM) after 6.6 upgrade

2017-09-22 Thread shamik
I usually log queries that took more than 1sec. Based on the logs, I haven't
seen anything alarming or surge in terms of slow queries, especially around
the time when the CPU spike happened.

I don't necessarily have the data for deep paging, but the usage of sort
parameter (date in our case) has been typically low. We also restrict 10
results per page for pagination. Are there are recommendations around this?

Again, I don't want to sound like a broken record, but I still don't get the
part why these issues crop in 6.6 as compared to 5.5  



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr nodes crashing (OOM) after 6.6 upgrade

2017-09-22 Thread shamik
All the tuning and scaling down of memory seemed to be stable for a couple of
days but then came down due to a huge spike in CPU usage, contributed by G1
Old Generation GC. I'm really puzzled why the instances are suddenly
behaving like this. It's not that a sudden surge of load contributed to
this, query and indexing load seemed to be comparable with the previous time
frame. Just wondering if the hardware itself is not adequate enough for 6.6.
The instances are all running on 8 CPU / 30gb m3.2xlarge EC2 instances.

Does anyone ever face issues similar to this?



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr nodes crashing (OOM) after 6.6 upgrade

2017-09-19 Thread shamik
Emir, after digging deeper into the logs (using new relic/solr admin) during
the outage, it looks like a combination of query load and indexing process
triggered it. Based on the earlier pattern, memory would tend to increase at
a steady pace, but then surge all of a sudden, triggering OOM. After I
scaled down the heap size as per Walter's suggestion, the memory seemed to
have been holding up. But there's a possibility the lower heap size might
have restricted the GC to utilize higher CPU. The cache size has been scaled
down, I'm hoping it's no longer adding an overhead after every commit.

I've facet.limit=-1 configured for few search types, but facet.mincount is
always set as 1. Didn't know that's detrimental to doc values.



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr nodes crashing (OOM) after 6.6 upgrade

2017-09-19 Thread shamik
Thanks, the change seemed to have addressed the memory issue (so far), but on
the contrary, the GC chocked the CPUs stalling everything. The CPU
utilization across the cluster clocked close to 400%, literally stalling
everything.On a first look, the G1-Old generation looks to be the culprit
that took up 80% of the CPU. Not sure what triggered really triggered it as
the GC seemed to have stable till then. The other thing I noticed was the
mlt queries (I'm using mlt query parser for cloud support) took a huge
amount of time to respond (10 sec+) during the CPU spike compared to the
rest. Again, that might just due to the CPU.

The index might not be a large one to merit a couple of shards, but it has
never been an issue for past couple of years on 5.5. We never had a single
outage related to memory or CPU. The query/indexing load has increased over
time, but it has been linear. I'm little baffled why would 6.6 behave so
differently. Perhaps the hardware is not adequate enough? I'm running on 8
core / 30gb machine with SSD.



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr nodes crashing (OOM) after 6.6 upgrade

2017-09-18 Thread shamik
I agree, should have made it clear in my initial post. The reason I thought
it's little trivial since the newly introduced collection has only few
hundred documents and is not being used in search yet. Neither it's being
indexed at a regular interval. The cache parameters are kept to a minimum as
well. But there might be overheads of a simply creating a collection which
I'm not aware of.

I did bring down the heap size to 8gb, changed to G1 and reduced the cache
params. The memory so far has been holding up but will wait for a while
before passing on a judgment. 







The change seemed to have increased the number of slow queries (1000 ms),
but I'm willing to address the OOM over performance at this point. One thing
I realized is that I provided the wrong index size here. It's 49gb instead
of 25, which I mistakenly picked from one shard. I hope the heap size will
continue to sustain for the index size. 



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr nodes crashing (OOM) after 6.6 upgrade

2017-09-18 Thread shamik
Walter, thanks again. Here's some information on the index and search
feature.

The index size is close to 25gb, with 20 million documents. it has two
collections, one being introduced with 6.6 upgrade. The primary collection
carries the bulk of the index, newly formed one being aimed at getting
populated going forward. Besides keyword search, the search has a bunch of
facets, which are configured to use docvalues. The notable search features
being used are highlighter, query elevation, mlt and suggester. The other
change from 5.5 was to replace Porter Stemmer with Lemmatizer in the
analysis channel.



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr nodes crashing (OOM) after 6.6 upgrade

2017-09-18 Thread shamik
Thanks for your suggesting, I'm going to tune it and bring it down. It just
happened to carry over from 5.5 settings. Based on Walter's suggestion, I'm
going to reduce the heap size and see if it addresses the problem.



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr nodes crashing (OOM) after 6.6 upgrade

2017-09-18 Thread shamik
Apologies, 290gb was a typo on my end, it should read 29gb instead. I started
with my 5.5 configurations of limiting the RAM to 15gb. But it started going
down once it reached the 15gb ceiling. I tried bumping it up to 29gb since
memory seemed to stabilize at 22gb after running for few hours, of course,
it didn't help eventually. I did try the G1 collector. Though garbage
collection was happening more efficiently compared to CMS, it brought the
nodes down after a while.

The part I'm trying to understand is whether the memory footprint is higher
for 6.6 and whether I need an instance with higher ram (>30gb in my case). I
haven't added any post 5.5 feature to rule out the possibility of a memory
leak.



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Solr nodes crashing (OOM) after 6.6 upgrade

2017-09-18 Thread Shamik Bandopadhyay
Hi,

   I recently upgraded to Solr 6.6 from 5.5. After running for a couple of
days, the entire Solr cluster suddenly came down with OOM exception. Once
the servers are being restarted, the memory footprint stays stable for a
while before the sudden spike in memory occurs. The heap surges up quickly
and hits the max causing the JVM to shut down due to OOM. It starts with
one server but eventually trickles downs to the rest of the nodes, bringing
the entire cluster down within a span of 10-15 mins.

The cluster consists of 6 nodes with two shards having 2 replicas each.
There are two collections with total index size close to 24 gb. Each server
has 8 CPUs with 30gb memory. Solr is running on an embedded jetty on jdk
1.8. The JVM parameters are identical to 5.5:

SOLR_JAVA_MEM="-Xms1000m -Xmx29m"

GC_LOG_OPTS="-verbose:gc -XX:+PrintHeapAtGC -XX:+PrintGCDetails \
  -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps
-XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime"

GC_TUNE="-XX:NewRatio=3 \
-XX:SurvivorRatio=4 \
-XX:TargetSurvivorRatio=90 \
-XX:MaxTenuringThreshold=8 \
-XX:+UseConcMarkSweepGC \
-XX:+UseParNewGC \
-XX:ConcGCThreads=4 -XX:ParallelGCThreads=4 \
-XX:+CMSScavengeBeforeRemark \
-XX:PretenureSizeThreshold=64m \
-XX:+UseCMSInitiatingOccupancyOnly \
-XX:CMSInitiatingOccupancyFraction=50 \
-XX:CMSMaxAbortablePrecleanTime=6000 \
-XX:+CMSParallelRemarkEnabled \
-XX:+ParallelRefProcEnabled"

I've tried G1GC based on Shawn's WIKI, but didn't make any difference.
Though G1GC seemed to do well with GC initially, it showed similar
behaviour during the spike. It prompted me to revert back to CMS.

I'm doing a hard commit every 5 mins.

SOLR_OPTS="$SOLR_OPTS -Xss256k"
SOLR_OPTS="$SOLR_OPTS -Dsolr.autoCommit.maxTime=30"
SOLR_OPTS="$SOLR_OPTS -Dsolr.clustering.enabled=true"
SOLR_OPTS="$SOLR_OPTS -Dpkiauth.ttl=12"

Othe Solr configurations:


${solr.autoSoftCommit.maxTime:-1}


Cache settings:

4096
1000






true
200
400

I'm not sure what has changed so drastically in 6.6 compared to 5.5. I
never had a single OOM in 5.5 which has been running for a couple of years.
Moreover, the memory footprint was much less with 15gb set as Xmx. All my
facet parameters have docvalues enabled, it should handle the memory part
efficiently.

I'm struggling to figure out the root cause. Does 6.6 command more memory
than what is currently available on our servers (30gb)? What might be the
probable cause for this sort of scenario? What are the best practices to
troubleshoot such issues?

Any pointers will be appreciated.

Thanks,
Shamik


Help with Query/Function for conditional boost

2017-08-16 Thread Shamik Bandopadhyay
Hi,

   I'm trying to create a function to boost dynamically boost a field based
on specific values for a searchable field. Here's an example:

I've the following query fields with default boost.

qf=text^2 title^4 command^8

Also, there's a default boost on the source field.

bq=source:help^10 source:forum^5

Among the searchable fields, command gets the highest preference. To add to
that,I would like to see boost results from source help further when a
query term exists in command field. With my current setting, documents from
forum are appearing at the top when a search term is found in command
field. Increasing the boost to source:help didn't make any difference.

Just wondering if it's possible to write a function which will
conditionally boost command field for documents tagged with source=help

if(termfreq(source,'help'), command^8, command^1)

The above function is just for reference to show what I'm trying to achieve.

Any pointers will be helpful.

Thanks,
Shamik


Re: Issues trying to boost phrase containing stop word

2017-07-20 Thread shamik
Any suggestion?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Problem-trying-to-boost-phrase-containing-stop-word-tp4346860p4347068.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Issues trying to boost phrase containing stop word

2017-07-19 Thread shamik
Hi Koji,

   I'm using a copy field to preserve the original term with stopword. It's
mapped to  titleExact.

  


textExact definition:













I'm using minimum analyzers to keep the original query in titleExact which
is exactly what it is doing. Not sure how adding a shingle filter is going
to benefit here. 

adsktext does all the heavy lifting of removing the stopwors and applying
stemmers.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Problem-trying-to-boost-phrase-containing-stop-word-tp4346860p4346915.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Issues trying to boost phrase containing stop word

2017-07-19 Thread shamik
Thanks Koji, I've tried KeywordRepeatFilterFactory which keeps the original
term, but the Stopword filter in the analysis chain will remove it
nonetheless. That's why I thought of creating a separate field devoiding of
stopwords/stemmers. Let me know if I'm missing something here.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Problem-trying-to-boost-phrase-containing-stop-word-tp4346860p4346909.html
Sent from the Solr - User mailing list archive at Nabble.com.


Issues trying to boost phrase containing stop word

2017-07-19 Thread Shamik Bandopadhyay
u must specifically save the block definition while you are
in the Block Editor.
Use the Block Editor to edit, correct, and save a
block definition.


SOLR1003
About Adding Parameters to Dynamic Blocks
Parameters determine the geometry that will be
affected by an action when you manipulate a block reference. When you add a
parameter to a dynamic block definition, grips are displayed at key points
of the parameter. Key points are the parts of a parameter that you use to
manipulate the block reference. For example, a linear parameter has key
points at its base point and end point. You can manipulate the parameter
distance from either key point. You can specify grip size and color for
display in the Block Editor. This setting does not affect the size and
color of the grips in a block reference.
Parameters determine the geometry that will be
affected by an action when you manipulate a block reference.


SOLR1004
About Adding Actions to Dynamic Blocks
Actions define how the geometry of a dynamic block
reference will move or change when its grips are manipulated. In general,
you associate an action with a parameter and the following: Key point . The
point on a parameter that drives the action. Selection set . The geometry
that will be affected by the action. When you move the grip in the example
above, only the geometry in the selection set is stretched. Specify
Distance and Angle Override Values Distance multiplier and angle offset
override properties allow you to specify a factor by which a parameter
value is increased or decreased. Action overrides are properties of actions
that have no effect on the block reference until it is manipulated in a
drawing. Use distance multiplier overrides with the following actions: Move
action Stretch action Polar Stretch action You can specify these action
override properties on the command line when you add an action to a dynamic
block definition. You can also specify these properties in the Properties
palette when you select an action in the Block Editor.
Actions define how the geometry of a dynamic
block reference will move or change when its grips are manipulated.


SOLR1005
Dynamic Block Grip Reference
This table describes the grips and how they're used.
Grip Type Grip Movement or Result Parameters: Associated Actions Standard
Within a plane in any direction Base: None Point: Move, Stretch, Polar:
Move, Scale, Stretch, Polar Stretch, Array XY: Move, Scale, Stretch, Array
Linear Back and forth in a defined direction or along an axis Linear: Move,
Scale, Stretch, Array Rotation Around an axis Rotation: Rotate Flip
Switches to a mirror image of the block geometry Flip: Flip Alignment
Within a plane in any direction; when moved over an object, triggers the
block reference to align with the object Alignment: None (action is
implied) Lookup Displays a list of values Visibility: None (action is
implied) Lookup: Lookup
This table describes the grips and how they're
used.


SOLR1006
To Open a Drawing Saved as a Dynamic Block (Block
Editor)
Click the Application button Open Drawing. Open the
drawing file that is saved as a block. An alert states that the drawing
contains authoring elements. In the alert dialog box, click Yes to open the
drawing in the Block Editor.
None




Here's my query:
http://localhost:8983/solr/techproducts/browse?q=About%20dynamic%20blocks=xml=title,subject,score=off=off=off=title
^5%20titleExact^1%20subject^1%20description^1=true=100=OR

It skips SOLR1004 and SOLR1005 since they don't have the term "about".
Adding q.op=OR didn't make any difference in result. Here's the debug query
for both q.op=AND and OR

AND:
(+(+DisjunctionMaxQuery((titleExact:about))
+DisjunctionMaxQuery((titleExact:dynamic | description:dynamic |
(title:dynamic)^5.0 | subject:dynamic))
+DisjunctionMaxQuery((titleExact:blocks | description:block |
(title:block)^5.0 | subject:block/no_coord

+(+(titleExact:about) +(titleExact:dynamic
| description:dynamic | (title:dynamic)^5.0 | subject:dynamic)
+(titleExact:blocks | description:block | (title:block)^5.0 |
subject:block))

OR:
(+(DisjunctionMaxQuery((titleExact:about))
DisjunctionMaxQuery((titleExact:dynamic | description:dynamic |
(title:dynamic)^5.0 | subject:dynamic))
DisjunctionMaxQuery((titleExact:blocks | description:block |
(title:block)^5.0 | subject:block)))~3)/no_coord

+(((titleExact:about) (titleExact:dynamic
| description:dynamic | (title:dynamic)^5.0 | subject:dynamic)
(titleExact:blocks | description:block | (title:block)^5.0 |
subject:block))~3)

Apologies for the long thread, but not sure what I'm doing wrong. I'll
appreciate if you someone can provide pointers. If there's a different
approach to solving this issue, please let me know.

Thanks,
Shamik


Re: How to combine third party search data as top results ?

2017-02-06 Thread shamik
Charlie, this looks something very close to what I'm looking for. Just
wondering if you've made this available as a jar or can be build from
source? Our Solr distribution is not built from source, I can only use an
external jar. I'll appreciate if you can let me know.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-combine-third-party-search-data-as-top-results-tp4318116p4319101.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How to combine third party search data as top results ?

2017-02-01 Thread shamik
Charlie, thanks for sharing the information. I'm going to take a look and get
back to you.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-combine-third-party-search-data-as-top-results-tp4318116p4318349.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How to combine third party search data as top results ?

2017-01-31 Thread shamik
Thanks, John.

The title is not unique, so I can't really rely on it. Also, keeping an
external mapping for url and id might not feasible as we are talking about
possibly millions of documents.

URLs are unique in our case, unfortunately, it can't be used as part of
Query elevation component since it only accepts ids. As you've mentioned, I
can probably apply a huge boost factor to each of these urls (through "bq")
and see if they appear at the top in order.

I was hoping for an elegant solution to this :-)




--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-combine-third-party-search-data-as-top-results-tp4318116p4318127.html
Sent from the Solr - User mailing list archive at Nabble.com.


How to combine third party search data as top results ?

2017-01-31 Thread Shamik Bandopadhyay
Hi,

  I'm trying to integrate results from a third party source with our
existing search. The idea is to include the top 5 results from this source
as the top result of our search.Though the external data is indexed in our
system, the use case dictates us to use their ranking (by getting the top
five result). Problem is, their result returns only text, title, and url.
To construct the final response, I need to include a bunch of metadata
fields which is only available in our index. Here are the steps:
1. Query external source, get top five results.
2. Query our index based on url from each result, retrieve their
corresponding id.
3. Query our index and pass the ids as elevateIds (dynamic query elevation)

This probably isn't a clean solution as it adds the overhead of an
additional query to retrieve document ids. Just wondering if there's a
better way to handle this situation, perhaps a way to combine step 2 and 3
in a single query or a different approach altogether?

Any pointers will be appreciated.

-Thanks,
Shamik


Re: Information on classifier based key word suggestion

2017-01-23 Thread shamik
Anyone ?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Information-on-classifier-based-key-word-suggestion-tp4314942p4315492.html
Sent from the Solr - User mailing list archive at Nabble.com.


Information on classifier based key word suggestion

2017-01-19 Thread Shamik Bandopadhyay
Hi,

  I'm exploring a way to suggest keywords/tags based on a text snippet. I
have a fairly small set of the taxonomy of product, release, category,
type, etc. stored in an in-memory database. What I'm looking at is a tool
which will analyze a given text, suggest not only the fields associated
with taxonomy but keywords which it might feel relevant to the text. The
keywords can be leveraged as a mechanism for findability of the document.
As a newbie in this area, I'm a tad overwhelmed at different options and
struggling to find the right approach.To start with I tried GATE, but it
seems to be limited only providing taxonomy data which needs to be provided
as a flat text. Few people suggested using classifiers like Naive Bayes
classifier or other machine learning tools.

I'll appreciate if anyone can provide some direction in this regard.

Thanks,
Shamik


Re: How to support facet values in search term

2016-11-22 Thread shamik
Thanks for the pointer Alex . I'll go through all four articles, thanksgiving
will be fun :-)



--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-include-facet-fields-in-keyword-search-tp4306967p4307020.html
Sent from the Solr - User mailing list archive at Nabble.com.


How to support facet values in search term

2016-11-22 Thread Shamik Bandopadhyay
Hi,

  I'm looking for some suggestions on enabling search terms to include
facet fields as well. In my use case, we've a bunch of product and
corresponding release fields which are explicitly used as facets. But what
we are observing is that end users tend to use the product name as part of
the search term, instead of filtering the product from the facet itself.
For e.g. we've "Product A" and "Product B", each having release 2016, 2017.
A common user search appears to be "Product A service pack" . Since Product
A is not part of the search fields (typically text, title, keyword, etc),
it's not returning any data.

We've a large set of facet fields, I would ideally like to avoid adding
them as part of the searchable list. Just wondering  if there's a better
way to handle this situation. Any pointers will be appreciated.

Thanks,
Shamik


Re: SolrJ doesn't work with Json facet api

2016-10-05 Thread shamik
You can try something like :

query.add("json.facet", your_json_facet_query); 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrJ-doesn-t-work-with-Json-facet-api-tp4299867p4299888.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: how to remove duplicate from search result

2016-09-27 Thread shamik
Did you take a look at Collapsin Query Parser ?

https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results



--
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-remove-duplicate-from-search-result-tp4298272p4298305.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How to retrieve parent documents without a nested structure (block-join)

2016-09-27 Thread shamik
Thanks again Alex.

I should have clarified the use of browse request handler. The reason I'm
simulating the request handler parameters of my production system using
browse. I used a separate request handler, stripped down all properties to
match "select". I finally narrowed down the issue to Minimum match (mm)
parameter. I had specified it as "mm=100%". That was preventing the query
from returning any results. 

Once again, appreciate all your help.





--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-retrieve-parent-documents-without-a-nested-structure-block-join-tp4297510p4298304.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How to retrieve parent documents without a nested structure (block-join)

2016-09-27 Thread shamik
Sorry to bump this up, but can someone please explain the parsing behaviour
of a join query (show above) in respect to different request handler ?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-retrieve-parent-documents-without-a-nested-structure-block-join-tp4297510p4298249.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How to retrieve parent documents without a nested structure (block-join)

2016-09-26 Thread shamik
Thanks Alex, this has been extremely helpful. There's one doubt though.

The query returns expected result if I use "select" or "query" request
handler, but fails for others. Here's the debug output from "/select" using
edismax.

http://localhost:8983/solr/techproducts/query?q=({!join%20from=manu_id_s%20to=id}ipod)(name:GB18030%20-manu_id_s:*)=id,title=query=xml=false=false=edismax

*(+(JoinQuery({!join from=manu_id_s to=id}text:ipod)
(name:gb18030 -manu_id_s:*)))/no_coord

+({!join from=manu_id_s to=id}text:ipod
(name:gb18030 -manu_id_s:*))
*

Now, if I use "/browse", I don't get any results back. Here's a snippet from
browse request handler config.


 
   explicit
   velocity
   browse
   layout
   Solritas

   
   edismax
   
  text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4
  title^10.0 description^5.0 keywords^5.0 author^2.0
resourcename^1.0 subject^0.5
   
   100%
   *:*
   10
   *,score

As you can see, I've defined the "qf" fields with defType as edismax.

Here's the query:

http://localhost:8983/solr/techproducts/browse?q=({!join%20from=manu_id_s%20to=id}ipod)(name:GB18030%20-manu_id_s:*)=query=xml=false=false

Output:

*(+((JoinQuery({!join from=manu_id_s
to=id}text:ipod) (DisjunctionMaxQuery((keywords:name:gb18030^5.0 |
author:name:gb18030^2.0 | ((subject:name subject:gb subject:18030)~3)^0.5 |
manu:name:gb18030^1.1 | ((description:name description:gb
description:18030)~3)^5.0 | ((title:name title:gb title:18030)~3)^10.0 |
features:name:gb18030 | cat:name:GB18030^1.4 | name:name:gb18030^1.2 |
text:name:gb18030^0.5 | id:name:GB18030^10.0 | resourcename:name:gb18030 |
sku:"namegb 18030"^1.5)) -manu_id_s:*))~2))/no_coord

+(({!join from=manu_id_s to=id}text:ipod
((keywords:name:gb18030^5.0 | author:name:gb18030^2.0 | ((subject:name
subject:gb subject:18030)~3)^0.5 | manu:name:gb18030^1.1 |
((description:name description:gb description:18030)~3)^5.0 | ((title:name
title:gb title:18030)~3)^10.0 | features:name:gb18030 | cat:name:GB18030^1.4
| name:name:gb18030^1.2 | text:name:gb18030^0.5 | id:name:GB18030^10.0 |
resourcename:name:gb18030 | sku:"namegb 18030"^1.5) -manu_id_s:*))~2)*

If I remove the join query condition ({!join from=manu_id_s to=id}ipod) ,
the query returns the result based on the second condition. 

The other doubt I've is why "text" is getting picked as a default field in
the join condition? I've defined the "df" fields in "browse" which are being
used in the second condition. Do I need to explicitly set the df fields
inside the join condition?

The other thing I've noticed is the difference in parsed query if I add a
space in between the two clause. For e.g. *q=({!join from=manu_id_s
to=id}ipod) (name:GB18030 -manu_id_s:*)* results in 

*(+((JoinQuery({!join from=manu_id_s
to=id}text:ipod) (name:gb18030 -manu_id_s:*))~2))/no_coord

+(({!join from=manu_id_s to=id}text:ipod
(name:gb18030 -manu_id_s:*))~2)*




--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-retrieve-parent-documents-without-a-nested-structure-block-join-tp4297510p4298115.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How to retrieve parent documents without a nested structure (block-join)

2016-09-25 Thread shamik
Thanks for getting back on this. I was trying to formulate a query in similar
lines but not able to construct it (multiple clauses) correctly so far. That
can be attributed to my inexperience with Solr queries as well. Can you
please point to any documentation / example for my reference ? 

Appreciate your help.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-retrieve-parent-documents-without-a-nested-structure-block-join-tp4297510p4297951.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How to retrieve parent documents without a nested structure (block-join)

2016-09-25 Thread shamik
Thanks Alex. With the conventional join query I'm able to return the parent
document based on a query match on the child. But, it filters out any other
documents which are outside the scope of join condition. For e.g. in my
case, I would expect the query to return :

 
   1 
   Parent title 
   123 
 
 
   4 
   Misc title2 


I'm only getting back id=1 with the following join query :

*{!join from=parent_doc_id to=doc_id}title2*

Is there a way to get the document with id=4 ?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-retrieve-parent-documents-without-a-nested-structure-block-join-tp4297510p4297935.html
Sent from the Solr - User mailing list archive at Nabble.com.


How to retrieve parent documents without a nested structure (block-join)

2016-09-22 Thread Shamik Bandopadhyay
Hi,

  I have a set of documents indexed which has a pseudo parent-child
relationship. Each child document has a reference to the parent document
through an ID. As the documents are not available to the crawler in order,
I'm not able to index them in a nested structure to support
block-join.Here's an example of a dataset in index right now.


  1
  Parent title
  123


  2
  Child title1
  123


  3
  Child title2
  123


  4
  Misc title2


As per my requirement, if I search on "title2", the result should bring
back the following result, the parent document (id=1) and non-related
document (id=4).


  1
  Parent title
  123


  4
  Misc title2


This is similar in lines with Block Join Parent Query Parser where I could
have fired a query like : q={!parent
which="content_type:parentDocument"}title:title2

Not sure if the Graph Query Parser can be a relevant solution in this
regard. The problem I see there is I'm running on 5.5 with 2 shard and n
number of replicas. The graph query parser seems to be designed for a
single node/single shard.

This is tad urgent for me as I'm trying to come up with an approach to deal
with this. Any pointers will be highly appreciated.

Thanks,
Shamik


Re: Inventor-template vs Inventor template - issue with hyphen

2016-08-26 Thread shamik
Anyone ?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Inventor-template-vs-Inventor-template-issue-with-hyphen-tp4293357p4293489.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Inventor-template vs Inventor template - issue with hyphen

2016-08-25 Thread shamik
Thanks Erick. I did look into the analyser tool and debug query and posted
the results in my post. WDF is correctly stripping off the "-" from
Inventor-template, both terms are getting broken down to "inventor templat".
But not sure why the query construct is different during query time. Here's
parsed query:

*Inventor-template*


(+DisjunctionMaxQuery(((+CommandSrch:inventor +CommandSrch:templat) |
text:"inventor templat"^1.5 | Description:"inventor templat"^2.0 |
title:"inventor templat"^3.5 | keywords:"inventor templat"^1.2)~0.01)
Source2:sfdcarticles^9.0 Source2:downloads^5.0
FunctionQuery(1.0/(3.16E-11*float(ms(const(147216960),date(PublishDate)))+1.0)))/no_coord



+((+CommandSrch:inventor +CommandSrch:templat) | text:"inventor templat"^1.5
| Description:"inventor templat"^2.0 | title:"inventor templat"^3.5 |
keywords:"inventor templat"^1.2)~0.01 Source2:sfdcarticles^9.0
Source2:downloads^5.0 
1.0/(3.16E-11*float(ms(const(147216960),date(PublishDate)))+1.0)


*Inventor template*


(+(+DisjunctionMaxQuery((CommandSrch:inventor | text:inventor^1.5 |
Description:inventor^2.0 | title:inventor^3.5 | keywords:inventor^1.2)~0.01)
+DisjunctionMaxQuery((CommandSrch:templat | text:templat^1.5 |
Description:templat^2.0 | title:templat^3.5 | keywords:templat^1.2)~0.01))
Source2:sfdcarticles^9.0 Source2:downloads^5.0
FunctionQuery(1.0/(3.16E-11*float(ms(const(147216960),date(PublishDate)))+1.0)))/no_coord



+(+(CommandSrch:inventor | text:inventor^1.5 | Description:inventor^2.0 |
title:inventor^3.5 | keywords:inventor^1.2)~0.01 +(CommandSrch:templat |
text:templat^1.5 | Description:templat^2.0 | title:templat^3.5 |
keywords:templat^1.2)~0.01) Source2:sfdcarticles^9.0 Source2:downloads^5.0 
1.0/(3.16E-11*float(ms(const(147216960),date(PublishDate)))+1.0)


The part I'm confused is why the two queries are being interpreted
differently ?

Thanks,
Shamik



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Inventor-template-vs-Inventor-template-issue-with-hyphen-tp4293357p4293380.html
Sent from the Solr - User mailing list archive at Nabble.com.


Inventor-template vs Inventor template - issue with hyphen

2016-08-25 Thread Shamik Bandopadhyay
Hi,

  I'm trying to figure out search behaviour related to similar terms, one
with and without the hyphen. Both of them are generating a different result
set , the search without the hyphen is bringing back more result compared
to the other. Here's the fieldtype definition :






















If I run the search term through the analyzer, the final indexed data for
both term (hyphen and without) results in  --> *inventor templat*

I was under the impression that based on my analyzers, both search term
will produce same result.

Here's the output from debug and splainer.

*Inventor-template*
*-*

(+DisjunctionMaxQuery(((+CommandSrch:inventor
+CommandSrch:templat) | text:"inventor templat"^1.5 | Description:"inventor
templat"^2.0 | title:"inventor templat"^3.5 | keywords:"inventor
templat"^1.2)~0.01) Source2:sfdcarticles^9.0 Source2:downloads^5.0
FunctionQuery(1.0/(3.16E-11*float(ms(const(147208320),date(PublishDate)))+1.0)))/no_coord

+((+CommandSrch:inventor
+CommandSrch:templat) | text:"inventor templat"^1.5 | Description:"inventor
templat"^2.0 | title:"inventor templat"^3.5 | keywords:"inventor
templat"^1.2)~0.01
1.0/(3.16E-11*float(ms(const(147208320),date(PublishDate)))+1.0)

>From Splainer:

10.974786 Sum of the following:
 9.203462 Dismax (max plus:0.01 times others)
   9.198681 title:"inventor templat"

   0.4781131 text:"inventor templat"

 1.7644342 Source2:sfdcarticles

 0.006889837 
1.0/(3.16E-11*float(ms(const(147208320),date(PublishDate)))+1.0)


*Inventor template*
*--*

(+(+DisjunctionMaxQuery((CommandSrch:inventor |
text:inventor^1.5 | Description:inventor^2.0 | title:inventor^3.5 |
keywords:inventor^1.2)~0.01) +DisjunctionMaxQuery((CommandSrch:templat |
text:templat^1.5 | Description:templat^2.0 | title:templat^3.5 |
keywords:templat^1.2)~0.01)) Source2:sfdcarticles^9.0 Source2:downloads^5.0
FunctionQuery(1.0/(3.16E-11*float(ms(const(147208320),date(PublishDate)))+1.0)))/no_coord

+(+(CommandSrch:inventor |
text:inventor^1.5 | Description:inventor^2.0 | title:inventor^3.5 |
keywords:inventor^1.2)~0.01 +(CommandSrch:templat | text:templat^1.5 |
Description:templat^2.0 | title:templat^3.5 | keywords:templat^1.2)~0.01)
Source2:sfdcarticles^9.0 Source2:downloads^5.0
1.0/(3.16E-11*float(ms(const(147208320),date(PublishDate)))+1.0)

>From splainer :

9.915069 Sum of the following:
 5.03947 Dismax (max plus:0.01 times others)
   5.038846 title:templat

   0.062400598 text:templat

 4.767776 Dismax (max plus:0.01 times others)
   4.7674117 title:inventor

   0.03642158 text:inventor

 0.098686054 Source2:CloudHelp

 0.009136423
1.0/(3.16E-11*float(ms(const(147208320),date(PublishDate)))+1.0)


I'm using edismax.


Just wondering what I'm missing here. Any help will be appreciated.

Regards,
Shamik


Re: [ANN] Relevant Search by Manning out! (Thanks Solr community!)

2016-06-23 Thread shamik
Thanks for all the pointers. With 50% discount, picking a copy is a
no-brainer 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/ANN-Relevant-Search-by-Manning-out-Thanks-Solr-community-tp4283667p4284107.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: [ANN] Relevant Search by Manning out! (Thanks Solr community!)

2016-06-23 Thread shamik
Hi Doug,

Congratulations on the release, I guess, lot of us have been eagerly
waiting for this. Just one quick clarification. You mentioned that the
examples in your book are executed against elasticsearch. For someone
familiar with Solr, will it be an issue to run those examples in a Solr
instance instead ?

-Thanks



--
View this message in context: 
http://lucene.472066.n3.nabble.com/ANN-Relevant-Search-by-Manning-out-Thanks-Solr-community-tp4283667p4284079.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Multiple context field / filters in Solr suggester

2016-06-22 Thread shamik
Anyone ?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Multiple-context-field-filters-in-Solr-suggester-tp4283739p4283894.html
Sent from the Solr - User mailing list archive at Nabble.com.


Multiple context field / filters in Solr suggester

2016-06-21 Thread Shamik Bandopadhyay
Hi,

  Just trying to understand if Solr suggester supports multiple filtering
through the "contextField" option. As shown in the config below, is it
possible to have two contextFields defined where I can use  "cat" and
"manu" as filtering criteria on the suggested result ?


  
mySuggester
AnalyzingInfixLookupFactory
DocumentDictionaryFactory
name
price
cat
manu
string
false
  


The only reference around this seemed to be Solr-7888, but based on my
understanding, it talks about boolean support for a given context.

Any pointers will be appreciated.

Thanks,
Shamik


Re: Solrj Basic Authentication randomly failing - "request has come without principal"

2016-05-18 Thread shamik
anyone ?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solrj-Basic-Authentication-randomly-failing-request-has-come-without-principal-tp4277342p4277533.html
Sent from the Solr - User mailing list archive at Nabble.com.


Solrj Basic Authentication randomly failing - request has come without principal

2016-05-17 Thread Shamik Bandopadhyay
Hi,

  I'm facing this issue where SolrJ calls are randomly failing on basic
authentication. Here's exception:

ERROR923629[qtp466002798-20] -
org.apache.solr.security.PKIAuthenticationPlugin.doAuthenticate(PKIAuthenticationPlugin.java:125)
- Invalid key
 INFO923630[qtp466002798-20] -
org.apache.solr.security.RuleBasedAuthorizationPlugin.checkPathPerm(RuleBasedAuthorizationPlugin.java:144)
- request has come without principal. failed permission
org.apache.solr.security.RuleBasedAuthorizationPlugin$Permission@1a343033
INFO923630[qtp466002798-20] -
org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:429) -
USER_REQUIRED auth header null context : userPrincipal: [null] type:
[READ], collections: [knowledge,], Path: [/select] path : /select params
:df=text=false=/select=false=id=score=4=0=true=
http://xx.xxx.x.222:8983/solr/knowledge/|http://xx.xxx.xxx.246:8983/solr/knowledge/=3=2=*:*=1463512962899=true=javabin

Here's my security.json. I've protected "browse" and "select" request
handler for my queries.

{
  "authentication": {
"blockUnknown": false,
"class": "solr.BasicAuthPlugin",
"credentials": {
  "solr": "IV0EHq1OnNrj6gvRCwvFwTrZ1+z1oBbnQdiVC3otuq0=
Ndd7LKvVBAaZIF0QAVi1ekCfAJXr1GGfLtRUXhgrF8c="
}
  },
  "authorization": {
"class": "solr.RuleBasedAuthorizationPlugin",
"user-role": {
  "solr": "admin",
  "solradmin": "admin",
  "beehive": "dev",
  "readuser": "read"
},
"permissions": [
  {
"name": "security-edit",
"role": "admin"
  },
  {
"name": "browse",
"collection": "knowledge",
"path": "/browse",
"role": [
  "admin",
  "dev",
  "read"
]
  },
  {
"name": "select",
"collection": "knowledge",
"path": "/select",
"role": [
  "admin",
  "dev",
  "read"
]
  },
  {
"name": "admin-ui",
"path": "/",
"role": [
  "admin",
  "dev"
]
  },
  {
"name": "update",
"role": [
  "admin",
  "dev"
]
  },
  {
"name": "collection-admin-edit",
"role": [
  "admin"
]
  },
  {
"name": "schema-edit",
"role": [
  "admin"
]
  },
  {
"name": "config-edit",
"role": [
  "admin"
]
  }
]
  }
}

Here's my sample code:

SolrClient client = new
CloudSolrClient("zoohost1:2181,zoohost2:2181,zoohost3:2181");
((CloudSolrClient)client).setDefaultCollection(DEFAULT_COLLECTION);
ModifiableSolrParams param = getSearchSolrQuery();
SolrRequest solrRequest = new QueryRequest(param);
solrRequest.setBasicAuthCredentials(USER, PASSWORD);
try{
 for(int j=0;j<20;j++){
NamedList results = client.request(solrRequest);
  }
}catch(Exception ex){

}

private static ModifiableSolrParams getSearchSolrQuery() {
ModifiableSolrParams solrParams = new ModifiableSolrParams();
solrParams.set("q", "*:*");
solrParams.set("qt","/select");
solrParams.set("rows", "3");
return solrParams;
}

The query sometime returns results, but fails probably half of the time,
there's no pattern though. This is applicable to any request handler
specified in the security.json

Looks like the SolrRequest loses the user/password on the flight.

Here's the exception recieved at SolrJ client:

org.apache.solr.common.SolrException.log(SolrException.java:148) -
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error
from server at http://xx.xxx.xxx.134:8983/solr/knowledge: Expected mime
type application/octet-stream but got text/html. 


Error 401 Unauthorized request, Response code: 401

HTTP ERROR 401
Problem accessing /solr/knowledge/select. Reason:
Unauthorized request, Response code:
401Powered by Jetty://




at
org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:544)
at
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:240)
at
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:229)
at
org.apache.solr.client.solrj.impl.LBHttpSolrClient.doRequest(LBHttpSolrClient.java:372)
at
org.apache.solr.client.solrj.impl.LBHttpSolrClient.request(LBHttpSolrClient.java:325)
at
org.apache.solr.handler.component.HttpShardHandlerFactory.makeLoadBalancedRequest(HttpShardHandlerFactory.java:246)
at
org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:201)
at
org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:163)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at

set-property API doesn't work for security.json authentication

2016-05-12 Thread Shamik Bandopadhyay
Hi,

  I'm trying to update the set-property option in security.json
authentication section. As per the documentation,

"Set  arbitrary properties for authentication plugin. The only supported
property is 'blockUnknown'"

https://cwiki.apache.org/confluence/display/solr/Basic+Authentication+Plugin

Calling the service throws the following exception:
"errorMessages": [
{
  "set-property": {
"blockUnknown": true
  },
  "errorMessages": [
"Unknown operation 'set-property' "
  ]
}
  ]

Is this a bug or the API is not supported?


Re: Solrj API with Basic Authentication

2016-05-11 Thread shamik
Ok, I found another way of doing it which will preserve the QueryResponse
object. I've used DefaultHttpClient, set the credentials and finally passed
it as a constructor to the CloudSolrClient.

*DefaultHttpClient httpclient = new DefaultHttpClient();
UsernamePasswordCredentials defaultcreds = new
UsernamePasswordCredentials(USER, PASSWORD);
httpclient.getCredentialsProvider().setCredentials(AuthScope.ANY,
defaultcreds);
SolrClient  client = new CloudSolrClient("127.0.0.1:9983", httpclient);*
SolrClient client = new CloudSolrClient("127.0.0.1:9983"); 
((CloudSolrClient)client).setDefaultCollection("gettingstarted"); 
ModifiableSolrParams param = getSearchSolrQuery(); 
try{ 
  QueryResponse res = client.query(param); 
  //facets 
  List fieldFacets = solrResp.getFacetFields(); 
  // results 
  SolrDocumentList docs = solrResp.getResults(); 
  // Spelling 
  SpellCheckResponse spellCheckResponse =
solrResp.getSpellCheckResponse(); 
  }catch(Exception ex){ 
ex.printStackTrace(); 
  }finally{ 
   try { 
client.close(); 
  } catch (IOException e) { 
 e.printStackTrace(); 
  }
  } 

Just wanted to know if this is recommended ?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solrj-API-with-Basic-Authentication-tp4276312p4276319.html
Sent from the Solr - User mailing list archive at Nabble.com.


Solrj API with Basic Authentication

2016-05-11 Thread Shamik Bandopadhyay
Hi,

  I'm looking into the option of adding basic authentication using Solrj
API. Currently, I'm using the following code for querying Solr.

SolrClient client = new CloudSolrClient("127.0.0.1:9983");
((CloudSolrClient)client).setDefaultCollection("gettingstarted");
ModifiableSolrParams param = getSearchSolrQuery();
try{
QueryResponse res = client.query(param);
 //facets
 List fieldFacets = solrResp.getFacetFields();
// results
SolrDocumentList docs = solrResp.getResults();
// Spelling
SpellCheckResponse spellCheckResponse =
solrResp.getSpellCheckResponse();
  }catch(Exception ex){
ex.printStackTrace();
  }finally{
try {
client.close();
} catch (IOException e) {
e.printStackTrace();
}
}

The QueryReponse object is well-constructed and provides clean APIs to
parse the result.

Now, to use the basic authentication, we need to use a SolrRequest object
instead.

SolrClient client = new CloudSolrClient("127.0.0.1:9983");
((CloudSolrClient)client).setDefaultCollection("gettingstarted");
ModifiableSolrParams param = getSearchSolrQuery();
SolrRequest solrRequest = new QueryRequest(param);
solrRequest.setBasicAuthCredentials(USER, PASSWORD);
try{
NamedList results = client.request(solrRequest);
for (int i = 0; i < results.size(); i++) {
System.out.println("RESULTS: " + i + " " + results.getName(i) + " : " +
results.getVal(i));
}
  }catch(Exception ex){
ex.printStackTrace();
  }finally{
try {
client.close();
} catch (IOException e) {
e.printStackTrace();
}
}

Since my existing APIs use QueryResponse res = client.query(param) , moving
to NamedList results = client.request(solrRequest) translates to a
bunch of code change. Moreover, by using SolrRequest, I lose the option
of delete, getById methods which doesn't accept solr request object.

Just wondering if there's another way to use Basic Authentication , perhaps
setting at the ModifiableSolrParams level. Ideally, i would like to retain
QueryResponse or UpdateResponse objects instead.

Any pointers will be appreciated.

-Thanks,
Shamik


Re: Issues with Authentication / Role based authorization

2016-05-11 Thread shamik
Brian,

  Thanks for your reply. My first post was bit convoluted, tried to explain
the issue in the subsequent post. Here's a security JSON. I've solr and
beehive assigned the admin role which allows them to have access to "update"
and "read". This works as expected. I add a new role "browseRole" in order
to restrict certain user to only have access to browse on gettingstarted
collection. 

  "authorization.enabled": true,
  "authorization": {
"class": "solr.RuleBasedAuthorizationPlugin",
"user-role": {
  "solr": "admin",
  "beehive": [
"admin"
  ],
  "dev": [
"browseRole"
  ]
},
"permissions": [
  {
"name": "update",
"role": "admin"
  },
  {
"name": "read",
"role": "admin"
  },
  {
"name": "browse",
"collection": "gettingstarted",
"path": "/browse",
"role": "browseRole"
  }
],
"": {
  "v": 6
}
  }
}

But when I log in as "dev", I seemed to have similar access to "solr" and
"beehive". "dev" can add/delete data, create collection, etc. Will the order
of the permissions matter here even though "dev" is assigned to a specific
role ?





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Issues-with-Authentication-Role-based-authorization-tp4276024p4276203.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Issues with Authentication / Role based authorization

2016-05-11 Thread shamik
Anyone ?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Issues-with-Authentication-Role-based-authorization-tp4276024p4276153.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Issues with Authentication / Role based authorization

2016-05-11 Thread shamik
pl.LBHttpSolrClient.doRequest(LBHttpSolrClient.java:372)
at
org.apache.solr.client.solrj.impl.LBHttpSolrClient.request(LBHttpSolrClient.java:325)
at
org.apache.solr.handler.component.HttpShardHandlerFactory.makeLoadBalancedRequest(HttpShardHandlerFactory.java:246)
at
org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:201)
at
org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:163)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:231)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

This changed the authorization realm for some reason. If I log back in as
"solr" or "superuser", I could no longer access request handlers, which was
possible before adding the two new roles, i.e. "browseRole","selectRole". I
went back and assigned"superuser" to these roles, only after that it was
able to access the request handlers, though with above exceptions.

Here's authentication :

{
  "responseHeader": {
"status": 0,
"QTime": 0
  },
  "authentication.enabled": true,
  "authentication": {
"blockUnknown": true,
"class": "solr.BasicAuthPlugin",
"credentials": {
  "solr": "IV0EHq1OnNrj6gvRCwvFwTrZ1+z1oBbnQdiVC3otuq0=
Ndd7LKvVBAaZIF0QAVi1ekCfAJXr1GGfLtRUXhgrF8c=",
  "superuser": "SOkYlwKY6aW0Tr31o9xE3etyR6XHNtxw2fSY80s1CZs=
LFOQr7kQefru9L/F/l3ORPiJNzMGmS5xzVcxcYE5GL0=",
  "beehive": "NRWjSrEYDEh3ZrIVKV/3GvVT46rMxRLXI0cmyAD132E=
vUg7DcwOj4hMGRi8Fjya4guhuz7L1dM8HvvXKzVHI8M="
},
"": {
  "v": 2
}
  }
}

And authorization:
{
  "responseHeader": {
"status": 0,
"QTime": 0
  },
  "authorization.enabled": true,
  "authorization": {
"class": "solr.RuleBasedAuthorizationPlugin",
"user-role": {
  "solr": "admin",
  "superuser": [
"browseRole",
"selectRole"
  ],
  "beehive": [
"browseRole",
"selectRole"
  ]
},
"permissions": [
  {
"name": "security-edit",
"role": "admin"
  },
  {
"name": "select",
"collection": "gettingstarted",
"path": "/select/*",
"role": "selectRole"
  },
  {
"name": "browse",
"collection": "gettingstarted",
"path": "/browse",
"role": "browseRole"
  }
],
"": {
  "v": 7
}
  }
}

I was under the impression that these roles are independent of each other,
based on the assignment, individual user should be able to access their
respective areas. On a related note, I was not able to make roles like
"all", "read" work.

Not sure what I'm doing wrong here. Any feedback will be appreciated.

Thanks,
Shamik




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Issues-with-Authentication-Role-based-authorization-tp4276024p4276056.html
Sent from the Solr - User mailing list archive at Nabble.com.


How do we generate SHA256 password for Authentication

2016-05-10 Thread Shamik Bandopadhyay
Hi,

  I'm trying to setup Authentication and Role-based authorization in Solr
5.5. Beside "Solr" user from example, I've created another user "dev". I've
used the following website to generate sha256 encoded password.

http://www.lorem-ipsum.co.uk/hasher.php

I've used password as "password" .

Here's my security.json

{
  "authentication": {
"blockUnknown": false,
"class": "solr.BasicAuthPlugin",
"credentials": {
  "solr": "IV0EHq1OnNrj6gvRCwvFwTrZ1+z1oBbnQdiVC3otuq0=
Ndd7LKvVBAaZIF0QAVi1ekCfAJXr1GGfLtRUXhgrF8c=",
  "dev":"
5e884898da28047151d0e56f8dc6292773603d0d6aabbdd62a11ef721d1542d8",
}
  },
  "authorization": {
"class": "solr.RuleBasedAuthorizationPlugin",
"permissions": [
  {
"name": "security-edit",
"role": "admin"
  },
  {
"name": "schema-edit",
"role": "admin"
  },
  {
"name": "config-edit",
"role": "admin"
  },
  {
"name": "collection-admin-edit",
"role": "admin"
  },
  {
"name": "all-admin",
"collection": null,
"path": "/*",
"role": "adminAllRole"
  },
  {
"name": "all-core-handlers",
"path": "/*",
"role": "adminAllHandler"
  },
  {
"name": "update",
"role": "updateRole"
  },
  {
"name": "read",
"role": "readRole"
  },
  {
"name": "browse",
"collection": "gettingstarted",
"path": "/browse",
"role": "browseRole"
  },
  {
"name": "select",
"collection": "gettingstarted",
"path": "/select/*",
"role": "selectRole"
  }
],
"user-role": {
  "solr": [
"admin",
"adminAllRole",
"adminAllHandler",
"updateRole"
  ],
  "dev": [
"readRole"
  ]
}
  }
}

Here's what I'm doing.
1. I started Solr in Cloud mode "solr start -e cloud -noprompt"
2. zkcli.bat -zkhost localhost:9983 -cmd putfile /security.json
security.json
3. tried http://localhost:8983/solr/gettingstarted/browse , provided
dev/password but I'm getting the following exception:

[c:gettingstarted s:shard2 r:core_node3 x:gettingstarted_shard2_replica2]
org.apache.solr.servlet.HttpSolrCall; USER_REQUIRED auth header Basic
c29scjpTb2xyUm9ja3M= context : userPrincipal: [[principal: solr]] type:
[UNKNOWN], collections: [gettingstarted,], Path: [/browse] path : /browse
params :

Looks like I'm using the wrong way of generating the password.
solr/SolrRocks works as expected.

Also, sure what's wrong with the "readRole" . It doesn't seem to work when
I try with user "solr".

Any pointers will be appreciated.

-Thanks,
Shamik


Return only parent on child query match (w/o block-join)

2016-04-19 Thread Shamik Bandopadhyay
Hi,

   I have a set of documents indexed which has a pseudo parent-child
relationship. Each child document had a reference to the parent document.
Due to document availability complexity (and the condition of updating both
parent-child documents at the time of indexing), I'm not able to use
explicit block-join.Instead of a nested structure, they are all flat.
Here's an example:


  1
  Parent title
  123


  2
  Child title1
  123


  3
  Child title2
  123


  4
  Misc title2


What I'm looking is if I search "title2", the result should bring back the
following two docs, 1 matching the parent and one based on a regular match.


  1
  Parent title
  123


  4
  Misc title2


With block-join support, I could have used Block Join Parent Query Parser,
q={!parent which="content_type:parentDocument"}title:title2

Transforming result documents is an alternate but it has the reverse
support through ChildDocTransformerFactory

Just wondering if there's a way to address query in a different way. Any
pointers will be appreciated.

-Thanks,
Shamik


Re: MLT Query Parser

2016-04-07 Thread shamik
Thanks Shawn and Alessandro. I get the part why id is needed. I was trying to
compare with the "mlt" request handler which doesn't enforce such
constraint. My previous example of title/keyword is not the right one, but I
do have fields which are unique to each document and can be used as a key to
extract similar content. I don't think we can always have handle to the
document id in every scenario. In my case, it's a composite id and I don't
pass it back and forth as part of search results. For e.g. when I'm trying
to get similar content for a specific forum thread, I could very well use
the threadId field stored in Solr (unique to each document) to generate
similar content. This works great using "mlt" request handler. I was
expecting the query parser will have similar capability.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/MLT-Query-Parser-for-SolrCloud-tp4268308p4268759.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: MLT Query Parser

2016-04-06 Thread shamik
Thanks Alessandro, that answers my doubt. in a nutshell, to make MLT Query
parser work, you need to know the document id. I'm just curious as why this
constraint has been added. This will not work for a bulk of use cases. For
e.g. if we are trying to generate MLT based on a text or a keyword, how
would I ever use this API ? My initial impression was that this was designed
to work on a distributed mode.

Now, this adds up a follow-up question as in which one is the right approach
in a solr cloud mode. "mlt"request handler is off the equation since it's
not supported. That leaves with MoreLikeThisComponent which has a known
issue with performance. Is that the only availble solution then ?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/MLT-Query-Parser-for-SolrCloud-tp4268308p4268482.html
Sent from the Solr - User mailing list archive at Nabble.com.


MLT Query Parser

2016-04-05 Thread Shamik Bandopadhyay
Hi,

  I'm trying to use the new MLT query parser in a SolrCloud mode. As per
the documentation, here's the syntax,

{!mlt qf=name}1

where "1" is the id.

What I'm trying to undertsand is whether "id" is a mandatory field in
making this work? Right now,I'm getting mlt documents based on a "keyword"
field. With the new query parser,I'm not able to see a way to use another
field except for id. Is this a constraint? Or there's a different syntax?

Any pointers will be appreciated.

Thanks,
Shamik


Solr 5.5 error at startup - ClassNotFoundException: org.simpleframework.xml.core.Persister

2016-03-19 Thread Shamik Bandopadhyay
rCore.java:800)
... 10 more
Caused by: java.lang.ClassNotFoundException:
org.simpleframework.xml.core.Persister
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at
org.eclipse.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:450)
at
org.eclipse.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:403)
... 15 more

Looks like it's missing the reference of simple-xml jar. I've been using
Solr 5.0 till now, never see this error on startup.

Do I need to copy this jar in lib?

Any pointers will be appreciated.

-Thanks,
Shamik


Error starting solr 5.5 - Cannot open solr.log:No such file or directory

2016-03-19 Thread Shamik Bandopadhyay
Hi,

  I'm trying to upgrade from Solr 5.0 to 5.5. I'm getting the following
error:

tail: cannot open `/mnt/ebs2/solrhome/logs/solr.log' for reading: No such
file or directory

I'm running on CentOS 6.7. The same startup script has been working fine
for 5.0 till now. I'm executing as user "solr". In the logs directory, it's
able to create solr-8983-console.log and solr_gc.log, so it's not an issue
with permission. Does it have to do with JDK version? I'm running the
latest oracle java.

java version "1.8.0_73"
Java(TM) SE Runtime Environment (build 1.8.0_73-b02)
Java HotSpot(TM) 64-Bit Server VM (build 25.73-b02, mixed mode)

I earlier tried 1.8.0_31 but with no luck. Apparently, Solr 5.0 was running
on 1.8.0_31 without any issue.

Here's my startup parameter:

server -Xss256k -Xms512m -Xmx6144m -XX:NewRatio=3 -XX:SurvivorRatio=4
-XX:TargetSurvivorRatio=90 -XX:MaxTenuringThreshold=8
-XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:ConcGCThreads=4
-XX:ParallelGCThreads=4 -XX:+CMSScavengeBeforeRemark
-XX:PretenureSizeThreshold=64m -XX:+UseCMSInitiatingOccupancyOnly
-XX:CMSInitiatingOccupancyFraction=50 -XX:CMSMaxAbortablePrecleanTime=6000
-XX:+CMSParallelRemarkEnabled -XX:+ParallelRefProcEnabled -verbose:gc
-XX:+PrintHeapAtGC -XX:+PrintGCDetails -XX:+PrintGCDateStamps
-XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution
-XX:+PrintGCApplicationStoppedTime
-Xloggc:/mnt/ebs2/solrhome/logs/solr_gc.log -Dcom.sun.management.jmxremote
-Dcom.sun.management.jmxremote.local.only=false
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.port=18983
-Dcom.sun.management.jmxremote.rmi.port=18983
-Djava.rmi.server.hostname=54.176.219.134  -DzkClientTimeout=45000
-DzkHost=zoohost1:2181,zoohost2:2181,zoohost3:2181
-Dbootstrap_confdir=./solr/knowledge/conf -Dcollection.configName=myconf
-DnumShards=2 -DSTOP.PORT=7983 -DSTOP.KEY=solrrocks -Dhost=54.176.219.134
-Djetty.port=8983 -Dsolr.solr.home=/mnt/ebs2/solrhome/solr
-Dsolr.install.dir=/mnt/ebs2/solrhome -Duser.timezone=UTC
-Djava.net.preferIPv4Stack=true
-Dlog4j.configuration=file:/mnt/ebs2/solrhome/log4j.properties
 -Dsolr.autoCommit.maxTime=6 -Dsolr.clustering.enabled=true

Not sure what's going wrong. Any pointers will be appreciated.

-Thanks,
Shamik


Re: Solr Cloud sharding strategy

2016-03-07 Thread shamik
Thanks Eric and Walter, this is extremely insightful. One last followup
question on composite routing. I'm trying to have a better understanding of
index distribution. If I use language as a prefix, SolrCloud guarantees that
same language content will be routed to the same shard. What I'm curious to
know is how rest of the data is being distributed across remaining shards.
For e.g. I've the following composite keys,

enu!doc1
enu!doc2
deu!doc3
deu!doc4
esp!doc5
chs!doc6

If I've 2 shards in the cluster, will SolrCloud try to distribute the above
data evenly? Is is possible that enu will be routed to shard1 while deu goes
to shard2, and esp and chs gets indexed in either of them. Or, all of them
can potentially end up getting indexed in the same shard, either 1 or 2,
leaving one shard under-utilized.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Cloud-sharding-strategy-tp4262274p4262336.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr Cloud sharding strategy

2016-03-07 Thread shamik
Thanks a lot, Erick. You are right, it's a tad small with around 20 million
documents, but the growth projection around 50 million in next 6-8 months.
It'll continue to grow, but maybe not at the same rate. From the index size
point of view, the size can grow up to half a TB from its current state.
Honestly, my perception of "big" index is still vague :-) . All I'm trying
to make sure is that decision I take is scalable in the long term and will
be able to sustain the growth without compromising the performance.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Cloud-sharding-strategy-tp4262274p4262304.html
Sent from the Solr - User mailing list archive at Nabble.com.


Solr Cloud sharding strategy

2016-03-07 Thread Shamik Bandopadhyay
Hi,

  I'm trying to figure the best way to design/allocate shards for our Solr
Cloud environment.Our current index has around 20 million documents, in 10
languages. Around 25-30% of the content is in English. Rest are almost
equally distributed among the remaining 13 languages. Till now, we had to
deal with query time deduplication using collapsing parser  for which we
used multi-level composite routing. But due to that, documents were
disproportionately distributed across 3 shards. The shard containing the
duplicate data ended up hosting 80% of the index. For e.g. Shard1 had a
30gb index while Shard2 and Shard3 10gb each. The composite key is
currently made of "language!dedup_id!url" . At query time, we are using
shard.keys=language/8! for three level routing.

Due to performance overhead, we decided to move the de-duplication logic
during index time which made the composite routing redundant. We are not
discarding the duplicate content so there's no change in index size.Before
I update the routing key, just wanted to check what will be the best
approach to the sharding architecture so that we get optimal performance.
We've currently have 3 shards wth 2 replicas each. The entire index resides
in one single collection. What I'm trying to understand is whether:

1. We let Solr use simple document routing based on id and route the
documents to any of the 3 shards
2. We create a composite id using language, e.g. language!unique_id and
make sure that the same language content will always be in same the shard.
What I'm not sure is whether the index will be equally distributed across
the three shards.
3. Index English only content to a dedicated shard, rest equally
distributed to the remaining two. I'm not sure if that's possible.
4. Create a dedicated collection for English and one for rest of the
languages.

Any pointers on this will be highly appreciated.

Regards,
Shamik


Re: understand scoring

2016-03-01 Thread shamik
Doug, do we've a date for the hard copy launch?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/understand-scoring-tp4260837p4260860.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: docValues error

2016-02-28 Thread shamik
David, this is tad weird. I've seen this error if you turn on docvalues for
an existing field. You can running an "optimize" on your index and see if it
helps.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/docValues-error-tp4260408p4260455.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Query time de-boost

2016-02-28 Thread shamik
I tried the function query route, but getting a weird exception.

*bf=if(termfreq(ContentGroup,'Developer Doc'),-20,0)* throws an exception
*org.apache.solr.search.SyntaxError: Missing end quote for string at pos 29
str='if(termfreq(ContentGroup,'Developer'* . Does it only accept single word
or there's something wrong with the syntax? It seems to work if only use
'Developer' as a single term.

I was trying to explore the negative boost route. From the documentation, 

"Negative query boosts have been supported at the "Query" object level for a
long time (resulting in negative scores for matching documents). Now the
QueryParsers have been updated to handle this too." 

I'm struggling to figure the usage for this. To me, it seems like having the
same effect in boost query if I use either 

*(*:* -ContetGroup:"Developer")^99*  or *ContenGroup-local:Developer^-99*

But, both cannot be used in conjunction with other bq parameters. 

*bq=Source:simplecontent^10 Source:Help^20 ContenGroup-local:Developer^-99*
doesn't work, which I thought would be.





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Query-time-de-boost-tp4259309p4260451.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Query time de-boost

2016-02-26 Thread shamik
Thanks Walter,   I've tried this earlier and it works. But the problem in my
case is that I've boosting on few Source parameters as well. My ideal "bq"
should like this:

 *bq=Source:simplecontent^10 Source:Help^20 (*:*
-ContentGroup-local:("Developer"))^99* 

But this is not going to work.

I'm working on the functional query side to see if this can be done.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Query-time-de-boost-tp4259309p4260077.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Query time de-boost

2016-02-25 Thread shamik
Emir, I don't Solr supports a negative boosting *^-99* syntax like this. I
can certainly do something like:

bq=(*:* -ContetGroup:"Developer's Documentation")^99 , but then I can't have
my other bq parameters.

This doesn't work --> bq=Source:simplecontent^10 Source:Help^20 (*:*
-ContetGroup:"Developer's Documentation")^99

Are you sure something like *bq=ContenGroup-local:Developer^-99* worked for
you?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Query-time-de-boost-tp4259309p4259879.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Query time de-boost

2016-02-24 Thread shamik
Binoy, 0.1 is still a positive boost. With title getting the highest weight,
this won't make any difference. I've tried this as well.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Query-time-de-boost-tp4259309p4259552.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Query time de-boost

2016-02-24 Thread shamik
Hi Emir,

I've a bunch of contentgroup values, so boosting them individually is
cumbersome. I've boosting on query fields 

qf=text^6 title^15 IndexTerm^8

and 

bq=Source:simplecontent^10 Source:Help^20
(-ContentGroup-local:("Developer"))^99

I was hoping *(-ContentGroup-local:("Developer"))^99* will implicitly boost
the rest, but that didn't happen.

I'm using edismax.





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Query-time-de-boost-tp4259309p4259551.html
Sent from the Solr - User mailing list archive at Nabble.com.


Query time de-boost

2016-02-23 Thread Shamik Bandopadhyay
Hi,

  I'm looking into the possibility of de-boosting a set of documents during
query time. In my application, when I search for e.g. "preferences", I want
to de-boost content tagged with ContentGroup:"Developer" or in other words,
push those content back in the order. Here's the catch. I've the following
weights.

text^1.5 title^4 IndexTerm^2

As you can see, Title has a higher weight.

Now, a bunch of content tagged with ContentGroup:"Developer" consists of a
title like "Preferences.material" or "Preferences Property" or
"Preferences.graphics". The boost on title pushes these documents at the
top.

What I'm looking is to see if there's a way deboost all documents that are
tagged with ContentGroup:"Developer" irrespective of the term occurrence is
text or title.

Any pointers will be appreciated.

Thanks,
Shamik


Re: Question on index time de-duplication

2015-11-01 Thread shamik
That's what I observed as well. Perhaps there's a way to customize
SignatureUpdateProcessorFactory to support my use case. I'll look into the
source code and figure if there's a way to do it.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Question-on-index-time-de-duplication-tp4237306p4237623.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Question on index time de-duplication

2015-10-30 Thread shamik
Thanks Markus. I've been using field collapsing till now but the performance
constraint is forcing me to think about index time de-duplication. I've been
using a composite router to make sure that duplicate documents are routed to
the same shard. Won't that work for SignatureUpdateProcessorFactory ?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Question-on-index-time-de-duplication-tp4237306p4237403.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Question on index time de-duplication

2015-10-30 Thread shamik
Thanks Scott. I could directly use field collapsing on adskdedup field
without the signature field. Problem with field collapsing is the
performance overhead. It slows down the query to 10 folds.
CollapsingQParserPlugin is a better option, unfortunately, it doesn't
support ngroups equivalent, which is a requirement for me.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Question-on-index-time-de-duplication-tp4237306p4237401.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Question on index time de-duplication

2015-10-30 Thread shamik
Thanks for your reply. Have you customized SignatureUpdateProcessorFactory or
are you using the configuration out of the box ? I know it works for simple
dedup, but my requirement is tad different as I need to tag an identifier to
the latest document. My goal is to understand if that's possible using
SignatureUpdateProcessorFactory. 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Question-on-index-time-de-duplication-tp4237306p4237409.html
Sent from the Solr - User mailing list archive at Nabble.com.


Question on index time de-duplication

2015-10-29 Thread Shamik Bandopadhyay
Hi,

  I'm looking to customizing index time de-duplication. Here's my use case
and what I'm trying to achieve.

I've identical documents coming from different release year of a given
product. I need to index them in Solr as they are required in individual
year context. But there's a generic search which spans across all the years
and hence bring back duplicate/identical content. My goal is to only return
the latest document and filter out the rest. For e.g. if product A has
identical documents for 2015, 2014 and 2013, search should only return 2015
(latest document) and filter out the rest.

What I'm thinking (if possible) during index time :

Index all documents, but add a special tag (e.g. dedup=true) to 2013 and
2014 content, keeping 2015 (the latest release) untouched. During query
time, I'll add a filter which will exclude contents tagged with "dedup".

Just wondering if this is achievable by perhaps extending
UpdateRequestProcessorFactory or
customizing SignatureUpdateProcessorFactory ?

Any pointers will be appreciated.

Regards,
Shamik


Re: Issue Using Solr 5.3 Authentication and Authorization Plugins

2015-09-01 Thread shamik
Hi Kevin,

  Were you able to get a workaround / fix for your problem ? I'm also
looking to secure Collection and Update APIs by upgrading to 5.3. Just
wondering if it's worth the upgrade or should I wait for the next version,
which will probably address this.

Regards,
Shamik



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Issue-Using-Solr-5-3-Authentication-and-Authorization-Plugins-tp4226011p4226552.html
Sent from the Solr - User mailing list archive at Nabble.com.


  1   2   3   >