Re: Urgent- General Question about document Indexing frequency in solr

2021-02-04 Thread Scott Stults
Manisha,

The most general recommendation around commits is to not explicitly commit
after every update. There are settings that will let Solr automatically
commit after some threshold is met, and by delegating commits to that
mechanism you can generally ingest faster.

See this blog post that goes into detail about how to set that up for your
situation:

https://lucidworks.com/post/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/


Kind regards,
Scott


On Wed, Feb 3, 2021 at 5:44 PM Manisha Rahatadkar <
manisha.rahatad...@anjusoftware.com> wrote:

> Hi All
>
> Looking for some help on document indexing frequency. I am using apache
> solr 7.7 and SolrNet library to commit documents to Solr. Summary for this
> function is:
> // Summary:
> // Commits posted documents, blocking until index changes are flushed
> to disk and
> // blocking until a new searcher is opened and registered as the main
> query searcher,
> // making the changes visible.
>
> I understand that, the document gets reindexed after every commit. I have
> noticed that as the number of documents are increasing, the reindexing
> takes time. and sometimes I am getting solr connection time out error.
> I have following questions:
>
>   1.  Is there any frequency suggested by Solr for document insert/update
> and reindex? Is there any standard recommendation?
>   2.  If I remove the copy fields from managed-schema.xml, do I need to
> delete the existing indexed data from solr core and then insert data and
> reindex it again?
>
> Thanks in advance.
>
> Regards
> Manisha
>
>
>
> Confidentiality Notice
> 
> This email message, including any attachments, is for the sole use of the
> intended recipient and may contain confidential and privileged information.
> Any unauthorized view, use, disclosure or distribution is prohibited. If
> you are not the intended recipient, please contact the sender by reply
> email and destroy all copies of the original message. Anju Software, Inc.
> 4500 S. Lakeshore Drive, Suite 620, Tempe, AZ USA 85282.
>


-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com


Distributing and scaling Lucene Monitor?

2021-02-03 Thread Scott Stults
Has anyone built scaling around Lucene Monitor? I worked with it when it
was Luwak, but I haven't had to scale it beyond a single node. There's all
of the cluster-ish framework in Solr, but Lucene Monitor is fairly
disconnected from that. I've seen the URP someone built around it, but that
doesn't seem to deal with CRUD operations on the monitor queries
themselves.

So has anyone built this or given some thought about how to incorporate the
monitor index into SolrCloud?


Thank you,
Scott

-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com


Re: Solr Cloud on Docker?

2020-01-29 Thread Scott Stults
xample Docker configurations from command line
> > parameters to docker-compose files running multiple instances and
> zookeeper
> > quarums.
> > - The Docker extra hosts parameter is useful for adding extra hosts to
> > your containers hosts file particularly if you have multiple nic cards
> with
> > internal and external interfaces and you want to force communication
> over a
> > specific one.
> > - We use the Solr Prometheus exporter to collect node metrics. I've found
> > I've needed to reduce the metrics to collect as having this many nodes
> > overwhelmed it occasionally. From memory it had something to do with
> > concurrent modification of Future objects the collector users and it
> > sometimes misses collection cycles. This is not Docker related but Solr
> > size related and the exporter's ability to handle it.
> > - We use the zkCli script a lot for updating configsets. As I did not
> want
> > to have to copy them into a container to update them I just download a
> copy
> > of the Solr binaries and use it entirely for this zookeeper script. It's
> > not elegant but a number of our Dev's are not familiar with Docker and
> this
> > was a nice compromise. Another alternative is to just use the rest API to
> > do any configset manipulation.
> > - We load balance all of these nodes to external clients using a haproxy
> > Docker image. This combined with the Docker restart policy and Solr
> > replication and autoscaling capabilities provides a very stable
> environment
> > for us.
> >
> > All in all migrating and running Solr on Docker has been brilliant. It
> was
> > primarily driven by a need to scale our environment vertically on large
> > hardware instances as running 100 nodes on bare metal was too big a
> > maintenance and administrative burden for us with a small Dev and support
> > team. To date it's been very stable and reliable so I would recommend the
> > approach if you are in a similar situation.
> >
> > Thanks,
> >
> > Dwane
> >
> >
> >
> >
> >
> >
> > 
> > From: Walter Underwood 
> > Sent: Saturday, 14 December 2019 6:04 PM
> > To: solr-user@lucene.apache.org 
> > Subject: Solr Cloud on Docker?
> >
> > Does anyone have experience running a big Solr Cloud cluster on Docker
> > containers? By “big”, I mean 35 million docs, 40 nodes, 8 shards, with 36
> > CPU instances. We are running version 6.6.2 right now, but could upgrade.
> >
> > If people have specific things to do or avoid, I’d really appreciate it.
> >
> > I got a couple of responses on the Slack channel, but I’d love more
> > stories from the trenches. This is a direction for our company
> architecture.
> >
> > We have a master/slave cluster (Solr 4.10.4) that is awesome. I can
> > absolutely see running the slaves as containers. For Solr Cloud? Makes me
> > nervous.
> >
> > wunder
> > Walter Underwood
> > wun...@wunderwood.org
> > http://observer.wunderwood.org/  (my blog)
> >
> >
>


-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com


Re: Query terms and the match state

2019-09-09 Thread Scott Stults
Lucene has a SynonymQuery and a BlendedTermQuery that do something like you
want in different ways. However, if you want to keep your existing schema
and do this through Solr you can use the constant score syntax in edismax
on each term:

q=name:(corsair)^=1.0 name:(ddr)^=1.0 manu:(corsair)^=1.0 manu:(ddr)^=1.0

The resulting score will be the total number of times each term matched in
either field. (Note, if you group the terms together in the parentheses
like "name:(corsair ddr)^=1.0" you'll only know if either term matched --
the whole clause gets a score of 1.0). For the techproducts example corpus:

[
  {
"name":"CORSAIR  XMS 2GB (2 x 1GB) 184-Pin DDR SDRAM
Unbuffered DDR 400 (PC 3200) Dual Channel Kit System Memory - Retail",
"manu":"Corsair Microsystems Inc.",
"score":3.0},
  {
"name":"CORSAIR ValueSelect 1GB 184-Pin DDR SDRAM Unbuffered
DDR 400 (PC 3200) System Memory - Retail",
"manu":"Corsair Microsystems Inc.",
"score":3.0},
  {
"name":"A-DATA V-Series 1GB 184-Pin DDR SDRAM Unbuffered DDR
400 (PC 3200) System Memory - OEM",
"manu":"A-DATA Technology Inc.",
"score":1.0}]


You could use this as the basis for a function query to gain more control
over your scoring.

Hope that helps!

-Scott


On Tue, Sep 3, 2019 at 1:35 PM Kumaresh AK  wrote:

> Hello Solr Community!
>
> *Problem*: I wish to know if the result document matched all the terms in
> the query. The ranking used in solr works most of the time. For some cases
> where one of the term is rare and occurs in couple of fields; such
> documents trump a document which matches all the terms. Ideally i wish to
> have such a document (that matches all terms) to trump a document that
> matches only 9/10 terms but matches one of the rare terms twice.
> eg:
> *query1*
> field1:(a b c d) field2:(a b c d)
> Results of the above query looks good.
>
> *query2*
> filed1:(a b c 5) field2:(a b c 5)
> result:
> doc1: {field1: b c 5 field2: b c 5}
> 
> doc21: {field1: a b c 5 field: null}
>
> Results are almost good except that doc21 is trailing doc1. There are a few
> documents similar to doc1 and pushes doc21 to next page (I use default page
> size = 10)
>
> I understand that this is how tf-idf works. I tried to boost certain fields
> to solve this problem. But that breaks normal cases (query1). So, I set out
> to just solve the case where I wish to boost (or) augment a field with that
> information (as ratio of matched-terms/total-terms)
>
> *Ask:* Is it possible to get back the terms of the query and the matched
> state ?
>
> I tried
>
>- debug=query option (with the default select handler)
>- with terms in the debug response I could write a function query to
>know its match state
>
> Is this approach safe/performant for production use ? Is there a better
> approach to solve this problem ?
>
> Regards,
> Kumaresh
>


-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com


Re: BBox question

2019-02-04 Thread Scott Stults
Hi Fernando,

Solr (Lucene) uses a tree-based filter called BKD-tree. There's a good
write-up of the approach over on the Elasticsearch blog:

https://www.elastic.co/blog/lucene-points-6.0

and a cool animation of it in action on Youtube:

https://www.youtube.com/watch?v=x9WnzOvsGKs

The blog write-up and Jira issue talk about performance vs other approaches.


k/r,
Scott

On Mon, Feb 4, 2019 at 1:17 PM Fernando Otero 
wrote:

> Hey guys,
>   I was wondering if BBoxes use filters (ie: goes through all
> documents) or uses the index to do a range filter?
> It's clear in the doc that the performance is better than geodist but I
> couldn't find implementation details.I'm not sure if the performance comes
> from doing less comparissons, simple calculations or both (which I assume
> it's the case)
>
> Thanks!
>
> --
>
> Fernando Otero
>
> Sr Engineering Manager, Panamera
>
> Buenos Aires - Argentina
>
> Email:  fernando.ot...@olx.com
>


-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com


Re: Query over nested documents with an AND Operator

2019-02-01 Thread Scott Stults
Hi Julia,

Keep in mind that in order to facet on child document fields you'll need to
use the block join facet component:
https://lucene.apache.org/solr/guide/7_4/blockjoin-faceting.html

For the query itself you probably need to specify each required attribute
value, but looks like you're already heading down that path with the
facets. Add required local queries wrapped in the default query parser. The
local queries themselves would be block joins similar to this:

"+{!parent which=contenttype_s:parentDocument}attributevalue_s:brass
+{!parent which=contenttype_s:parentDocument}attributevalue_s:plastic"

That requires that a parent document satisfies both child document
constraints.

Also, if you want to return the child documents you'll need to use the
ChildDocTransformerFactory:
"fl=id,[child parentFilter=contenttype_s:parentDocument]"
(I'm not sure if that's required if you just want to facet on the child doc
values and not display the other fields.)

Hope that helps!

-Scott


On Fri, Feb 1, 2019 at 8:51 AM Mikhail Khludnev  wrote:

> Whats' your current query? It's probably a question of building boolean
> query by combining Solr queries.
> Note, this datamodel might be a little bit overwhelming, So, if number of
> distinct attributename values is around a thousand, just handle it via
> dynamic field without nesting docs:
>
>
>   brass
>
> 1
> >
> >   4711
> >
> >   here is a short text dealing with plastic and
> > brass
> >
> >   here is a detailed description
> >
> >   parentDocument
> >
> >   
> >
> > 
> >
> >   2
> >
> >   4811
> >
> >   here is a shorttext
> >
> >   here you will find a detailed
> description
> >
> >   parentDocument
> >
> >   
> >
> > 
> >
> >   2_1
> >
> >   material 
> >
> >   brass
> >
> >   
> >
> >   
> >
> >   2_2
> >
> >   material quality
> >
> >   plastic
> >
> >   
> >
> > 
> >
> > 
> >
> > I need an AND operator between my queries because I want to get as
> > accurate hits as possible. I managed to search all Parent and Child
> > Documents with one search term and get the right result.
> >
> > But if I want to search for example for plastic and brass (that means 2
> or
> > more search terms). I want to get both the Parent Document for the
> > respective child document as result (article 4811), as well as article
> 4711
> > because in this article the two words appear in the description. But the
> > result of my query is always only article 4711. I know that I could also
> > write the attribute in one field. However, I want to have a facet about
> the
> > attribute name.
> >
> >
> >
> > I hope you can help me with this problem.
> >
> >
> >
> > Thank you very much,
> >
> >
> >
> > Mit freundlichen Grüßen / Kind regards
> >
> >
> > *Julia Gelszus *
> > Bachelor of Science
> > Consultant SAP Development Workbench
> >
> >
> > *FIS Informationssysteme und Consulting GmbH *Röthleiner Weg 1
> > 97506 Grafenrheinfeld
> >
> > P +49 (9723) 9188-667
> > F +49 (9723) 9188-200
> > E j.gels...@fis-gmbh.de
> > www.fis-gmbh.de
> >
> > Managing Directors:
> > Ralf Bernhardt, Wolfgang Ebner, Frank Schöngarth
> >
> > Registration Office Schweinfurt HRB 2209
> >
> > <https://www.fis-gmbh.de/>  <https://de-de.facebook.com/FISgmbh>
> > <https://www.xing.com/companies/fisinformationssystemeundconsultinggmbh>
> > <http://www.kununu.com/de/all/de/it/fis-informationssysteme-consulting>
> > <https://www.youtube.com/channel/UC49711WwZ_tSIp_QnAWdeQA>
> >
> >
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>


-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com


Re: Need to perfom search and group the record on basis of domain,subject,from address and display the count of label i.e inbox,spam

2019-02-01 Thread Scott Stults
Hi Swapnil,

There wasn't a a question in your post, so I'm guessing you're having
trouble getting started. Take a look at the JSON Facet API. That should get
you most of the way there.

https://lucene.apache.org/solr/guide/7_5/json-facet-api.html

k/r,
Scott

On Fri, Feb 1, 2019 at 7:36 AM swap  wrote:

> Need to perfom search and group the record on basis of domain,subject,from
> address and display the count of label i.e inbox,spam
>   and label status i.e read and unread with it.The label and label status
> should be displayed as percentage.
>
> Scenorio 1
> Document structure is as mentioned below indexed in solr. message_id is
> unique field in solr
>   {
> "email_date_time": 1548922689,
> "subject": "abcdef",
> "created": 1548932108,
> "domain": ".com",
> "message_id": "123456789ui",
> "label": "inbox",
> "from_address": xxxbc.com",
> "email": "g...@gmail.com",
> "label_status": "unread"
>   }
>
>   {
> "email_date_time": 1548922689,
> "subject": "abcdef",
> "created": 1548932108,
> "domain": ".com",
> "message_id": "zxiu22",
> "label": "inbox",
> "from_address": xxxbc.com",
> "email": "g...@gmail.com",
> "label_status": "unread"
>   }
>
>   {
> "email_date_time": 1548922689,
> "subject": "defg",
> "created": 1548932108,
> "domain": ".com",
> "message_id": "ftyuiooo899",
> "label": "inbox",
> "from_address": xxxbc.com",
> "email": "f...@gmail.com",
> "label_status": "unread"
>   }
>
> I have below mentioned point to be implemented
>
> 1. Need to perfom search and group the record on basis of
> domain,subject,from address and display the count of label i.e inbox,spam
>   and label status i.e read and unread with it.The label and label status
> should be displayed as percentage.
>
>
> 2. Need to paginate the record along with the implementation 1
>
>
> Display will be as mentioned below
>
>
> 1. domain name : @ subject:hello from addredd: abcd@i
>
> inbox percentage : 20% spam percentage : 80%
> read percentage  : 30%  unread percentage : 70%
>
> 2. domain name : @ subject:hi from addredd: abcd@i
>
> inbox percentage : 20% spam percentage : 80%
> read percentage  : 30%  unread percentage : 70%
>
>
> 3. domain name : @ subject:where from addredd: abcd@i
>
> inbox percentage : 20% spam percentage : 80%
> read percentage  : 30%  unread percentage : 70%
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com


Re: PatternReplaceFilterFactory problem

2019-01-28 Thread Scott Stults
Hi Chris,

You've included the field definition of type text_en, but in your queries
you're searching the field "text", which is of type text_general. That may
be the source of your problem, but if looking into that doesn't help send
the definition of text_general as well.

Hope that helps!

-Scott

On Mon, Jan 28, 2019 at 6:02 AM Chris Wareham <
chris.ware...@graduate-jobs.com> wrote:

> I'm trying to index some data which often includes domain names. I'd
> like to remove the .com TLD, so I have modified the text_en field type
> by adding a PatternReplaceFilterFactory filter. However, it doesn't
> appear to be working as a search for "text:(mydomain.com)" matches
> records but "text:(mydomain)" does not.
>
> positionIncrementGap="100">
>  
>
> ignoreCase="true" synonyms="synonyms.txt"/>
> ignoreCase="true"/>
>
> pattern="([-a-z])\.com" replacement="$1"/>
>
> protected="protwords.txt"/>
>
>  
>  
>
> ignoreCase="true" synonyms="synonyms.txt"/>
> ignoreCase="true"/>
>
> pattern="([-a-z])\.com" replacement="$1"/>
>
> protected="protwords.txt"/>
>
>  
>
>
> The actual field definitions are as follows:
>
> stored="true"  required="true" />
> stored="true"  required="true" />
> stored="false" />
>
>
>
>


-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com


Re: Aggregate functions

2019-01-28 Thread Scott Stults
Yes. Have a look at the Facet API:
https://lucene.apache.org/solr/guide/7_5/json-facet-api.html


On Mon, Jan 28, 2019 at 6:07 AM naga pradeep dhulipalla <
naga.prade...@gmail.com> wrote:

> Hi Team,
>
>
>
> Can we use SUM aggregate function in our SOLR queries. If not is there an
> alternative to achieve this.
>
> My sample query looks like this as mentioned below.
>
>
>
> Select duration from tableName where
> solr_query='{"q":"(appName:\"test\")"}'
>
>
>
> I need the aggregate SUM value of duration column. Thanks for your quick
> help.
>
>
>
> Regards
>
> Pradeep
>
> +917204007740
>


-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com


Re: Region wise query routing with solr

2019-01-28 Thread Scott Stults
Hi Shruti,

Solr clusters should NOT span regions, so when a query hits a particular
cluster in a region that query should be handled by nodes in that region
and not forwarded to another. My recommendation is to check out
cross-datacenter replication and route requests to the correct region (with
a load balancer or DNS tricks) rather than queries to the correct cluster.

https://lucene.apache.org/solr/guide/7_6/cdcr-architecture.html


k/r,
Scott



On Mon, Jan 28, 2019 at 2:24 AM shruti suri  wrote:

> Hi,
>
> I want to configure Region wise query routing with Solr. Suppose, I have my
> data center in Singapore and India so if user hit a query from India then
> query should fall at Indian data center, likewise for Singapore. How can I
> achieve this? Is there any such functionality in Solr or SolrCloud.
>
> Thanks
>
>
>
> -
> Regards
> Shruti
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com


Re: Active node "kicked out" when starting a new node

2019-01-28 Thread Scott Stults
Hi Teddie,

Take a look at the core.properties file on the cloned or clone. I suspect
there's info in it that describes which collection and shard that node is
responsible for. Zookeeper maintains a mapping of node addresses to cores
and you can lock a node out of the cluster if you're not careful.

This used to be a common mistake with naive autoscaling where a "new" node
would spin up with the same IP as an old node before the old one was
properly removed from the cluster. Solr 7 has better autoscaling
capabilities now:
https://lucene.apache.org/solr/guide/7_6/solrcloud-autoscaling-overview.html


k/r,
Scott

On Mon, Jan 28, 2019 at 1:44 AM teddie_lee  wrote:

> Hi,
>
> I have a SolrCloud cluster with 3 nodes running on AWS. My collection is
> created with numShard=1and replicationFactor=3. Recently, due to the need
> of
> having stress test, our ops cloned a new machine with exactly the same
> configuration as one of the nodes in existed cluster (let's say the new
> machine is node4 and the node being cloned is node1).
>
>
> However, after I started node4 mistakenly (node4 is supposed to start in
> standalone mode, I just forgot to remove the configuration regards to
> zookeeper), I could see that node4 took the place of node1 in Admin UI.
> Then
> I found directory 'items_shard1_replica_n1' under path
> '../solr/server/solr/' is no longer exist on node1. Instead, the directory
> was copied to node4.
>
>
> I tried to stop Solr on node4 and restarted Solr on node1 but to no avail.
> It seems like node1 can't rejoin the cluster automatically. Then I found
> even I start Solr on node4, the status of node4 was still 'Down' and never
> become 'Recovering' while the rest of the nodes in cluster are 'Active'.
>
> So the final solution is to copied directory  'items_shard1_replica_n1'
> from
> node4 back to the node1 and restarted Solr on node1. Then node1 join the
> cluster automatically and everything seems fine.
>
>
> My question is why this would happen? Or are there any documents about how
> SolrCloud manages the cluster behind the scenes?
>
>
> Thanks,
> Teddie
>
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com


Re: Log Statements: Collection, Shard, Replica, and Core Info Missing

2019-01-28 Thread Scott Stults
Hi Alicia,

You've probably already tried this but just to check all the basics, verify
that each log4j2.xml file is the same on all of your servers. Then go to
the logging config admin page on each machine and verify that none of the
overrides have been enabled. The overrides there are temporary, so you can
either reset them if they've been changed or restart the instance to get
back to default.

If none of that helps let us know how many nodes you're running, and
double-check the file permissions on log4j2.xml. You could also make a
slight modification to the format string just to verify that it's indeed
being read.

Hope that helps!
Scott

On Fri, Jan 25, 2019 at 6:35 PM Alicia Broederdorf 
wrote:

> I’m using the SLF4J Reporter for logging metrics (
> https://lucene.apache.org/solr/guide/7_5/metrics-reporting.html#slf4j-reporter).
> I have two collections with 5 shards each. Only 3 shards of one collection
> are printing collection, shard, replica, and core data in the log
> statements, the others do not. For the same metric log statement this data
> is only present for 3 of the 10 shards.
>
> The three shards will have something like: 2019-01-25 21:41:05.297 INFO
> (metrics-org.apache.solr.metrics.reporters.SolrSlf4jReporter-6-thread-1)
> [c:coll_1 s:shard2 r:core_node13 x:coll_1_shard2_replica_n10] type=GAUGE,
> name=SEARCHER.searcher.numDocs, value=236140
>
> Others will have: 2019-01-25 21:41:07.125 INFO
> (metrics-org.apache.solr.metrics.reporters.SolrSlf4jReporter-8-thread-1) [
> ] type=GAUGE, name=SEARCHER.searcher.numDocs, value=899794
>
>
>
> Here is the config for my metrics log in log4j2.xml:
>  name="MetricsFile"
> fileName="<%= @solr_logs %>/solr_metrics.log"
> filePattern="<%= @solr_logs %>/solr_metrics.log.%i" >
>   
> 
>   %d{-MM-dd HH:mm:ss.SSS} %-5p (%t) [%X{collection} %X{shard}
> %X{replica} %X{core}] %c{1.} %m%n
> 
>   
>   
> 
> 
>   
>   
> 
>
> Any thoughts on how to get the collection, shard, replica, and core data
> printed in every log statement?
>
> Thanks for the help!
> Alicia
>


-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com


Re: Need help on Solr authorization

2019-01-18 Thread Scott Stults
nnection.java:283)\n\tat
> > org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108)\n\tat
> > org.eclipse.jetty.io
> .SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)\n\tat
> >
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)\n\tat
> >
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)\n\tat
> >
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)\n\tat
> >
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)\n\tat
> >
> org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)\n\tat
> > java.lang.Thread.run(Thread.java:748)\nCaused by:
> > javax.net.ssl.SSLHandshakeException:
> > sun.security.validator.ValidatorException: PKIX path building failed:
> > sun.security.provider.certpath.SunCertPathBuilderException: unable to
> find
> > valid certification path to requested target\n\tat
> > sun.security.ssl.Alerts.getSSLException(Alerts.java:192)\n\tat
> > sun.security.ssl.SSLSocketImpl.fatal(SSLSocketImpl.java:1959)\n\tat
> > sun.security.ssl.Handshaker.fatalSE(Handshaker.java:302)\n\tat
> > sun.security.ssl.Handshaker.fatalSE(Handshaker.java:296)\n\tat
> >
> sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1514)\n\tat
> >
> sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:216)\n\tat
> > sun.security.ssl.Handshaker.processLoop(Handshaker.java:1026)\n\tat
> > sun.security.ssl.Handshaker.process_record(Handshaker.java:961)\n\tat
> > sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:1072)\n\tat
> >
> sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1385)\n\tat
> >
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1413)\n\tat
> >
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1397)\n\tat
> >
> org.apache.http.conn.ssl.SSLConnectionSocketFactory.createLayeredSocket(SSLConnectionSocketFactory.java:396)\n\tat
> >
> org.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(SSLConnectionSocketFactory.java:355)\n\tat
> >
> org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142)\n\tat
> >
> org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:359)\n\tat
> >
> org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:381)\n\tat
> >
> org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:237)\n\tat
> >
> org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185)\n\tat
> > org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)\n\tat
> >
> org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:111)\n\tat
> >
> org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)\n\tat
> >
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)\n\tat
> >
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)\n\tat
> >
> org.apache.solr.servlet.HttpSolrCall.remoteQuery(HttpSolrCall.java:618)\n\t...
> > 33 more\nCaused by: sun.security.validator.ValidatorException: PKIX path
> > building failed:
> > sun.security.provider.certpath.SunCertPathBuilderException: unable to
> find
> > valid certification path to requested target\n\tat
> >
> sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:397)\n\tat
> >
> sun.security.validator.PKIXValidator.engineValidate(PKIXValidator.java:302)\n\tat
> > sun.security.validator.Validator.validate(Validator.java:260)\n\tat
> >
> sun.security.ssl.X509TrustManagerImpl.validate(X509TrustManagerImpl.java:324)\n\tat
> >
> sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509TrustManagerImpl.java:229)\n\tat
> >
> sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X509TrustManagerImpl.java:124)\n\tat
> >
> sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1496)\n\t...
> > 53 more\nCaused by:
> > sun.security.provider.certpath.SunCertPathBuilderException: unable to
> find
> > valid certification path to requested target\n\tat
> >
> sun.security.provider.certpath.SunCertPathBuilder.build(SunCertPathBuilder.java:141)\n\tat
> >
> sun.security.provider.certpath.SunCertPathBuilder.engineBuild(SunCertPathBuilder.java:126)\n\tat
> > java.security.cert.CertPathBuilder.build(CertPathBuilder.java:280)\n\tat
> >
> sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:392)\n\t...
> > 59 more\n",
> >
> > "code":500}}
> >
> >
> >
> >
> >
> > Regards,
> >
> > Sathish.
> >
> >
> >
>


-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com


Re: [QA-search] About field setting

2019-01-18 Thread Scott Stults
No, you have to tokenize before you filter, but the Keyword tokenizer
outputs the whole input text as a single token.

On Thu, Jan 17, 2019 at 11:36 PM 유정인  wrote:

> hi
> Can you use multiple query analyzers to search for or?
>
> Ex)
>
>  positionIncrementGap="100" multiValued="true">
>
> 
>
>   
>
>ignoreCase="true"/>
>
>   
>
> 
>
> 
>
>   
>
>ignoreCase="true"/>
>
>   
>
> 
>
> 
>
>   
>
>ignoreCase="true"/>
>
>ignoreCase="true" synonyms="synonyms.txt"/>
>
>   
>
> 
>
> 
>
>
>
> Can you get synonyms to run before tokenzier?
>
> Ex)
>
>  positionIncrementGap="100" multiValued="true">
>
> 
>
>   
>
>ignoreCase="true"/>
>
>   
>
> 
>
> 
>
> ignoreCase="true" synonyms="synonyms.txt"/>
>
>   
>
>ignoreCase="true"/>
>
> 
>
> 
>
> 
>
>
>
>
>
> thanks
>
>

-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com


Re: regarding debugging solr in eclipse

2019-01-18 Thread Scott Stults
This blog article might help:
https://opensourceconnections.com/blog/2013/04/13/how-to-debug-solr-with-eclipse/



On Fri, Jan 18, 2019 at 6:53 AM SAGAR INGALE 
wrote:

> Can anybody tell me how to debug solr in eclipse, if possible how can I
> build a maven project and launch the jetty server in debug mode?
> Thanks. Regards
>


-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com


Re: So Many Zookeeper Warnings--There Must Be a Problem

2019-01-03 Thread Scott Stults
Good! Hopefully that's your smoking gun.

The port settings are fine, but since you're deploying to separate servers
you don't need different ports in the "server.x=" section. This section of
the docs explains it better:

http://zookeeper.apache.org/doc/r3.4.7/zookeeperAdmin.html#sc_zkMulitServerSetup


On Thu, Jan 3, 2019 at 3:49 PM Joe Lerner  wrote:

> Hi Scott,
>
> First, we are definitely mis-onfigured for the myid thing. Basically two of
> them were identifying as ID #2, and they are the two ZK's claiming to be
> the
> leader. Definitely something to straighten out!
>
> Our 3 lines in zoo.cfg look correct. Except they look like this:
>
> clientPort:2181
>
> server.1=host1:2190:2195
> server.2=host2:2191:2196
> server.3=host3:2192:2197
>
> Notice the port range, and overlap...
>
> Is that.../copacetic/?
>
> Thanks!
>
> Joe
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com


Re: So Many Zookeeper Warnings--There Must Be a Problem

2019-01-03 Thread Scott Stults
Hi Joe,

Yeah, two leaders is definitely a problem. I'd fix that before wading
through the error logs.

Check out zoo.cfg on each server. You should have three lines at the end
similar to this:

server.1=host1:2181:2281
server.2=host2:2182:2282
server.3=host3:2183:2283

(substitute "host*" with the right IP or address of your servers)

Also on each server, check the file "myid". It should have a single number
that maps to the list above. For example, on host1 your myid file should
contain a single value of "1" in it. On host2 the file should contain "2".

You'll probably have to delete the contents of the zk data directory and
rebuild your collections.



On Thu, Jan 3, 2019 at 2:47 PM Joe Lerner  wrote:

> Hi,
>
> We have a simple architecture: 2 SOLR Cloud servers (on servers #1 and #2),
> and 3 zookeeper instances (on servers #1, #2, and #3). Things work fine
> (although we had a couple of brief unexplained outages), but:
>
> One worrisome thing is that when I status zookeeper on #1 and #2, I get
> Mode=Leader on both--#3 shows follower. This seems to be a pretty permanent
> condition, at least right now as I look at it. And there isn't any big
> maintenance or anything going on.
>
> Also, we are getting *TONS* of continuous log warnings from our client
> applications. From one server it shows this:
>
>
>
> And from another server we get this:
>
>
> These are making our logs impossible to read, but worse, I assume indicate
> that something is wrong.
>
> Thanks for any help!
>
> Joe Lerner
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com


Re: Excessive resources consumption migrating from Solr 6.6.0 Master/Slave to SolrCloud 6.6.0 (dozen times more resources)

2017-08-28 Thread Scott Stults
Dani,

It might be time to attach some instrumentation to one of your nodes.
Finding out which classes are occupying the memory will help narrow the
issue.

Are you using a lot of facets, grouping, or stats during your queries?
Also, when you were doing Master/Slave, was that on the same version of
Solr as you're using now in SolrCloud mode?


-Scott

On Mon, Aug 28, 2017 at 4:50 AM, Daniel Ortega <danielortegauf...@gmail.com>
wrote:

> Hi Scott,
>
> Yes, we think that our usage scenario falls into Index-Heavy/Query-Heavy
> too. We have tested with several values in softcommit/hardcommit values
> (from few seconds to minutes) with no appreciable improvements :(
>
> Thanks for your reply!
>
> - Daniel
>
> 2017-08-25 6:45 GMT+02:00 Scott Stults <sstu...@opensourceconnections.com
> >:
>
> > Hi Dani,
> >
> > It seems like your use case falls into the Index-Heavy / Query-Heavy
> > category, so you might try increasing your hard commit frequency to 15
> > seconds rather than 15 minutes:
> >
> > https://lucidworks.com/2013/08/23/understanding-
> > transaction-logs-softcommit-and-commit-in-sorlcloud/
> >
> >
> > -Scott
> >
> > On Thu, Aug 24, 2017 at 10:03 AM, Daniel Ortega <
> > danielortegauf...@gmail.com
> > > wrote:
> >
> > > Hi Scott,
> > >
> > > In our indexing service we are using that client too
> > > (org.apache.solr.client.solrj.impl.CloudSolrClient) :)
> > >
> > > This is out Update Request Processor chain configuration:
> > >
> > >  > > name
> > > ="signature"> true  > name="signatureField">
> > > hash false  > > "signatureClass">solr.processor.Lookup3Signature
> > 
> > > <
> > > updateRequestProcessorChain processor="signature" name="dedupe">
> >  > > class="solr.LogUpdateProcessorFactory" />  > > "solr.RunUpdateProcessorFactory" /> 
>  <
> > > requestHandler name="/update" class="solr.UpdateRequestHandler" >  > > name=
> > > "defaults"> dedupe 
> > 
> > >
> > > Thanks for your reply :)
> > >
> > > - Dani
> > >
> > > 2017-08-24 14:49 GMT+02:00 Scott Stults <sstults@
> > opensourceconnections.com
> > > >:
> > >
> > > > Hi Daniel,
> > > >
> > > > SolrJ has a few client implementations to choose from:
> CloudSolrClient,
> > > > ConcurrentUpdateSolrClient, HttpSolrClient, LBHttpSolrClient. You
> said
> > > your
> > > > query service uses CloudSolrClient, but it would be good to verify
> > which
> > > > implementation your indexing service uses.
> > > >
> > > > One of the problems you might be having is with your deduplication
> > step.
> > > > Can you post your Update Request Processor Chain?
> > > >
> > > >
> > > > -Scott
> > > >
> > > >
> > > > On Wed, Aug 23, 2017 at 4:13 PM, Daniel Ortega <
> > > > danielortegauf...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi Scott,
> > > > >
> > > > > - *Can you describe the process that queries the DB and sends
> records
> > > to
> > > > *
> > > > > *Solr?*
> > > > >
> > > > > We are enqueueing ids during every ORACLE transaction (in
> > > > insert/updates).
> > > > >
> > > > > An application dequeues every id and perform queries against dozen
> of
> > > > > tables in the relational model to retrieve the fields to build the
> > > > > document.  As we know that we are modifying the same ORACLE row in
> > > > > different (but consecutive) transactions, we store only the last
> > > version
> > > > of
> > > > > the modified documents in a map data structure.
> > > > >
> > > > > The application has a configurable interval to send the documents
> > > stored
> > > > in
> > > > > the map to the update handler (we have tested different intervals
> > from
> > > > few
> > > > > milliseconds to several seconds) using the SolrJ client. Actually
> we
> > > are
> > > > > sending all the documents every 15 seconds.
> > > > >
> > > > > This application is developed using Java, Spring and Maven and we
> >

Re: Excessive resources consumption migrating from Solr 6.6.0 Master/Slave to SolrCloud 6.6.0 (dozen times more resources)

2017-08-24 Thread Scott Stults
Hi Dani,

It seems like your use case falls into the Index-Heavy / Query-Heavy
category, so you might try increasing your hard commit frequency to 15
seconds rather than 15 minutes:

https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/


-Scott

On Thu, Aug 24, 2017 at 10:03 AM, Daniel Ortega <danielortegauf...@gmail.com
> wrote:

> Hi Scott,
>
> In our indexing service we are using that client too
> (org.apache.solr.client.solrj.impl.CloudSolrClient) :)
>
> This is out Update Request Processor chain configuration:
>
>  name
> ="signature"> true 
> hash false  "signatureClass">solr.processor.Lookup3Signature 
> <
> updateRequestProcessorChain processor="signature" name="dedupe">  class="solr.LogUpdateProcessorFactory" />  "solr.RunUpdateProcessorFactory" />   <
> requestHandler name="/update" class="solr.UpdateRequestHandler" >  name=
> "defaults"> dedupe  
>
> Thanks for your reply :)
>
> - Dani
>
> 2017-08-24 14:49 GMT+02:00 Scott Stults <sstu...@opensourceconnections.com
> >:
>
> > Hi Daniel,
> >
> > SolrJ has a few client implementations to choose from: CloudSolrClient,
> > ConcurrentUpdateSolrClient, HttpSolrClient, LBHttpSolrClient. You said
> your
> > query service uses CloudSolrClient, but it would be good to verify which
> > implementation your indexing service uses.
> >
> > One of the problems you might be having is with your deduplication step.
> > Can you post your Update Request Processor Chain?
> >
> >
> > -Scott
> >
> >
> > On Wed, Aug 23, 2017 at 4:13 PM, Daniel Ortega <
> > danielortegauf...@gmail.com>
> > wrote:
> >
> > > Hi Scott,
> > >
> > > - *Can you describe the process that queries the DB and sends records
> to
> > *
> > > *Solr?*
> > >
> > > We are enqueueing ids during every ORACLE transaction (in
> > insert/updates).
> > >
> > > An application dequeues every id and perform queries against dozen of
> > > tables in the relational model to retrieve the fields to build the
> > > document.  As we know that we are modifying the same ORACLE row in
> > > different (but consecutive) transactions, we store only the last
> version
> > of
> > > the modified documents in a map data structure.
> > >
> > > The application has a configurable interval to send the documents
> stored
> > in
> > > the map to the update handler (we have tested different intervals from
> > few
> > > milliseconds to several seconds) using the SolrJ client. Actually we
> are
> > > sending all the documents every 15 seconds.
> > >
> > > This application is developed using Java, Spring and Maven and we have
> > > several instances.
> > >
> > > -* Is it a SolrJ-based application?*
> > >
> > > Yes, it is. We aren't using the last version of SolrJ client (we are
> > > currently using SolrJ v6.3.0).
> > >
> > > - *If it is, which client package are you using?*
> > >
> > > I don't know exactly what do you mean saying 'client package' :)
> > >
> > > - *How many documents do you send at once?*
> > >
> > > It depends on the defined interval described before and the number of
> > > transactions executed in our relational database. From dozens to few
> > > hundreds (and even thousands).
> > >
> > > - *Are you sending your indexing or query traffic through a load
> > balancer?*
> > >
> > > We aren't using a load balancer for indexing, but we have all our Rest
> > > Query services through an HAProxy (using 'leastconn' algorithm). The
> Rest
> > > Query Services performs queries using the CloudSolrClient.
> > >
> > > Thanks for your reply,
> > > if you need any further information don't hesitate to ask
> > >
> > > Daniel
> > >
> > > 2017-08-23 14:57 GMT+02:00 Scott Stults <sstults@
> > opensourceconnections.com
> > > >:
> > >
> > > > Hi Daniel,
> > > >
> > > > Great background information about your setup! I've got just a few
> more
> > > > questions:
> > > >
> > > > - Can you describe the process that queries the DB and sends records
> to
> > > > Solr?
> > > > - Is it a SolrJ-based application?
> > > > - If it is, which client package are you using?
> > 

Re: Excessive resources consumption migrating from Solr 6.6.0 Master/Slave to SolrCloud 6.6.0 (dozen times more resources)

2017-08-24 Thread Scott Stults
Hi Daniel,

SolrJ has a few client implementations to choose from: CloudSolrClient,
ConcurrentUpdateSolrClient, HttpSolrClient, LBHttpSolrClient. You said your
query service uses CloudSolrClient, but it would be good to verify which
implementation your indexing service uses.

One of the problems you might be having is with your deduplication step.
Can you post your Update Request Processor Chain?


-Scott


On Wed, Aug 23, 2017 at 4:13 PM, Daniel Ortega <danielortegauf...@gmail.com>
wrote:

> Hi Scott,
>
> - *Can you describe the process that queries the DB and sends records to *
> *Solr?*
>
> We are enqueueing ids during every ORACLE transaction (in insert/updates).
>
> An application dequeues every id and perform queries against dozen of
> tables in the relational model to retrieve the fields to build the
> document.  As we know that we are modifying the same ORACLE row in
> different (but consecutive) transactions, we store only the last version of
> the modified documents in a map data structure.
>
> The application has a configurable interval to send the documents stored in
> the map to the update handler (we have tested different intervals from few
> milliseconds to several seconds) using the SolrJ client. Actually we are
> sending all the documents every 15 seconds.
>
> This application is developed using Java, Spring and Maven and we have
> several instances.
>
> -* Is it a SolrJ-based application?*
>
> Yes, it is. We aren't using the last version of SolrJ client (we are
> currently using SolrJ v6.3.0).
>
> - *If it is, which client package are you using?*
>
> I don't know exactly what do you mean saying 'client package' :)
>
> - *How many documents do you send at once?*
>
> It depends on the defined interval described before and the number of
> transactions executed in our relational database. From dozens to few
> hundreds (and even thousands).
>
> - *Are you sending your indexing or query traffic through a load balancer?*
>
> We aren't using a load balancer for indexing, but we have all our Rest
> Query services through an HAProxy (using 'leastconn' algorithm). The Rest
> Query Services performs queries using the CloudSolrClient.
>
> Thanks for your reply,
> if you need any further information don't hesitate to ask
>
> Daniel
>
> 2017-08-23 14:57 GMT+02:00 Scott Stults <sstu...@opensourceconnections.com
> >:
>
> > Hi Daniel,
> >
> > Great background information about your setup! I've got just a few more
> > questions:
> >
> > - Can you describe the process that queries the DB and sends records to
> > Solr?
> > - Is it a SolrJ-based application?
> > - If it is, which client package are you using?
> > - How many documents do you send at once?
> > - Are you sending your indexing or query traffic through a load balancer?
> >
> > If you're sending documents to each replica as fast as they can take
> them,
> > you might be seeing a bottleneck at the shard leaders. The SolrJ
> > CloudSolrClient finds out from Zookeeper which nodes are the shard
> leaders
> > and sends docs directly to them.
> >
> >
> > -Scott
> >
> > On Tue, Aug 22, 2017 at 2:16 PM, Daniel Ortega <
> > danielortegauf...@gmail.com>
> > wrote:
> >
> > > *Main Problems*
> > >
> > >
> > > We are involved in a migration from Solr Master/Slave infrastructure to
> > > SolrCloud infrastructure.
> > >
> > >
> > >
> > > The main problems that we have now are:
> > >
> > >
> > >
> > >- Excessive resources consumption: Currently we have 5 instances
> with
> > 80
> > >processors/768 GB RAM each instance using SSD Hard Disk Drives that
> > > doesn't
> > >support the load that we have in the other architecture. In our
> > >Master-Slave architecture we have only 7 Virtual Machines with lower
> > > specs
> > >(4 processors and 16 GB each instance using SSD Hard Disk Drives
> too).
> > > So,
> > >at the moment our SolrCloud infrastructure is wasting several dozen
> > > times
> > >more resources than our Solr Master/Slave infrastructure.
> > >- Despite spending more resources we have worst query times
> (compared
> > to
> > >Solr in master/slave architecture)
> > >
> > >
> > > *Search infrastructure (SolrCloud infrastructure)*
> > >
> > >
> > >
> > > As we cannot use DIH Handler (which is what we use in Solr
> Master/Slave),
> > > we
> > > have developed an application whi

Re: Excessive resources consumption migrating from Solr 6.6.0 Master/Slave to SolrCloud 6.6.0 (dozen times more resources)

2017-08-23 Thread Scott Stults
Hi Daniel,

Great background information about your setup! I've got just a few more
questions:

- Can you describe the process that queries the DB and sends records to
Solr?
- Is it a SolrJ-based application?
- If it is, which client package are you using?
- How many documents do you send at once?
- Are you sending your indexing or query traffic through a load balancer?

If you're sending documents to each replica as fast as they can take them,
you might be seeing a bottleneck at the shard leaders. The SolrJ
CloudSolrClient finds out from Zookeeper which nodes are the shard leaders
and sends docs directly to them.


-Scott

On Tue, Aug 22, 2017 at 2:16 PM, Daniel Ortega <danielortegauf...@gmail.com>
wrote:

> *Main Problems*
>
>
> We are involved in a migration from Solr Master/Slave infrastructure to
> SolrCloud infrastructure.
>
>
>
> The main problems that we have now are:
>
>
>
>- Excessive resources consumption: Currently we have 5 instances with 80
>processors/768 GB RAM each instance using SSD Hard Disk Drives that
> doesn't
>support the load that we have in the other architecture. In our
>Master-Slave architecture we have only 7 Virtual Machines with lower
> specs
>(4 processors and 16 GB each instance using SSD Hard Disk Drives too).
> So,
>at the moment our SolrCloud infrastructure is wasting several dozen
> times
>more resources than our Solr Master/Slave infrastructure.
>- Despite spending more resources we have worst query times (compared to
>Solr in master/slave architecture)
>
>
> *Search infrastructure (SolrCloud infrastructure)*
>
>
>
> As we cannot use DIH Handler (which is what we use in Solr Master/Slave),
> we
> have developed an application which reads every transaction from Oracle,
> builds a document collection searching in the database and sends the result
> to the */update* handler every 200 milliseconds using SolrJ client. This
> application tries to delete the possible duplicates in each update window,
> but we are using solr’s de-duplication techniques
> <https://emea01.safelinks.protection.outlook.com/?url=
> https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%
> 2Fsolr%2FDe-Duplication=02%7C01%7Cdortega%40idealista.com%
> 7Cb169ea024abc4954927208d4bc6868eb%7Cd78b7929c2a34897ae9a7d8f8dc1
> a1cf%7C0%7C0%7C636340604697721266=WEhzoHC1Bf77K706%
> 2Fj2wIWOw5gzfOgsP1IPQESvMsqQ%3D=0>
>  too.
>
>
>
> We are indexing ~100 documents per second (with peaks of ~1000 documents
> per second).
>
>
>
> Every search query is centralized in other application which exposes a DSL
> behind a REST API and uses SolrJ client too to perform queries. We have
> peaks of 2000 QPS.
>
> *Cluster structure **(SolrCloud infrastructure)*
>
>
>
> At the moment, the cluster has 30 SolrCloud instances with the same specs
> (Same physical hosts, same JVM Settings, etc.).
>
>
>
> *Main collection*
>
>
>
> In our use case we are using this collection as a NoSQL database basically.
> Our document is composed of about 300 fields that represents an advert, and
> is a denormalization of its relational representation in Oracle.
>
>
> We are using all our nodes to store the  collection in 3 shards. So, each
> shard has 10 replicas.
>
>
> At the moment, we are only indexing a subset of the adverts stored in
> Oracle, but our goal is to store all the ads that we have in the DB (a few
> tens of millions of documents). We have NRT requirements, so we need to
> index every document as soon as posible once it’s changed in Oracle.
>
>
>
> We have defined the properties of each field (if it’s stored/indexed or
> not, if should be defined as DocValue, etc…) considering the use of that
> field.
>
>
>
> *Index size **(SolrCloud infrastructure)*
>
>
>
> The index size is currently above 6 GB, storing 1.300.000 documents in each
> shard. So, we are storing 3.900.000 documents and the total index size is
> 18 GB.
>
>
>
> *Indexation **(SolrCloud infrastructure)*
>
>
>
> The commits *aren’t* triggered by the application described before. The
> hardcommit/softcommit interval are configured in Solr:
>
>
>
>- *HardCommit:* every 15 minutes (with opensearcher = false)
>- *SoftCommit:* every 5 seconds
>
>
>
> *Apache Solr Version*
>
>
>
> We are currently using the last version of Solr (6.6.0) under an Oracle VM
> (Java(TM) SE Runtime Environment (build 1.8.0_131-b11) Oracle (64 bits)) in
> both deployments.
>
>
> The question is... What is wrong here?!?!?!
>



-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com


Re: solr jetty based auth and distributed solr requests

2017-08-23 Thread Scott Stults
Radhakrishnan,

I'm not sure offhand whether or not that's possible. It sounds like you've
done enough analysis to write a good Jira ticket, so if nobody speaks up on
the mailing list, go ahead and create one.


Cheers,
Scott

On Tue, Aug 22, 2017 at 7:15 PM, radha krishnan <dradhakrishna...@gmail.com>
wrote:

> Hi,
>
> I enabled jetty basic auth for solr by making changes to jetty.xml and add
> a 'realm.properties'
>
> while basic queries are working, queries involving more than one shard is
> not working. i went through the code and figured out that in
> HttpShardHandler, there is no provision to specify a username:password
>
> I went through a lot of JIRA's/posts and was not able to figure out whether
> it is really possible to do.
>
> can we do a distributed operation with jetty base basic auth. can you
> please give share the relevant links so that i can try it out.
>
>
> Thanks,
> Radhakrishnan
>



-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com


Re: Facet date Range without start and and date

2017-01-12 Thread Scott Stults
No it's not. Use something like facet.date.start=-00-00T00:00:00Z
and facet.date.end=3000-00-00T00:00:00Z.


k/r,
Scott

On Mon, Jan 9, 2017 at 10:46 AM, nabil Kouici <koui...@yahoo.fr.invalid>
wrote:

> Hi All,
> Is it possible to have facet date range without specifying start and and
> of the range.
> Otherwise, is it possible to put in the same request start to min value
> and end to max value.
> Thank you.
> Regards,NKI.
>



-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com


Re: regarding extending classes in org.apache.solr.client.solrj.io.stream.metrics package

2017-01-12 Thread Scott Stults
Radhakrishnan,

That would be an appropriate Jira ticket. You can submit it here:

https://issues.apache.org/jira/browse/solr

Also, if you want to submit a patch, check out the guidelines (it's pretty
easy):

https://wiki.apache.org/solr/HowToContribute


k/r,
Scott


On Tue, Jan 10, 2017 at 7:12 PM, radha krishnan <dradhakrishna...@gmail.com>
wrote:

>  Hi,
>
> i want to extend the update(Tuple tuple) method in MaxMetric,. MinMetric,
> SumMetric, MeanMetric classes.
>
> can you please make the below metioned variables and methods in the above
> mentioned classes as protected so that it will be easy to extend
>
> variables
> ---
>
> longMax
>
> doubleMax
>
> columnName
>
>
> and
>
> methods
>
> ---
>
> init
>
>
>
> Thanks,
>
> Radhakrishnan D
>



-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com


Re: Max length of solr query

2017-01-12 Thread Scott Stults
That doesn't seem like an efficient use of a search engine. Maybe what you
want to do is use streaming expressions to process some data:

https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions


k/r,
Scott

On Thu, Jan 12, 2017 at 11:36 AM, 武井宜行 <nta...@sios.com> wrote:

> Hi,all
>
> My Application throws too large query to solr server with solrj
> client.(Http Method is Post)
>
> I have two questions.
>
> At first,I would like to know the limit of  clauses of Boolean Query.I Know
> the number is restricted to 1024 by default, and I can increase the limit
> by setting setMaxClauseCount,but what is the limit of increasing clauses?
>
> Next,if there is no limit of increasing clauses,is there the limit of query
> length?My Application throws to large query like this with solrj client.
>
> item_id: OR item_id: OR item_id: ...
> (The number of item_id is maybe over than one million)
>



-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com


Re: [More Like This] Query building

2016-04-12 Thread Scott Stults
 as the field with the highest document
> >>>>> frequency
> >>>>> > for the term t .
> >>>>> > Then we build the termQuery :
> >>>>> >
> >>>>> > queue.add(new ScoreTerm(word, *topField*, score, idf, docFreq,
> tf));
> >>>>> >
> >>>>> > In this way we lose a lot of precision.
> >>>>> > Not sure why we do that.
> >>>>> > I would prefer to keep the relation between terms and fields.
> >>>>> > The MLT query can improve a lot the quality.
> >>>>> > If i run the MLT on 2 fields : *description* and *facilities* for
> >>>>> example.
> >>>>> > It is likely I want to find documents with similar terms in the
> >>>>> > description and similar terms in the facilities, without mixing up
> >>>>> the
> >>>>> > things and loosing the semantic of the terms.
> >>>>> >
> >>>>> > Let me know your opinion,
> >>>>> >
> >>>>> > Cheers
> >>>>> >
> >>>>> >
> >>>>> > --
> >>>>> > --
> >>>>> >
> >>>>> > Benedetti Alessandro
> >>>>> > Visiting card : http://about.me/alessandro_benedetti
> >>>>> >
> >>>>> > "Tyger, tyger burning bright
> >>>>> > In the forests of the night,
> >>>>> > What immortal hand or eye
> >>>>> > Could frame thy fearful symmetry?"
> >>>>> >
> >>>>> > William Blake - Songs of Experience -1794 England
> >>>>> >
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Anshum Gupta
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> --
> >>>>
> >>>> Benedetti Alessandro
> >>>> Visiting card : http://about.me/alessandro_benedetti
> >>>>
> >>>> "Tyger, tyger burning bright
> >>>> In the forests of the night,
> >>>> What immortal hand or eye
> >>>> Could frame thy fearful symmetry?"
> >>>>
> >>>> William Blake - Songs of Experience -1794 England
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> --
> >>>
> >>> Benedetti Alessandro
> >>> Visiting card : http://about.me/alessandro_benedetti
> >>>
> >>> "Tyger, tyger burning bright
> >>> In the forests of the night,
> >>> What immortal hand or eye
> >>> Could frame thy fearful symmetry?"
> >>>
> >>> William Blake - Songs of Experience -1794 England
> >>>
> >>
> >>
> >>
> >> --
> >> --
> >>
> >> Benedetti Alessandro
> >> Visiting card : http://about.me/alessandro_benedetti
> >>
> >> "Tyger, tyger burning bright
> >> In the forests of the night,
> >> What immortal hand or eye
> >> Could frame thy fearful symmetry?"
> >>
> >> William Blake - Songs of Experience -1794 England
> >>
> >
> >
> >
> > --
> > --
> >
> > Benedetti Alessandro
> > Visiting card : http://about.me/alessandro_benedetti
> >
> > "Tyger, tyger burning bright
> > In the forests of the night,
> > What immortal hand or eye
> > Could frame thy fearful symmetry?"
> >
> > William Blake - Songs of Experience -1794 England
> >
>
>
>
> --
> --
>
> Benedetti Alessandro
> Visiting card : http://about.me/alessandro_benedetti
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England
>



-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com


Re: Boosts for relevancy (shopping products)

2016-03-19 Thread Scott Stults
You're not going to be able to look at field boosts by themselves to judge
relevancy because it's very much a data-driven optimization problem. For
example, if you only sell iPhone cases but no iPhones, a search for "black
iphone" should show a bunch of black iPhone cases at the top of the
results. But if you do sell iPhones themselves, you'll likely see them rank
low in the results because they typically have names like "Apple iPhone 6s
Plus 64 GB - Black" and your cases just have "iPhone Case - Black". More of
the search terms match the shorter field value and so it scores better.

Approach the problem methodically and collect data. There are several
evaluation metrics that will not only help you quantify the problem but
also gauge how much your tuning efforts have improved things. MRR and DCGS
are good places to start.

https://en.wikipedia.org/wiki/Category:Information_retrieval_evaluation

Also take a look at Quepid (full disclosure: my company makes it). It'll
let the business folks rank the results for searches and you'll be able to
do search regression tests against those judgement lists as you tweak
things.


k/r,
Scott

On Thu, Mar 17, 2016 at 4:36 AM, Robert Brown <r...@intelcompute.com> wrote:

> Hi,
>
> I currently have an index of ~50m docs representing shopping products:
> name, description, brand, category, etc.
>
> Our "qf" is currently setup as:
>
> name^5
> brand^2
> category^3
> merchant^2
> description^1
>
> mm: 100%
> ps: 5
>
> I'm getting complaints from the business concerning relevancy, and was
> hoping to get some constructive ideas/thoughts on whether these boosts look
> semi-sensible or not, I think they were put in place pretty much at random.
>
> I know it's going to be a case of rounds upon rounds of testing, but maybe
> there's a good starting point that will save me some time?
>
> My initial thoughts right now are to actually just search on the name
> field, and maybe the brand (for things like "Apple Ipod").
>
> Has anyone got a similar setup that could share some direction?
>
> Many Thanks,
> Rob
>
>


-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com


Re: Multi-lingual search

2016-02-02 Thread Scott Stults
The IndicNormalizationFilter appears to work with Tamil. Is it not working
for you?


k/r,
Scott

On Mon, Feb 1, 2016 at 8:34 AM, vidya <vidya.nade...@tcs.com> wrote:

> Hi
>
>  My use case is to index and able to query different languages in solr
> which
> are not in-built languages supported by solr. How can i implement this ?
>
> My input document consists of different languages in a field. I came across
> "Solr in action" book with searching content in multiple languages i.e.,
> chapter 14. For built in languages i have implemented this approach. But
> for
> languages like Tamil, how to implement? Do i need to find for filter
> classes
> of that particular language or any libraries in specific.
>
> Please help me on this.
>
> Thanks in advance.
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Multi-lingual-search-tp4254398.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com


Re: Solr+HDFS

2016-02-02 Thread Scott Stults
le=hdfs://nameservice1:8020/solr5.2/UNCLASS/core_node14/data/tlog/tlog.0282933
> after 4181ms
> INFO  - 2016-01-28 22:16:32.971; [   UNCLASS]
> org.apache.solr.util.FSHDFSUtils; recoverLease=false, attempt=2 on
>
> file=hdfs://nameservice1:8020/solr5.2/UNCLASS/core_node14/data/tlog/tlog.0282933
> after 65331ms
> INFO  - 2016-01-28 22:17:34.638; [   UNCLASS]
> org.apache.solr.util.FSHDFSUtils; recoverLease=false, attempt=3 on
>
> file=hdfs://nameservice1:8020/solr5.2/UNCLASS/core_node14/data/tlog/tlog.0282933
> after 126998ms
> INFO  - 2016-01-28 22:18:35.764; [   UNCLASS]
> org.apache.solr.util.FSHDFSUtils; recoverLease=false, attempt=4 on
>
> file=hdfs://nameservice1:8020/solr5.2/UNCLASS/core_node14/data/tlog/tlog.0282933
> after 188124ms
> INFO  - 2016-01-28 22:19:37.114; [   UNCLASS]
> org.apache.solr.util.FSHDFSUtils; recoverLease=false, attempt=5 on
>
> file=hdfs://nameservice1:8020/solr5.2/UNCLASS/core_node14/data/tlog/tlog.0282933
> after 249474ms
> INFO  - 2016-01-28 22:20:38.629; [   UNCLASS]
> org.apache.solr.util.FSHDFSUtils; recoverLease=false, attempt=6 on
>
> file=hdfs://nameservice1:8020/solr5.2/UNCLASS/core_node14/data/tlog/tlog.0282933
> after 310989ms
> INFO  - 2016-01-28 22:21:39.751; [   UNCLASS]
> org.apache.solr.util.FSHDFSUtils; recoverLease=false, attempt=7 on
>
> file=hdfs://nameservice1:8020/solr5.2/UNCLASS/core_node14/data/tlog/tlog.0282933
> after 372111ms
> INFO  - 2016-01-28 22:22:40.854; [   UNCLASS]
> org.apache.solr.util.FSHDFSUtils; recoverLease=false, attempt=8 on
>
> file=hdfs://nameservice1:8020/solr5.2/UNCLASS/core_node14/data/tlog/tlog.0282933
> after 433214ms
> INFO  - 2016-01-28 22:23:41.981; [   UNCLASS]
> org.apache.solr.util.FSHDFSUtils; recoverLease=false, attempt=9 on
>
> file=hdfs://nameservice1:8020/solr5.2/UNCLASS/core_node14/data/tlog/tlog.0282933
> after 494341ms
> INFO  - 2016-01-28 22:24:43.088; [   UNCLASS]
> org.apache.solr.util.FSHDFSUtils; recoverLease=false, attempt=10 on
>
> file=hdfs://nameservice1:8020/solr5.2/UNCLASS/core_node14/data/tlog/tlog.0282933
> after 555448ms
> INFO  - 2016-01-28 22:25:44.808; [   UNCLASS]
> org.apache.solr.util.FSHDFSUtils; recoverLease=false, attempt=11 on
>
> file=hdfs://nameservice1:8020/solr5.2/UNCLASS/core_node14/data/tlog/tlog.0282933
> after 617168ms
> INFO  - 2016-01-28 22:26:45.934; [   UNCLASS]
> org.apache.solr.util.FSHDFSUtils; recoverLease=false, attempt=12 on
>
> file=hdfs://nameservice1:8020/solr5.2/UNCLASS/core_node14/data/tlog/tlog.0282933
> after 678294ms
> INFO  - 2016-01-28 22:27:47.036; [   UNCLASS]
> org.apache.solr.util.FSHDFSUtils; recoverLease=false, attempt=13 on
>
> file=hdfs://nameservice1:8020/solr5.2/UNCLASS/core_node14/data/tlog/tlog.0282933
> after 739396ms
> INFO  - 2016-01-28 22:28:48.504; [   UNCLASS]
> org.apache.solr.util.FSHDFSUtils; recoverLease=false, attempt=14 on
>
> file=hdfs://nameservice1:8020/solr5.2/UNCLASS/core_node14/data/tlog/tlog.0282933
> after 800864ms
>
> Some shards in the cluster can take hours to come back up.  Any ideas? It
> appears to wait 900 seconds for each of the tlog files.  When there are 60+
> files - this takes a long time!
> Thank you!
>



-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com


Re: upgrade SolrCloud

2016-02-02 Thread Scott Stults
That appears to be the case. If you're apprehensive because you had trouble
upgrading to 5.4.0, there was a bug in that release (fixed in 5.4.1) that
could've bitten you:

https://issues.apache.org/jira/browse/SOLR-8561


k/r,
Scott

On Thu, Jan 28, 2016 at 1:36 PM, Oakley, Craig (NIH/NLM/NCBI) [C] <
craig.oak...@nih.gov> wrote:

> I'm planning to upgrade (from 5.4.0 to 5.4.1) a SolrCloud with two
> replicas (one shard).
>
> Am I correct in thinking I should be able simply to shutdown one node,
> change it to using 5.4.1, restart the upgraded node, shutdown the other
> node and upgrade it? Or are there caveats to consider?
>



-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com


Re: Scripting server side

2016-02-02 Thread Scott Stults
Are you trying to manipulate the query with a script, or just the response?
If it's the response you want to work with, I think your only options are
using Velocity templates or XSLT. For working with the query you'll either
have to make your own QueryParserPlugin or intercept the request before it
gets to Solr.


k/r,
Scott

On Sun, Jan 24, 2016 at 6:22 PM, Vincenzo D'Amore <v.dam...@gmail.com>
wrote:

> Hi,
>
> looking at Solr documentation I found a pretty interesting processor which
> is able to execute scripting languages server side.
>
>
> http://lucene.apache.org/solr/5_4_0/solr-core/org/apache/solr/update/processor/StatelessScriptUpdateProcessorFactory.html
>
> As far as I understood, this is useful only during document update.
> I'm just curious to know if there is something else that I can use before
> or during the query execution.
>
> Best regards,
> Vincenzo
>
> --
> Vincenzo D'Amore
> email: v.dam...@gmail.com
> skype: free.dev
> mobile: +39 349 8513251
>



-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com


Re: plugging an analyzer

2016-02-02 Thread Scott Stults
There are a lot of things that can go wrong when you're wiring up a custom
analyzer. I'd first check the simple things:

* Custom jar is in Solr's classpath
* Not using the custom factory in a field type's analysis chain
* Not declaring a field with that type
* Not using that field in a document
* Assuming the tokenizer/filter will be instantiated directly rather than
through the factory interfaces.

Hope that helps!

k/r,
Scott


On Tue, Feb 2, 2016 at 3:04 AM, Roxana Danger <
roxana.dan...@reedonline.co.uk> wrote:

> Hello,
> I would like to use some code embedded on an analyser. The problem is that
> I need to pass some parameters for initializing it. My though was to create
> a plugin and initialize the parameters with the init( Map<String,String>
> args ) or init( NamedList args ) methods as explained in
> http://wiki.apache.org/solr/SolrPlugins.
> But none of these methods are called when the schema is read and the
> analyser constructed. I have also tried implementing the
> ResourceLoaderAware interface, but the inform() method is not called
> either.
> I am missing something to have my analyser running? When these init methods
> are and how can I trigger their call? Any suggestion that does not imply to
> divide the code on Tokenizer/Filters?
>
> Thank you very much in advance,
> Roxana
>



-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com


Re: How to achieve exact string match query which includes spaces and quotes

2016-01-13 Thread Scott Stults
This might be a good case for the Raw query parser (I haven't used it
myself).

https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-RawQueryParser


k/r,
Scott

On Wed, Jan 13, 2016 at 12:05 PM, Erick Erickson <erickerick...@gmail.com>
wrote:

> what _does_ matter is getting all that through the parser which means
> you have to enclose things in quotes and escape them.
>
> For instance, consider this query  stringFIeld:abc "i am not"
>
> this will get parsed as
> stringField:abc defaultTextField:"i am not".
>
> To get around this you need to make sure the entire search gets
> through the parser as a _single_ token by enclosing in quotes. But
> then of course you have confusion because you have quotes in your
> search term so you need to escape those, something like
> stringField:"abc \"i am not\""
>
> Here's a list for Lucene 5
>
> https://lucene.apache.org/core/5_1_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#Escaping_Special_Characters
>
> Best,
> Erick
>
> On Wed, Jan 13, 2016 at 3:39 AM, Binoy Dalal <binoydala...@gmail.com>
> wrote:
> > No.
> >
> > On Wed, 13 Jan 2016, 16:58 Alok Bhandari <
> alokomprakashbhand...@gmail.com>
> > wrote:
> >
> >> Hi Binoy thanks.
> >>
> >> But does it matter which query-parser I use , shall I use "lucene"
> parser
> >> or
> >> "edismax" parser.
> >>
> >>
> >>
> >> --
> >> View this message in context:
> >>
> http://lucene.472066.n3.nabble.com/How-to-achieve-exact-string-match-query-which-includes-spaces-and-quotes-tp4250402p4250405.html
> >> Sent from the Solr - User mailing list archive at Nabble.com.
> >>
> > --
> > Regards,
> > Binoy Dalal
>



-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com


Re: Permutations of entries in a multivalued field

2015-12-18 Thread Scott Stults
Johannes,

I think your best bet is to create a QParserPlugin that orders the terms of
the incoming query. It sounds like you have control over the way that field
is indexed, so you could enforce the same ordering when the document comes
into Solr. If that's not the case then you'll also want to write an
UpdateRequestProcessor:

https://wiki.apache.org/solr/UpdateRequestProcessor

Using a phrase query is probably not an option since you're probably
working with > 3 terms and phrase slop wouldn't be able to extend past that.


Hope that helps!
-Scott


On Wed, Dec 16, 2015 at 8:38 AM, Johannes Riedl <
johannes.ri...@uni-tuebingen.de> wrote:

> Hello all,
>
> we are facing the following problem: we use a multivalued string field
> that contains entries of the kind A/B/C/, where A,B,C are terms.
> We are now looking for a simple way to also find all permutations of
> A/B/C, so e.g. B/A/C. As a workaround we added a new field that contains
> all entries alphabetically sorted and guarantee sorting on the user side.
> However - since this is limited in some ways - is there a simple way to
> either index in a way such that solely A/B/C and all permutations are found
> (using e.g. type=text is not an option since a term could occur in a
> different entry of the multivalued field) or trigger an alphabetical
> sorting of incoming queries.
>
> Thanks a lot for your feedback, best regards
>
> Johannes
>
>


-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com


Re: query to get parents without childs

2015-12-16 Thread Scott Stults
Hi Novin,

How are you associating parents with children? Is it a "children"
multivalued field in the parent record? If so you could query for records
that don't have a value in that field like "-children:[* TO *]"

k/r,
Scott

On Wed, Dec 16, 2015 at 7:29 AM, Novin Novin <toe.al...@gmail.com> wrote:

> Hi guys,
>
> I have few parent index without child, what would wold be the query for
> those to get?
>
> Thanks,
> Novin
>



-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com


Re: Highlighting large documents

2015-12-08 Thread Scott Stults
There are two things going on that you should be aware of. The first is,
Solr Highlighting is mainly concerned about putting a representative
snippet in a results listing. There are a couple of configuration changes
you need to do if you want to highlight a whole document, like setting the
fragListBuilder to SingleFragListBuilder and the maxAnalyzedChars setting
you've already mentioned:

https://wiki.apache.org/solr/HighlightingParameters#hl.fragsize

Because full document highlighting is so different from highlighting
snippets in a result list you'll want to configure two different
highlighters: One for snippets and one for the full document.

The other thing you need to know is that performance in highlighting is an
active area of development. Right now the top docs in the current result
list are calculated completely separate from the snippets (highlighting),
which can lead to problems when the most relevant snippets are later in the
document.

What most people do is compromise by making the result list fast but
inaccurate, and having the full-document highlight be accurate but slower.


Hope that helps,
-Scott


On Fri, Dec 4, 2015 at 11:12 AM, Andrea Gazzarini <a.gazzar...@gmail.com>
wrote:

> No no, sorry, the project is not yet started so I didn't experience your
> issue, but I'll be a careful listener of this thread
>
> Best,
> Andrea
>
> 2015-12-04 17:04 GMT+01:00 Zheng Lin Edwin Yeo <edwinye...@gmail.com>:
>
> > Hi Andrea,
> >
> > I'm using the original highlighter.
> >
> > Below is my configuration for the highlighter in solrconfig.xml
> >
> >   
> >
> >explicit
> >10
> >json
> >true
> >   text
> >   id, title, content_type, last_modified, url, score
> 
> >
> >   on
> >id, title, content, author 
> >   true
> >true
> >html
> >   200
> >   100
> >
> > true
> > signature
> > true
> > 100
> >   
> >   
> >
> >
> > Have you managed to solve the problem?
> >
> > Regards,
> > Edwin
> >
> >
> > On 4 December 2015 at 23:54, Andrea Gazzarini <a.gazzar...@gmail.com>
> > wrote:
> >
> > > Hi Zheng,
> > > just curiousity, because shortly I will have to deal with a similar
> > > scenario (Solr 5.3.1 + large documents + highlighting).
> > > Which highlighter are you using?
> > >
> > > Andrea
> > >
> > > 2015-12-04 16:51 GMT+01:00 Zheng Lin Edwin Yeo <edwinye...@gmail.com>:
> > >
> > > > Hi,
> > > >
> > > > I'm using Solr 5.3.0
> > > >
> > > > I found that in large documents, sometimes I face situation that
> when I
> > > do
> > > > a highlight query, the resultset that is returned does not contain
> the
> > > > highlighted query. There are actually matches in the documents, but
> > just
> > > > that they located further back in the documents.
> > > >
> > > > I have tried to increase the value of the hl.maxAnalyzedChars, as the
> > > > default value is 51200, and I have documents that are much larger
> than
> > > > 51200 characters. Although this method works, but, when I increase
> this
> > > > value, the performance of the search and highlight drops. It can drop
> > > from
> > > > less than 0.5 seconds to more than 10 seconds.
> > > >
> > > > Would like to check, is this method of increasing the value of the
> > > > hl.maxAnalyzedChars the best method to use, or is there other ways
> > which
> > > > can solve the same purpose, but without affecting the performance
> much?
> > > >
> > > > Regards,
> > > > Edwin
> > > >
> > >
> >
>



-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com


Re: Highlighting tag problem

2015-12-07 Thread Scott Stults
I see. There appears to be a gap in what you can match on and what will get
highlighted:

id, title, content_type, last_modified, url, score 

id, title, content, author, tag

Unless you override fl or hl.fl in url parameters you can get a hit in
content_type, last_modified, url, or score and those fields will not get
highlighted. Try adding those fields to hl.fl.


k/r,
Scott

On Fri, Dec 4, 2015 at 12:59 AM, Zheng Lin Edwin Yeo <edwinye...@gmail.com>
wrote:

> Hi Scott,
>
> No, what's describe in SOLR-8334 is the tag appearing at the result, but at
> the wrong position.
>
> For this problem, the situation is that when I do a highlight query, some
> of the results in the resultset does not contain the search word in  title,
> content_type, last_modified and  url, as specified in my solrconfig.xml
> which I'm posted earlier on, and there is no  tag in those results. So
> I'm not sure why those results are returned.
>
> Regards,
> Edwin
>
>
> On 4 December 2015 at 01:03, Scott Stults <
> sstu...@opensourceconnections.com
> > wrote:
>
> > Edwin,
> >
> > Is this related to what's described in SOLR-8334?
> >
> >
> > k/r,
> > Scott
> >
> > On Thu, Dec 3, 2015 at 5:07 AM, Zheng Lin Edwin Yeo <
> edwinye...@gmail.com>
> > wrote:
> >
> > > Hi,
> > >
> > > I'm using Solr 5.3.0.
> > > Would like to find out, during a search, sometimes there is a match in
> > > content, but it is not highlighted (the word is not in the stopword
> > list)?
> > > Did I make any mistakes in my configuration?
> > >
> > > This is my highlighting request handler from solrconfig.xml.
> > >
> > > 
> > > 
> > > explicit
> > > 10
> > > json
> > > true
> > > text
> > > id, title, content_type, last_modified, url, score
> 
> > >
> > > on
> > > id, title, content, author, tag
> > >true
> > > true
> > > html
> > > 200
> > >
> > > true
> > > signature
> > > true
> > > 100
> > > 
> > > 
> > >
> > >
> > > This is my pipeline for the field.
> > >
> > >   > > positionIncrementGap="100">
> > >
> > >
> > >
> > > class="analyzer.solr5.jieba.JiebaTokenizerFactory"
> > > segMode="SEARCH"/>
> > >
> > >
> > >
> > >
> > >
> > > > > words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
> > >
> > > > > words="stopwords.txt" />
> > >
> > > > > generateWordParts="1" generateNumberParts="1" catenateWords="0"
> > > catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
> > >
> > > > > synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
> > >
> > >
> > >
> > > > > maxGramSize="15"/>
> > >
> > >
> > >
> > >
> > >
> > > class="analyzer.solr5.jieba.JiebaTokenizerFactory"
> > > segMode="SEARCH"/>
> > >
> > >
> > >
> > >
> > >
> > > > > words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
> > >
> > > > > words="stopwords.txt" />
> > >
> > > > > generateWordParts="0" generateNumberParts="0" catenateWords="0"
> > > catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
> > >
> > > > > synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
> > >
> > >
> > >
> > > 
> > >
> > >  
> > >
> > >
> > > Regards,
> > > Edwin
> > >
> >
> >
> >
> > --
> > Scott Stults | Founder & Solutions Architect | OpenSource Connections,
> LLC
> > | 434.409.2780
> > http://www.opensourceconnections.com
> >
>



-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com


Re: Highlighting tag problem

2015-12-03 Thread Scott Stults
Edwin,

Is this related to what's described in SOLR-8334?


k/r,
Scott

On Thu, Dec 3, 2015 at 5:07 AM, Zheng Lin Edwin Yeo <edwinye...@gmail.com>
wrote:

> Hi,
>
> I'm using Solr 5.3.0.
> Would like to find out, during a search, sometimes there is a match in
> content, but it is not highlighted (the word is not in the stopword list)?
> Did I make any mistakes in my configuration?
>
> This is my highlighting request handler from solrconfig.xml.
>
> 
> 
> explicit
> 10
> json
> true
> text
> id, title, content_type, last_modified, url, score 
>
> on
> id, title, content, author, tag
>true
> true
> html
> 200
>
> true
> signature
> true
> 100
> 
> 
>
>
> This is my pipeline for the field.
>
>   positionIncrementGap="100">
>
>
>
> segMode="SEARCH"/>
>
>
>
>
>
> words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
>
> words="stopwords.txt" />
>
> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>
> synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
>
>
>
> maxGramSize="15"/>
>
>
>
>
>
> segMode="SEARCH"/>
>
>
>
>
>
> words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
>
> words="stopwords.txt" />
>
> generateWordParts="0" generateNumberParts="0" catenateWords="0"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
>
> synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
>
>
>
> 
>
>  
>
>
> Regards,
> Edwin
>



-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com


Re: Different Similarities for the same field

2015-12-02 Thread Scott Stults
I haven't tried this before (overriding default similarity in a custom
SearchComponent), but it looks like it should be possible. In
QueryComponent.process() you can get a hold of the SolrIndexSearcher and
call setSimilarity(). It also looks like this is set only once by default
when the searcher is created, so you may need to set it back to the default
similarity when you're done.


k/r,
Scott

On Tue, Nov 24, 2015 at 10:25 AM, Markus, Sascha <sas...@uberresearch.com>
wrote:

> Hi,
> I implemented a Similarity which is based on the DefaultSimilarity changing
> the calculation for the idf.
> To work with this CustomSimilarity and the DefaultSimilarity from our
> application I have one field with the default and a copyfield with my
> similarity.
> Concerning the extra space needed for this field I wonder if there is a way
> to have my similarity or the default one on the SAME field. Because there
> are no differences for the index. E.g. by creating a SearchComponent to
> have something like solr/mySelect for queries with my similarity and the
> usual solr/select for the default similarity?
> How could I achive this, has anybody a hint?
>
> Cheers,
>  Sascha
>



-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com


Re: Highlighting content field problem when using JiebaTokenizerFactory

2015-11-23 Thread Scott Stults
quot;true"/>
> >>> > >
> >>> > >
> >>> > >  >>> > > positionIncrementGap="100">
> >>> > > 
> >>> > >  >>> > > segMode="SEARCH"/>
> >>> > > 
> >>> > > 
> >>> > >  >>> > > words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
> >>> > >  >>> > > maxGramSize="15"/>
> >>> > > 
> >>> > > 
> >>> > > 
> >>> > >  >>> > > segMode="SEARCH"/>
> >>> > > 
> >>> > > 
> >>> > >  >>> > > words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
> >>> > > 
> >>> > > 
> >>> > > 
> >>> > >
> >>> > >
> >>> > > Here's my solrconfig.xml on the highlighting portion:
> >>> > >
> >>> > > 
> >>> > > 
> >>> > > explicit
> >>> > > 10
> >>> > > json
> >>> > > true
> >>> > > text
> >>> > > id, title, content_type, last_modified, url, score
> >>> 
> >>> > >
> >>> > > on
> >>> > > id, title, content, author, tag
> >>> > > true
> >>> > > true
> >>> > > html
> >>> > > 200
> >>> > > true
> >>> > > signature
> >>> > > true
> >>> > > 100
> >>> > > 
> >>> > > 
> >>> > >
> >>> > >  >>> > > class="solr.highlight.BreakIteratorBoundaryScanner">
> >>> > > 
> >>> > > WORD
> >>> > > en
> >>> > > SG
> >>> > > 
> >>> > > 
> >>> > >
> >>> > >
> >>> > > Meanwhile, I'll take a look at the articles too.
> >>> > >
> >>> > > Thank you.
> >>> > >
> >>> > > Regards,
> >>> > > Edwin
> >>> > >
> >>> > >
> >>> > > On 20 October 2015 at 11:32, Scott Chu <scott@udngroup.com
> >>> <+scott@udngroup.com>
> >>> > <+scott@udngroup.com <+scott@udngroup.com>>
> >>> > > <+scott@udngroup.com <+scott@udngroup.com> <+
> >>> scott@udngroup.com <+scott@udngroup.com>>>> wrote:
> >>> > >
> >>> > > > Hi Edwin,
> >>> > > >
> >>> > > > I didn't use Jieba on Chinese (I use only CJK, very
> foundamental, I
> >>> > > > know) so I didn't experience this problem.
> >>> > > >
> >>> > > > I'd suggest you post your schema.xml so we can see how you define
> >>> your
> >>> >
> >>> > > > content field and the field type it uses?
> >>> > > >
> >>> > > > In the mean time, refer to these articles, maybe the answer or
> >>> > workaround
> >>> > > > can be deducted from them.
> >>> > > >
> >>> > > > https://issues.apache.org/jira/browse/SOLR-3390
> >>> > > >
> >>> > > >
> >>> http://qnalist.com/questions/661133/solr-is-highlighting-wrong-words
> >>>
> >>> > > >
> >>> > > >
> http://qnalist.com/questions/667066/highlighting-marks-wrong-words
> >>> > > >
> >>> > > > Good luck!
> >>> > > >
> >>> > > >
> >>> > > >
> >>> > > >
> >>> > > > Scott Chu,scott@udngroup.com <+scott@udngroup.com> <+
> >>> scott@udngroup.com <+scott@udngroup.com>> <+
> >>> > scott@udngroup.com <+scott@udngroup.com> <+
> >>> scott@udngroup.com <+scott@udngroup.com>>>
> >>> > > > 2015/10/20
> >>> > > >
> >>> > > > - Original Message -
> >>> > > > *From: *Zheng Lin Edwin Yeo <edwinye...@gmail.com
> >>> <+edwinye...@gmail.com>
> >>> > <+edwinye...@gmail.com <+edwinye...@gmail.com>>
> >>> > > <+edwinye...@gmail.com <+edwinye...@gmail.com> <+
> >>> edwinye...@gmail.com <+edwinye...@gmail.com>>>>
> >>> > > > *To: *solr-user <solr-user@lucene.apache.org
> >>> <+solr-user@lucene.apache.org>
> >>> > <+solr-user@lucene.apache.org <+solr-user@lucene.apache.org>>
> >>> > > <+solr-user@lucene.apache.org <+solr-user@lucene.apache.org> <+
> >>> solr-user@lucene.apache.org <+solr-user@lucene.apache.org>>>>
> >>> >
> >>> > > > *Date: *2015-10-13, 17:04:29
> >>> > > > *Subject: *Highlighting content field problem when using
> >>> > > > JiebaTokenizerFactory
> >>> > > >
> >>> > > > Hi,
> >>> > > >
> >>> > > > I'm trying to use the JiebaTokenizerFactory to index Chinese
> >>> characters
> >>> > > in
> >>> > > >
> >>> > > > Solr. It works fine with the segmentation when I'm using
> >>> > > > the Analysis function on the Solr Admin UI.
> >>> > > >
> >>> > > > However, when I tried to do the highlighting in Solr, it is not
> >>> > > > highlighting in the correct place. For example, when I search of
> >>> > > 自然環境与企業本身,
> >>> > > > it highlight 認為自然環境与企業本身的
> >>> > > >
> >>> > > > Even when I search for English character like responsibility, it
> >>> > > highlight
> >>> > > >  *responsibilit*y.
> >>> > > >
> >>> > > > Basically, the highlighting goes off by 1 character/space
> >>> consistently.
> >>> > > >
> >>> > > > This problem only happens in content field, and not in any other
> >>> > fields.
> >>> > >
> >>> > > > Does anyone knows what could be causing the issue?
> >>> > > >
> >>> > > > I'm using jieba-analysis-1.0.0, Solr 5.3.0 and Lucene 5.3.0.
> >>> > > >
> >>> > > >
> >>> > > > Regards,
> >>> > > > Edwin
> >>> > > >
> >>> > > >
> >>> > > >
> >>> > > > -
> >>> > > > 未在此訊息中找到病毒。
> >>> > > > 已透過 AVG 檢查 - www.avg.com
> >>> > > > 版本: 2015.0.6140 / 病毒庫: 4447/10808 - 發佈日期: 10/12/15
> >>> > > >
> >>> > > >
> >>> > >
> >>> > >
> >>> > >
> >>> > > -
> >>> > > 未在此訊息中找到病毒。
> >>> > > 已透過 AVG 檢查 - www.avg.com
> >>> > > 版本: 2015.0.6172 / 病毒庫: 4447/10853 - 發佈日期: 10/19/15
> >>> > >
> >>> > >
> >>> >
> >>> >
> >>> >
> >>> > -
> >>> > 未在此訊息中找到病毒。
> >>> > 已透過 AVG 檢查 - www.avg.com
> >>> > 版本: 2015.0.6172 / 病毒庫: 4450/10867 - 發佈日期: 10/21/15
> >>> >
> >>> >
> >>>
> >>>
> >>>
> >>> -
> >>> 未在此訊息中找到病毒。
> >>> 已透過 AVG 檢查 - www.avg.com
> >>> 版本: 2015.0.6173 / 病毒庫: 4450/10871 - 發佈日期: 10/22/15
> >>>
> >>>
> >>
> >
>



-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com


Re: Number of fields in qf & fq

2015-11-20 Thread Scott Stults
Steve,

Another thing debugQuery will give you is a breakdown of how much each
field contributed to the final score of each hit. That's going to give you
a nice shopping list of qf to weed out.


k/r,
Scott

On Fri, Nov 20, 2015 at 9:26 AM, Mikhail Khludnev <
mkhlud...@griddynamics.com> wrote:

> Hello Steve,
>
> debugQuery=true shows whether it's facets or query, whether it's query
> parsing or searching (prepare vs process), cache statistics can tell about
> its' efficiency; sometimes a problem is obvious from request parameters.
> Simple sampling with jconsole or even by jstack can point on a smoking
> gun.
>
> On Fri, Nov 20, 2015 at 4:08 PM, Steven White <swhite4...@gmail.com>
> wrote:
>
> > Thanks Erick.
> >
> > The 1500 fields is a design that I inherited.  I'm trying to figure out
> why
> > it was done as such and what it will take to fix it.
> >
> > What about my other question: how does one go about debugging performance
> > issues in Solr to find out where time is mostly spent?  How do I know my
> > Solr parameters, such as cache and what have you are set right?  From
> what
> > I see, we are using the defaults off solrconfig.xml.
> >
> > I'm on Solr 5.2
> >
> > Steve
> >
> >
> > On Thu, Nov 19, 2015 at 11:36 PM, Erick Erickson <
> erickerick...@gmail.com>
> > wrote:
> >
> > > An fq is still a single entry in your filterCache so from that
> > > perspective it's the same.
> > >
> > > And to create that entry, you're still using all the underlying fields
> > > to search, so they have to be loaded just like they would be in a q
> > > clause.
> > >
> > > But really, the fundamental question here is why your design even has
> > > 1,500 fields and, more specifically, why you would want to search them
> > > all at once. From a 10,000 ft. view, that's a very suspect design.
> > >
> > > Best,
> > > Erick
> > >
> > > On Thu, Nov 19, 2015 at 4:06 PM, Walter Underwood <
> wun...@wunderwood.org
> > >
> > > wrote:
> > > > The implementation for fq has changed from 4.x to 5.x, so I’ll let
> > > someone else answer that in detail.
> > > >
> > > > In 4.x, the result of each filter query can be cached. After that,
> they
> > > are quite fast.
> > > >
> > > > wunder
> > > > Walter Underwood
> > > > wun...@wunderwood.org
> > > > http://observer.wunderwood.org/  (my blog)
> > > >
> > > >
> > > >> On Nov 19, 2015, at 3:59 PM, Steven White <swhite4...@gmail.com>
> > wrote:
> > > >>
> > > >> Thanks Walter.  I see your point.  Does this apply to fq as will?
> > > >>
> > > >> Also, how does one go about debugging performance issues in Solr to
> > find
> > > >> out where time is mostly spent?
> > > >>
> > > >> Steve
> > > >>
> > > >> On Thu, Nov 19, 2015 at 6:54 PM, Walter Underwood <
> > > wun...@wunderwood.org>
> > > >> wrote:
> > > >>
> > > >>> With one field in qf for a single-term query, Solr is fetching one
> > > posting
> > > >>> list. With 1500 fields, it is fetching 1500 posting lists. It could
> > > easily
> > > >>> be 1500 times slower.
> > > >>>
> > > >>> It might be even slower than that, because we can’t guarantee that:
> > a)
> > > >>> every algorithm in Solr is linear, b) that all those lists will fit
> > in
> > > >>> memory.
> > > >>>
> > > >>> wunder
> > > >>> Walter Underwood
> > > >>> wun...@wunderwood.org
> > > >>> http://observer.wunderwood.org/  (my blog)
> > > >>>
> > > >>>
> > > >>>> On Nov 19, 2015, at 3:46 PM, Steven White <swhite4...@gmail.com>
> > > wrote:
> > > >>>>
> > > >>>> Hi everyone
> > > >>>>
> > > >>>> What is considered too many fields for qf and fq?  On average I
> will
> > > have
> > > >>>> 1500 fields in qf and 100 in fq (all of which are OR'ed).
> Assuming
> > I
> > > can
> > > >>>> (I have to check with the design) for qf, if I cut it down to 1
> > field,
> > > >>> will
> > > >>>> I see noticeable performance improvement?  It will take a lot of
> > > effort
> > > >>> to
> > > >>>> test this which is why I'm asking first.
> > > >>>>
> > > >>>> As is, I'm seeing 2-5 sec response time for searches on an index
> of
> > 1
> > > >>>> million records with total index size (on disk) of 4 GB.  I gave
> > Solr
> > > 2
> > > >>> GB
> > > >>>> of RAM (also tested at 4 GB) in both cases Solr didn't use more
> > then 1
> > > >>> GB.
> > > >>>>
> > > >>>> Thanks in advanced
> > > >>>>
> > > >>>> Steve
> > > >>>
> > > >>>
> > > >
> > >
> >
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> <http://www.griddynamics.com>
> <mkhlud...@griddynamics.com>
>



-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com


Re: StringIndexOutOfBoundsException using spellcheck and synonyms

2015-11-16 Thread Scott Stults
ng.Thread.run(Thread.java:722)
>
> Derek
>
> --
> CONFIDENTIALITY NOTICE
> This e-mail (including any attachments) may contain confidential and/or
> privileged information. If you are not the intended recipient or have
> received this e-mail in error, please inform the sender immediately and
> delete this e-mail (including any attachments) from your computer, and you
> must not use, disclose to anyone else or copy this e-mail (including any
> attachments), whether in whole or in part.
> This e-mail and any reply to it may be monitored for security, legal,
> regulatory compliance and/or other appropriate reasons.




-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com


Re: Stopping Solr on Linux when run as a service

2015-11-10 Thread Scott Stults
Steve,

In short, don't worry: it all gets taken care of.

The way services work on Linux is, when the system shuts down it will
basically call "service (servicname) stop" on each service. That calls the
bin/init.d/solr script with a "stop" argument, which in turn calls the
bin/solr script with a "stop" argument (I'm referring to where the files
are in the distribution, not where they get installed).

k/r,
Scott


On Tue, Nov 10, 2015 at 9:40 AM, Steven White <swhite4...@gmail.com> wrote:

> Hi folks,
>
> This question maybe more of a Linux one vs. Solr, but I have to start
> someplace.
>
> I'm reading this link
> https://cwiki.apache.org/confluence/display/solr/Taking+Solr+to+Production
> to get Solr on Linux (I'm more of a Windows guy).
>
> The page provides good intro on how to setup Solr to start as a service on
> Linux.  Now what I don't get is this: what happens when the system is
> shutting down?  How does Solr knows to shutdown gracefully when there is
> noting on that page talks about issuing a "stop" command on system
> shutdown?  Can someone shed some light on this?  Like I said, I'm more of a
> "Windows" guy.
>
> Thanks in advanced!!
>
> Steve
>



-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com


Re: Solr Search: Access Control / Role based security

2015-11-09 Thread Scott Stults
Susheel,

This is perfectly fine for simple use-cases and has the benefit that the
filterCache will help things stay nice and speedy. Apache ManifoldCF goes a
bit further and ties back to your authentication and authorization
mechanism:

http://manifoldcf.apache.org/release/trunk/en_US/concepts.html#ManifoldCF+security+model


k/r,
Scott

On Thu, Nov 5, 2015 at 2:26 PM, Susheel Kumar <susheel2...@gmail.com> wrote:

> Hi,
>
> I have seen couple of use cases / need where we want to restrict result of
> search based on role of a user.  For e.g.
>
> - if user role is admin, any document from the search result will be
> returned
> - if user role is manager, only documents intended for managers will be
> returned
> - if user role is worker, only documents intended for workers will be
> returned
>
> Typical practise is to tag the documents with the roles (using a
> multi-valued field) during indexing and then during search append filter
> query to restrict result based on roles.
>
> Wondering if there is any other better way out there and if this common
> requirement should be added as a Solr feature/plugin.
>
> The current security plugins are more towards making Solr apis/resources
> secure not towards securing/controlling data during search.
>
> https://cwiki.apache.org/confluence/display/solr/Authentication+and+Authorization+Plugins
>
>
> Please share your thoughts.
>
> Thanks,
> Susheel
>



-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com


Re: Securing field level access permission by filtering the query itself

2015-11-05 Thread Scott Stults
Good to hear! Depending on how far you want to take it, you can then scan
the initial request coming in from the client (and the final response) for
raw Solr fields -- that shouldn't happen. I've used mod_security as a
general-purpose application firewall and would recommend it.

k/r,
Scott

On Wed, Nov 4, 2015 at 1:40 PM, Douglas McGilvray <d...@weemondo.com> wrote:

>
> Thanks Alessandro, I had overlooked the highlighting component.
>
> I will also add a reminder to exclude these fields from spellcheck fields,
> (or maintain different spellcheck fields for different roles).
>
> @Scott - Once I started planning my code the penny finally dropped
> regarding your point about aliasing the fields - it removes the need for
> calculating which fields to request in the app itself.
>
> Regards,
> D
>
>
> > On 4 Nov 2015, at 14:53, Alessandro Benedetti <abenede...@apache.org>
> wrote:
> >
> > Of course it depends of all the query parameter you use and you process
> in
> > the response.
> > The list you wrote should be ok if you use only those components.
> >
> > For example if you use highlight, it's not ok and you need to take care
> of
> > the highlighted fields as well.
> >
> > Cheers
> >
> > On 30 October 2015 at 14:51, Douglas McGilvray <d...@weemondo.com> wrote:
> >
> >>
> >> Scott thanks for the reply. I like the idea of mapping all the
> fieldnames
> >> internally, adding security through obscurity. My question therefore
> would
> >> be what is the definitive list of query parameters that one must filter
> to
> >> ensure a particular field is not exposed in the query response? Am I
> >> missing in the following?
> >>
> >> fl
> >> facect.field
> >> facet.pivot
> >> json.facet
> >> terms.fl
> >>
> >>
> >> kr
> >> Douglas
> >>
> >>
> >>> On 30 Oct 2015, at 07:37, Scott Stults <
> >> sstu...@opensourceconnections.com> wrote:
> >>>
> >>> Douglas,
> >>>
> >>> Managing a per-user-group whitelist of fields outside of Solr seems the
> >>> best approach. When the query comes in you can then filter out any
> fields
> >>> not contained in the whitelist before you send the request to Solr. The
> >>> easy part will be to do that on URL parameters like fl. Depending on
> how
> >>> your app generates the actual query string, you may want to also scan
> >> that
> >>> for fielded query clauses (eg "badfield:value") and localParams (eg
> >>> "{!dismax qf=badfield}value").
> >>>
> >>> Secondly, you can map internal Solr fields to aliases using this syntax
> >> in
> >>> the fl parameter: "display_name:real_solr_name". So when the request
> >> comes
> >>> in from your app, first you'll map from the requested field alias names
> >> to
> >>> internal Solr names (while enforcing the whitelist), and then in the fl
> >>> parameter supply the aliases you want sent in the response.
> >>>
> >>>
> >>> k/r,
> >>> Scott
> >>>
> >>> On Wed, Oct 28, 2015 at 6:58 PM, Douglas McGilvray <d...@weemondo.com>
> >> wrote:
> >>>
> >>>> Hi all,
> >>>>
> >>>> First I’d like to say the nested facets and the json facet api in
> >>>> particular have made my world much better, I thank everyone involved,
> >> you
> >>>> are all awesome.
> >>>>
> >>>> In my implementation has much of the solr query building working on
> the
> >>>> browser, solr is behind a php server which acts as “proxy” and
> doorman,
> >>>> filtering at the document level according to user role and supplying
> >> some
> >>>> sensible maximums …
> >>>>
> >>>> However we now wish to filter just one or two potentially sensitive
> >> fields
> >>>> in one document type according to user role (as determined in the php
> >>>> proxy). Duplicating documents (or cores) seems like overkill for just
> >> two
> >>>> fields in one document type .. I wondered if it would be feasible (in
> >> the
> >>>> interests of preventing malicious activity) to filter the query itself
> >>>> whether it be parameters (fl, facet.fields, terms, etc) … or even deny
> >> any
> >>>> request in which fieldname occurs …
> >>>>
> >>>> Is there someway someone might obscure a fieldname in a request?
> >>>>
> >>>> Kind Regards & thanks in davacne,
> >>>> Douglas
> >>>
> >>>
> >>>
> >>>
> >>> --
> >>> Scott Stults | Founder & Solutions Architect | OpenSource Connections,
> >> LLC
> >>> | 434.409.2780
> >>> http://www.opensourceconnections.com
> >>
> >>
> >
> >
> > --
> > --
> >
> > Benedetti Alessandro
> > Visiting card : http://about.me/alessandro_benedetti
> >
> > "Tyger, tyger burning bright
> > In the forests of the night,
> > What immortal hand or eye
> > Could frame thy fearful symmetry?"
> >
> > William Blake - Songs of Experience -1794 England
>
>


-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com


Re: Securing field level access permission by filtering the query itself

2015-10-30 Thread Scott Stults
Douglas,

Managing a per-user-group whitelist of fields outside of Solr seems the
best approach. When the query comes in you can then filter out any fields
not contained in the whitelist before you send the request to Solr. The
easy part will be to do that on URL parameters like fl. Depending on how
your app generates the actual query string, you may want to also scan that
for fielded query clauses (eg "badfield:value") and localParams (eg
"{!dismax qf=badfield}value").

Secondly, you can map internal Solr fields to aliases using this syntax in
the fl parameter: "display_name:real_solr_name". So when the request comes
in from your app, first you'll map from the requested field alias names to
internal Solr names (while enforcing the whitelist), and then in the fl
parameter supply the aliases you want sent in the response.


k/r,
Scott

On Wed, Oct 28, 2015 at 6:58 PM, Douglas McGilvray <d...@weemondo.com> wrote:

> Hi all,
>
> First I’d like to say the nested facets and the json facet api in
> particular have made my world much better, I thank everyone involved, you
> are all awesome.
>
> In my implementation has much of the solr query building working on the
> browser, solr is behind a php server which acts as “proxy” and doorman,
> filtering at the document level according to user role and supplying some
> sensible maximums …
>
> However we now wish to filter just one or two potentially sensitive fields
> in one document type according to user role (as determined in the php
> proxy). Duplicating documents (or cores) seems like overkill for just two
> fields in one document type .. I wondered if it would be feasible (in the
> interests of preventing malicious activity) to filter the query itself
> whether it be parameters (fl, facet.fields, terms, etc) … or even deny any
> request in which fieldname occurs …
>
> Is there someway someone might obscure a fieldname in a request?
>
> Kind Regards & thanks in davacne,
> Douglas




-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com


Re: Question on index time de-duplication

2015-10-30 Thread Scott Stults
At the top of the De-Duplication wiki page is a note about collapsing
results. Once you have the signature (identical for each of the duplicates)
you'll want to collapse your results, keeping the one with max date.

https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results


k/r,
Scott

On Thu, Oct 29, 2015 at 11:59 PM, Zheng Lin Edwin Yeo <edwinye...@gmail.com>
wrote:

> Yes, you can try to use the SignatureUpdateProcessorFactory to do a hashing
> of the content to a signature field, and group the signature field during
> your search.
>
> You can find more information here:
> https://cwiki.apache.org/confluence/display/solr/De-Duplication
>
> I have been using this method to group the index with duplicated content,
> and it is working fine.
>
> Regards,
> Edwin
>
>
> On 30 October 2015 at 07:20, Shamik Bandopadhyay <sham...@gmail.com>
> wrote:
>
> > Hi,
> >
> >   I'm looking to customizing index time de-duplication. Here's my use
> case
> > and what I'm trying to achieve.
> >
> > I've identical documents coming from different release year of a given
> > product. I need to index them in Solr as they are required in individual
> > year context. But there's a generic search which spans across all the
> years
> > and hence bring back duplicate/identical content. My goal is to only
> return
> > the latest document and filter out the rest. For e.g. if product A has
> > identical documents for 2015, 2014 and 2013, search should only return
> 2015
> > (latest document) and filter out the rest.
> >
> > What I'm thinking (if possible) during index time :
> >
> > Index all documents, but add a special tag (e.g. dedup=true) to 2013 and
> > 2014 content, keeping 2015 (the latest release) untouched. During query
> > time, I'll add a filter which will exclude contents tagged with "dedup".
> >
> > Just wondering if this is achievable by perhaps extending
> > UpdateRequestProcessorFactory or
> > customizing SignatureUpdateProcessorFactory ?
> >
> > Any pointers will be appreciated.
> >
> > Regards,
> > Shamik
> >
>



-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com


Re: Solr collection alias - how rank is affected

2015-10-27 Thread Scott Stults
Collection statistics aren't shared between collections, so there's going
to be a difference. However, if the distribution is fairly random you won't
notice.

On Tue, Oct 27, 2015 at 3:21 PM, SolrUser1543 <osta...@gmail.com> wrote:

> How is document ranking is affected when using a collection alias for
> searching on two collections with same schema ? is it affected at all  ?
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Solr-collection-alias-how-rank-is-affected-tp4236776.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com


Re: Does docValues impact termfreq ?

2015-10-26 Thread Scott Stults
;>>>>>>>>>
> >>>>>>>>> wrote:
> >>>>>
> >>>>>> If you mean using the term frequency function query, then
> >>>>>>>>>>>
> >>>>>>>>>> I'm
> >>>
> >>>> not
> >>>>>
> >>>>>> sure
> >>>>>>>
> >>>>>>>> there's a huge amount you can do to improve performance.
> >>>>>>>>>>>
> >>>>>>>>>>> The term frequency is a number that is used often, so it is
> >>>>>>>>>>>
> >>>>>>>>>> stored
> >>>>>
> >>>>>> in
> >>>>>>>
> >>>>>>>> the index pre-calculated. Perhaps, if your data is not
> >>>>>>>>>>>
> >>>>>>>>>> changing,
> >>>>>
> >>>>>> optimising your index would reduce it to one segment, and
> >>>>>>>>>>>
> >>>>>>>>>> thus
> >>>
> >>>> might
> >>>>>>>
> >>>>>>>> ever so slightly speed the aggregation of term frequencies,
> >>>>>>>>>>>
> >>>>>>>>>> but I
> >>>>>
> >>>>>> doubt
> >>>>>>>
> >>>>>>>> it'd make enough difference to make it worth doing.
> >>>>>>>>>>>
> >>>>>>>>>>> Upayavira
> >>>>>>>>>>>
> >>>>>>>>>>> On Sat, Oct 24, 2015, at 03:37 PM, Aki Balogh wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Thanks, Jack. I did some more research and found similar
> >>>>>>>>>>>>
> >>>>>>>>>>> results.
> >>>>>
> >>>>>> In our application, we are making multiple (think: 50)
> >>>>>>>>>>>>
> >>>>>>>>>>> concurrent
> >>>>>
> >>>>>> requests
> >>>>>>>>>>>> to calculate term frequency on a set of documents in
> >>>>>>>>>>>>
> >>>>>>>>>>> "real-time". The
> >>>>>>>
> >>>>>>>> faster that results return, the better.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Most of these requests are unique, so cache only helps
> >>>>>>>>>>>>
> >>>>>>>>>>> slightly.
> >>>>>
> >>>>>> This analysis is happening on a single solr instance.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Other than moving to solr cloud and splitting out the
> >>>>>>>>>>>>
> >>>>>>>>>>> processing
> >>>>>
> >>>>>> onto
> >>>>>>>
> >>>>>>>> multiple servers, do you have any suggestions for what
> >>>>>>>>>>>>
> >>>>>>>>>>> might
> >>>
> >>>> speed up
> >>>>>>>
> >>>>>>>> termfreq at query time?
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks,
> >>>>>>>>>>>> Aki
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Fri, Oct 23, 2015 at 7:21 PM, Jack Krupansky
> >>>>>>>>>>>> <jack.krupan...@gmail.com>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Term frequency applies only to the indexed terms of a
> >>>>>>>>>>>>>
> >>>>>>>>>>>> tokenized
> >>>>>
> >>>>>> field.
> >>>>>>>>>>
> >>>>>>>>>>> DocValues is really just a copy of the original source
> >>>>>>>>>>>>>
> >>>>>>>>>>>> text
> >>>
> >>>> and is
> >>>>>>>
> >>>>>>>> not
> >>>>>>>>>>
> >>>>>>>>>>> tokenized into terms.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Maybe you could explain how exactly you are using term
> >>>>>>>>>>>>>
> >>>>>>>>>>>> frequency in
> >>>>>>>
> >>>>>>>> function queries. More importantly, what is so "heavy"
> >>>>>>>>>>>>>
> >>>>>>>>>>>> about
> >>>>>
> >>>>>> your
> >>>>>>>
> >>>>>>>> usage?
> >>>>>>>>>>>
> >>>>>>>>>>>> Generally, moderate use of a feature is much more
> >>>>>>>>>>>>>
> >>>>>>>>>>>> advisable to
> >>>>>
> >>>>>> heavy
> >>>>>>>>>
> >>>>>>>>>> usage,
> >>>>>>>>>>>
> >>>>>>>>>>>> unless you don't care about performance.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> -- Jack Krupansky
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Fri, Oct 23, 2015 at 8:19 AM, Aki Balogh <
> >>>>>>>>>>>>>
> >>>>>>>>>>>> a...@marketmuse.com>
> >>>>>>>
> >>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Hello,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> In our solr application, we use a Function Query
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> (termfreq)
> >>>>>
> >>>>>> very
> >>>>>>>
> >>>>>>>> heavily.
> >>>>>>>>>>>
> >>>>>>>>>>>> Index time and disk space are not important, but
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> we're
> >>>
> >>>> looking to
> >>>>>>>
> >>>>>>>> improve
> >>>>>>>>>>>
> >>>>>>>>>>>> performance on termfreq at query time.
> >>>>>>>>>>>>>> I've been reading up on docValues. Would this be a
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> way to
> >>>
> >>>> improve
> >>>>>>>
> >>>>>>>> performance?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I had read that Lucene uses Field Cache for Function
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> Queries, so
> >>>>>>>
> >>>>>>>> performance may not be affected.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> And, any general suggestions for improving query
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> performance
> >>>>>
> >>>>>> on
> >>>>>>>
> >>>>>>>> Function
> >>>>>>>>>>>
> >>>>>>>>>>>> Queries?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>> Aki
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> > --
> > Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> > Solr & Elasticsearch Support * http://sematext.com/
> > <
> https://t.yesware.com/tl/506312808dab13214164f92fbcf5714d3ce38c6b/92f5492fd055692ff7f03b2888be3b50/7a8fd1f72b93af5d79583420b3483a7d?ytl=http%3A%2F%2Fsematext.com%2F
> >
> >
> >
>



-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com


Re: Highlight with NGram and German S Sharp "ß"

2015-10-20 Thread Scott Stults
Yep, I misunderstood the problem.

The multiple tokens at the same offset might be messing things up. One
thing you can do is copyField to a field that doesn't have n-grams and do
something like f.textng.hl.alternateField= in your solrconfig. That'll use
the other field during highlighting. Yeah, that'll increase your index size
on disk.



On Fri, Oct 16, 2015 at 10:07 AM, Jérôme Bernardes <
jerome.bernar...@mappy.com> wrote:

> Thanks for your reply Scott.
>
> I tried
>
> bs.language=de=de
>
> Unfortunately the problem still occurs.
> I have just discovered that the problem does not only affect "ß" but also
> "æ" (which is mapped to "ae"
> at query and index time)
> q=hae   -->   hæna
> So it seems to me that the problem is related to any single character that
> is map to several characters using  class="solr.MappingCharFilterFactory"
> mapping="mapping-ISOLatin1Accent.txt"/>
>
> Jérôme
>
>
> Le 13/10/2015 07:46, Scott Stults a écrit :
>
>> My guess is that the boundary scanner isn't configured right for your
>> highlighter. Try setting the bs.language and bs.country parameters either
>> in your request or in the requestHandler.
>>
>>
>> k/r,
>> Scott
>>
>> On Mon, Oct 5, 2015 at 4:57 AM, Jérôme Bernardes <
>> jerome.bernar...@mappy.com
>>
>>> wrote:
>>> Dear Solr Users,
>>> I am facing a problem with highligting on ngram fields.
>>> Highlighting is working well, except for words with german character
>>> "ß".
>>> Eg : with q=rosen&
>>> "highlighting": {
>>>  "gcl3r:12723710:6643": {
>>>  "textng": [
>>>  "Rosensteinpark (Métro), Stuttgart (Allemagne)"
>>>  ]
>>>  },
>>>  "gcl3r:2267495:780930": {
>>>  "textng": [
>>>  "Rosenstraße, 94554 Moos (Allemagne)"
>>>  ]
>>>  }
>>>  }
>>> Without "ß" words are highlight partially Rosensteinpark but
>>> with "ß", the whole word is highlighted (Rosenstraße)
>>>
>>> -
>>> This characters ß is mapped to "ss" at query and index time (using
>>> >> mapping="mapping-ISOLatin1Accent.txt"/>
>>>
>>> )
>>> .
>>> Here the schema.xml for the highlighted field.
>>> 
>>>
>>>  >> mapping="mapping-ISOLatin1Accent.txt"/>
>>>  
>>>  >> pattern="[\s,;:
>>> \-\']"/>
>>>  >>  splitOnNumerics="0"
>>>  generateWordParts="1"
>>>  generateNumberParts="1"
>>>  catenateWords="0"
>>>  catenateNumbers="0"
>>>  catenateAll="0"
>>>  splitOnCaseChange="1"
>>>  preserveOriginal="1"
>>>  types="wdfftypes.txt"
>>>  />
>>>  
>>>  >> ignoreCase="true" expand="true"/>
>>>  >> minGramSize="1"/>
>>>  
>>>
>>>
>>>  >> mapping="mapping-ISOLatin1Accent.txt"/>
>>>  
>>>  >> pattern="[\s,;:
>>> \-\']"/>
>>>  >>  splitOnNumerics="0"
>>>  generateWordParts="1"
>>>  generateNumberParts="0"
>>>  catenateWords="0"
>>>  catenateNumbers="0"
>>>  catenateAll="0"
>>>  splitOnCaseChange="0"
>>>  preserveOriginal="1"
>>>  types="wdfftypes.txt"
>>>  />
>>>  
>>>  
>>>  >> pattern="^(.{20})(.*)?" replacement="$1" replace="all"/>
>>>
>>> 
>>>
>>> Is it a problem in our configuration or a known bug ?
>>> Regards
>>> Jérôme
>>>
>>>
>>>
>>
>


-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com


Re: Highlighting content field problem when using JiebaTokenizerFactory

2015-10-19 Thread Scott Stults
Edwin,

Try setting hl.bs.language and hl.bs.country in your request or
requestHandler:

https://cwiki.apache.org/confluence/display/solr/FastVector+Highlighter#FastVectorHighlighter-UsingBoundaryScannerswiththeFastVectorHighlighter


-Scott

On Tue, Oct 13, 2015 at 5:04 AM, Zheng Lin Edwin Yeo <edwinye...@gmail.com>
wrote:

> Hi,
>
> I'm trying to use the JiebaTokenizerFactory to index Chinese characters in
> Solr. It works fine with the segmentation when I'm using
> the Analysis function on the Solr Admin UI.
>
> However, when I tried to do the highlighting in Solr, it is not
> highlighting in the correct place. For example, when I search of 自然环境与企业本身,
> it highlight 认为自然环境与企业本身的
>
> Even when I search for English character like  responsibility, it highlight
>   *responsibilit*y.
>
> Basically, the highlighting goes off by 1 character/space consistently.
>
> This problem only happens in content field, and not in any other fields.
> Does anyone knows what could be causing the issue?
>
> I'm using jieba-analysis-1.0.0, Solr 5.3.0 and Lucene 5.3.0.
>
>
> Regards,
> Edwin
>



-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com


Re: Autostart Zookeeper and Solr using scripting

2015-10-19 Thread Scott Stults
Hi Adrian,

I'd probably start with the expect command and "echo ruok | nc  "
for a simple script. You might also want to try the Netflix Exhibitor REST
interface:

https://github.com/Netflix/exhibitor/wiki/REST-Cluster


k/r,
Scott

On Thu, Oct 15, 2015 at 2:01 AM, Adrian Liew <adrian.l...@avanade.com>
wrote:

> Hi,
>
> I am trying to implement some scripting to detect if all Zookeepers have
> started in a cluster, then restart the solr servers. Has anyone achieved
> this yet through scripting?
>
> I also saw there is the ZookeeperClient that is available in .NET via a
> nuget package. Not sure if this could be also implemented to check if a
> zookeeper is running.
>
> Any thoughts on anyone using a script to perform this?
>
> Regards,
> Adrian
>
>


-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com


Re: Highlight with NGram and German S Sharp "ß"

2015-10-12 Thread Scott Stults
My guess is that the boundary scanner isn't configured right for your
highlighter. Try setting the bs.language and bs.country parameters either
in your request or in the requestHandler.


k/r,
Scott

On Mon, Oct 5, 2015 at 4:57 AM, Jérôme Bernardes <jerome.bernar...@mappy.com
> wrote:

> Dear Solr Users,
> I am facing a problem with highligting on ngram fields.
> Highlighting is working well, except for words with german character
> "ß".
> Eg : with q=rosen&
> "highlighting": {
> "gcl3r:12723710:6643": {
> "textng": [
> "Rosensteinpark (Métro), Stuttgart (Allemagne)"
> ]
> },
> "gcl3r:2267495:780930": {
> "textng": [
> "Rosenstraße, 94554 Moos (Allemagne)"
> ]
> }
> }
> Without "ß" words are highlight partially Rosensteinpark but
> with "ß", the whole word is highlighted (Rosenstraße)
>
> -
> This characters ß is mapped to "ss" at query and index time (using
>  mapping="mapping-ISOLatin1Accent.txt"/>
>
> )
> .
> Here the schema.xml for the highlighted field.
> 
>   
>  mapping="mapping-ISOLatin1Accent.txt"/>
> 
>  pattern="[\s,;:
> \-\']"/>
>  splitOnNumerics="0"
> generateWordParts="1"
> generateNumberParts="1"
> catenateWords="0"
> catenateNumbers="0"
> catenateAll="0"
> splitOnCaseChange="1"
> preserveOriginal="1"
> types="wdfftypes.txt"
> />
> 
>  ignoreCase="true" expand="true"/>
>  minGramSize="1"/>
> 
>   
>   
>  mapping="mapping-ISOLatin1Accent.txt"/>
> 
>  pattern="[\s,;:
> \-\']"/>
>  splitOnNumerics="0"
> generateWordParts="1"
> generateNumberParts="0"
> catenateWords="0"
> catenateNumbers="0"
> catenateAll="0"
> splitOnCaseChange="0"
> preserveOriginal="1"
> types="wdfftypes.txt"
> />
> 
> 
>  pattern="^(.{20})(.*)?" replacement="$1" replace="all"/>
>   
> 
>
> Is it a problem in our configuration or a known bug ?
> Regards
> Jérôme
>
>


-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com


Re: are there any SolrCloud supervisors?

2015-10-12 Thread Scott Stults
Something like Exhibitor for Zookeeper? Very cool! Don't worry too much
about cleaning up the repo. When it comes time to integrate it with Solr or
make it an Apache top-level project you can start with a fresh commit
history :)


-Scott

On Fri, Oct 2, 2015 at 3:09 PM, r b <chopf...@gmail.com> wrote:

> I've been working on something that just monitors ZooKeeper to add and
> remove nodes from collections. the use case being I put SolrCloud in
> an autoscaling group on EC2 and as instances go up and down, I need
> them added to the collection. It's something I've built for work and
> could clean up to share on GitHub if there is much interest.
>
> I asked in the IRC about a SolrCloud supervisor utility but wanted to
> extend that question to this list. are there any more "full featured"
> supervisors out there?
>
>
> -renning
>



-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com


Re: Selective field query

2015-10-12 Thread Scott Stults
Colin,

The other thing you'll want to keep in mind (and you'll find this out with
debugQuery) is that the query parser is going to take your
ServiceName:(Search Service) and turn it into two queries --
ServiceName:(Search) ServiceName:(Service). That's because the query parser
breaks on whitespace. My bet is you have a lot of entries with a name of "X
Service" and the second part of your query is hitting them. Phrase Field
might be your friend here:

https://wiki.apache.org/solr/ExtendedDisMax#pf_.28Phrase_Fields.29


-Scott

On Mon, Oct 12, 2015 at 4:15 AM, Colin Hunter <greenfi...@gmail.com> wrote:

> Thanks Erick, I'm sure this will be valuable in implementing ngram filter
> factory
>
> On Fri, Oct 9, 2015 at 4:38 PM, Erick Erickson <erickerick...@gmail.com>
> wrote:
>
> > Colin:
> >
> > Adding =all to your query is your friend here, the
> > parsed_query.toString will show you exactly what
> > is searched against.
> >
> > Best,
> > Erick
> >
> > On Fri, Oct 9, 2015 at 2:09 AM, Colin Hunter <greenfi...@gmail.com>
> wrote:
> > > Ah ha...   the copy field...  makes sense.
> > > Thank You.
> > >
> > > On Fri, Oct 9, 2015 at 10:04 AM, Upayavira <u...@odoko.co.uk> wrote:
> > >
> > >>
> > >>
> > >> On Fri, Oct 9, 2015, at 09:54 AM, Colin Hunter wrote:
> > >> > Hi
> > >> >
> > >> > I am working on a complex search utility with an index created via
> > data
> > >> > import from an extensive MySQL database.
> > >> > There are many ways in which the index is searched. One of the
> utility
> > >> > input fields searches only on a Service Name. However, if I target
> the
> > >> > query as q=ServiceName:"Searched service", this only returns an
> exact
> > >> > string match. If q=Searched Service, the query still returns results
> > from
> > >> > all indexed data.
> > >> >
> > >> > Is there a way to construct a query to only return results from one
> > field
> > >> > of a doc ?
> > >> > I have tried setting index=false, stored=true on unwanted fields,
> but
> > >> > these
> > >> > appear to have still been returned in results.
> > >>
> > >> q=ServiceName:(Searched Service)
> > >>
> > >> That'll look in just one field.
> > >>
> > >> Remember changing indexed to false doesn't impact the stuff already in
> > >> your index. And the reason you are likely getting all that stuff is
> > >> because you have a copyField that copies it over into the 'text'
> field.
> > >> If you'll never want to search on some fields, switch them to
> > >> index=false, make sure you aren't doing a copyField on them, and then
> > >> reindex.
> > >>
> > >> Upayavira
> > >>
> > >
> > >
> > >
> > > --
> > > www.gfc.uk.net
> >
>
>
>
> --
> www.gfc.uk.net
>



-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com


Re: Why is Process Total Time greater than Elapsed Time?

2015-09-03 Thread Scott Stults
Thanks Hoss, sorry I wasn't clear. By Process Total Time I mean this
structure in the debug response:

debug
  timing
process
  time

Elapsed time is what I get from SolrJ's API:
 SolrClient.quey().getElapsedTime().

So I really expect elapsed time to be the greatest duration of all values.
Do you know why that's not the case?


Thank you,
Scott

On Thu, Sep 3, 2015 at 4:41 PM, Chris Hostetter <hossman_luc...@fucit.org>
wrote:

>
> depends on where you are reading "Process Total Time" from.  that
> terminology isn't something i've ever sen used in the context of solr
> (fairly certain nothing in solr refers to anything that way)
>
> QTime is the amount of time spent processing a request before it starts
> being written out over the wire to the client, so it is almost garunteed
> to be *less* then the total elapsed (wall clock) time witnessed by your
> solrJ client ... but i have no idea what "Process Total Time" is if you
> are seeing it greater then wall clock.
>
> : From what I can tell, each component processes the request sequentially.
> So
> : how can I see an Elapsed Time of 750ms (SolrJ client) and a Process Total
> : Time of 1300ms? Does the Process Total Time add up the amount of time
> each
> : leaf reader takes, or some other concurrent things?
>
>
> -Hoss
> http://www.lucidworks.com/
>



-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com


Why is Process Total Time greater than Elapsed Time?

2015-09-03 Thread Scott Stults
>From what I can tell, each component processes the request sequentially. So
how can I see an Elapsed Time of 750ms (SolrJ client) and a Process Total
Time of 1300ms? Does the Process Total Time add up the amount of time each
leaf reader takes, or some other concurrent things?

Thank you,
Scott


Re: Solr packages in Apache BigTop.

2015-03-09 Thread Scott Stults
Jay,

This is music to my ears. I've used the bigtop packages and would love to
see the Solr portion of them keep pace with releases.

Let me know where to start!


Thank you,
Scott

On Sat, Mar 7, 2015 at 5:03 PM, jay vyas jayunit100.apa...@gmail.com
wrote:

 Hi Solr.

 I work on the apache bigtop project, and am interested in integrating it
 deeper with Solr, for example, for testing spark / solr integration cases.

 Is anyone in the Solr community interested in collborating on testing
 releases with us and maintaining Solr packagins in bigtop (with our help of
 course) ?

 The advantage here is that we can synergize efforts:  When new SOLR
 releases come out, we can test them in bigtop to gaurantee that there are
 rpm/deb packages which work well with the hadoop ecosystem.

 For those that don't know, bigtop is the upstream apache bigdata packaging
 project, we build hadoop, spark, solr, hbase and so on in rpm/deb format,
 and supply puppet provisioners along with vagrant recipse for testing.

 --
 jay vyas




-- 
Scott Stults | Founder  Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com


Re: Multi words query

2015-02-13 Thread Scott Stults
 A couple more things would help debug this. First, could you grab the
specific Solr log entry when this query is sent? Also, have you changed the
default schema at all? If you're querying string fields you have to
exactly match what's indexed there, versus text which gets tokenized.


k/r,
Scott

On Thu, Feb 12, 2015 at 4:22 AM, melb melaggo...@gmail.com wrote:

 I am using rub gem rsolr and querying simply the collection by this query:

 response = solr.get 'select', :params = {
   :q=query,
   :fl= 'id,title,description,body'
   :rows=10
 }

 response[response][docs].each{|doc| puts doc[id] }

 I created a text field to copy all the fields to and the query handler
 request this field

 rgds,



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Multi-words-query-tp4185625p4185922.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Scott Stults | Founder  Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com


Re: bulk indexing with optimistick lock

2015-02-13 Thread Scott Stults
This isn't a Solr-specific answer, but the easiest approach might be to
just collect the document IDs you're about to add, query for them, and then
filter out the ones Solr already has (this'll give you a nice list for
later reporting). You'll need to keep your batch sizes below
maxBooleanClauses in solrconfig.xml.

Overall, this might be simpler to maintain and less prone to bugs.

k/r,
Scott

On Wed, Feb 11, 2015 at 4:59 AM, Sankalp Gupta sankalp.gu...@snapdeal.com
wrote:

 Hi All,
 My server side we are trying to add multiple documents in a list and then
 ask solr to add them in solr (using solrj client) and then after its
 finished calling the commit.
 Now we also want to control concurrency and for that we wanted to use
 solr's optimistic lock/versioning feature. That is good but *in case of
 bulk docs add, the solr doesn't perform add docs as expected.* It fails as
 soon as it finds any doc with optimistic lock failure and return response
 telling only the first failed doc (adding all docs before that and no docs
 are added after that). *We require solr to add all docs for which no
 versioning problem is there and return list of all failed docs. *
 Please can anyone suggest a way to do this?

 Regards
 Sankalp Gupta




-- 
Scott Stults | Founder  Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com


Re: SpellingQueryConverter and query parsing

2015-01-29 Thread Scott Stults
Thank you, James, I'll do that.

ResponseBuilder carries around with it the QParser, Query, and query
string, so getting suggestions from parsed query terms shouldn't be a big
deal. What looks to be hard is rewriting the original query with the
suggestions. That's probably why the regex is used instead of the parser.

-Scott

On Tue, Jan 27, 2015 at 1:37 PM, Dyer, James james.d...@ingramcontent.com
wrote:

 Having worked with the spellchecking code for the last few years, I've
 often wondered the same thing, but I never looked seriously into it.  I'm
 sure there's probably some serious hurdles, hence the Query Converter.  The
 easy thing to do here is to use spellcheck.q, and then pass in
 space-delimited keywords.  This bypasses the query converter entirely for
 custom situations like yours.

 But please, if you find a way to plug the actual query parser into
 spellcheck, consider opening a jira  contributing the code, even if what
 you end up with isn't in a final polished state for general use.

 James Dyer
 Ingram Content Group


 -Original Message-
 From: Scott Stults [mailto:sstu...@opensourceconnections.com]
 Sent: Tuesday, January 27, 2015 11:26 AM
 To: solr-user@lucene.apache.org
 Subject: SpellingQueryConverter and query parsing

 Hello!

 SpellingQueryConverter parses the incoming query in sort of a quick and
 dirty way with a regular expression. Is there a reason the query string
 isn't parsed with the _actual_ parser, if one was configured for that type
 of request? Even better, could the parsed query object be added to the
 response in some way so that the query wouldn't need to be parsed twice?
 The individual terms could then be visited and substituted in-place without
 needing to worry about preserving the meaning of operators in the query.

 The motive in my question is, I may need to implement a QueryConverter
 because I'm using a custom parser, and using that parser in the
 QueryConverter itself seems like the right thing to do. That wasn't done
 though in SpellingQueryConverter, so I wan't to find out why before I go
 blundering into a known minefield.


 Thanks!
 -Scott




-- 
Scott Stults | Founder  Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com


SpellingQueryConverter and query parsing

2015-01-27 Thread Scott Stults
Hello!

SpellingQueryConverter parses the incoming query in sort of a quick and
dirty way with a regular expression. Is there a reason the query string
isn't parsed with the _actual_ parser, if one was configured for that type
of request? Even better, could the parsed query object be added to the
response in some way so that the query wouldn't need to be parsed twice?
The individual terms could then be visited and substituted in-place without
needing to worry about preserving the meaning of operators in the query.

The motive in my question is, I may need to implement a QueryConverter
because I'm using a custom parser, and using that parser in the
QueryConverter itself seems like the right thing to do. That wasn't done
though in SpellingQueryConverter, so I wan't to find out why before I go
blundering into a known minefield.


Thanks!
-Scott


Re: zkCli zkhost parameter

2014-04-28 Thread Scott Stults
I did, but it looks like I mixed in the chroot too after every entry rather
than once at the very end (thanks to David Smiley for catching that). I'll
try again and update if it's still a problem.

Thanks!
-Scott




On Sat, Apr 26, 2014 at 1:08 PM, Mark Miller markrmil...@gmail.com wrote:

 Have you tried a comma-separated list or are you going by documentation?
 It should work.
 --
 Mark Miller
 about.me/markrmiller

 On April 26, 2014 at 1:03:25 PM, Scott Stults (
 sstu...@opensourceconnections.com) wrote:

 It looks like this only takes a single host as its value, whereas the
 zkHost environment variable for Solr takes a comma-separated list.
 Shouldn't the client also take a comma-separated list?

 k/r,
 Scott




-- 
Scott Stults | Founder  Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com


zkCli zkhost parameter

2014-04-26 Thread Scott Stults
It looks like this only takes a single host as its value, whereas the
zkHost environment variable for Solr takes a comma-separated list.
Shouldn't the client also take a comma-separated list?

k/r,
Scott


JVM tuning?

2013-11-12 Thread Scott Stults
We've been using a slightly older version of this script to start Solr in
server environments:

https://github.com/apache/cassandra/blob/trunk/conf/cassandra-env.sh

The thing I especially like about it is its ability to dynamically cap
memory usage, and the garbage collection log section is a great reference
when we need to check gc times.

My question is, does anyone else use a script like this to configure the
JVM for Solr? Would it be useful to have this as a reference in
solr/example/etc?


Thanks!
-Scott


Re: Thoughts on production deployment?

2013-02-02 Thread Scott Stults
There's an RPM project on GitHub that comes close:

https://github.com/boogieshafer/jetty-solr-rpm



On Fri, Feb 1, 2013 at 6:19 AM, Michael Della Bitta 
michael.della.bi...@appinions.com wrote:

 When I was referring to the different version of Jetty, I meant Jetty
 Plus, which the wiki mentions. Is this no longer true?

 My Chef recipe makes assumptions about the OS and EBS volumes being
 available, which can easily be fixed.

 Michael
 Thanks for jumping in guys. I agree the SolrJetty page needs just a little
 updating -- I commented at the bottom of SOLR-3159 about that.

 Michael and Paul, are your chef and ant recipes generic enough to share? My
 next install is going to be on RHEL 6, so I can take a crack at an install
 script that'll work there. It wouldn't be hard to translate between shell
 and chef.

 Michael: The problem with adding a dependency on Jetty in your chef recipe
 is that it's going to grab whatever version of Jetty was blessed by the
 distro maintainers on your target platform.




-- 
Scott Stults | Founder  Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com


Re: Thoughts on production deployment?

2013-02-01 Thread Scott Stults
Thanks for jumping in guys. I agree the SolrJetty page needs just a little
updating -- I commented at the bottom of SOLR-3159 about that.

Michael and Paul, are your chef and ant recipes generic enough to share? My
next install is going to be on RHEL 6, so I can take a crack at an install
script that'll work there. It wouldn't be hard to translate between shell
and chef.

Michael: The problem with adding a dependency on Jetty in your chef recipe
is that it's going to grab whatever version of Jetty was blessed by the
distro maintainers on your target platform.


Thoughts on production deployment?

2013-01-31 Thread Scott Stults
Part of this is a rant, part is a plea to others who've run successful 
production deployments.

Solr is a second-class citizen when it comes to production deployment. Every 
recipe I've seen (RPM, DEB, chef, or puppet) makes assumptions that in one way 
or another run afoul of best-practices when it comes to production use. And if 
you're not using one of these recipe formats to deploy Solr you're building a 
SnowflakeServer (Martin Fowler's term).

Granted, Solr _can_ be deployed into any vanilla JEE container, so the 
deployment spec responsibility may be erroneously assigned to whichever you 
choose. BUT, if you want to get the maximum out of Solr you'll want to put it 
on its own box, running in its own tuned container, and that container should 
be the one that Solr's been tested on repeatedly by an army of build bots. 
Right now that blessed container is Jetty version 8.1.2.v20120308.

So the first problem with the recipes is that they make a generic dependency of 
Jetty or Tomcat. The assumption there is that either can be treated as a 
generic OS facility to be shared with other apps. That's not true because Solr 
is the driving force behind which version is deployed. The container can't be 
up- or downgraded without affecting Solr, and any other app running in there 
needs to be aware that Solr is taking first priority.

The next problem is that most recipes don't make a distinction between 
collections. Solr configuration goes in one folder, Solr data goes in 
another, and the logs and container stuff gets scattered likewise. In reality, 
every collection can be configured differently and there is no generic Solr 
data. 

Lastly, the package maintainers of all the major OS distributions have ignored 
Solr since around version 1.4. That means if you want a newer version you're 
going to download a tarball and make another snowflake. This might be 
attributable to thinking of Solr as just another web app that doesn't need 
special packaging. Regardless, the consequence is that the only people who are 
deploying Solr according to best-practices are those intimately familiar with 
Solr.

So what's the best way to fix this situation? Solr already ships with 
everything it needs except Java and a start-up script. Maybe the first step is 
to include a generic install.sh script that has a couple distro-specific 
support scripts. That would be fairly agnostic toward package management 
systems and it would be useful to sysadmins right away. It would also help 
package maintainers update their build specs.

What do _you_ think? 


-Scott

Re: Will SolrCloud always slice by ID hash?

2013-01-07 Thread Scott Stults
Thanks guys. Yeah, separate rolling collections seem like the better way to
go.


-Scott

On Sat, Dec 29, 2012 at 1:30 AM, Otis Gospodnetic 
otis.gospodne...@gmail.com wrote:

 https://issues.apache.org/jira/browse/SOLR-4237


Will SolrCloud always slice by ID hash?

2012-12-18 Thread Scott Stults
I'm going to be building a Solr cluster and I want to have a rolling set of
slices so that I can keep a fixed number of days in my collection. If I
send an update to a particular slice leader, will it always hash the unique
key and (probably) forward the doc to another leader?


Thank you,
Scott


Re: Do Hignlighting + proximity using surround query parser

2012-01-24 Thread Scott Stults
I got this working the way you describe it (in the getHighlightQuery()
method). The span queries were tripping it up, so I extracted the query
terms and created a DisMax query from them. There'll be a loss of accuracy
in the highlighting, but in my case that's better than no highlighting.

Should I just go ahead and submit a patch to SOLR-2703?


On Tue, Jan 10, 2012 at 9:35 AM, Ahmet Arslan iori...@yahoo.com wrote:

  I am not able to do highlighting with surround query parser
  on the returned
  results.
  I have tried the highlighting component but it does not
  return highlighted
  results.

 Highlighter does not recognize Surround Query. It must be re-written to
 enable highlighting in o.a.s.search.QParser#getHighlightQuery() method.

 Not sure this functionality should be added in SOLR-2703 or a separate
 jira issue.




-- 
Scott Stults | Founder  Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com