update document stuck on: java.net.SocketInputStream.socketRead0

2017-10-26 Thread Nawab Zada Asad Iqbal
Hi,

After Solr 7 upgrade, I am realizing that my '/update' request is sometimes
getting stuck on this:-

 - java.net.SocketInputStream.socketRead0(java.io.FileDescriptor, byte[],
int, int, int) @bci=0 (Compiled frame; information may be imprecise)
 - java.net.SocketInputStream.read(byte[], int, int, int) @bci=87, line=152
(Compiled frame)
 - java.net.SocketInputStream.read(byte[], int, int) @bci=11, line=122
(Compiled frame)
 - org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer() @bci=71,
line=166 (Compiled frame)
 - org.apache.http.impl.io.SocketInputBuffer.fillBuffer() @bci=1, line=90
(Compiled frame)
 -
org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(org.apache.http.util.CharArrayBuffer)
@bci=137, line=281 (Compiled frame)
 -
org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(org.apache.http.io.SessionInputBuffer)
@bci=16, line=92 (Compiled frame)
 -
org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(org.apache.http.io.SessionInputBuffer)
@bci=2, line=62 (Compiled frame)
 - org.apache.http.impl.io.AbstractMessageParser.parse() @bci=38, line=254
(Compiled frame)
 -
org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader()
@bci=8, line=289 (Compiled frame)
 -
org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader()
@bci=1, line=252 (Compiled frame)
 -
org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader()
@bci=6, line=191 (Compiled frame)
 -
org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(org.apache.http.HttpRequest,
org.apache.http.HttpClientConnection, org.apache.http.protocol.HttpContext)
@bci=62, line=300 (Compiled frame)
 -
org.apache.http.protocol.HttpRequestExecutor.execute(org.apache.http.HttpRequest,
org.apache.http.HttpClientConnection, org.apache.http.protocol.HttpContext)
@bci=60, line=127 (Compiled frame)
 -
org.apache.http.impl.client.DefaultRequestDirector.tryExecute(org.apache.http.impl.client.RoutedRequest,
org.apache.http.protocol.HttpContext) @bci=198, line=715 (Compiled frame)
 -
org.apache.http.impl.client.DefaultRequestDirector.execute(org.apache.http.HttpHost,
org.apache.http.HttpRequest, org.apache.http.protocol.HttpContext)
@bci=574, line=520 (Compiled frame)
 -
org.apache.http.impl.client.AbstractHttpClient.execute(org.apache.http.HttpHost,
org.apache.http.HttpRequest, org.apache.http.protocol.HttpContext)
@bci=344, line=906 (Compiled frame)
 -
org.apache.http.impl.client.AbstractHttpClient.execute(org.apache.http.client.methods.HttpUriRequest,
org.apache.http.protocol.HttpContext) @bci=21, line=805 (Compiled frame)
 -
org.apache.http.impl.client.AbstractHttpClient.execute(org.apache.http.client.methods.HttpUriRequest)
@bci=6, line=784 (Compiled frame)


It seems that I am hitting this issue:
https://stackoverflow.com/questions/28785085/how-to-prevent-hangs-on-socketinputstream-socketread0-in-java
Although, I will fix my timeout settings in client, I am curious what has
changed in Solr7 (i am upgrading from solr 4), which would cause this?


Thanks
Nawab


Re: Need help detecting Relatedness in documents

2017-10-26 Thread Atita Arora
Thanks for the suggestion Anshum , appreciate your response..!

I tried using MLT with the field that stores the similarity index of topics
this could be related to.
But this wasn't really accepted as the solution, as this could not resolve
my next stage of the problem where I need to get the effective 'number of
posts' where the topics which were found as related topics as deduced by
MLT were found together.
So I believe MLT leverages these number to orders the returned set
internally.
So the major challenge was to get those numbers too as they are being used
on graph where these number are plotted.

I wonder if there's an alternative way to get it.
Appreciate any further input on this.

Thanks,
Atita

On Thu, Oct 26, 2017 at 11:36 PM, Anshum Gupta  wrote:

> I would suggest you look at the mlt query parser. That allows you to find
> documents similar to a particular documents, and also allows for specifying
> the field to use for similarity purposes.
>
> https://lucene.apache.org/solr/guide/7_0/other-parsers.
> html#more-like-this-query-parser
>
> -Anshum
>
>
>
> On Oct 26, 2017, at 1:16 AM, Atita Arora  wrote:
>
> Hi ,
>
> We're working with a productr where the idea is to present the users the
> related documents in particular timeseries.
>
> For an overview think about this as an application which picks up top
> trending blogposts "topics" which are picked and ingested from various
> social sites.
> Further , when you look into the topic from the trending list it shows the
> related topics which happen to happen on the blogposts.
> So to mark a related topic they should have occured on a same blogpost , to
> add , more are these number of occurences , more would be the relatedness
> factor.
>
> Complexity is the related topics change on the user defined date spread ,
> which means if x & y were top most related topics in the blogposts made in
> last 30 days ,
> there is an equal possibility that x could be more related to z if the user
> would have wanted to see related topics in last 60 days.
> So the number of days are user defined and they impact the related topics.
>
> For now every blogpost goes in the index as a seperate document and the
> topic extraction happens alongside indexing which extracts the topics from
> the blogposts and stores them in a different collection.
> For this we have lot of duplicates on the index too , for e.g. a topicname
> search  "football" has around 80K documents , all of them are
> topicname="football".
>
> I wonder if someone can help me :
> 1. How to structure the document in such a way the queries could be more
> performant
> 2. Suggest me as to how can we detect the RELATED topics.
>
> Any help on this would be highly appreciated.
>
> Thanks in advance.
>
> Atita
>
>
>


Re: Edismax - bq taking precedence over pf

2017-10-26 Thread Chris Hostetter

: ok. Shouldn't pf be applied on top of bq=? that way among the object_types
: boosted, if one has "Manufacturing" then it should be listed first?

No.

bq is an *additive* boost ... documents must match your "main query" to be 
included, but if document X scores very high against the bq query, and 
very low against hte main query, the cumulative score can still be higher 
then a docuemnt Y which scores mid-range against both.

you can think of it under the covers esentially just taking your "q" (with 
qf and ph fields) and your "bq" and building a query that looks like...

(+q bq)

...which is why what people should use 99% of the time is "boost" instead 
of bq ... that creates a *multiplicitive* boost factor based on the 
function specified (which can use the "query()" function to wrap an 
arbitrary query) instead.


-Hoss
http://www.lucidworks.com/


Re: Need help detecting Relatedness in documents

2017-10-26 Thread Anshum Gupta
I would suggest you look at the mlt query parser. That allows you to find 
documents similar to a particular documents, and also allows for specifying the 
field to use for similarity purposes.

https://lucene.apache.org/solr/guide/7_0/other-parsers.html#more-like-this-query-parser
 


-Anshum



> On Oct 26, 2017, at 1:16 AM, Atita Arora  wrote:
> 
> Hi ,
> 
> We're working with a productr where the idea is to present the users the
> related documents in particular timeseries.
> 
> For an overview think about this as an application which picks up top
> trending blogposts "topics" which are picked and ingested from various
> social sites.
> Further , when you look into the topic from the trending list it shows the
> related topics which happen to happen on the blogposts.
> So to mark a related topic they should have occured on a same blogpost , to
> add , more are these number of occurences , more would be the relatedness
> factor.
> 
> Complexity is the related topics change on the user defined date spread ,
> which means if x & y were top most related topics in the blogposts made in
> last 30 days ,
> there is an equal possibility that x could be more related to z if the user
> would have wanted to see related topics in last 60 days.
> So the number of days are user defined and they impact the related topics.
> 
> For now every blogpost goes in the index as a seperate document and the
> topic extraction happens alongside indexing which extracts the topics from
> the blogposts and stores them in a different collection.
> For this we have lot of duplicates on the index too , for e.g. a topicname
> search  "football" has around 80K documents , all of them are
> topicname="football".
> 
> I wonder if someone can help me :
> 1. How to structure the document in such a way the queries could be more
> performant
> 2. Suggest me as to how can we detect the RELATED topics.
> 
> Any help on this would be highly appreciated.
> 
> Thanks in advance.
> 
> Atita



signature.asc
Description: Message signed with OpenPGP


Re: Edismax - bq taking precedence over pf

2017-10-26 Thread Josh Lincoln
I was asking about the field definitions from the schema.

It would also be helpful to see the debug info from the query. Just add
debug=true to see how the query and params were executed by solr and how
the calculation was done for each result.

On Thu, Oct 26, 2017 at 1:33 PM ruby  wrote:

> ok. Shouldn't pf be applied on top of bq=? that way among the object_types
> boosted, if one has "Manufacturing" then it should be listed first?
>
> following are my objects:
>
>
> 
> 1
> Configuration
> typeA
> Manufacturing
>  <--catch all field where contents of all fields get
> copied to
> 
>
> 
> 2
> Manufacturing
> typeA
> xyz
>  <--catch all field where contents of all fields get
> copied to
> 
>
> I'm hoping to get id=2 first then get id=1 but I'm not seeing that. is my
> understanding of qf= not correct?
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


Re: Edismax - bq taking precedence over pf

2017-10-26 Thread ruby
ok. Shouldn't pf be applied on top of bq=? that way among the object_types
boosted, if one has "Manufacturing" then it should be listed first?

following are my objects:



1
Configuration
typeA
Manufacturing
 <--catch all field where contents of all fields get
copied to



2
Manufacturing
typeA
xyz
 <--catch all field where contents of all fields get
copied to


I'm hoping to get id=2 first then get id=1 but I'm not seeing that. is my
understanding of qf= not correct?



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Failed to create collection SOLR 6.3 HDP 2.6.2

2017-10-26 Thread Dan Caulfield
I'm creating a collection on a new cluster.  There are six new Solr nodes using 
a HDP 2.6.2 cluster for storage.  Has anyone seen similar errors?


/usr/iopsolr/current/iop-solr/server/scripts/cloud-scripts/zkcli.sh -cmd 
upconfig -zkhost 
d2mitphmn1001.edc.nam.gm.com:2181,d2mitphmn1003.edc.nam.gm.com:2181,d2mitphmn1004.edc.nam.gm.com:2181/solr
 -confname maxis_clickstream -confdir 
/home/solr/solr_configs/maxis_clickstream/conf


/usr/iopsolr/current/iop-solr/bin/solr create -c maxis_clickstream -d 
/home/solr/solr_configs/maxis_clickstream/conf -n maxis_clickstream -s 6 -rf 1
Connecting to ZooKeeper at 
d2mitphmn1001.edc.nam.gm.com:2181,d2mitphmn1003.edc.nam.gm.com:2181,d2mitphmn1004.edc.nam.gm.com:2181,d2mitphmn1005.edc.nam.gm.com:2181,d2mitphmn1006.edc.nam.gm.com:2181/solr
 ...
Re-using existing configuration directory maxis_clickstream

Creating new collection 'maxis_clickstream' using command:
http://localhost:8983/solr/admin/collections?action=CREATE=maxis_clickstream=6=1=1=maxis_clickstream


ERROR: Failed to create collection 'maxis_clickstream' due to: 
{10.126.191.24:8983_solr=org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:Error
 from server at http://10.126.191.24:8983/solr: Error CREATEing SolrCore 
'maxis_clickstream_shard6_replica1': Unable to create core 
[maxis_clickstream_shard6_replica1] Caused by: no segments* file found in 
LockValidatingDirectoryWrapper(NRTCachingDirectory(BlockDirectory(HdfsDirectory@hdfs://edwbitstmil/apps/solr/data/maxis_clickstream/core_node4/data/index
 lockFactory=org.apache.solr.store.hdfs.HdfsLockFactory@6d8569e4); 
maxCacheMB=192.0 maxMergeSizeMB=16.0)): files: [write.lock], 
10.126.191.28:8983_solr=org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:Error
 from server at http://10.126.191.28:8983/solr: Error CREATEing SolrCore 
'maxis_clickstream_shard4_replica1': Unable to create core 
[maxis_clickstream_shard4_replica1] Caused by: no segments* file found in 
LockValidatingDirectoryWrapper(NRTCachingDirectory(BlockDirectory(HdfsDirectory@hdfs://edwbitstmil/apps/solr/data/maxis_clickstream/core_node1/data/index
 lockFactory=org.apache.solr.store.hdfs.HdfsLockFactory@77f88c13); 
maxCacheMB=192.0 maxMergeSizeMB=16.0)): files: [write.lock], 
10.126.191.27:8983_solr=org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:Error
 from server at http://10.126.191.27:8983/solr: Error CREATEing SolrCore 
'maxis_clickstream_shard5_replica1': Unable to create core 
[maxis_clickstream_shard5_replica1] Caused by: no segments* file found in 
LockValidatingDirectoryWrapper(NRTCachingDirectory(BlockDirectory(HdfsDirectory@hdfs://edwbitstmil/apps/solr/data/maxis_clickstream/core_node6/data/index
 lockFactory=org.apache.solr.store.hdfs.HdfsLockFactory@4f658374); 
maxCacheMB=192.0 maxMergeSizeMB=16.0)): files: [write.lock], 
10.126.191.26:8983_solr=org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:Error
 from server at http://10.126.191.26:8983/solr: Error CREATEing SolrCore 
'maxis_clickstream_shard2_replica1': Unable to create core 
[maxis_clickstream_shard2_replica1] Caused by: no segments* file found in 
LockValidatingDirectoryWrapper(NRTCachingDirectory(BlockDirectory(HdfsDirectory@hdfs://edwbitstmil/apps/solr/data/maxis_clickstream/core_node2/data/index
 lockFactory=org.apache.solr.store.hdfs.HdfsLockFactory@24deb971); 
maxCacheMB=192.0 maxMergeSizeMB=16.0)): files: [write.lock], 
10.126.191.25:8983_solr=org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:Error
 from server at http://10.126.191.25:8983/solr: Error CREATEing SolrCore 
'maxis_clickstream_shard3_replica1': Unable to create core 
[maxis_clickstream_shard3_replica1] Caused by: no segments* file found in 
LockValidatingDirectoryWrapper(NRTCachingDirectory(BlockDirectory(HdfsDirectory@hdfs://edwbitstmil/apps/solr/data/maxis_clickstream/core_node3/data/index
 lockFactory=org.apache.solr.store.hdfs.HdfsLockFactory@3be7e4ac); 
maxCacheMB=192.0 maxMergeSizeMB=16.0)): files: [write.lock], 
10.126.191.23:8983_solr=org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:Error
 from server at http://10.126.191.23:8983/solr: Error CREATEing SolrCore 
'maxis_clickstream_shard1_replica1': Unable to create core 
[maxis_clickstream_shard1_replica1] Caused by: no segments* file found in 
LockValidatingDirectoryWrapper(NRTCachingDirectory(BlockDirectory(HdfsDirectory@hdfs://edwbitstmil/apps/solr/data/maxis_clickstream/core_node5/data/index
 lockFactory=org.apache.solr.store.hdfs.HdfsLockFactory@24a98dd3); 
maxCacheMB=192.0 maxMergeSizeMB=16.0)): files: [write.lock]}







Nothing in this message is intended to constitute an electronic signature 
unless a specific statement to the contrary is included in this message.

Confidentiality Note: This message is intended only for the person or entity to 
which it is addressed. It may contain confidential and/or privileged material. 
Any review, transmission, dissemination or other 

Re: TimeoutException, IOException, Read timed out

2017-10-26 Thread Fengtan
Thanks Erick and Emir -- we are going to start with <1> and possibly <2>.

On Thu, Oct 26, 2017 at 7:06 AM, Emir Arnautović <
emir.arnauto...@sematext.com> wrote:

> Hi Fengtan,
> I would just add that when merging collections, you might want to use
> document routing (https://lucene.apache.org/solr/guide/6_6/shards-and-
> indexing-data-in-solrcloud.html#ShardsandIndexingDatainSolrClo
> ud-DocumentRouting  indexing-data-in-solrcloud.html#ShardsandIndexingDatainSolrClo
> ud-DocumentRouting>) - since you are keeping separate collections, I
> guess you have a “collection ID” to use as routing key. This will enable
> you to have one collection but query only shard(s) with data from one
> “collection”.
>
> HTH,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 25 Oct 2017, at 19:25, Erick Erickson 
> wrote:
> >
> > <1> It's not the explicit commits are expensive, it's that they happen
> > too fast. An explicit commit and an internal autocommit have exactly
> > the same cost. Your "overlapping ondeck searchers"  is definitely an
> > indication that your commits are happening from somwhere too quickly
> > and are piling up.
> >
> > <2> Likely a good thing, each collection increases overhead. And
> > 1,000,000 documents is quite small in Solr's terms unless the
> > individual documents are enormous. I'd do this for a number of
> > reasons.
> >
> > <3> Certainly an option, but I'd put that last. Fix the commit problem
> first ;)
> >
> > <4> If you do this, make the autowarm count quite small. That said,
> > this will be very little use if you have frequent commits. Let's say
> > you commit every second. The autowarming will warm caches, which will
> > then be thrown out a second later. And will increase the time it takes
> > to open a new searcher.
> >
> > <5> Yeah, this would probably just be a band-aid.
> >
> > If I were prioritizing these, I'd do
> > <1> first. If you control the client, just don't call commit. If you
> > do not control the client, then what you've outlined is fine. Tip: set
> > your soft commit settings to be as long as you can stand. If you must
> > have very short intervals, consider disabling your caches completely.
> > Here's a long article on commits
> > https://lucidworks.com/2013/08/23/understanding-
> transaction-logs-softcommit-and-commit-in-sorlcloud/
> >
> > <2> Actually, this and <1> are pretty close in priority.
> >
> > Then re-evaluate. Fixing the commit issue may buy you quite a bit of
> > time. Having 1,000 collections is pushing the boundaries presently.
> > Each collection will establish watchers on the bits it cares about in
> > ZooKeeper, and reducing the watchers by a factor approaching 1,000 is
> > A Good Thing.
> >
> > Frankly, between these two things I'd pretty much expect your problems
> > to disappear. wouldn't be the first time I've been totally wrong, but
> > it's where I'd start ;)
> >
> > Best,
> > Erick
> >
> > On Wed, Oct 25, 2017 at 8:54 AM, Fengtan  wrote:
> >> Hi,
> >>
> >> We run a SolrCloud 6.4.2 cluster with ZooKeeper 3.4.6 on 3 VM's.
> >> Each VM runs RHEL 7 with 16 GB RAM and 8 CPU and OpenJDK 1.8.0_131 ;
> each
> >> VM has one Solr and one ZK instance.
> >> The cluster hosts 1,000 collections ; each collection has 1 shard and
> >> between 500 and 50,000 documents.
> >> Documents are indexed incrementally every day ; the Solr client mostly
> does
> >> searching.
> >> Solr runs with -Xms7g -Xmx7g.
> >>
> >> Everything has been working fine for about one month but a few days ago
> we
> >> started to see Solr timeouts: https://pastebin.com/raw/E2prSrQm
> >>
> >> Also we have always seen these:
> >>  PERFORMANCE WARNING: Overlapping onDeckSearchers=2
> >>
> >>
> >> We are not sure what is causing the timeouts, although we have
> identified a
> >> few things that could be improved:
> >>
> >> 1) Ignore explicit commits using IgnoreCommitOptimizeUpdateProc
> essorFactory
> >> -- we are aware that explicit commits are expensive
> >>
> >> 2) Drop the 1,000 collections and use a single one instead (all our
> >> collections use the same schema/solrconfig.xml) since stability problems
> >> are expected when the number of collections reaches the low hundreds
> >> . The
> >> downside is that the new collection would contain 1,000,000 documents
> which
> >> may bring new challenges.
> >>
> >> 3) Tune the GC and possibly switch from CMS to G1 as it seems to bring a
> >> better performance according to this
> >>  >,
> >> this
> >>  First.29_Collector>
> >> and this
> >>  

Re: Edismax - bq taking precedence over pf

2017-10-26 Thread Josh Lincoln
What's the analysis configuration for the object_name field and fieldType?
Perhaps the query is matching your catch-all field, but not the object_name
field, and therefore the pf boost never happens.




On Thu, Oct 26, 2017 at 8:55 AM ruby  wrote:

> I'm noticing in my following query bq= is taking precedence over pf.
>
> =Manufacturing
> =Catch_all_Copy_field
> =object_id^40+object_name^700
> =object_rating:(best)^10
> =object_rating:(candidate)^8
> =object_rating:(placeholder)^5
> =object_type_:(typeA)^10
> =object_type_:(typeB)^10
> =object_type_:(typeC)^10
>
> My intention is to show all objects of typeA having "Manufacturing" in name
> first
>
> But I'm seeing all typeA,TypeB,TypeC objects are being listed first,
> eventhough if their name is Not "Manufacturing".
>
> Is my query correct or my understanding of pf and bq parameters correct?
>
> Thanks
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


Edismax - bq taking precedence over pf

2017-10-26 Thread ruby
I'm noticing in my following query bq= is taking precedence over pf.

=Manufacturing
=Catch_all_Copy_field
=object_id^40+object_name^700
=object_rating:(best)^10
=object_rating:(candidate)^8
=object_rating:(placeholder)^5
=object_type_:(typeA)^10
=object_type_:(typeB)^10
=object_type_:(typeC)^10

My intention is to show all objects of typeA having "Manufacturing" in name
first

But I'm seeing all typeA,TypeB,TypeC objects are being listed first,
eventhough if their name is Not "Manufacturing".

Is my query correct or my understanding of pf and bq parameters correct?

Thanks



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


CVE-2016-6809: Java code execution for serialized objects embedded in MATLAB files parsed by Apache Solr using Apache Tika

2017-10-26 Thread Shalin Shekhar Mangar
CVE-2016-6809: Java code execution for serialized objects embedded in
MATLAB files parsed by Apache Solr using Tika

Severity: Important

Vendor:
The Apache Software Foundation

Versions Affected:
Solr 5.0.0 to 5.5.4
Solr 6.0.0 to 6.6.1
Solr 7.0.0 to 7.0.1

Description:

Apache Solr uses Apache Tika for parsing binary file types such as
doc, xls, pdf etc. Apache Tika wraps the jmatio parser
(https://github.com/gradusnikov/jmatio) to handle MATLAB files. The
parser uses native deserialization on serialized Java objects embedded
in MATLAB files. A malicious user could inject arbitrary code into a
MATLAB file that would be executed when the object is deserialized.

This vulnerability was originally described at
http://mail-archives.apache.org/mod_mbox/tika-user/201611.mbox/%3C2125912914.1308916.1478787314903%40mail.yahoo.com%3E

Mitigation:
Users are advised to upgrade to either Solr 5.5.5 or Solr 6.6.2 or Solr 7.1.0
releases which have fixed this vulnerability.

Solr 5.5.5 upgrades the jmatio parser to v1.2 and disables the Java
deserialisation support to protect against this vulnerability.

Solr 6.6.2 and Solr 7.1.0 have upgraded the bundled Tika to v1.16.

Once upgrade is complete, no other steps are required.

References:
https://issues.apache.org/jira/browse/SOLR-11486
https://issues.apache.org/jira/browse/SOLR-10335
https://wiki.apache.org/solr/SolrSecurity

-- 
Regards,
Shalin Shekhar Mangar.


Re: TimeoutException, IOException, Read timed out

2017-10-26 Thread Emir Arnautović
Hi Fengtan,
I would just add that when merging collections, you might want to use document 
routing 
(https://lucene.apache.org/solr/guide/6_6/shards-and-indexing-data-in-solrcloud.html#ShardsandIndexingDatainSolrCloud-DocumentRouting
 
)
 - since you are keeping separate collections, I guess you have a “collection 
ID” to use as routing key. This will enable you to have one collection but 
query only shard(s) with data from one “collection”.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 25 Oct 2017, at 19:25, Erick Erickson  wrote:
> 
> <1> It's not the explicit commits are expensive, it's that they happen
> too fast. An explicit commit and an internal autocommit have exactly
> the same cost. Your "overlapping ondeck searchers"  is definitely an
> indication that your commits are happening from somwhere too quickly
> and are piling up.
> 
> <2> Likely a good thing, each collection increases overhead. And
> 1,000,000 documents is quite small in Solr's terms unless the
> individual documents are enormous. I'd do this for a number of
> reasons.
> 
> <3> Certainly an option, but I'd put that last. Fix the commit problem first 
> ;)
> 
> <4> If you do this, make the autowarm count quite small. That said,
> this will be very little use if you have frequent commits. Let's say
> you commit every second. The autowarming will warm caches, which will
> then be thrown out a second later. And will increase the time it takes
> to open a new searcher.
> 
> <5> Yeah, this would probably just be a band-aid.
> 
> If I were prioritizing these, I'd do
> <1> first. If you control the client, just don't call commit. If you
> do not control the client, then what you've outlined is fine. Tip: set
> your soft commit settings to be as long as you can stand. If you must
> have very short intervals, consider disabling your caches completely.
> Here's a long article on commits
> https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
> 
> <2> Actually, this and <1> are pretty close in priority.
> 
> Then re-evaluate. Fixing the commit issue may buy you quite a bit of
> time. Having 1,000 collections is pushing the boundaries presently.
> Each collection will establish watchers on the bits it cares about in
> ZooKeeper, and reducing the watchers by a factor approaching 1,000 is
> A Good Thing.
> 
> Frankly, between these two things I'd pretty much expect your problems
> to disappear. wouldn't be the first time I've been totally wrong, but
> it's where I'd start ;)
> 
> Best,
> Erick
> 
> On Wed, Oct 25, 2017 at 8:54 AM, Fengtan  wrote:
>> Hi,
>> 
>> We run a SolrCloud 6.4.2 cluster with ZooKeeper 3.4.6 on 3 VM's.
>> Each VM runs RHEL 7 with 16 GB RAM and 8 CPU and OpenJDK 1.8.0_131 ; each
>> VM has one Solr and one ZK instance.
>> The cluster hosts 1,000 collections ; each collection has 1 shard and
>> between 500 and 50,000 documents.
>> Documents are indexed incrementally every day ; the Solr client mostly does
>> searching.
>> Solr runs with -Xms7g -Xmx7g.
>> 
>> Everything has been working fine for about one month but a few days ago we
>> started to see Solr timeouts: https://pastebin.com/raw/E2prSrQm
>> 
>> Also we have always seen these:
>>  PERFORMANCE WARNING: Overlapping onDeckSearchers=2
>> 
>> 
>> We are not sure what is causing the timeouts, although we have identified a
>> few things that could be improved:
>> 
>> 1) Ignore explicit commits using IgnoreCommitOptimizeUpdateProcessorFactory
>> -- we are aware that explicit commits are expensive
>> 
>> 2) Drop the 1,000 collections and use a single one instead (all our
>> collections use the same schema/solrconfig.xml) since stability problems
>> are expected when the number of collections reaches the low hundreds
>> . The
>> downside is that the new collection would contain 1,000,000 documents which
>> may bring new challenges.
>> 
>> 3) Tune the GC and possibly switch from CMS to G1 as it seems to bring a
>> better performance according to this
>> ,
>> this
>> 
>> and this
>> .
>> The downside is that Lucene explicitely discourages the usage of G1
>> 
>> so we are not sure what to expect. We use the default GC settings:
>>  -XX:NewRatio=3
>>  -XX:SurvivorRatio=4
>>  -XX:TargetSurvivorRatio=90
>>  

Solr require both hl.fl and df same for correct highlighting.

2017-10-26 Thread Amrit Sarkar
Solr version: 6.5.x

Why do we need to pass hl.fl and df to be same for correct highlighting?

Let us suppose I am highlighting on field: fieldA which has stemming filter
on its analysis.

Sample doc: {"id":"1", "fieldA":"Vacation"}

If I then highlighting request:
> "params":{
>   "q":"Vacation",
>   "hl":"on",
>   "indent":"on",
>   "hl.fl":"fieldA",
>   "wt":"json"}


Highlighting doesn't work as "Vacation" via _text_::text_general as
"Vacation" remains "Vacation", while on the index it is stored as "vacat".

I debugged through the code and HighlightComponent::169

highlightQuery = rb.getQparser().getHighlightQuery();


highlightQuery is passed which is analysed value of what's being passed,
this case: _text_:Vacation.

Fast-forwarding to WeightedSpanTermExtractor::extractWeightedTerms::366::

for (final Term queryTerm : nonWeightedTerms) {
>   if (fieldNameComparator(queryTerm.field())) {
> WeightedSpanTerm weightedSpanTerm = new WeightedSpanTerm(boost,
> queryTerm.text());
> terms.put(queryTerm.text(), weightedSpanTerm);
>   }
> }

extracted term is "Vacation".

Jumping to core highlighting code:

Highlighter::getBestTextFragements::213

TokenGroup tokenGroup=new TokenGroup(tokenStream);


Each tokenStream, has analysed tokens: "vacat" which obviously doesn't
match with extracted term.

Why the df, qf, values concern with what we pass in "hl.fl"? Isn't the
query which is to be highlighted be analysed by field passed in "hl.fl",
but then multiple fields can be passed in "hl.fl". Just wondering how it is
suppose to be done. Any explanation will be fine.

Amrit Sarkar
Search Engineer
Lucidworks, Inc.
415-589-9269
www.lucidworks.com
Twitter http://twitter.com/lucidworks
LinkedIn: https://www.linkedin.com/in/sarkaramrit2


Need help detecting Relatedness in documents

2017-10-26 Thread Atita Arora
Hi ,

We're working with a productr where the idea is to present the users the
related documents in particular timeseries.

For an overview think about this as an application which picks up top
trending blogposts "topics" which are picked and ingested from various
social sites.
Further , when you look into the topic from the trending list it shows the
related topics which happen to happen on the blogposts.
So to mark a related topic they should have occured on a same blogpost , to
add , more are these number of occurences , more would be the relatedness
factor.

Complexity is the related topics change on the user defined date spread ,
which means if x & y were top most related topics in the blogposts made in
last 30 days ,
there is an equal possibility that x could be more related to z if the user
would have wanted to see related topics in last 60 days.
So the number of days are user defined and they impact the related topics.

For now every blogpost goes in the index as a seperate document and the
topic extraction happens alongside indexing which extracts the topics from
the blogposts and stores them in a different collection.
For this we have lot of duplicates on the index too , for e.g. a topicname
search  "football" has around 80K documents , all of them are
topicname="football".

I wonder if someone can help me :
1. How to structure the document in such a way the queries could be more
performant
2. Suggest me as to how can we detect the RELATED topics.

Any help on this would be highly appreciated.

Thanks in advance.

Atita