Re: Solr still gives old data while faceting from the deleted documents

2018-04-12 Thread girish.vignesh
mincount will fix this issue for sure. I have tried that but the requirement
is to show facets with 0 count as disabled. 

I think I left with only 2 options. Either go with expungeDelets with update
URL or use optimize in a scheduler.

Regards,
Vignesh



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr 7.2

2018-04-12 Thread Shawn Heisey

On 4/12/2018 9:48 PM, Antony A wrote:

Thank you. I was trying to create the collection using the API.
Unfortunately the API changes a bit between 6x to 7x.

I posted the API that I used to create the collection and subsequently when
trying to create cores for the same collection.

https://pastebin.com/hrydZktX


You're going to need to look for errors in the solr.log file.  There 
should be something that explains what went wrong.The logfile often has 
more information than the response.


The top guess I have is that Java couldn't validate the certificate for 
https, but without error logs, I can't say for sure.


Thanks,
Shawn



Re: Solr 7.2

2018-04-12 Thread Antony A
Hi Edwin,

Thank you. I was trying to create the collection using the API.
Unfortunately the API changes a bit between 6x to 7x.

I posted the API that I used to create the collection and subsequently when
trying to create cores for the same collection.

https://pastebin.com/hrydZktX

Hopefully this helps.

Thanks

On Thu, Apr 12, 2018 at 2:06 PM, Antony A  wrote:

> Hi,
>
> I am trying to add a replica to the ssl-enabled solr cloud with external
> zookeeper ensemble.
>
> 2018-04-12 18:26:29.140 INFO  (qtp672320506-51) [   ]
> o.a.s.h.a.CollectionsHandler Invoked Collection Action :addreplica with
> params 
> node=_solr=ADDREPLICA=collection_name=shard1
> and sendToOCPQueue=true
>
> 2018-04-12 18:26:29.151 ERROR (OverseerThreadFactory-10-thre
> ad-5-processing-n:_solr) [   ] 
> o.a.s.c.OverseerCollectionMessageHandler
> Collection: tnlookup operation: addreplica 
> failed:org.apache.solr.common.SolrException:
> At lea
>
>


Re: Solr 7.2

2018-04-12 Thread Zheng Lin Edwin Yeo
Hi,

I can't really catch what is the issue you are facing.

Regards,
Edwin

On 13 April 2018 at 04:06, Antony A  wrote:

>  Hi,
>
> I am trying to add a replica to the ssl-enabled solr cloud with external
> zookeeper ensemble.
>
> 2018-04-12 18:26:29.140 INFO  (qtp672320506-51) [   ]
> o.a.s.h.a.CollectionsHandler Invoked Collection Action :addreplica with
> params node=_solr=ADDREPLICA=
> collection_name=shard1
> and sendToOCPQueue=true
>
> 2018-04-12 18:26:29.151 ERROR (OverseerThreadFactory-10-
> thread-5-processing-n:_solr) [   ] o.a.s.c.
> OverseerCollectionMessageHandler Collection: tnlookup operation:
> addreplica
> failed:org.apache.solr.common.SolrException: At lea
>


How to index and search (integer or float) vector.

2018-04-12 Thread Jason
Hi,I have specific documents that consist of integer vector with fixed
length.But I have no idea how to index integer vector and search similar
vector.Which fieldType should I use to solve this problem?And can I get any
example for how to search?



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Solr 7.2

2018-04-12 Thread Antony A
 Hi,

I am trying to add a replica to the ssl-enabled solr cloud with external
zookeeper ensemble.

2018-04-12 18:26:29.140 INFO  (qtp672320506-51) [   ]
o.a.s.h.a.CollectionsHandler Invoked Collection Action :addreplica with
params 
node=_solr=ADDREPLICA=collection_name=shard1
and sendToOCPQueue=true

2018-04-12 18:26:29.151 ERROR (OverseerThreadFactory-10-
thread-5-processing-n:_solr) [   ] o.a.s.c.
OverseerCollectionMessageHandler Collection: tnlookup operation: addreplica
failed:org.apache.solr.common.SolrException: At lea


Re: DIH with huge data

2018-04-12 Thread Sujay Bawaskar
That sounds good option. So spark job will connect to MySQL and create solr
document which is pushed into solr using solrj probably in batches.

On Thu, Apr 12, 2018 at 10:48 PM, Rahul Singh 
wrote:

> If you want speed, Spark is the fastest easiest way. You can connect to
> relational tables directly and import or export to CSV / JSON and import
> from a distributed filesystem like S3 or HDFS.
>
> Combining a dfs with spark and a highly available SolR - you are
> maximizing all threads.
>
> --
> Rahul Singh
> rahul.si...@anant.us
>
> Anant Corporation
>
> On Apr 12, 2018, 1:10 PM -0400, Sujay Bawaskar ,
> wrote:
> > Thanks Rahul. Data source is JdbcDataSource with MySQL database. Data
> size
> > is around 100GB.
> > I am not much familiar with spark but are you suggesting that we should
> > create document by merging distinct RDBMS tables in using RDD?
> >
> > On Thu, Apr 12, 2018 at 10:06 PM, Rahul Singh <
> rahul.xavier.si...@gmail.com
> > wrote:
> >
> > > How much data and what is the database source? Spark is probably the
> > > fastest way.
> > >
> > > --
> > > Rahul Singh
> > > rahul.si...@anant.us
> > >
> > > Anant Corporation
> > >
> > > On Apr 12, 2018, 7:28 AM -0400, Sujay Bawaskar <
> sujaybawas...@gmail.com>,
> > > wrote:
> > > > Hi,
> > > >
> > > > We are using DIH with SortedMapBackedCache but as data size
> increases we
> > > > need to provide more heap memory to solr JVM.
> > > > Can we use multiple CSV file instead of database queries and later
> data
> > > in
> > > > CSV files can be joined using zipper? So bottom line is to create CSV
> > > files
> > > > for each of entity in data-config.xml and join these CSV files using
> > > > zipper.
> > > > We also tried EHCache based DIH cache but since EHCache uses MMap IO
> its
> > > > not good to use with MMapDirectoryFactory and causes to exhaust
> physical
> > > > memory on machine.
> > > > Please suggest how can we handle use case of importing huge amount of
> > > data
> > > > into solr.
> > > >
> > > > --
> > > > Thanks,
> > > > Sujay P Bawaskar
> > > > M:+91-77091 53669
> > >
> >
> >
> >
> > --
> > Thanks,
> > Sujay P Bawaskar
> > M:+91-77091 53669
>



-- 
Thanks,
Sujay P Bawaskar
M:+91-77091 53669


Re: DIH with huge data

2018-04-12 Thread Rahul Singh

CSV -> Spark -> SolR

https://github.com/lucidworks/spark-solr/blob/master/docs/examples/csv.adoc

If speed is not an issue there are other methods. Spring Batch / Spring Data 
might have all the tools you need to get speed without Spark.

--
Rahul Singh
rahul.si...@anant.us

Anant Corporation

On Apr 12, 2018, 1:10 PM -0400, Sujay Bawaskar , wrote:
> Thanks Rahul. Data source is JdbcDataSource with MySQL database. Data size
> is around 100GB.
> I am not much familiar with spark but are you suggesting that we should
> create document by merging distinct RDBMS tables in using RDD?
>
> On Thu, Apr 12, 2018 at 10:06 PM, Rahul Singh  wrote:
>
> > How much data and what is the database source? Spark is probably the
> > fastest way.
> >
> > --
> > Rahul Singh
> > rahul.si...@anant.us
> >
> > Anant Corporation
> >
> > On Apr 12, 2018, 7:28 AM -0400, Sujay Bawaskar ,
> > wrote:
> > > Hi,
> > >
> > > We are using DIH with SortedMapBackedCache but as data size increases we
> > > need to provide more heap memory to solr JVM.
> > > Can we use multiple CSV file instead of database queries and later data
> > in
> > > CSV files can be joined using zipper? So bottom line is to create CSV
> > files
> > > for each of entity in data-config.xml and join these CSV files using
> > > zipper.
> > > We also tried EHCache based DIH cache but since EHCache uses MMap IO its
> > > not good to use with MMapDirectoryFactory and causes to exhaust physical
> > > memory on machine.
> > > Please suggest how can we handle use case of importing huge amount of
> > data
> > > into solr.
> > >
> > > --
> > > Thanks,
> > > Sujay P Bawaskar
> > > M:+91-77091 53669
> >
>
>
>
> --
> Thanks,
> Sujay P Bawaskar
> M:+91-77091 53669


Re: DIH with huge data

2018-04-12 Thread Rahul Singh
If you want speed, Spark is the fastest easiest way. You can connect to 
relational tables directly and import or export to CSV / JSON and import from a 
distributed filesystem like S3 or HDFS.

Combining a dfs with spark and a highly available SolR - you are maximizing all 
threads.

--
Rahul Singh
rahul.si...@anant.us

Anant Corporation

On Apr 12, 2018, 1:10 PM -0400, Sujay Bawaskar , wrote:
> Thanks Rahul. Data source is JdbcDataSource with MySQL database. Data size
> is around 100GB.
> I am not much familiar with spark but are you suggesting that we should
> create document by merging distinct RDBMS tables in using RDD?
>
> On Thu, Apr 12, 2018 at 10:06 PM, Rahul Singh  wrote:
>
> > How much data and what is the database source? Spark is probably the
> > fastest way.
> >
> > --
> > Rahul Singh
> > rahul.si...@anant.us
> >
> > Anant Corporation
> >
> > On Apr 12, 2018, 7:28 AM -0400, Sujay Bawaskar ,
> > wrote:
> > > Hi,
> > >
> > > We are using DIH with SortedMapBackedCache but as data size increases we
> > > need to provide more heap memory to solr JVM.
> > > Can we use multiple CSV file instead of database queries and later data
> > in
> > > CSV files can be joined using zipper? So bottom line is to create CSV
> > files
> > > for each of entity in data-config.xml and join these CSV files using
> > > zipper.
> > > We also tried EHCache based DIH cache but since EHCache uses MMap IO its
> > > not good to use with MMapDirectoryFactory and causes to exhaust physical
> > > memory on machine.
> > > Please suggest how can we handle use case of importing huge amount of
> > data
> > > into solr.
> > >
> > > --
> > > Thanks,
> > > Sujay P Bawaskar
> > > M:+91-77091 53669
> >
>
>
>
> --
> Thanks,
> Sujay P Bawaskar
> M:+91-77091 53669


Re: PreAnalyzed URP and SchemaRequest API

2018-04-12 Thread David Smiley
Ah ok.
I've wondered how much value there is in pre-analysis.  The serialization
of the analyzed form in JSON is bulky.  If you can share any results, I'd
be interested to hear how it went.  It's an optimization so you should be
able to know how much better it is.  Of course it isn't for everybody --
only when the analysis chain is sufficiently complex.

On Mon, Apr 9, 2018 at 9:45 AM Markus Jelsma 
wrote:

> Hello David,
>
> The remote client has everything on the class path but just calling
> setTokenStream is not going to work. Remotely, all i get from SchemaRequest
> API is a AnalyzerDefinition. I haven't found any Solr code that allows me
> to transform that directly into an analyzer. If i had that, it would make
> things easy.
>
> As far as i see it, i need to reconstruct a real Analyzer using
> AnalyzerDefinition's information. It won't be a problem, but it is
> cumbersome.
>
> Thanks anyway,
> Markus
>
> -Original message-
> > From:David Smiley 
> > Sent: Thursday 5th April 2018 19:38
> > To: solr-user@lucene.apache.org
> > Subject: Re: PreAnalyzed URP and SchemaRequest API
> >
> > Is this really a problem when you could easily enough create a TextField
> > and call setTokenStream?
> >
> > Does your remote client have Solr-core and all its dependencies on the
> > classpath?   That's one way to do it... and presumably the direction you
> > are going because you're asking how to work with PreAnalyzedParser which
> is
> > in solr-core.  *Alternatively*, only bring in Lucene core and construct
> > things yourself in the right format.  You could copy PreAnalyzedParser
> into
> > your codebase so that you don't have to reinvent any wheels, even though
> > that's awkward.  Perhaps that ought to be in Solrj?  But no we don't want
> > SolrJ depending on Lucene-core, though it'd make a fine "optional"
> > dependency.
> >
> > On Wed, Apr 4, 2018 at 4:53 AM Markus Jelsma  >
> > wrote:
> >
> > > Hello,
> > >
> > > We intend to move to PreAnalyzed URP for analysis offloading. Browsing
> the
> > > Javadocs i came across the SchemaRequest API looking for a way to get a
> > > Field object remotely, which i seem to need for
> > > JsonPreAnalyzedParser.toFormattedString(Field f). But all i can get
> from
> > > SchemaRequest API is FieldTypeRepresentation, which offers me
> > > getIndexAnalyzer() but won't allow me to construct a Field object.
> > >
> > > So, to analyze remotely i do need an index-time analyzer. I can get it,
> > > but not turn it into a Field object, which the PreAnalyzedParser for
> some
> > > reason wants.
> > >
> > > Any hints here? I must be looking the wrong way.
> > >
> > > Many thanks!
> > > Markus
> > >
> > --
> > Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
> > LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
> > http://www.solrenterprisesearchserver.com
> >
>
-- 
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com


Re: DIH with huge data

2018-04-12 Thread Sujay Bawaskar
Thanks Rahul. Data source is JdbcDataSource with MySQL database. Data size
is around 100GB.
I am not much familiar with spark but are you suggesting that we should
create document by merging distinct RDBMS tables in using RDD?

On Thu, Apr 12, 2018 at 10:06 PM, Rahul Singh 
wrote:

> How much data and what is the database source? Spark is probably the
> fastest way.
>
> --
> Rahul Singh
> rahul.si...@anant.us
>
> Anant Corporation
>
> On Apr 12, 2018, 7:28 AM -0400, Sujay Bawaskar ,
> wrote:
> > Hi,
> >
> > We are using DIH with SortedMapBackedCache but as data size increases we
> > need to provide more heap memory to solr JVM.
> > Can we use multiple CSV file instead of database queries and later data
> in
> > CSV files can be joined using zipper? So bottom line is to create CSV
> files
> > for each of entity in data-config.xml and join these CSV files using
> > zipper.
> > We also tried EHCache based DIH cache but since EHCache uses MMap IO its
> > not good to use with MMapDirectoryFactory and causes to exhaust physical
> > memory on machine.
> > Please suggest how can we handle use case of importing huge amount of
> data
> > into solr.
> >
> > --
> > Thanks,
> > Sujay P Bawaskar
> > M:+91-77091 53669
>



-- 
Thanks,
Sujay P Bawaskar
M:+91-77091 53669


Re: DIH with huge data

2018-04-12 Thread Rahul Singh
How much data and what is the database source? Spark is probably the fastest 
way.

--
Rahul Singh
rahul.si...@anant.us

Anant Corporation

On Apr 12, 2018, 7:28 AM -0400, Sujay Bawaskar , wrote:
> Hi,
>
> We are using DIH with SortedMapBackedCache but as data size increases we
> need to provide more heap memory to solr JVM.
> Can we use multiple CSV file instead of database queries and later data in
> CSV files can be joined using zipper? So bottom line is to create CSV files
> for each of entity in data-config.xml and join these CSV files using
> zipper.
> We also tried EHCache based DIH cache but since EHCache uses MMap IO its
> not good to use with MMapDirectoryFactory and causes to exhaust physical
> memory on machine.
> Please suggest how can we handle use case of importing huge amount of data
> into solr.
>
> --
> Thanks,
> Sujay P Bawaskar
> M:+91-77091 53669


RE: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?

2018-04-12 Thread Allison, Timothy B.
There's also, of course, tika-server. 

No matter the method, it is always best to isolate Tika to its own jvm, vm or m.

-Original Message-
From: Charlie Hull [mailto:char...@flax.co.uk] 
Sent: Monday, April 9, 2018 4:15 PM
To: solr-user@lucene.apache.org
Subject: Re: How to use Tika (Solr Cell) to extract content from HTML document 
instead of Solr's MostlyPassthroughHtmlMapper ?

As a bonus here's a Dropwizard Tika wrapper that gives you a Tika web service 
https://github.com/mattflax/dropwizard-tika-server written by a colleague of 
mine at Flax. Hope this is useful.

Cheers

Charlie




Re: How many SynonymGraphFilterFactory can I have?

2018-04-12 Thread Shawn Heisey

On 4/12/2018 6:46 AM, Vincenzo D'Amore wrote:

Thanks Shawn, synonyms right now are just organized in categories with
different meanings. Thanks a lot for the response.
I think this behaviour should be clearly stated in the documentation. Can I
access to solr guide and add few notes on this?


Anyone can access the Solr reference guide.  It's included in Solr's 
source code, and the online/PDF guides are built from that code.  Only a 
committer can actually change it, but if you submit your desired changes 
with an issue in Jira, all good changes are likely to be included.  It 
is possible to automatically link a github pull request with an issue in 
Jira, just by mentioning the SOLR- issue identifier in the PR 
description.


Thanks,
Shawn



Re: ant eclipse on branch_6_4

2018-04-12 Thread Steve Rowe
Hi,

You probably have a stale Ivy lock file in ~/.ivy2/cache/, very likely orphaned 
as a result of manually interrupting the Lucene/Solr build, e.g. via Ctrl-C.

You can find it via e.g.: find ~/.ivy2/cache -name ‘*.lck’

Once you have found the stale lock file, manually deleting it should allow the 
build to start working again.

FYI, LUCENE-6144[1] upgraded Lucene/Solr’s Ivy dependency to 2.4.0, and as part 
of that upgrade, switched the build’s lock strategy[2] from “artifact-lock” to 
“artifact-lock-nio”.  As a result, in branch_5_5, branch_6_6, branch_7* and 
master, ‘*.lck’ files are expected to be present all the time in the Ivy cache, 
and their presence will not cause the build to fail.  More info on IVY-1489[3].

[1] https://issues.apache.org/jira/browse/LUCENE-6144
[2] 
http://ant.apache.org/ivy/history/latest-milestone/settings/lock-strategies.html
[3] https://issues.apache.org/jira/browse/IVY-1489

--
Steve
www.lucidworks.com

> On Apr 12, 2018, at 10:34 AM, rgummadi  wrote:
> 
> I cloned lucene-solr git and working on git branch branch_6_4. I am trying to
> make this eclipse compatible.
> So I "ant eclipse" from the root folder. I am getting the below error. Can
> some one suggest a resolution.
> 
> [ivy:retrieve]  ::
> [ivy:retrieve]  ::  UNRESOLVED DEPENDENCIES ::
> [ivy:retrieve]  ::
> [ivy:retrieve]  :: junit#junit;4.10: not found
> [ivy:retrieve]  ::
> [ivy:retrieve]
> [ivy:retrieve]  ERRORS
> [ivy:retrieve]  impossible to acquire lock for junit#junit;4.10
> [ivy:retrieve]  impossible to acquire lock for junit#junit;4.10
> [ivy:retrieve]  impossible to acquire lock for junit#junit;4.10
> [ivy:retrieve]  impossible to acquire lock for junit#junit;4.10
> [ivy:retrieve]  impossible to acquire lock for junit#junit;4.10
> [ivy:retrieve]  impossible to acquire lock for junit#junit;4.10
> [ivy:retrieve]  impossible to acquire lock for junit#junit;4.10
> [ivy:retrieve]  impossible to acquire lock for junit#junit;4.10
> [ivy:retrieve]  impossible to acquire lock for junit#junit;4.10
> [ivy:retrieve]  impossible to acquire lock for junit#junit;4.10
> [ivy:retrieve]  impossible to acquire lock for junit#junit;4.10
> [ivy:retrieve]
> [ivy:retrieve] :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
> 
> 
> 
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html



Re: Decision on Number of shards and collection

2018-04-12 Thread Shawn Heisey

On 4/12/2018 4:57 AM, neotorand wrote:

I read from the link you shared that
"Shard cannot contain more than 2 billion documents since Lucene is using
integer for internal IDs."

In which java class of SOLR implimentaion repository this can be found.


The 2 billion limit  is a *hard* limit from Lucene.  It's not in Solr.  
It's pretty much the only hard limit that Lucene actually has - there's 
a workaround for everything else.  Solr can overcome this limit for a 
single index by sharding the index into multiple physical indexes across 
multiple servers, which is more automated in SolrCloud than in 
standalone mode.


The 2 billion limit per individual index can't be raised. Lucene uses an 
"int" datatype to hold the internal ID everywhere it's used.  Java 
numeric types are signed, which means that the maximum number a 32-bit 
data type can hold is 2147483647.  This is the value returned by the 
Java constant Integer.MAX_VALUE.  A little bit is subtracted from that 
value to obtain the maximum it will attempt to use, to be absolutely 
sure it can't go over.


https://issues.apache.org/jira/browse/LUCENE-5843

Raising the limit is theoretically possible, but not without *MAJOR* 
surgery to an extremely large amount of Lucene's code. The risk of bugs 
when attempting that change is *VERY* high -- it could literally take 
months to find them all and fix them.


The two most popular search engines using Lucene are Solr and 
elasticsearch. Both of these packages can overcome the 2 billion limit 
with sharding.


Summary: The 2 billion document limit can be frustrating, but since an 
index that large on a single machine is most likely not going to perform 
well and should be split across several machines, there's almost no 
value to raising the limit and risking a large number of software bugs.


Thanks,
Shawn



Re: Solr still gives old data while faceting from the deleted documents

2018-04-12 Thread Shawn Heisey

On 4/12/2018 5:53 AM, girish.vignesh wrote:

Solr gives old data while faceting from old deleted or updated documents.

For example we are doing faceting on name. name changes frequently for our
application. When we index the document after changing the name we get both
old name and new name in the search results. After digging more on this I
got to know that Solr indexes are composed of segments (write once) and each
segment contains set of documents. Whenever hard commit happens these
segments will be closed and even if a document is deleted after that it will
still have those documents (which will be marked as deleted). These
documents will not be cleared immediately. It will not be displayed in the
search result though, but somehow faceting is still able to access those
data.


If all documents with that term are deleted, then this will be fixed by 
adding a facet.mincount=1 parameter to your facet URL.  If you are using 
the JSON facet API, then there is a mincount parameter that you can 
place into your JSON request. I've never actually used the JSON facet 
API, but there is documentation:


https://lucene.apache.org/solr/guide/7_2/json-facet-api.html#TermsFacet

The mincount parameter might make it unnecessary to optimize.  But if 
you are updating a LOT of your documents on a regular basis, you might 
find that it gives you better performance, so optimizing once a day 
during a time when traffic is low might be useful.


Thanks,
Shawn



ant eclipse on branch_6_4

2018-04-12 Thread rgummadi
I cloned lucene-solr git and working on git branch branch_6_4. I am trying to
make this eclipse compatible.
So I "ant eclipse" from the root folder. I am getting the below error. Can
some one suggest a resolution.

[ivy:retrieve]  ::
[ivy:retrieve]  ::  UNRESOLVED DEPENDENCIES ::
[ivy:retrieve]  ::
[ivy:retrieve]  :: junit#junit;4.10: not found
[ivy:retrieve]  ::
[ivy:retrieve]
[ivy:retrieve]  ERRORS
[ivy:retrieve]  impossible to acquire lock for junit#junit;4.10
[ivy:retrieve]  impossible to acquire lock for junit#junit;4.10
[ivy:retrieve]  impossible to acquire lock for junit#junit;4.10
[ivy:retrieve]  impossible to acquire lock for junit#junit;4.10
[ivy:retrieve]  impossible to acquire lock for junit#junit;4.10
[ivy:retrieve]  impossible to acquire lock for junit#junit;4.10
[ivy:retrieve]  impossible to acquire lock for junit#junit;4.10
[ivy:retrieve]  impossible to acquire lock for junit#junit;4.10
[ivy:retrieve]  impossible to acquire lock for junit#junit;4.10
[ivy:retrieve]  impossible to acquire lock for junit#junit;4.10
[ivy:retrieve]  impossible to acquire lock for junit#junit;4.10
[ivy:retrieve]
[ivy:retrieve] :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS






--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: How many SynonymGraphFilterFactory can I have?

2018-04-12 Thread Vincenzo D'Amore
Thanks Shawn, synonyms right now are just organized in categories with
different meanings. Thanks a lot for the response.
I think this behaviour should be clearly stated in the documentation. Can I
access to solr guide and add few notes on this?

On Thu, Apr 12, 2018 at 11:40 AM, Shawn Heisey  wrote:

> On 4/12/2018 3:11 AM, Vincenzo D'Amore wrote:
>
>> Hi all, anyone could at least point me some good resource that explain how
>> to configure filters in fieldType building?
>>
>> Just understand if exist a document that explain the changes introduced
>> with SynonymGraphFilter or in general what kind of filters are compatible
>> and can stay together in the same chain.
>>
>
> https://lucene.apache.org/solr/guide/7_3/understanding-analy
> zers-tokenizers-and-filters.html
>
> I don't think you can have multiple Graph filters, even if they're the
> same filter, in a chain, whether it's index or query.
>
> I can understand the desire to have multiple synonym filters from the
> standpoint of organizing things into categories, but instead of having
> multiple synonym filters, you should have one filter pointing at a file
> containing all of the synonyms that you want to use.  You could have any
> automation scripts combine synonym files into a single file that gets
> uploaded to your running configuration directory.
>
> Thanks,
> Shawn
>
>


-- 
Vincenzo D'Amore


Solr still gives old data while faceting from the deleted documents

2018-04-12 Thread girish.vignesh
Solr gives old data while faceting from old deleted or updated documents.

For example we are doing faceting on name. name changes frequently for our
application. When we index the document after changing the name we get both
old name and new name in the search results. After digging more on this I
got to know that Solr indexes are composed of segments (write once) and each
segment contains set of documents. Whenever hard commit happens these
segments will be closed and even if a document is deleted after that it will
still have those documents (which will be marked as deleted). These
documents will not be cleared immediately. It will not be displayed in the
search result though, but somehow faceting is still able to access those
data.

Optimizing fixed this issue. But we cannot perform this each time customer
changes data on production. I tried below options and that did not work for
me.

1) *expungeDeletes*.

Added this line below in solrconfig.xml


  3
  false





  1


 // This is not working.
I don't think I can use expungeDeletes like this in solrConfig.xml

When I send commit parameters in update URL it is working. 

2) Using *TieredMergePolicyFactory* might not help me as the threshold might
not reach always and user will see old data during this time.

3) One more way of doing it is calling *optimize*() method which is exposed
in solrj daily once. But not sure what impact this will have on performance.

4) Tried to manipulate filterCache, documentCache and queryResultCache in
solrConfig.xml

This did not solve my issue as well. I do not think any cache is causing
this issue.

Number of documents we index per server will be maximum 2M-3M.

Please suggest if there is any solution to this apart from
expungeDeletes/optimize.

Let me know if more data needed.



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Solr still gives old data while faceting from the deleted documents

2018-04-12 Thread girish.vignesh
Solr gives old data while faceting from old deleted or updated documents.

For example we are doing faceting on name. name changes frequently for our
application. When we index the document after changing the name we get both
old name and new name in the search results. After digging more on this I
got to know that Solr indexes are composed of segments (write once) and each
segment contains set of documents. Whenever hard commit happens these
segments will be closed and even if a document is deleted after that it will
still have those documents (which will be marked as deleted). These
documents will not be cleared immediately. It will not be displayed in the
search result though, but somehow faceting is still able to access those
data.

Optimizing fixed this issue. But we cannot perform this each time customer
changes data on production. I tried below options and that did not work for
me.

1) *expungeDeletes*.

Added this line below in solrconfig.xml


  3
  false





  1


  // This is not
working.

I do not think I can add expungeDeletes configuration like this. When I make
expungeDeletes call using curl command its merging the segments.

2) Using *TieredMergePolicyFactory* might not help me as the threshold might
not reach always and user will see old data during this time.

3) One more way of doing it is calling *optimize*() method which is exposed
in solrj daily once. But not sure what impact this will have on performance.

4) Tried manipulating filterCache, documentCache and queryResultCache. I do
not think whatever the issue I am facing is because of these caches. 

Number of documents we index per server will be maximum 2M-3M.

Please suggest if there is any solution to this.

Let me know if more data needed.



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


DIH with huge data

2018-04-12 Thread Sujay Bawaskar
Hi,

We are using DIH with SortedMapBackedCache but as data size increases we
need to provide more heap memory to solr JVM.
Can we use multiple CSV file instead of database queries and later data in
CSV files can be joined using zipper? So bottom line is to create CSV files
for each of entity in data-config.xml and join these CSV files using
zipper.
We also tried EHCache based DIH cache but since EHCache uses MMap IO its
not good to use with MMapDirectoryFactory and causes to exhaust physical
memory on machine.
Please suggest how can we handle use case of importing huge amount of data
into solr.

-- 
Thanks,
Sujay P Bawaskar
M:+91-77091 53669


Re: in-place updates

2018-04-12 Thread Hendrik Haddorp

ah, right, sorry

On 11.04.2018 17:38, Emir Arnautović wrote:

Hi Hendrik,
Documentation clearly states conditions when in-place updates are possible: 
https://lucene.apache.org/solr/guide/6_6/updating-parts-of-documents.html#UpdatingPartsofDocuments-In-PlaceUpdates
 

The first one mentions “numeric docValues”.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/




On 11 Apr 2018, at 07:34, Hendrik Haddorp  wrote:

Hi,

in 
http://lucene.472066.n3.nabble.com/In-Place-Updates-not-working-as-expected-tp4375621p4380035.html
 some restrictions on the supported fields are given. I could however not find 
if in-place updates are supported for are field types or if they only work for 
say numeric fields.

thanks,
Hendrik






Re: Decision on Number of shards and collection

2018-04-12 Thread neotorand
Emir
I read from the link you shared that 
"Shard cannot contain more than 2 billion documents since Lucene is using
integer for internal IDs."

In which java class of SOLR implimentaion repository this can be found.

Regards
Neo



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: How many SynonymGraphFilterFactory can I have?

2018-04-12 Thread Shawn Heisey

On 4/12/2018 3:11 AM, Vincenzo D'Amore wrote:

Hi all, anyone could at least point me some good resource that explain how
to configure filters in fieldType building?

Just understand if exist a document that explain the changes introduced
with SynonymGraphFilter or in general what kind of filters are compatible
and can stay together in the same chain.


https://lucene.apache.org/solr/guide/7_3/understanding-analyzers-tokenizers-and-filters.html

I don't think you can have multiple Graph filters, even if they're the 
same filter, in a chain, whether it's index or query.


I can understand the desire to have multiple synonym filters from the 
standpoint of organizing things into categories, but instead of having 
multiple synonym filters, you should have one filter pointing at a file 
containing all of the synonyms that you want to use.  You could have any 
automation scripts combine synonym files into a single file that gets 
uploaded to your running configuration directory.


Thanks,
Shawn



Re: Filter query question

2018-04-12 Thread Shawn Heisey

On 4/12/2018 1:46 AM, LOPEZ-CORTES Mariano-ext wrote:

In our search application we have one facet filter (Status)

Each status value corresponds to multiple values in the Solr database

Example : Status : Initialized --> status in solr =  11I, 12I, 13I, 14I, ...

On status value click, search is re-fired with fq filter:

fq: status:(11I OR 12I OR 13I )

This was very very inefficient. Filter query response time was longer than same 
search without filter!


How many different status values are you including in that query?  And 
how many unique values are there in the status field for the entire index?


Echoing what Emir said:  If the numbers are large, it's going to be 
slow.  But once that filter is run successfully, it should be cached 
until the next time you commit changes to your index and open a new 
searcher, or there are enough different filters executed that this entry 
is pushed out of the cache.


Having sufficient memory for your operating system to effectively cache 
your index data can speed up ALL queries.  How much memory do you have 
in the server?  How much memory have you allocated to Solr on that 
system?  Is there software other than Solr installed on the server?  How 
much disk space do all the Solr indexes on that server consume?  How 
many documents are represented by that disk space?


Back in late February, I discussed this with you when you were wanting 
query times below one second.



We have changed status value in Solr database for corresponding to visual 
filter values. In consequence, there is no OR in the fq filter.
The performance is better now.


This seems to say that you've made a change that improved your 
situation.  But reading what you wrote, I cannot tell what that change was.


If you're running at least version 4.10, you could use the terms query 
parser instead of a search clause with OR separators.  In addition to 
better performance, this query parser doesn't have the maxBooleanClauses 
limit.


https://lucene.apache.org/solr/guide/7_3/other-parsers.html

Thanks,
Shawn



Re: How many SynonymGraphFilterFactory can I have?

2018-04-12 Thread Vincenzo D'Amore
Hi all, anyone could at least point me some good resource that explain how
to configure filters in fieldType building?

Just understand if exist a document that explain the changes introduced
with SynonymGraphFilter or in general what kind of filters are compatible
and can stay together in the same chain.

Best regards,
Vincenzo

On Mon, Apr 9, 2018 at 7:24 PM, Vincenzo D'Amore  wrote:

> Hi all,
>
> in an Solr 4.8 schema I have a fieldType with few SynonymFilter filters at
> index and few at query time.
>
> Moving this old schema to Solr 7.3.0 I see that if I use
> SynonymGraphFilter during indexing, I have to follow it with
> FlattenGraphFilter.
>
> I also know that I cannot have multiple SynonymGraphFilter, because
> produce a graph but cannot consume an incoming graph.
>
> So, should I add an FlattenGraphFilter after each SynonymGraphFilter at
> index time to have more than one?
>
> And, again, how can I have many SynonymGraphFilter at query time? :)
>
> Thanks in advance for your time.
>
> Best regards,
> Vincenzo
>
> --
> Vincenzo D'Amore
>
>


-- 
Vincenzo D'Amore


Re: Filter query question

2018-04-12 Thread Emir Arnautović
Hi,
What is the number of these status indicators? It is expected to have slower 
query if you have more clauses since Solr/Lucene has to load postings for each 
term and then OR them. The real question is why it is constantly slow since you 
are using fq and it should be cached. Did you disable filter cache? Or you 
maybe commit frequently and do not have warmup query that would load that 
filter in cache?

Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 12 Apr 2018, at 09:46, LOPEZ-CORTES Mariano-ext 
>  wrote:
> 
> Hi
> 
> In our search application we have one facet filter (Status)
> 
> Each status value corresponds to multiple values in the Solr database
> 
> Example : Status : Initialized --> status in solr =  11I, 12I, 13I, 14I, ...
> 
> On status value click, search is re-fired with fq filter:
> 
> fq: status:(11I OR 12I OR 13I )
> 
> This was very very inefficient. Filter query response time was longer than 
> same search without filter!
> 
> We have changed status value in Solr database for corresponding to visual 
> filter values. In consequence, there is no OR in the fq filter.
> The performance is better now.
> 
> What is the reason?
> 
> Thanks!
> 



Filter query question

2018-04-12 Thread LOPEZ-CORTES Mariano-ext
Hi

In our search application we have one facet filter (Status)

Each status value corresponds to multiple values in the Solr database

Example : Status : Initialized --> status in solr =  11I, 12I, 13I, 14I, ...

On status value click, search is re-fired with fq filter:

fq: status:(11I OR 12I OR 13I )

This was very very inefficient. Filter query response time was longer than same 
search without filter!

We have changed status value in Solr database for corresponding to visual 
filter values. In consequence, there is no OR in the fq filter.
The performance is better now.

What is the reason?

Thanks!



Re: Decision on Number of shards and collection

2018-04-12 Thread neotorand
Thanks every one for your beautifull explanation and valuable time.

Thanks Emir for the Nice
Link(http://www.od-bits.com/2018/01/solrelasticsearch-capacity-planning.html)
Thanks Shawn for
https://lucidworks.com/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

When should we have more collection?

We have a business reason to keep them in separate collection
we dont need to query all data at once

When should we have more shards?
Define Latency
Go on adding document to shards till you have acceptable Latency.That will
define the shards size(SS)
Get the size of all data to be indexed.(TS)
numshards = TS/SS

One quick question.
@Shawn
If i have data in more than one collection still i can query them at once.?
I think yes as i read from SOLR site.
What are pros and cons of single vs multiple collection?

I have gone through the estimating Memory and storage for SOLR from
Lucid.(https://lucidworks.com/2011/09/14/estimating-memory-and-storage-for-lucenesolr/)

@SOLR4189 i will go through the book and get back to you.Thanks.

Time is too short to explore the Long Lived Open source technology

Regards
Neo



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html