Re: Sort with subquery

2017-11-27 Thread Erick Erickson
No. You're missing the point that [subquery] is called when assembling
the return packet, which consists of only the top N docs from your
query against the static collection, _not_ as part of the search which
it would have to be to do what you want.

To sort the complete result set, you have to ask for every document
that matches "does the value here match rank it in the top N (where
rows=N)? That would be enormously expensive if each and every doc that
matched needed to make a sub-query.

bq: since we have combined them together in the result

You have not really done this. This is not like a SQL query, all
you're combining is the stored fields from the "rows=" parameter,
_not_ the complete result set.

You might be able to do something with streaming. You might be able to
do something with updateable docValues. But you can't do what you're
asking the way you're trying to.

Best,
Erick

On Mon, Nov 27, 2017 at 8:20 PM, Jinyi Lu  wrote:
> Thank you for the reply!
>
> In terms of sort, I am wondering is it possible to sort the docs from my 
> static collection based on the corresponding counts in the dynamic 
> collection, since we have combined them together in the result.
> Something like:
> sort=max(status.cnt) asc
>
> Or is it possible to add a multiValued pseudo field "cnts"  using a subquery 
> for every static doc in the result, and then sort the static doc by the 
> pseudo field?
> Something like:
> fl=*,cnts:[subquery]=={!term f=object_id 
> v=$row.id}=cnt=max(cnts) asc
>
> Thanks,
> Jinyi
>
> On 11/27/17, 6:04 PM, "Erick Erickson"  wrote:
>
> I'm not quite sure what "sort the results" means here. The [subquery]
> bit just adds a field to the output of the top N. So what you'd be
> doing here is just getting the top 10 (if =10) from your static
> collection, then adding the counts to them from the "dynamic"
> collection. So the sort here you're asking for would not be ordered by
> actual counts in the dynamic collection, the 11th document may have a
> count much greater than anything in the results list.
>
> If anything you need to turn it around and query your dynamic
> collection and add [subquery] to the top N from your static
> collection.
>
> Best,
> Erick
>
> Best,
> Erick
>
> On Mon, Nov 27, 2017 at 1:39 PM, Jinyi Lu  wrote:
> > Hi all,
> >
> > I have a question about how to sort results based on the fields in the 
> subquery. It’s exactly same as this question posted on the stackoverflow 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__stackoverflow.com_questions_47127478_solr-2Dhow-2Dto-2Dsort-2Dbased-2Don-2Dsubquery=DwIFaQ=uilaK90D4TOVoH58JNXRgQ=9_rAkkO7SJ7HHMsaZXKAFNMoKpKiTfF8Eho2F9ygvjQ=oA6uB1I6AEAC64URTsslH7aCCfxu34orl2K5xnzvKTQ=bVCaDnDL8m0xXsxi8vhCrgiUFwYVyfICucICSVrnZFk=
>  but no answer yet.
> >
> > Basically, I have two collections:
> >
> >   1.  Static data like the information about the objects.
> > {
> >   "id": "a",
> >   "type": "type1"
> > }
> >
> >   1.  Status about the objects in the previous collection which will be 
> frequently updated.
> > {
> >   "object_id": "a",
> >   "cnt": 1
> > }
> >
> > By using queries like q=id:*=*,status:[subquery]= 
> status.q={!term f=object_id v=$row.id}, I am able to combine two collections 
> together and the response is something like:
> > [{
> >   "id": "a",
> >   "type": "type1"
> >   "status":{"numFound":1, "start":0, "docs":[
> > {
> >   "object_id": "a",
> >   "cnt": 1
> > }]
> >   }
> > },
> > …]
> >
> > But is there a way to sort the results based on the fields in the 
> subquery, like "cnt" in this case? Any ideas are appreciated!
> >
> > Thanks!
> > Jinyi
>
>


Re: Sort with subquery

2017-11-27 Thread Jinyi Lu
Thank you for the reply!

In terms of sort, I am wondering is it possible to sort the docs from my static 
collection based on the corresponding counts in the dynamic collection, since 
we have combined them together in the result.
Something like:
sort=max(status.cnt) asc

Or is it possible to add a multiValued pseudo field "cnts"  using a subquery 
for every static doc in the result, and then sort the static doc by the pseudo 
field?
Something like:
fl=*,cnts:[subquery]=={!term f=object_id 
v=$row.id}=cnt=max(cnts) asc

Thanks,
Jinyi

On 11/27/17, 6:04 PM, "Erick Erickson"  wrote:

I'm not quite sure what "sort the results" means here. The [subquery]
bit just adds a field to the output of the top N. So what you'd be
doing here is just getting the top 10 (if =10) from your static
collection, then adding the counts to them from the "dynamic"
collection. So the sort here you're asking for would not be ordered by
actual counts in the dynamic collection, the 11th document may have a
count much greater than anything in the results list.

If anything you need to turn it around and query your dynamic
collection and add [subquery] to the top N from your static
collection.

Best,
Erick

Best,
Erick

On Mon, Nov 27, 2017 at 1:39 PM, Jinyi Lu  wrote:
> Hi all,
>
> I have a question about how to sort results based on the fields in the 
subquery. It’s exactly same as this question posted on the stackoverflow 
https://urldefense.proofpoint.com/v2/url?u=https-3A__stackoverflow.com_questions_47127478_solr-2Dhow-2Dto-2Dsort-2Dbased-2Don-2Dsubquery=DwIFaQ=uilaK90D4TOVoH58JNXRgQ=9_rAkkO7SJ7HHMsaZXKAFNMoKpKiTfF8Eho2F9ygvjQ=oA6uB1I6AEAC64URTsslH7aCCfxu34orl2K5xnzvKTQ=bVCaDnDL8m0xXsxi8vhCrgiUFwYVyfICucICSVrnZFk=
 but no answer yet.
>
> Basically, I have two collections:
>
>   1.  Static data like the information about the objects.
> {
>   "id": "a",
>   "type": "type1"
> }
>
>   1.  Status about the objects in the previous collection which will be 
frequently updated.
> {
>   "object_id": "a",
>   "cnt": 1
> }
>
> By using queries like q=id:*=*,status:[subquery]= 
status.q={!term f=object_id v=$row.id}, I am able to combine two collections 
together and the response is something like:
> [{
>   "id": "a",
>   "type": "type1"
>   "status":{"numFound":1, "start":0, "docs":[
> {
>   "object_id": "a",
>   "cnt": 1
> }]
>   }
> },
> …]
>
> But is there a way to sort the results based on the fields in the 
subquery, like "cnt" in this case? Any ideas are appreciated!
>
> Thanks!
> Jinyi




RE: Solr Spellcheck

2017-11-27 Thread GVK Prasad

Hi Alessandro,

My search and request handler are as included below. This config included  with 
  version 6.3.0


text_general



  default
  term
  solr.DirectSolrSpellChecker
  
  internal
  
  0.5
  
  2
  
  1
  
  5
  
  4
  
  0.01
  

  

  

  
  default
  on
  true
  10
  5
  5
  true
  true
  10
  5


  spellcheck

  

My schema file is 40kb so including below only the fields added my me
 
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  

Thanks,
Prasad.

From: alessandro.benedetti
Sent: Monday, November 27, 2017 8:53 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr Spellcheck

Do you mean you are over-spellchecking ?
Correcting even "not mispelled words" ?

Can you give us the request handler configuration, spellcheck configuration
and the schema ?

Cheers



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html



---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus


IndexMergeTool to adhere to TieredMergePolicyFactory settings

2017-11-27 Thread Zheng Lin Edwin Yeo
Hi,

I'm currently using Solr 6.5.1.

I found that in the IndexMergeTool.java, we found that there is this line
which set the maxNumSegments to 1.

writer.forceMerge(1);


For this, does it means that there will always be only 1 segment after the
merging? From what I see, that seems to be the case.

Is there any way which we can allow the merging to be in multiple segment,
which each segment of a certain size? Like if we want each segment to be of
20GB, which is what is what is set in the TieredMergePolicyFactory?


10 10 <
double name="maxMergedSegmentMB">20480 


Regards,
Edwin


Re: Sort with subquery

2017-11-27 Thread Erick Erickson
I'm not quite sure what "sort the results" means here. The [subquery]
bit just adds a field to the output of the top N. So what you'd be
doing here is just getting the top 10 (if =10) from your static
collection, then adding the counts to them from the "dynamic"
collection. So the sort here you're asking for would not be ordered by
actual counts in the dynamic collection, the 11th document may have a
count much greater than anything in the results list.

If anything you need to turn it around and query your dynamic
collection and add [subquery] to the top N from your static
collection.

Best,
Erick

Best,
Erick

On Mon, Nov 27, 2017 at 1:39 PM, Jinyi Lu  wrote:
> Hi all,
>
> I have a question about how to sort results based on the fields in the 
> subquery. It’s exactly same as this question posted on the stackoverflow 
> https://stackoverflow.com/questions/47127478/solr-how-to-sort-based-on-subquery
>  but no answer yet.
>
> Basically, I have two collections:
>
>   1.  Static data like the information about the objects.
> {
>   "id": "a",
>   "type": "type1"
> }
>
>   1.  Status about the objects in the previous collection which will be 
> frequently updated.
> {
>   "object_id": "a",
>   "cnt": 1
> }
>
> By using queries like q=id:*=*,status:[subquery]= status.q={!term 
> f=object_id v=$row.id}, I am able to combine two collections together and the 
> response is something like:
> [{
>   "id": "a",
>   "type": "type1"
>   "status":{"numFound":1, "start":0, "docs":[
> {
>   "object_id": "a",
>   "cnt": 1
> }]
>   }
> },
> …]
>
> But is there a way to sort the results based on the fields in the subquery, 
> like "cnt" in this case? Any ideas are appreciated!
>
> Thanks!
> Jinyi


Sort with subquery

2017-11-27 Thread Jinyi Lu
Hi all,

I have a question about how to sort results based on the fields in the 
subquery. It’s exactly same as this question posted on the stackoverflow 
https://stackoverflow.com/questions/47127478/solr-how-to-sort-based-on-subquery 
but no answer yet.

Basically, I have two collections:

  1.  Static data like the information about the objects.
{
  "id": "a",
  "type": "type1"
}

  1.  Status about the objects in the previous collection which will be 
frequently updated.
{
  "object_id": "a",
  "cnt": 1
}

By using queries like q=id:*=*,status:[subquery]= status.q={!term 
f=object_id v=$row.id}, I am able to combine two collections together and the 
response is something like:
[{
  "id": "a",
  "type": "type1"
  "status":{"numFound":1, "start":0, "docs":[
{
  "object_id": "a",
  "cnt": 1
}]
  }
},
…]

But is there a way to sort the results based on the fields in the subquery, 
like "cnt" in this case? Any ideas are appreciated!

Thanks!
Jinyi


Re: Recovery Issue - Solr 6.6.1 and HDFS

2017-11-27 Thread Joe Obernberger
Just to add onto this.  Right now the cluster has recovered, and life is 
good.  My concern with a cluster restart are, lock files, and network 
timeouts on startup.  The 1st can be addressed by stopping indexing, 
waiting until things flush out, and then halting all the nodes.  No lock 
files.


The 2nd is the one I'm scared about.  We use puppet to start/stop all 
the 45 nodes in the cluster, and on startup there is a massive amount of 
HDFS activity, that I'm afraid will put some of the replicas into 
recovery.  If that happens, then we're probably in for the recovery, 
failed, retry loop.  Anyone else run into this?


Thanks.

-Joe


On 11/27/2017 11:28 AM, Joe Obernberger wrote:
Thank you Erick.  Right now, we have our autoCommit time set to 
180 (30 minutes), and our autoSoftCommit set to 12.  The 
thought was that with HDFS we want less frequent, but larger 
operations, since HDFS has such a large block size.  Is that incorrect 
thinking?


As to why we are using HDFS.  For our use case, we already have a 
large cluster that runs HBase, and we want to index data within it.  
Adding another layer of storage that we would need to manage would add 
complexity.  With HDFS, we just add another box that has disk, and 
boom - more storage for all players involved.


-Joe


On 11/22/2017 8:17 PM, Erick Erickson wrote:

Hmm. This is quite possible. Any time things take "too long" it can be
  a problem. For instance, if the leader sends docs to a replica and
the request times out, the leader throws the follower into "Leader
Initiated Recovery". The smoking gun here is that there are no errors
on the follower, just the notification that the leader put it into
recovery.

There are other variations on the theme, it all boils down to when
communications fall apart replicas go into recovery.

Best,
Erick

On Wed, Nov 22, 2017 at 11:02 AM, Joe Obernberger
 wrote:
Hi Shawn - thank you for your reply. The index is 29.9TBytes as 
reported

by:
hadoop fs -du -s -h /solr6.6.0
29.9 T  89.9 T  /solr6.6.0

The 89.9TBytes is due to HDFS having 3x replication.  There are 
about 1.1
billion documents indexed and we index about 2.5 million documents 
per day.

Assuming an even distribution, each node is handling about 680GBytes of
index.  So our cache size is 1.4%. Perhaps 'relatively small block 
cache'
was an understatement! This is why we split the largest collection 
into two,
where one is data going back 30 days, and the other is all the 
data.  Most

of our searches are not longer than 30 days back.  The 30 day index is
2.6TBytes total.  I don't know how the HDFS block cache splits between
collections, but the 30 day index performs acceptable for our specific
application.

If we wanted to cache 50% of the index, each of our 45 nodes would 
need a

block cache of about 350GBytes.  I'm accepting offers of DIMMs!

What I believe caused our 'recovery, fail, retry loop' was one of our
servers died.  This caused HDFS to start to replicate blocks across the
cluster and produced a lot of network activity.  When this happened, I
believe there was high network contention for specific nodes in the 
cluster

and their network interfaces became pegged and requests for HDFS blocks
timed out.  When that happened, SolrCloud went into recovery which 
caused

more network traffic.  Fun stuff.

-Joe


On 11/22/2017 11:44 AM, Shawn Heisey wrote:

On 11/22/2017 6:44 AM, Joe Obernberger wrote:

Right now, we have a relatively small block cache due to the
requirements that the servers run other software.  We tried to find
the best balance between block cache size, and RAM for programs, 
while

still giving enough for local FS cache.  This came out to be 84 128M
blocks - or about 10G for the cache per node (45 nodes total).

How much data is being handled on a server with 10GB allocated for
caching HDFS data?

The first message in this thread says the index size is 31TB, which is
*enormous*.  You have also said that the index takes 93TB of disk
space.  If the data is distributed somewhat evenly, then the answer to
my question would be that each of those 45 Solr servers would be
handling over 2TB of data.  A 10GB cache is *nothing* compared to 2TB.

When index data that Solr needs to access for an operation is not 
in the

cache and Solr must actually wait for disk and/or network I/O, the
resulting performance usually isn't very good.  In most cases you 
don't

need to have enough memory to fully cache the index data ... but less
than half a percent is not going to be enough.

Thanks,
Shawn


---
This email has been checked for viruses by AVG.
http://www.avg.com







Inverted Index positions vs Term Vector positions

2017-11-27 Thread alessandro.benedetti
Hi all,
it may sounds a silly question, but is there any reason that the term
positions in the inverted index are using 1 based numbering while the Term
Vector positions are using a 0 based numbering[1] ?

This may affect different areas in Solr and cause problems which are quite
tricky to spot.

Regards

[1] http://blog.jpountz.net/post/41301889664/putting-term-vectors-on-a-diet



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Recovery Issue - Solr 6.6.1 and HDFS

2017-11-27 Thread Joe Obernberger
Thank you Erick.  Right now, we have our autoCommit time set to 180 
(30 minutes), and our autoSoftCommit set to 12.  The thought was 
that with HDFS we want less frequent, but larger operations, since HDFS 
has such a large block size.  Is that incorrect thinking?


As to why we are using HDFS.  For our use case, we already have a large 
cluster that runs HBase, and we want to index data within it.  Adding 
another layer of storage that we would need to manage would add 
complexity.  With HDFS, we just add another box that has disk, and boom 
- more storage for all players involved.


-Joe


On 11/22/2017 8:17 PM, Erick Erickson wrote:

Hmm. This is quite possible. Any time things take "too long" it can be
  a problem. For instance, if the leader sends docs to a replica and
the request times out, the leader throws the follower into "Leader
Initiated Recovery". The smoking gun here is that there are no errors
on the follower, just the notification that the leader put it into
recovery.

There are other variations on the theme, it all boils down to when
communications fall apart replicas go into recovery.

Best,
Erick

On Wed, Nov 22, 2017 at 11:02 AM, Joe Obernberger
 wrote:

Hi Shawn - thank you for your reply.  The index is 29.9TBytes as reported
by:
hadoop fs -du -s -h /solr6.6.0
29.9 T  89.9 T  /solr6.6.0

The 89.9TBytes is due to HDFS having 3x replication.  There are about 1.1
billion documents indexed and we index about 2.5 million documents per day.
Assuming an even distribution, each node is handling about 680GBytes of
index.  So our cache size is 1.4%. Perhaps 'relatively small block cache'
was an understatement! This is why we split the largest collection into two,
where one is data going back 30 days, and the other is all the data.  Most
of our searches are not longer than 30 days back.  The 30 day index is
2.6TBytes total.  I don't know how the HDFS block cache splits between
collections, but the 30 day index performs acceptable for our specific
application.

If we wanted to cache 50% of the index, each of our 45 nodes would need a
block cache of about 350GBytes.  I'm accepting offers of DIMMs!

What I believe caused our 'recovery, fail, retry loop' was one of our
servers died.  This caused HDFS to start to replicate blocks across the
cluster and produced a lot of network activity.  When this happened, I
believe there was high network contention for specific nodes in the cluster
and their network interfaces became pegged and requests for HDFS blocks
timed out.  When that happened, SolrCloud went into recovery which caused
more network traffic.  Fun stuff.

-Joe


On 11/22/2017 11:44 AM, Shawn Heisey wrote:

On 11/22/2017 6:44 AM, Joe Obernberger wrote:

Right now, we have a relatively small block cache due to the
requirements that the servers run other software.  We tried to find
the best balance between block cache size, and RAM for programs, while
still giving enough for local FS cache.  This came out to be 84 128M
blocks - or about 10G for the cache per node (45 nodes total).

How much data is being handled on a server with 10GB allocated for
caching HDFS data?

The first message in this thread says the index size is 31TB, which is
*enormous*.  You have also said that the index takes 93TB of disk
space.  If the data is distributed somewhat evenly, then the answer to
my question would be that each of those 45 Solr servers would be
handling over 2TB of data.  A 10GB cache is *nothing* compared to 2TB.

When index data that Solr needs to access for an operation is not in the
cache and Solr must actually wait for disk and/or network I/O, the
resulting performance usually isn't very good.  In most cases you don't
need to have enough memory to fully cache the index data ... but less
than half a percent is not going to be enough.

Thanks,
Shawn


---
This email has been checked for viruses by AVG.
http://www.avg.com





Re: Solr7 org.apache.lucene.index.IndexUpgrader

2017-11-27 Thread Shawn Heisey
On 11/27/2017 2:58 AM, Leo Prince wrote:
> Actually I have two major cores. One I have primary document store as MySQL
> and I can populate and re-index data from MySQL. However the other core
> with 40mil, is keeping as primary store (with stored=true), I get the fact
> that it's not a good practice, however due to some reason, I don't have any
> duplicate data store for this core. It's a low priority dynamic data piling
> up directly to Solr. In this context, is there a better method other than
> keeping the current Solr (older) as data source..? Any other work
> arounds..? I am checking the method with least time frame of execution.

The "how do I reindex" question comes up frequently enough that I wrote
up a wiki page about it.  Feel free to accuse me of writing a page that
doesn't actually answer the question. ;)

https://wiki.apache.org/solr/HowToReindex

Thanks,
Shawn



Re: Solr Spellcheck

2017-11-27 Thread alessandro.benedetti
Do you mean you are over-spellchecking ?
Correcting even "not mispelled words" ?

Can you give us the request handler configuration, spellcheck configuration
and the schema ?

Cheers



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


RE: Spellchecker Results

2017-11-27 Thread Sadiki Latty
This is perfect, thanks Emir.

-Original Message-
From: Emir Arnautović [mailto:emir.arnauto...@sematext.com] 
Sent: November-27-17 4:21 AM
To: solr-user@lucene.apache.org
Subject: Re: Spellchecker Results

Hi Sid,
I don’t think that such feature is added to Solr, but there is Sematext’s 
component that does what you need: 
https://github.com/sematext/solr-researcher/tree/master/dym 


HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection Solr & Elasticsearch 
Consulting Support Training - http://sematext.com/



> On 23 Nov 2017, at 16:34, Sadiki Latty  wrote:
> 
> Hi all,
> 
> Is it possible to return the results of a spellcheck in addition to the 
> spellcheck WITHOUT sending another query request?
> Example:
> Client sends "educatione", results  returns education results as well as 
> noting that the term "educatione" was spellchecked.
> 
> 
> Thanks
> 
> Sid Latty



Re: Solr7 org.apache.lucene.index.IndexUpgrader

2017-11-27 Thread Rick Leir
Leo
Your low priority data could be accumulated in a Couchbase DB or just in JSONL. 
Then it would be easy to re-index.
Cheers -- Rick
-- 
Sorry for being brief. Alternate email is rickleir at yahoo dot com 

Re: Embedded SOLR - Best practice?

2017-11-27 Thread alessandro.benedetti
When you say " caching 100.000 docs" what do you mean ?
being able to quickly find information in a corpus which increases in size (
100.000 docs) everyday ?

I second Erick, I think this is fairly normal Solr use case.
If you really care about fast searches, you will get a fairly acceptable
default configuration.
Then, you can tune Solr caching if you need.
Just remember that nowadays by default Solr is optimized for  Near Real Time
Search and it vastly uses the Memory Mapping feature of modern OSs.
This means that Solr is not going to do I/O all the time with the disk but
index portions will be memory mapped (if the memory assigned to the OS is
enough on the machine) .

Furthemore you may use the heap memory assigned to the Solr JVM to cache
additional elements [1] .

In conclusion : I never used the embedded Solr Server ( apart from
integration tests).

If you really want to play a bit with a scenario where you don't need
persistency on disk, you may play with the RamDirectory[2], but also in this
case, I generally discourage this approach unless very specific usecases and
small indexes.

[1]
https://lucene.apache.org/solr/guide/6_6/query-settings-in-solrconfig.html#QuerySettingsinSolrConfig-Caches
[2]
https://lucene.apache.org/solr/guide/6_6/datadir-and-directoryfactory-in-solrconfig.html#DataDirandDirectoryFactoryinSolrConfig-SpecifyingtheDirectoryFactoryForYourIndex



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: TimeZone issue

2017-11-27 Thread alessandro.benedetti
Hi,
it is on my TO DO list with low priority, there is a Jira issue already[1],
feel free to contribute it !

[1] https://issues.apache.org/jira/browse/SOLR-8952






-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Merging of index in Solr

2017-11-27 Thread Zheng Lin Edwin Yeo
Hi,

I found that in the IndexMergeTool.java, we found that there is this line
which set the maxNumSegments to 1

writer.forceMerge(1);


For this, does it means that there will always be only 1 segment after the
merging?

Is there any way which we can allow the merging to be in multiple segment,
which each segment of a certain size? Like if we want each segment to be of
20GB?

Regards,
Edwin


On 23 November 2017 at 20:35, Zheng Lin Edwin Yeo 
wrote:

> Hi Shawn,
>
> Thanks for the info. We will most likely be doing sharding when we migrate
> to Solr 7.1.0, and re-index the data.
>
> But as Solr 7.1.0 is still not ready to index EML files yet due to this
> JIRA, https://issues.apache.org/jira/browse/SOLR-11622, we have to make
> use with our current Solr 6.5.1 first, which was already created without
> sharding from the start.
>
> Regards,
> Edwin
>
> On 23 November 2017 at 12:50, Shawn Heisey  wrote:
>
>> On 11/22/2017 6:19 PM, Zheng Lin Edwin Yeo wrote:
>>
>>> I'm doing the merging on the SSD drive, the speed should be ok?
>>>
>>
>> The speed of virtually all modern disks will have almost no influence on
>> the speed of the merge.  The bottleneck isn't disk transfer speed, it's the
>> operation of the merge code in Lucene.
>>
>> As I said earlier in this thread, a merge is **NOT** just a copy. Lucene
>> must completely rebuild the data structures of the index to incorporate all
>> of the segments of the source indexes into a single segment in the target
>> index, while simultaneously *excluding* information from documents that
>> have been deleted.
>>
>> The best speed I have ever personally seen for a merge is 30 megabytes
>> per second.  This is far below the sustained transfer rate of a typical
>> modern SATA disk.  SSD is capable of far faster data transfer ...but it
>> will NOT make merges go any faster.
>>
>> We need to merge because the data are indexed in two different
>>> collections,
>>> and we need them to be under the same collection, so that we can do
>>> things
>>> like faceting more accurately.
>>> Will sharding alone achieve this? Or do we have to merge first before we
>>> do
>>> the sharding?
>>>
>>
>> If you want the final index to be sharded, it's typically best to index
>> from scratch into a new empty collection that has the number of shards you
>> want.  The merging tool you're using isn't aware of concepts like shards.
>> It combines everything into a single index.
>>
>> It's not entirely clear what you're asking with the question about
>> sharding alone.  Making a guess:  I have never heard of facet accuracy
>> being affected by whether or not the index is sharded.  If that *is*
>> possible, then I would expect an index that is NOT sharded to have better
>> accuracy.
>>
>> Thanks,
>> Shawn
>>
>>
>


Re: Solr7 org.apache.lucene.index.IndexUpgrader

2017-11-27 Thread Leo Prince
Hi Daniel,

Thanks for the help.

Actually I have two major cores. One I have primary document store as MySQL
and I can populate and re-index data from MySQL. However the other core
with 40mil, is keeping as primary store (with stored=true), I get the fact
that it's not a good practice, however due to some reason, I don't have any
duplicate data store for this core. It's a low priority dynamic data piling
up directly to Solr. In this context, is there a better method other than
keeping the current Solr (older) as data source..? Any other work
arounds..? I am checking the method with least time frame of execution.

Thanks in advance,
Leo Prince

On Mon, Nov 27, 2017 at 2:58 PM, Daniel Collins 
wrote:

> Leo, the general rule of thumb here is that the Solr index should *not* be
> your main document store.  It is the index to your document store, but if
> it needs to be re-indexed, you should use your document store as the place
> to index from.
>
> Your index will not have the full source data (unless ALL your fields have
> stored=true, which suggests you are using it as a document store, see my
> first point), so you should look to where you index your data from, that is
> what should drive your new index.
>
> On 27 November 2017 at 07:16, Leo Prince 
> wrote:
>
> > Hi Shawn,
> >
> > Thanks for the help.
> >
> > I hate to burst your bubble here ... but 4 million docs is pretty small
> for
> > > a Solr index.  I have one index that's a hundred times larger, and
> there
> > > are people with *billions* of documents in SolrCloud.
> > >
> >
> > Sorry I missed a "0" there. It's actually 40 Millions, still according to
> > you, it's still a small sized index. 
> >
> >
> > You would need to keep the schema the same for the upgrade, except that
> you
> > > would need to disable docValues on some of your fields to get rid of
> the
> > > error you encountered.  You won't be able to take advantage of some of
> > the
> > > new capability in the new version unless you re-engineer your
> > config/schema
> > > and reindex.
> > >
> >
> > Thanks.. Got your point.
> >
> >
> >
> > >
> > > Upgrading an index, especially through three major versions, is
> generally
> > > not recommended.  I always reindex when upgrading Solr, especially to a
> > new
> > > major version, because Solr evolves quickly.
> > >
> >
> > What method do you actually follow to re-index upon Solr major upgrade..?
> >
> > I think I should do re-index since I am upgrading 3 major versions at
> once.
> > What is best method to re-index..? At present, I am planning to re-index
> by
> > SELECT from previous version (4.10.2) and then UPDATE into latest
> > version(7.1.0). Any other better thoughts to re-index..?
> >
> > Thanks in advance,
> > Leo Prince.
> >
>


Re: Strip out punctuation at the end of token

2017-11-27 Thread Emir Arnautović
Hi Sergio,
Is this the only case that needs “special” handling? If you are only after 
matching phone numbers then you need to think about both false negatives and 
false positives. E.g. if you go with only WDFF you will end up with ‘008’ 
token. That means that you will also return this doc for any query like 
X-008 which is not expected behaviour. I guess that you will need to do a 
bit of regex to clean up number and as Erick explained, you need to focus on 
tokens that will end up in index and make sure the right tokens are produced 
for different queries.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 24 Nov 2017, at 19:35, Erick Erickson  wrote:
> 
> You need to play with the (many) parameters for WordDelimiterFilterFactory.
> 
> For instance, you have preserveOriginal set to 1. That's what's
> generating the token with the dot.
> 
> You have catenateAll and catenateNumbers set to zero. That means that
> someone searching for 61149008 won't get a hit.
> 
> The fact that the dot is in the tokens generated doesn't really matter
> as long as the query tokens produced will match.
> 
> I think you're getting a bit off track by focusing on the hyphen and
> dot, you're only seeing them in the index at all since you have
> preserveOriginal set to 1. Let's say that you set preserveOriginal to
> 0 and catenateNumbers to 1. Then you'd get:
> 61149
> 008
> 61149008
> 
> in your index. No dots, no hyphens.
> 
> Not your _query_ analysis also has catenateNumbers as 1 and
> preserveOriginal as 0. The user searches for
> 61149-008
> 
> and the emitted tokens are in the index and you're OK. The user
> searches for 61149008 and gets a hit there too. The dot is irrelevant.
> 
> now, all that said if that isn't comfortable you could certainly add
> PatternReplaceFilterFactory, but really WDFF is designed for this kind
> of thing, I think you'll be just fine if you play with the options
> enough to understand the nuances, which can be tricky I'll admit..
> 
> 
> Best,
> Erick
> 
> On Fri, Nov 24, 2017 at 7:13 AM, Sergio García Maroto
>  wrote:
>> Yes. You are right. I understand now.
>> Let me explain my issue a bit better with the exact problem i have.
>> 
>> I have this text "Information number  61149-008."
>> Using the tokenizers and filters described previously i get this list of
>> tokens.
>> information
>> number
>> 61149-008.
>> 61149
>> 008
>> 
>> Basically last token   "61149-008."  gets tokenized as
>> 61149-008.
>> 61149
>> 008
>> User is searching for "61149-008" without dot, so this is not a match.
>> I don't want to change the tokenization on the query to avoid altering the
>> matches for other cases.
>> 
>> I would like to delete the dot at the end. Basically generate this extra
>> token
>> information
>> number
>> 61149-008.
>> 61149
>> 008
>> 61149-008
>> 
>> Not sure if what I am saying make sense or there is other way to do this
>> right.
>> 
>> Thanks a lot
>> Sergio
>> 
>> 
>> On 24 November 2017 at 15:31, Shawn Heisey  wrote:
>> 
>>> On 11/24/2017 2:32 AM, marotosg wrote:
>>> 
 Hi Shaw.
 Thanks for your reply. Actually my issue is with the last token. It looks
 like for the last token of a string. It keeps the dot.
 
 In your case Testing. This is a test. Test.
 
 Keeps the "Test."
 
 Is there any reason I can't see for that behauviour?
 
>>> 
>>> I am really not sure what you're saying here.
>>> 
>>> Every token is duplicated, one has the dot and one doesn't.  This is what
>>> you wanted based on what I read in your initial email.
>>> 
>>> Making a guess as to what you're asking about this time: If you're
>>> noticing that there isn't a "Test" as the last token on the line for WDF,
>>> then I have to tell you that it actually is there, the display was simply
>>> too wide for the browser window. Scrolling horizontally would be required
>>> to see the whole thing.
>>> 
>>> Thanks,
>>> Shawn
>>> 
>>> 



Re: Solr7 org.apache.lucene.index.IndexUpgrader

2017-11-27 Thread Daniel Collins
Leo, the general rule of thumb here is that the Solr index should *not* be
your main document store.  It is the index to your document store, but if
it needs to be re-indexed, you should use your document store as the place
to index from.

Your index will not have the full source data (unless ALL your fields have
stored=true, which suggests you are using it as a document store, see my
first point), so you should look to where you index your data from, that is
what should drive your new index.

On 27 November 2017 at 07:16, Leo Prince 
wrote:

> Hi Shawn,
>
> Thanks for the help.
>
> I hate to burst your bubble here ... but 4 million docs is pretty small for
> > a Solr index.  I have one index that's a hundred times larger, and there
> > are people with *billions* of documents in SolrCloud.
> >
>
> Sorry I missed a "0" there. It's actually 40 Millions, still according to
> you, it's still a small sized index. 
>
>
> You would need to keep the schema the same for the upgrade, except that you
> > would need to disable docValues on some of your fields to get rid of the
> > error you encountered.  You won't be able to take advantage of some of
> the
> > new capability in the new version unless you re-engineer your
> config/schema
> > and reindex.
> >
>
> Thanks.. Got your point.
>
>
>
> >
> > Upgrading an index, especially through three major versions, is generally
> > not recommended.  I always reindex when upgrading Solr, especially to a
> new
> > major version, because Solr evolves quickly.
> >
>
> What method do you actually follow to re-index upon Solr major upgrade..?
>
> I think I should do re-index since I am upgrading 3 major versions at once.
> What is best method to re-index..? At present, I am planning to re-index by
> SELECT from previous version (4.10.2) and then UPDATE into latest
> version(7.1.0). Any other better thoughts to re-index..?
>
> Thanks in advance,
> Leo Prince.
>


Re: Spellchecker Results

2017-11-27 Thread Emir Arnautović
Hi Sid,
I don’t think that such feature is added to Solr, but there is Sematext’s 
component that does what you need: 
https://github.com/sematext/solr-researcher/tree/master/dym 


HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 23 Nov 2017, at 16:34, Sadiki Latty  wrote:
> 
> Hi all,
> 
> Is it possible to return the results of a spellcheck in addition to the 
> spellcheck WITHOUT sending another query request?
> Example:
> Client sends "educatione", results  returns education results as well as 
> noting that the term "educatione" was spellchecked.
> 
> 
> Thanks
> 
> Sid Latty