RE: Index size increases disproportionately to size of added field when indexed=false

2018-02-19 Thread Alessandro Benedetti
Hi David,
good to know that sorting solved your problem.
I understand perfectly that given the urgency of your situation, having the
solution ready takes priority over continuing with the investigations.

I would recommend anyway to open a Jira issue in Apache Solr with all the
information gathered so far.
Your situation caught our attention and definitely changing the order of the
documents in input shouldn't affect the index size ( by such a greater
factor).
The fact that the optimize didn't change anything is even more suspicious.
It may be an indicator that in some edge cases ordering of input documents
is affecting one of the index data structure.
As a last thing when you have time I would suggest to :

1) index the ordering which gives you a small index - Optimize - Take note
of the size by index file extension

2) index the ordering which gives you a big index - Optimize - Take note of
the size by index file extension

And attach that to the Jira issue.
Whenever someone picks it up, that would definitely help.

Cheers




-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


RE: Index size increases disproportionately to size of added field when indexed=false

2018-02-18 Thread Howe, David

Hi Erick & Alessandro,

I have solved my problem by re-ordering the data in the SQL query.  I don't 
know why it works but it does.  I can consistently re-produce the problem 
without changing anything else except the database table.  As our Solr build is 
scripted and we always build a new Solr server from scratch, I'm pretty 
confident that the defaults haven't changed between test runs as when we create 
the Solr index, Solr doesn't know what order the data in the database table is 
in.

I did try removing the geo location field to see if that made a difference, and 
it didn't.

Due to project commitments, I don't have any time to investigate this further 
at the moment.  When/if things quiet down I may see if I can reproduce the 
problem with a smaller number of records loaded from a flat file to make it 
easier to share a project that shows the problem occurring.

Thanks again for all of your assistance and suggestions.

Regards,

David

David Howe
Java Domain Architect
Postal Systems
Level 16, 111 Bourke Street Melbourne VIC 3000

T  0391067904

M  0424036591

E  david.h...@auspost.com.au

W  auspost.com.au
W  startrack.com.au

Australia Post is committed to providing our customers with excellent service. 
If we can assist you in any way please telephone 13 13 18 or visit our website.

The information contained in this email communication may be proprietary, 
confidential or legally professionally privileged. It is intended exclusively 
for the individual or entity to which it is addressed. You should only read, 
disclose, re-transmit, copy, distribute, act in reliance on or commercialise 
the information if you are authorised to do so. Australia Post does not 
represent, warrant or guarantee that the integrity of this email communication 
has been maintained nor that the communication is free of errors, virus or 
interference.

If you are not the addressee or intended recipient please notify us by replying 
direct to the sender and then destroy any electronic or paper copy of this 
message. Any views expressed in this email communication are taken to be those 
of the individual sender, except where the sender specifically attributes those 
views to Australia Post and is authorised to do so.

Please consider the environment before printing this email.


Re: Index size increases disproportionately to size of added field when indexed=false

2018-02-16 Thread Erick Erickson
I didn't mean to imply that _you'd_ changed things, the _defaults_ may
have changed. So the "string" fieldType may be defined with
docValues="true" in your new schema and "false" in your old schema
without you intentionally changing anything at _all_.

That's why the LukeRequestHandler will help, because it tells you
what's _there_ regardless of how it got there...

Best,
Erick

On Fri, Feb 16, 2018 at 1:37 PM, Howe, David  wrote:
>
> Hi Erick,
>
> I'm 99% sure that I haven't changed the field types between the two snapshots 
> as all of my test runs are completely scripted and build a new Solr server 
> from scratch (both the virtual machine and the Solr software).  I can diff 
> the scripts between two runs to make sure I haven't accidentally changed 
> anything, and I have done this.
>
> The only difference is that I added docValues=false to all of the fields that 
> are indexed=false and stored=true in the run that is smaller.  I had tested 
> this previously with the data in the order that makes the index larger and it 
> only made a minor difference (see one of my previous posts).  Unfortunately, 
> I hadn't added the change to log the file sizes when I did that run, but it 
> definitely didn't fix the problem.
>
> I need to try and get my project back on track now, so I will concentrate on 
> the "fix" that I have and perhaps re-run some other scenarios when I have 
> more time.
>
> Thanks again for your help.
>
> Regards,
>
> David
>
> David Howe
> Java Domain Architect
> Postal Systems
> Level 16, 111 Bourke Street Melbourne VIC 3000
>
> T  0391067904
>
> M  0424036591
>
> E  david.h...@auspost.com.au
>
> W  auspost.com.au
> W  startrack.com.au
>
> Australia Post is committed to providing our customers with excellent 
> service. If we can assist you in any way please telephone 13 13 18 or visit 
> our website.
>
> The information contained in this email communication may be proprietary, 
> confidential or legally professionally privileged. It is intended exclusively 
> for the individual or entity to which it is addressed. You should only read, 
> disclose, re-transmit, copy, distribute, act in reliance on or commercialise 
> the information if you are authorised to do so. Australia Post does not 
> represent, warrant or guarantee that the integrity of this email 
> communication has been maintained nor that the communication is free of 
> errors, virus or interference.
>
> If you are not the addressee or intended recipient please notify us by 
> replying direct to the sender and then destroy any electronic or paper copy 
> of this message. Any views expressed in this email communication are taken to 
> be those of the individual sender, except where the sender specifically 
> attributes those views to Australia Post and is authorised to do so.
>
> Please consider the environment before printing this email.


RE: Index size increases disproportionately to size of added field when indexed=false

2018-02-16 Thread Howe, David

Hi Erick,

I'm 99% sure that I haven't changed the field types between the two snapshots 
as all of my test runs are completely scripted and build a new Solr server from 
scratch (both the virtual machine and the Solr software).  I can diff the 
scripts between two runs to make sure I haven't accidentally changed anything, 
and I have done this.

The only difference is that I added docValues=false to all of the fields that 
are indexed=false and stored=true in the run that is smaller.  I had tested 
this previously with the data in the order that makes the index larger and it 
only made a minor difference (see one of my previous posts).  Unfortunately, I 
hadn't added the change to log the file sizes when I did that run, but it 
definitely didn't fix the problem.

I need to try and get my project back on track now, so I will concentrate on 
the "fix" that I have and perhaps re-run some other scenarios when I have more 
time.

Thanks again for your help.

Regards,

David

David Howe
Java Domain Architect
Postal Systems
Level 16, 111 Bourke Street Melbourne VIC 3000

T  0391067904

M  0424036591

E  david.h...@auspost.com.au

W  auspost.com.au
W  startrack.com.au

Australia Post is committed to providing our customers with excellent service. 
If we can assist you in any way please telephone 13 13 18 or visit our website.

The information contained in this email communication may be proprietary, 
confidential or legally professionally privileged. It is intended exclusively 
for the individual or entity to which it is addressed. You should only read, 
disclose, re-transmit, copy, distribute, act in reliance on or commercialise 
the information if you are authorised to do so. Australia Post does not 
represent, warrant or guarantee that the integrity of this email communication 
has been maintained nor that the communication is free of errors, virus or 
interference.

If you are not the addressee or intended recipient please notify us by replying 
direct to the sender and then destroy any electronic or paper copy of this 
message. Any views expressed in this email communication are taken to be those 
of the individual sender, except where the sender specifically attributes those 
views to Australia Post and is authorised to do so.

Please consider the environment before printing this email.


Re: Index size increases disproportionately to size of added field when indexed=false

2018-02-16 Thread Erick Erickson
Well, I'm not entirely sure either ;)

What I'm seeing. And, BTW, I'm making a couple of assumptions here. In
the one listing, your biggest segment starts with _7l and in the other
its _zd. The aggregate size is
2,815M for _7l and 705M for _zd. So multiplying the individual files
in _zd by 4 (poor-man's normalization) I get these major differences:

ext_7l(M)_zd(M)
dim84200These are points fields.
fdt1,431  1,000   These are stored data, the discrepancy here
goes the "other" way.
pos 335  480   position information.
dvd 165  400   docValues
tim3480  terms dictionary

I don't think the fdt or pos fields matter all that much, they're
"close enough". That said, I'd guess you have some position
information turned on in the more recent Solr that wasn't in the old
one.

Points and dvd fields are much more interesting, as well as terms dictionary.

I doubt you've consciously changed the field types but some of the
_defaults_ have changed in the fieldType definitions. Perhaps that
accounts for some of the difference? That's why I was curious about
what the LukeRequestHandler (or Luke itself) show's for each field
type. Those tools show you what's actually _in_ the index's metadata,
not just what is in the schema file.

As for the sorting, I'll have to defer to the people who understand
how spatial data is stored

Best,
Erick

On Fri, Feb 16, 2018 at 11:37 AM, Howe, David  wrote:
>
> Hi Erick,
>
> Below is the file listing for when the index is loaded with the table ordered 
> in a way that produces the smaller index.
>
> I have checked the console, and we have no deleted docs and we have the same 
> number of docs in the index as there are rows in the staging table that we 
> load from.  I would be surprised if this wasn't the case as we use the 
> primary key from the staging table as the id in Solr, so it is pretty much 
> guaranteed to be unique.  The primary key in the staging table is a 
> NUMBER(10, 0) column which contains the row number in Oracle, so it starts 
> from 1 and goes up to 14,061,990.  We load the index in row number order.
>
> When we get the larger sized index, the table is sequenced by a field named 
> DPID which is a NUMBER(10, 0) in Oracle.  The corresponding Solr definition 
> for that field is:
>
>   curl -X POST -H 'Content-type:application/json' --data-binary '{
> "add-field":{
>"name":"dpid",
>"type":"pint",
>"stored":true,
>"indexed": true
> }
>   }' http://localhost:8983/solr/address/schema
>
> When we get the smaller sized index, the table is sequenced by locality 
> (VARCHAR2(80)) and then postcode (VARCHAR2(4)).  The corresponding Solr 
> definition for these fields is:
>
>   echo "$(date) Creating locality field"
>   curl -X POST -H 'Content-type:application/json' --data-binary '{
> "add-field":{
>"name":"locality",
>"type":"locality",
>"stored":true,
>"indexed":true
> }
>   }' http://localhost:8983/solr/address/schema
>
>   echo "$(date) Creating postcode field"
>   curl -X POST -H 'Content-type:application/json' --data-binary '{
> "add-field":{
>"name":"postcode",
>"type":"pint",
>"stored":true,
>"indexed":true
> }
>   }' http://localhost:8983/solr/address/schema
>
> Not sure if this helps or not.
>
> Regards,
>
> David
>
> total 5300812
> -rw-r--r-- 1 solr solr97 Feb 16 04:12 _14o.dii
> -rw-r--r-- 1 solr solr  45400325 Feb 16 04:12 _14o.dim
> -rw-r--r-- 1 solr solr 221114041 Feb 16 04:10 _14o.fdt
> -rw-r--r-- 1 solr solr286434 Feb 16 04:10 _14o.fdx
> -rw-r--r-- 1 solr solr  6370 Feb 16 04:12 _14o.fnm
> -rw-r--r-- 1 solr solr  17379224 Feb 16 04:12 _14o.nvd
> -rw-r--r-- 1 solr solr   463 Feb 16 04:12 _14o.nvm
> -rw-r--r-- 1 solr solr   620 Feb 16 04:12 _14o.si
> -rw-r--r-- 1 solr solr 147867580 Feb 16 04:11 _14o_Lucene50_0.doc
> -rw-r--r-- 1 solr solr 111291706 Feb 16 04:11 _14o_Lucene50_0.pos
> -rw-r--r-- 1 solr solr  18793856 Feb 16 04:11 _14o_Lucene50_0.tim
> -rw-r--r-- 1 solr solr360329 Feb 16 04:11 _14o_Lucene50_0.tip
> -rw-r--r-- 1 solr solr  91972283 Feb 16 04:12 _14o_Lucene70_0.dvd
> -rw-r--r-- 1 solr solr  4173 Feb 16 04:12 _14o_Lucene70_0.dvm
> -rw-r--r-- 1 solr solr   405 Feb 16 04:20 _16l.cfe
> -rw-r--r-- 1 solr solr  10956277 Feb 16 04:20 _16l.cfs
> -rw-r--r-- 1 solr solr   455 Feb 16 04:20 _16l.si
> -rw-r--r-- 1 solr solr   405 Feb 16 04:30 _18t.cfe
> -rw-r--r-- 1 solr solr  11619394 Feb 16 04:30 _18t.cfs
> -rw-r--r-- 1 solr solr   455 Feb 16 04:30 _18t.si
> -rw-r--r-- 1 solr solr97 Feb 16 04:34 _19e.dii
> -rw-r--r-- 1 solr solr  39424990 Feb 16 04:34 _19e.dim
> -rw-r--r-- 1 solr solr 188005197 Feb 16 04:33 _19e.fdt
> -rw-r--r-- 1 solr solr249160 Feb 16 04:33 _19e.fdx
> -rw-r--r-- 1 solr solr  6370 Feb 16 04:34 _19e.fnm
> -rw-r--r-- 1 solr solr  14660427 Feb 16 04:34 

RE: Index size increases disproportionately to size of added field when indexed=false

2018-02-16 Thread Howe, David

Hi Erick,

Thinking some more about the differences between the two sort orders has 
suggested another possibility.  We also have a geo spatial field defined in the 
index:

  echo "$(date) Creating geoLocation field"
  curl -X POST -H 'Content-type:application/json' --data-binary '{
"add-field":{
   "name":"geoLocation",
   "type":"location",
   "stored":true,
   "indexed":true
}
  }' http://localhost:8983/solr/address/schema

One of the differences between the two sort orders is that when the data is 
sorted by locality and post code, it means that addresses that are close to 
each other will be sorted together as both locality and postcode have 
geographic meaning.  So when they are indexed, they will be indexed in groups 
of addresses that are quite near to each other.

When the data is sorted by DPID, the order is near random as the dpid has no 
meaning at all, so the geo location sequence should be random as well.

I don't have time to test this at the moment, as I need to get my project back 
on track after chasing this performance issue but it might ring a bell with 
somebody.

Regards,

David

David Howe
Java Domain Architect
Postal Systems
Level 16, 111 Bourke Street Melbourne VIC 3000

T  0391067904

M  0424036591

E  david.h...@auspost.com.au

W  auspost.com.au
W  startrack.com.au

Australia Post is committed to providing our customers with excellent service. 
If we can assist you in any way please telephone 13 13 18 or visit our website.

The information contained in this email communication may be proprietary, 
confidential or legally professionally privileged. It is intended exclusively 
for the individual or entity to which it is addressed. You should only read, 
disclose, re-transmit, copy, distribute, act in reliance on or commercialise 
the information if you are authorised to do so. Australia Post does not 
represent, warrant or guarantee that the integrity of this email communication 
has been maintained nor that the communication is free of errors, virus or 
interference.

If you are not the addressee or intended recipient please notify us by replying 
direct to the sender and then destroy any electronic or paper copy of this 
message. Any views expressed in this email communication are taken to be those 
of the individual sender, except where the sender specifically attributes those 
views to Australia Post and is authorised to do so.

Please consider the environment before printing this email.


RE: Index size increases disproportionately to size of added field when indexed=false

2018-02-16 Thread Howe, David

Hi Erick,

Below is the file listing for when the index is loaded with the table ordered 
in a way that produces the smaller index.

I have checked the console, and we have no deleted docs and we have the same 
number of docs in the index as there are rows in the staging table that we load 
from.  I would be surprised if this wasn't the case as we use the primary key 
from the staging table as the id in Solr, so it is pretty much guaranteed to be 
unique.  The primary key in the staging table is a NUMBER(10, 0) column which 
contains the row number in Oracle, so it starts from 1 and goes up to 
14,061,990.  We load the index in row number order.

When we get the larger sized index, the table is sequenced by a field named 
DPID which is a NUMBER(10, 0) in Oracle.  The corresponding Solr definition for 
that field is:

  curl -X POST -H 'Content-type:application/json' --data-binary '{
"add-field":{
   "name":"dpid",
   "type":"pint",
   "stored":true,
   "indexed": true
}
  }' http://localhost:8983/solr/address/schema

When we get the smaller sized index, the table is sequenced by locality 
(VARCHAR2(80)) and then postcode (VARCHAR2(4)).  The corresponding Solr 
definition for these fields is:

  echo "$(date) Creating locality field"
  curl -X POST -H 'Content-type:application/json' --data-binary '{
"add-field":{
   "name":"locality",
   "type":"locality",
   "stored":true,
   "indexed":true
}
  }' http://localhost:8983/solr/address/schema

  echo "$(date) Creating postcode field"
  curl -X POST -H 'Content-type:application/json' --data-binary '{
"add-field":{
   "name":"postcode",
   "type":"pint",
   "stored":true,
   "indexed":true
}
  }' http://localhost:8983/solr/address/schema

Not sure if this helps or not.

Regards,

David

total 5300812
-rw-r--r-- 1 solr solr97 Feb 16 04:12 _14o.dii
-rw-r--r-- 1 solr solr  45400325 Feb 16 04:12 _14o.dim
-rw-r--r-- 1 solr solr 221114041 Feb 16 04:10 _14o.fdt
-rw-r--r-- 1 solr solr286434 Feb 16 04:10 _14o.fdx
-rw-r--r-- 1 solr solr  6370 Feb 16 04:12 _14o.fnm
-rw-r--r-- 1 solr solr  17379224 Feb 16 04:12 _14o.nvd
-rw-r--r-- 1 solr solr   463 Feb 16 04:12 _14o.nvm
-rw-r--r-- 1 solr solr   620 Feb 16 04:12 _14o.si
-rw-r--r-- 1 solr solr 147867580 Feb 16 04:11 _14o_Lucene50_0.doc
-rw-r--r-- 1 solr solr 111291706 Feb 16 04:11 _14o_Lucene50_0.pos
-rw-r--r-- 1 solr solr  18793856 Feb 16 04:11 _14o_Lucene50_0.tim
-rw-r--r-- 1 solr solr360329 Feb 16 04:11 _14o_Lucene50_0.tip
-rw-r--r-- 1 solr solr  91972283 Feb 16 04:12 _14o_Lucene70_0.dvd
-rw-r--r-- 1 solr solr  4173 Feb 16 04:12 _14o_Lucene70_0.dvm
-rw-r--r-- 1 solr solr   405 Feb 16 04:20 _16l.cfe
-rw-r--r-- 1 solr solr  10956277 Feb 16 04:20 _16l.cfs
-rw-r--r-- 1 solr solr   455 Feb 16 04:20 _16l.si
-rw-r--r-- 1 solr solr   405 Feb 16 04:30 _18t.cfe
-rw-r--r-- 1 solr solr  11619394 Feb 16 04:30 _18t.cfs
-rw-r--r-- 1 solr solr   455 Feb 16 04:30 _18t.si
-rw-r--r-- 1 solr solr97 Feb 16 04:34 _19e.dii
-rw-r--r-- 1 solr solr  39424990 Feb 16 04:34 _19e.dim
-rw-r--r-- 1 solr solr 188005197 Feb 16 04:33 _19e.fdt
-rw-r--r-- 1 solr solr249160 Feb 16 04:33 _19e.fdx
-rw-r--r-- 1 solr solr  6370 Feb 16 04:34 _19e.fnm
-rw-r--r-- 1 solr solr  14660427 Feb 16 04:34 _19e.nvd
-rw-r--r-- 1 solr solr   463 Feb 16 04:34 _19e.nvm
-rw-r--r-- 1 solr solr   620 Feb 16 04:34 _19e.si
-rw-r--r-- 1 solr solr 131101691 Feb 16 04:33 _19e_Lucene50_0.doc
-rw-r--r-- 1 solr solr  97734855 Feb 16 04:33 _19e_Lucene50_0.pos
-rw-r--r-- 1 solr solr  16502289 Feb 16 04:33 _19e_Lucene50_0.tim
-rw-r--r-- 1 solr solr320224 Feb 16 04:33 _19e_Lucene50_0.tip
-rw-r--r-- 1 solr solr  78801516 Feb 16 04:34 _19e_Lucene70_0.dvd
-rw-r--r-- 1 solr solr  2097 Feb 16 04:34 _19e_Lucene70_0.dvm
-rw-r--r-- 1 solr solr   405 Feb 16 04:35 _19y.cfe
-rw-r--r-- 1 solr solr  78051374 Feb 16 04:35 _19y.cfs
-rw-r--r-- 1 solr solr   455 Feb 16 04:35 _19y.si
-rw-r--r-- 1 solr solr   405 Feb 16 04:37 _1ai.cfe
-rw-r--r-- 1 solr solr  53311170 Feb 16 04:37 _1ai.cfs
-rw-r--r-- 1 solr solr   455 Feb 16 04:37 _1ai.si
-rw-r--r-- 1 solr solr   405 Feb 16 04:40 _1b2.cfe
-rw-r--r-- 1 solr solr  70986259 Feb 16 04:40 _1b2.cfs
-rw-r--r-- 1 solr solr   455 Feb 16 04:40 _1b2.si
-rw-r--r-- 1 solr solr   405 Feb 16 04:41 _1bc.cfe
-rw-r--r-- 1 solr solr  10338200 Feb 16 04:41 _1bc.cfs
-rw-r--r-- 1 solr solr   455 Feb 16 04:41 _1bc.si
-rw-r--r-- 1 solr solr   405 Feb 16 04:42 _1bm.cfe
-rw-r--r-- 1 solr solr  68074070 Feb 16 04:42 _1bm.cfs
-rw-r--r-- 1 solr solr   455 Feb 16 04:42 _1bm.si
-rw-r--r-- 1 solr solr   405 Feb 16 04:45 _1c5.cfe
-rw-r--r-- 1 solr solr  67766868 Feb 16 04:45 _1c5.cfs
-rw-r--r-- 1 solr solr   455 Feb 16 04:45 _1c5.si
-rw-r--r-- 1 solr solr91 Feb 16 04:45 _1c6.dii
-rw-r--r-- 1 solr solr666032 Feb 16 04:45 _1c6.dim
-rw-r--r-- 1 solr solr   2515129 Feb 16 04:45 _1c6.fdt

RE: Index size increases disproportionately to size of added field when indexed=false

2018-02-16 Thread Howe, David

Hi Alessandro,

There are 14,061,990 records in the staging table and that is how many 
documents that we end up with in Solr.  I would be surprised if we have a 
problem with the id, as we use the primary key of the table as the id in Solr 
so it must be unique.

The primary key of the staging table is a NUMBER(10, 0) in Oracle, and we set 
it to the row number when we are populating the table.  So the id's will start 
at 1 and go up to 14,061,990 and we load the records in id order.

Regards,

David

David Howe
Java Domain Architect
Postal Systems
Level 16, 111 Bourke Street Melbourne VIC 3000

T  0391067904

M  0424036591

E  david.h...@auspost.com.au

W  auspost.com.au
W  startrack.com.au

Australia Post is committed to providing our customers with excellent service. 
If we can assist you in any way please telephone 13 13 18 or visit our website.

The information contained in this email communication may be proprietary, 
confidential or legally professionally privileged. It is intended exclusively 
for the individual or entity to which it is addressed. You should only read, 
disclose, re-transmit, copy, distribute, act in reliance on or commercialise 
the information if you are authorised to do so. Australia Post does not 
represent, warrant or guarantee that the integrity of this email communication 
has been maintained nor that the communication is free of errors, virus or 
interference.

If you are not the addressee or intended recipient please notify us by replying 
direct to the sender and then destroy any electronic or paper copy of this 
message. Any views expressed in this email communication are taken to be those 
of the individual sender, except where the sender specifically attributes those 
views to Australia Post and is authorised to do so.

Please consider the environment before printing this email.


Re: Index size increases disproportionately to size of added field when indexed=false

2018-02-16 Thread Alessandro Benedetti
It's a silly thing, but to confirm the direction that Erick is suggesting :
How many rows in the DB ?
If updates are happening on Solr ( causing the deletes), I would expect a
greater number of documents in the DB than in the Solr index.
Is the DB primary key ( if any) the same of the uniqueKey field in Solr ?

Regards

--
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
www.sease.io

On Fri, Feb 16, 2018 at 10:18 AM, Howe, David 
wrote:

>
> Hi Emir,
>
> We have no copy field definitions.  To keep things simple, we have a one
> to one mapping between the columns in our staging table and the fields in
> our Solr index.
>
> Regards,
>
> David
>
> David Howe
> Java Domain Architect
> Postal Systems
> Level 16, 111 Bourke Street Melbourne VIC 3000
>
> T  0391067904
>
> M  0424036591
>
> E  david.h...@auspost.com.au
>
> W  auspost.com.au
> W  startrack.com.au
>
> Australia Post is committed to providing our customers with excellent
> service. If we can assist you in any way please telephone 13 13 18 or visit
> our website.
>
> The information contained in this email communication may be proprietary,
> confidential or legally professionally privileged. It is intended
> exclusively for the individual or entity to which it is addressed. You
> should only read, disclose, re-transmit, copy, distribute, act in reliance
> on or commercialise the information if you are authorised to do so.
> Australia Post does not represent, warrant or guarantee that the integrity
> of this email communication has been maintained nor that the communication
> is free of errors, virus or interference.
>
> If you are not the addressee or intended recipient please notify us by
> replying direct to the sender and then destroy any electronic or paper copy
> of this message. Any views expressed in this email communication are taken
> to be those of the individual sender, except where the sender specifically
> attributes those views to Australia Post and is authorised to do so.
>
> Please consider the environment before printing this email.
>


RE: Index size increases disproportionately to size of added field when indexed=false

2018-02-16 Thread Howe, David

Hi Emir,

We have no copy field definitions.  To keep things simple, we have a one to one 
mapping between the columns in our staging table and the fields in our Solr 
index.

Regards,

David

David Howe
Java Domain Architect
Postal Systems
Level 16, 111 Bourke Street Melbourne VIC 3000

T  0391067904

M  0424036591

E  david.h...@auspost.com.au

W  auspost.com.au
W  startrack.com.au

Australia Post is committed to providing our customers with excellent service. 
If we can assist you in any way please telephone 13 13 18 or visit our website.

The information contained in this email communication may be proprietary, 
confidential or legally professionally privileged. It is intended exclusively 
for the individual or entity to which it is addressed. You should only read, 
disclose, re-transmit, copy, distribute, act in reliance on or commercialise 
the information if you are authorised to do so. Australia Post does not 
represent, warrant or guarantee that the integrity of this email communication 
has been maintained nor that the communication is free of errors, virus or 
interference.

If you are not the addressee or intended recipient please notify us by replying 
direct to the sender and then destroy any electronic or paper copy of this 
message. Any views expressed in this email communication are taken to be those 
of the individual sender, except where the sender specifically attributes those 
views to Australia Post and is authorised to do so.

Please consider the environment before printing this email.


Re: Index size increases disproportionately to size of added field when indexed=false

2018-02-16 Thread Emir Arnautović
Hi David,
I skimmed through thread and don’t see if already eliminated, so will ask: Can 
you check if there are some copyField rules that are triggered when new field 
is added. You mentioned that ordering fixed the size of the index, but might be 
worth checking.

Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 16 Feb 2018, at 05:05, Erick Erickson  wrote:
> 
> This isn't terribly useful without a similar dump of "the other" index
> directory. The point is to compare the different extensions some
> segment where the sum of all the files in that segment is roughly
> equal. So if you have a listing of the old index around, that would
> help.
> 
> bq: We don't have any deleted docs in our index, as we always build it
> from a brand new virtual machine with a brand new installation of
> Solr.
> 
> Well, that's an assumption I want to check. Here's the problem. It's
> possible that the ordering bit you're talking about is really masking
> indexing the same  multiple times. Since indexing a doc
> with the same  just marks the old doc as deleted, the old
> doc will take up room in your index until it's purged during segment
> merging. This is a _really_ long shot mind you, I have a hard time
> believing that this is the root cause here. It's worth checking
> though. Even doing a q=*:* won't help since that doesn't count deleted
> docs. Take a quick glance at the admin overview page for a core and
> check, there is "maxDoc", "deletedDocs" and "numDocs". I expect
> deletedDocs will be zero and numDocs and maxDoc will be your 14M, but
> this problem is so odd that I'm covering as many  bases as I can think
> of ;)
> 
> Now, ordering may appear to change things, but that could simply be
> that the deleted docs don't happen to fall in segments that are
> merged. Again, this is unlikely but possible.
> 
> The shortcut here would be to optimize afterwards. In the usual course
> of events this should _not_ be necessary (or even desirable) unless
> you do it every time you build your index for arcane reasons, see:
> https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/.
> But if you do optimize (forceMerge) and the size drops back to more
> reasonable levels it would be a clue.
> 
> Ordering simply should not affect the final index size except for,
> possibly, changing the number of deleted docs in the index largely
> through chance. If you do see a dramatic difference, try the optimize
> thing to check.
> 
> If simple ordering _does_ really make a difference (outside of number
> of deleted docs)  my understanding of Solr is going to undergo a
> revision. And we'll probably be raising a JIRA or two ;)
> 
> Now, what I really expect the issue is is one of two things:
> 1> you have some options turned on now that weren't before, either
> through some innocent-seeming change, a change in the internal
> defaults etc.
> 2> your SQL with the extra field is behaving unexpectedly.
> 
> The proof is of course in the pudding...
> 
> Best,
> Erick
> 
> 
> 
> On Thu, Feb 15, 2018 at 5:15 PM, Howe, David  
> wrote:
>> 
>> Hi Erick,
>> 
>> I have the full dump of the Solr index file sizes as well if that is of any 
>> help.  I have attached it below this message.
>> 
>> We don't have any deleted docs in our index, as we always build it from a 
>> brand new virtual machine with a brand new installation of Solr.
>> 
>> The ordering is definitely making a difference, as I can run the same 
>> indexing configuration over a table with the same data just in different 
>> orders and it produces these vastly different results.  I have been chasing 
>> this for a couple of weeks trying to work out what the difference is when we 
>> just add one extra field.  The difference that I have found is that the 
>> extra field causes the staging table population query to be optimised 
>> differently and to select the records in a different sequence.  When I force 
>> the records back to their original sequence, the index goes back to being 
>> small again.
>> 
>> I'm currently re-building my staging data to try and get it into the same 
>> order as before and including the extra field.  I will post the file sizes 
>> again when I have that result.
>> 
>> Regards,
>> 
>> David
>> 
>> total 14600404
>> -rw-r--r-- 1 solr solr 97 Feb 14 01:34 _7l.dii
>> -rw-r--r-- 1 solr solr   83831801 Feb 14 01:34 _7l.dim
>> -rw-r--r-- 1 solr solr 1431645451 Feb 14 01:33 _7l.fdt
>> -rw-r--r-- 1 solr solr 381994 Feb 14 01:33 _7l.fdx
>> -rw-r--r-- 1 solr solr   6370 Feb 14 01:34 _7l.fnm
>> -rw-r--r-- 1 solr solr   29353048 Feb 14 01:34 _7l.nvd
>> -rw-r--r-- 1 solr solr463 Feb 14 01:34 _7l.nvm
>> -rw-r--r-- 1 solr solr606 Feb 14 01:34 _7l.si
>> -rw-r--r-- 1 solr solr  734701117 Feb 14 01:34 _7l_Lucene50_0.doc
>> -rw-r--r-- 1 solr solr  335043096 Feb 14 01:34 

Re: Index size increases disproportionately to size of added field when indexed=false

2018-02-15 Thread Erick Erickson
This isn't terribly useful without a similar dump of "the other" index
directory. The point is to compare the different extensions some
segment where the sum of all the files in that segment is roughly
equal. So if you have a listing of the old index around, that would
help.

bq: We don't have any deleted docs in our index, as we always build it
from a brand new virtual machine with a brand new installation of
Solr.

Well, that's an assumption I want to check. Here's the problem. It's
possible that the ordering bit you're talking about is really masking
indexing the same  multiple times. Since indexing a doc
with the same  just marks the old doc as deleted, the old
doc will take up room in your index until it's purged during segment
merging. This is a _really_ long shot mind you, I have a hard time
believing that this is the root cause here. It's worth checking
though. Even doing a q=*:* won't help since that doesn't count deleted
docs. Take a quick glance at the admin overview page for a core and
check, there is "maxDoc", "deletedDocs" and "numDocs". I expect
deletedDocs will be zero and numDocs and maxDoc will be your 14M, but
this problem is so odd that I'm covering as many  bases as I can think
of ;)

Now, ordering may appear to change things, but that could simply be
that the deleted docs don't happen to fall in segments that are
merged. Again, this is unlikely but possible.

The shortcut here would be to optimize afterwards. In the usual course
of events this should _not_ be necessary (or even desirable) unless
you do it every time you build your index for arcane reasons, see:
https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/.
But if you do optimize (forceMerge) and the size drops back to more
reasonable levels it would be a clue.

Ordering simply should not affect the final index size except for,
possibly, changing the number of deleted docs in the index largely
through chance. If you do see a dramatic difference, try the optimize
thing to check.

If simple ordering _does_ really make a difference (outside of number
of deleted docs)  my understanding of Solr is going to undergo a
revision. And we'll probably be raising a JIRA or two ;)

Now, what I really expect the issue is is one of two things:
1> you have some options turned on now that weren't before, either
through some innocent-seeming change, a change in the internal
defaults etc.
2> your SQL with the extra field is behaving unexpectedly.

The proof is of course in the pudding...

Best,
Erick



On Thu, Feb 15, 2018 at 5:15 PM, Howe, David  wrote:
>
> Hi Erick,
>
> I have the full dump of the Solr index file sizes as well if that is of any 
> help.  I have attached it below this message.
>
> We don't have any deleted docs in our index, as we always build it from a 
> brand new virtual machine with a brand new installation of Solr.
>
> The ordering is definitely making a difference, as I can run the same 
> indexing configuration over a table with the same data just in different 
> orders and it produces these vastly different results.  I have been chasing 
> this for a couple of weeks trying to work out what the difference is when we 
> just add one extra field.  The difference that I have found is that the extra 
> field causes the staging table population query to be optimised differently 
> and to select the records in a different sequence.  When I force the records 
> back to their original sequence, the index goes back to being small again.
>
> I'm currently re-building my staging data to try and get it into the same 
> order as before and including the extra field.  I will post the file sizes 
> again when I have that result.
>
> Regards,
>
> David
>
> total 14600404
> -rw-r--r-- 1 solr solr 97 Feb 14 01:34 _7l.dii
> -rw-r--r-- 1 solr solr   83831801 Feb 14 01:34 _7l.dim
> -rw-r--r-- 1 solr solr 1431645451 Feb 14 01:33 _7l.fdt
> -rw-r--r-- 1 solr solr 381994 Feb 14 01:33 _7l.fdx
> -rw-r--r-- 1 solr solr   6370 Feb 14 01:34 _7l.fnm
> -rw-r--r-- 1 solr solr   29353048 Feb 14 01:34 _7l.nvd
> -rw-r--r-- 1 solr solr463 Feb 14 01:34 _7l.nvm
> -rw-r--r-- 1 solr solr606 Feb 14 01:34 _7l.si
> -rw-r--r-- 1 solr solr  734701117 Feb 14 01:34 _7l_Lucene50_0.doc
> -rw-r--r-- 1 solr solr  335043096 Feb 14 01:34 _7l_Lucene50_0.pos
> -rw-r--r-- 1 solr solr   34248274 Feb 14 01:34 _7l_Lucene50_0.tim
> -rw-r--r-- 1 solr solr 624945 Feb 14 01:34 _7l_Lucene50_0.tip
> -rw-r--r-- 1 solr solr  165958502 Feb 14 01:34 _7l_Lucene70_0.dvd
> -rw-r--r-- 1 solr solr   2581 Feb 14 01:34 _7l_Lucene70_0.dvm
> -rw-r--r-- 1 solr solr405 Feb 14 01:46 _9p.cfe
> -rw-r--r-- 1 solr solr   38776749 Feb 14 01:46 _9p.cfs
> -rw-r--r-- 1 solr solr452 Feb 14 01:46 _9p.si
> -rw-r--r-- 1 solr solr 97 Feb 14 02:07 _cm.dii
> -rw-r--r-- 1 solr solr   83111509 Feb 14 02:07 _cm.dim
> -rw-r--r-- 1 solr solr 1419981112 Feb 14 02:02 _cm.fdt
> -rw-r--r-- 1 solr solr   

RE: Index size increases disproportionately to size of added field when indexed=false

2018-02-15 Thread Howe, David

Hi Erick,

I have the full dump of the Solr index file sizes as well if that is of any 
help.  I have attached it below this message.

We don't have any deleted docs in our index, as we always build it from a brand 
new virtual machine with a brand new installation of Solr.

The ordering is definitely making a difference, as I can run the same indexing 
configuration over a table with the same data just in different orders and it 
produces these vastly different results.  I have been chasing this for a couple 
of weeks trying to work out what the difference is when we just add one extra 
field.  The difference that I have found is that the extra field causes the 
staging table population query to be optimised differently and to select the 
records in a different sequence.  When I force the records back to their 
original sequence, the index goes back to being small again.

I'm currently re-building my staging data to try and get it into the same order 
as before and including the extra field.  I will post the file sizes again when 
I have that result.

Regards,

David

total 14600404
-rw-r--r-- 1 solr solr 97 Feb 14 01:34 _7l.dii
-rw-r--r-- 1 solr solr   83831801 Feb 14 01:34 _7l.dim
-rw-r--r-- 1 solr solr 1431645451 Feb 14 01:33 _7l.fdt
-rw-r--r-- 1 solr solr 381994 Feb 14 01:33 _7l.fdx
-rw-r--r-- 1 solr solr   6370 Feb 14 01:34 _7l.fnm
-rw-r--r-- 1 solr solr   29353048 Feb 14 01:34 _7l.nvd
-rw-r--r-- 1 solr solr463 Feb 14 01:34 _7l.nvm
-rw-r--r-- 1 solr solr606 Feb 14 01:34 _7l.si
-rw-r--r-- 1 solr solr  734701117 Feb 14 01:34 _7l_Lucene50_0.doc
-rw-r--r-- 1 solr solr  335043096 Feb 14 01:34 _7l_Lucene50_0.pos
-rw-r--r-- 1 solr solr   34248274 Feb 14 01:34 _7l_Lucene50_0.tim
-rw-r--r-- 1 solr solr 624945 Feb 14 01:34 _7l_Lucene50_0.tip
-rw-r--r-- 1 solr solr  165958502 Feb 14 01:34 _7l_Lucene70_0.dvd
-rw-r--r-- 1 solr solr   2581 Feb 14 01:34 _7l_Lucene70_0.dvm
-rw-r--r-- 1 solr solr405 Feb 14 01:46 _9p.cfe
-rw-r--r-- 1 solr solr   38776749 Feb 14 01:46 _9p.cfs
-rw-r--r-- 1 solr solr452 Feb 14 01:46 _9p.si
-rw-r--r-- 1 solr solr 97 Feb 14 02:07 _cm.dii
-rw-r--r-- 1 solr solr   83111509 Feb 14 02:07 _cm.dim
-rw-r--r-- 1 solr solr 1419981112 Feb 14 02:02 _cm.fdt
-rw-r--r-- 1 solr solr 379544 Feb 14 02:02 _cm.fdx
-rw-r--r-- 1 solr solr   6370 Feb 14 02:07 _cm.fnm
-rw-r--r-- 1 solr solr   29049434 Feb 14 02:07 _cm.nvd
-rw-r--r-- 1 solr solr463 Feb 14 02:07 _cm.nvm
-rw-r--r-- 1 solr solr606 Feb 14 02:07 _cm.si
-rw-r--r-- 1 solr solr  728509370 Feb 14 02:07 _cm_Lucene50_0.doc
-rw-r--r-- 1 solr solr  332343997 Feb 14 02:07 _cm_Lucene50_0.pos
-rw-r--r-- 1 solr solr   34361884 Feb 14 02:07 _cm_Lucene50_0.tim
-rw-r--r-- 1 solr solr 658404 Feb 14 02:07 _cm_Lucene50_0.tip
-rw-r--r-- 1 solr solr  164612509 Feb 14 02:07 _cm_Lucene70_0.dvd
-rw-r--r-- 1 solr solr   2581 Feb 14 02:07 _cm_Lucene70_0.dvm
-rw-r--r-- 1 solr solr405 Feb 14 02:09 _fb.cfe
-rw-r--r-- 1 solr solr   44333425 Feb 14 02:09 _fb.cfs
-rw-r--r-- 1 solr solr452 Feb 14 02:09 _fb.si
-rw-r--r-- 1 solr solr 97 Feb 14 02:24 _h2.dii
-rw-r--r-- 1 solr solr   77079684 Feb 14 02:24 _h2.dim
-rw-r--r-- 1 solr solr 1304390074 Feb 14 02:22 _h2.fdt
-rw-r--r-- 1 solr solr 347494 Feb 14 02:22 _h2.fdx
-rw-r--r-- 1 solr solr   6370 Feb 14 02:24 _h2.fnm
-rw-r--r-- 1 solr solr   26756876 Feb 14 02:24 _h2.nvd
-rw-r--r-- 1 solr solr463 Feb 14 02:24 _h2.nvm
-rw-r--r-- 1 solr solr606 Feb 14 02:24 _h2.si
-rw-r--r-- 1 solr solr  669875920 Feb 14 02:24 _h2_Lucene50_0.doc
-rw-r--r-- 1 solr solr  305954906 Feb 14 02:24 _h2_Lucene50_0.pos
-rw-r--r-- 1 solr solr   32019733 Feb 14 02:24 _h2_Lucene50_0.tim
-rw-r--r-- 1 solr solr 619562 Feb 14 02:24 _h2_Lucene50_0.tip
-rw-r--r-- 1 solr solr  151772808 Feb 14 02:24 _h2_Lucene70_0.dvd
-rw-r--r-- 1 solr solr   2497 Feb 14 02:24 _h2_Lucene70_0.dvm
-rw-r--r-- 1 solr solr405 Feb 14 02:45 _mx.cfe
-rw-r--r-- 1 solr solr  277937779 Feb 14 02:45 _mx.cfs
-rw-r--r-- 1 solr solr452 Feb 14 02:45 _mx.si
-rw-r--r-- 1 solr solr 97 Feb 14 02:47 _n9.dii
-rw-r--r-- 1 solr solr   82335510 Feb 14 02:47 _n9.dim
-rw-r--r-- 1 solr solr 1400595065 Feb 14 02:46 _n9.fdt
-rw-r--r-- 1 solr solr 374259 Feb 14 02:46 _n9.fdx
-rw-r--r-- 1 solr solr   6370 Feb 14 02:47 _n9.fnm
-rw-r--r-- 1 solr solr   28775974 Feb 14 02:47 _n9.nvd
-rw-r--r-- 1 solr solr463 Feb 14 02:47 _n9.nvm
-rw-r--r-- 1 solr solr606 Feb 14 02:47 _n9.si
-rw-r--r-- 1 solr solr  719183309 Feb 14 02:46 _n9_Lucene50_0.doc
-rw-r--r-- 1 solr solr  328214265 Feb 14 02:46 _n9_Lucene50_0.pos
-rw-r--r-- 1 solr solr   34098919 Feb 14 02:46 _n9_Lucene50_0.tim
-rw-r--r-- 1 solr solr 654313 Feb 14 02:46 _n9_Lucene50_0.tip
-rw-r--r-- 1 solr solr  163220960 Feb 14 02:46 _n9_Lucene70_0.dvd
-rw-r--r-- 1 solr solr   2560 Feb 14 02:46 _n9_Lucene70_0.dvm
-rw-r--r-- 1 solr solr405 Feb 14 02:52 _ns.cfe

Re: Index size increases disproportionately to size of added field when indexed=false

2018-02-15 Thread Erick Erickson
David:

Rats, the cfs files make everything I'd hoped to understand with the
sizes ambiguous, since they conceal the underlying sizes of each other
extension. We can approach it a bit differently though. Take one
segment that's _not_ in cfs format where the total size of all files
making up that segment is near 5GB (the default max segment size) and
compare the individual segments for that segment only. What I'm hoping
to find out, of course, is which extensions vary dramatically. But
let's assume for the nonce that the numbers you already have are
comparable if we ignore the .cfs files.

.doc1094.682767.53 - term frequencies.
.fdt 1633.21 5387.92 - stored data
.pos809.23  1272.70 - position information

So the file difference (if borne out) indicates the following

- doc you have more documents or more terms or different options on
your terms [1]
- fdt you're storing more fields than you used to. [1]
- pos you have more docs or more terms or have position information
turned on where you didn't before. [1]

[1] or lots of deleted docs that haven't been merged away. This
information should be on the admin page for any particular core. I
think this unlikely, but who knows? NOTE, just because you get 14M fro
querying *:* does _not_ say anything about the deleted docs, which
take up space. This is highly unlikely to be your problem, but let's
eliminate the easy stuff ;)

Where I'd go from here after checking that these ratios are true for a
single like-sized segment in both cases

1> the LukeReqeustHandler can tell you information about exactly how
the index is defined, and using Luke itself can provide you a much
more detailed look at what's actually _in_ your index. You could also
have Luke reconstruct the same doc from your index in each case and
compare. Perhaps your SQL is doing something really unexpected. This
_should_ show you the realized meta-data for each field and let you
pinpoint any different options that have been enabled.

2> compare your Oracle intermediate tables, are they _really_
identical? The ordering shouldn't make any difference at all to Solr
assuming the same docs are being indexed (plus any expected delta).
There's an edge case I can imagine if you hit a "perfect storm" and
one version has a lot more deleted docs than the other that's possibly
the result of reordering, but that's unlikely. The edge case I'm
imagining would be easily verifiable by the two versions having a
radically different number of deleted docs

Best,
Erick




On Thu, Feb 15, 2018 at 7:13 AM, Pratik Patel  wrote:
> @Alessandro I will see if I can reproduce the same issue just by turning
> off omitNorms on field type. I'll open another mail thread if required.
> Thanks.
>
> On Thu, Feb 15, 2018 at 6:12 AM, Howe, David 
> wrote:
>
>>
>> Hi Alessandro,
>>
>> Some interesting testing today that seems to have gotten me closer to what
>> the issue is.  When I run the version of the index that is working
>> correctly against my database table that has the extra field in it, the
>> index suddenly increases in size.  This is even though the data importer is
>> running the same SELECT as before (which doesn't include the extra column)
>> and loads the same number of rows.
>>
>> After scratching my head for a bit and browsing through both versions of
>> the table I am loading from (with and without the extra field), I noticed
>> that the natural ordering of the tables is different.  These tables are
>> "staging" tables that I populate with another set of queries and inserts to
>> get the data into a format that is easy to ingest into Solr.  When I add
>> the extra field to these queries, it changes the Oracle query plan as the
>> field is contained in a different table that I need to join to.  As I don't
>> specify an "ORDER BY" on the query (as I didn't think it would make a
>> difference and would slow the query down), Oracle is free to chose how it
>> orders the result set.  Adding the extra field changes that natural
>> ordering, which affects the order things go into my staging table.  As I
>> don't specify an "ORDER BY" when I select things out of the staging table,
>> my data in the scenario that is working is being loaded in a different
>> order to the scenario which doesn't work.
>>
>> I am currently running full loads to verify this under each scenario, as I
>> have now forced the data in the scenario that doesn't work to be in the
>> same order as the scenario that does.  Will see how this load goes
>> overnight.
>>
>> This leads to the question of what difference does it make to Solr what
>> order I load the data in?
>>
>> I also noticed that the .cfs file is quite large in the second scenario,
>> even though this is supposed to be disabled by default in Solr.  I checked
>> my Solr config and there is no override of the default.
>>
>> In answer to your questions:
>>
>> 1) same number of documents - YES ~14,000,000 

Re: Index size increases disproportionately to size of added field when indexed=false

2018-02-15 Thread Pratik Patel
@Alessandro I will see if I can reproduce the same issue just by turning
off omitNorms on field type. I'll open another mail thread if required.
Thanks.

On Thu, Feb 15, 2018 at 6:12 AM, Howe, David 
wrote:

>
> Hi Alessandro,
>
> Some interesting testing today that seems to have gotten me closer to what
> the issue is.  When I run the version of the index that is working
> correctly against my database table that has the extra field in it, the
> index suddenly increases in size.  This is even though the data importer is
> running the same SELECT as before (which doesn't include the extra column)
> and loads the same number of rows.
>
> After scratching my head for a bit and browsing through both versions of
> the table I am loading from (with and without the extra field), I noticed
> that the natural ordering of the tables is different.  These tables are
> "staging" tables that I populate with another set of queries and inserts to
> get the data into a format that is easy to ingest into Solr.  When I add
> the extra field to these queries, it changes the Oracle query plan as the
> field is contained in a different table that I need to join to.  As I don't
> specify an "ORDER BY" on the query (as I didn't think it would make a
> difference and would slow the query down), Oracle is free to chose how it
> orders the result set.  Adding the extra field changes that natural
> ordering, which affects the order things go into my staging table.  As I
> don't specify an "ORDER BY" when I select things out of the staging table,
> my data in the scenario that is working is being loaded in a different
> order to the scenario which doesn't work.
>
> I am currently running full loads to verify this under each scenario, as I
> have now forced the data in the scenario that doesn't work to be in the
> same order as the scenario that does.  Will see how this load goes
> overnight.
>
> This leads to the question of what difference does it make to Solr what
> order I load the data in?
>
> I also noticed that the .cfs file is quite large in the second scenario,
> even though this is supposed to be disabled by default in Solr.  I checked
> my Solr config and there is no override of the default.
>
> In answer to your questions:
>
> 1) same number of documents - YES ~14,000,000 documents
> 2) identical documents ( + 1 new field each not indexed) - YES, the second
> scenario has one extra field that is stored but not indexed
> 3) same number of deleted documents - YES, there are zero deleted
> documents in both scenarios
> 4) they both were born from scratch ( an empty index) - YES, both start
> from a brand new virtual server with a brand new installation of Solr
>
> I am using the default auto commit, which I think is 15000.
>
> Thanks again for your assistance.
>
> Regards,
>
> David
>
> David Howe
> Java Domain Architect
> Postal Systems
> Level 16, 111 Bourke Street Melbourne VIC 3000
>
> T  0391067904
>
> M  0424036591
>
> E  david.h...@auspost.com.au
>
> W  auspost.com.au
> W  startrack.com.au
>
> Australia Post is committed to providing our customers with excellent
> service. If we can assist you in any way please telephone 13 13 18 or visit
> our website.
>
> The information contained in this email communication may be proprietary,
> confidential or legally professionally privileged. It is intended
> exclusively for the individual or entity to which it is addressed. You
> should only read, disclose, re-transmit, copy, distribute, act in reliance
> on or commercialise the information if you are authorised to do so.
> Australia Post does not represent, warrant or guarantee that the integrity
> of this email communication has been maintained nor that the communication
> is free of errors, virus or interference.
>
> If you are not the addressee or intended recipient please notify us by
> replying direct to the sender and then destroy any electronic or paper copy
> of this message. Any views expressed in this email communication are taken
> to be those of the individual sender, except where the sender specifically
> attributes those views to Australia Post and is authorised to do so.
>
> Please consider the environment before printing this email.
>


RE: Index size increases disproportionately to size of added field when indexed=false

2018-02-15 Thread Howe, David

Hi Alessandro,

Some interesting testing today that seems to have gotten me closer to what the 
issue is.  When I run the version of the index that is working correctly 
against my database table that has the extra field in it, the index suddenly 
increases in size.  This is even though the data importer is running the same 
SELECT as before (which doesn't include the extra column) and loads the same 
number of rows.

After scratching my head for a bit and browsing through both versions of the 
table I am loading from (with and without the extra field), I noticed that the 
natural ordering of the tables is different.  These tables are "staging" tables 
that I populate with another set of queries and inserts to get the data into a 
format that is easy to ingest into Solr.  When I add the extra field to these 
queries, it changes the Oracle query plan as the field is contained in a 
different table that I need to join to.  As I don't specify an "ORDER BY" on 
the query (as I didn't think it would make a difference and would slow the 
query down), Oracle is free to chose how it orders the result set.  Adding the 
extra field changes that natural ordering, which affects the order things go 
into my staging table.  As I don't specify an "ORDER BY" when I select things 
out of the staging table, my data in the scenario that is working is being 
loaded in a different order to the scenario which doesn't work.

I am currently running full loads to verify this under each scenario, as I have 
now forced the data in the scenario that doesn't work to be in the same order 
as the scenario that does.  Will see how this load goes overnight.

This leads to the question of what difference does it make to Solr what order I 
load the data in?

I also noticed that the .cfs file is quite large in the second scenario, even 
though this is supposed to be disabled by default in Solr.  I checked my Solr 
config and there is no override of the default.

In answer to your questions:

1) same number of documents - YES ~14,000,000 documents
2) identical documents ( + 1 new field each not indexed) - YES, the second 
scenario has one extra field that is stored but not indexed
3) same number of deleted documents - YES, there are zero deleted documents in 
both scenarios
4) they both were born from scratch ( an empty index) - YES, both start from a 
brand new virtual server with a brand new installation of Solr

I am using the default auto commit, which I think is 15000.

Thanks again for your assistance.

Regards,

David

David Howe
Java Domain Architect
Postal Systems
Level 16, 111 Bourke Street Melbourne VIC 3000

T  0391067904

M  0424036591

E  david.h...@auspost.com.au

W  auspost.com.au
W  startrack.com.au

Australia Post is committed to providing our customers with excellent service. 
If we can assist you in any way please telephone 13 13 18 or visit our website.

The information contained in this email communication may be proprietary, 
confidential or legally professionally privileged. It is intended exclusively 
for the individual or entity to which it is addressed. You should only read, 
disclose, re-transmit, copy, distribute, act in reliance on or commercialise 
the information if you are authorised to do so. Australia Post does not 
represent, warrant or guarantee that the integrity of this email communication 
has been maintained nor that the communication is free of errors, virus or 
interference.

If you are not the addressee or intended recipient please notify us by replying 
direct to the sender and then destroy any electronic or paper copy of this 
message. Any views expressed in this email communication are taken to be those 
of the individual sender, except where the sender specifically attributes those 
views to Australia Post and is authorised to do so.

Please consider the environment before printing this email.


RE: Index size increases disproportionately to size of added field when indexed=false

2018-02-15 Thread Alessandro Benedetti
@Pratik: you should have investigated. I understand that solved your issue,
but in case you needed norms it doesn't make sense that cause your index to
grow up by a factor of 30. You must have faced a nasty bug if it was just
the norms.

@Howe : 

*Compound File* .cfs, .cfe  An optional "virtual" file consisting of all the
other index files for systems that frequently run out of file handles.

*Frequencies*   .docContains the list of docs which contain each term along
with frequency

*Field Data*.fdtThe stored fields for documents

*Positions* .posStores position information about where a term occurs in
the index

*Term Index*.tipThe index into the Term Dictionary

So, David, you confirm that those two index have :

1) same number of documents
2) identical documents ( + 1 new field each not indexed)
3) same number of deleted documents
4) they both were born from scratch ( an empty index)

The matter is still suspicious :
- Cfs seems to highlight some sort of malfunctioning during
indexing/committing in relation with the OS. What was the way of commiting
you were using ?

- .doc, .pos, .tip -> they shouldn't change, assuming both the indexes are
optimised, you are adding a not indexed field, those data structures
shouldn't be affected

- the stored content as well, too much of an increment 

Can you send us the full configuration for the new field ?
You don't want, norms, positions and frequencies for it.
But in case they are the issue, you may have found some very edge case,
because also enabling all of them you shouldn't incur in such a penalty for
just an additional tiny field



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


RE: Index size increases disproportionately to size of added field when indexed=false

2018-02-14 Thread Howe, David

I have re-run both scenarios and captured the total size of each type of index 
file.  The MB (1) column is for the baseline scenario which has the smaller 
index and acceptable performance.  The MB(2) column is after I have added the 
extra field to the index.

Ext MB (1)  MB (2)
.cfe0.000.01
.cfs335.01  3612.09
.dii0.000.00
.dim324.38  319.07
.doc1094.68 2767.53
.dvd1211.84 625.44
.dvm0.140.08
.fdt1633.21 5387.92
.fdx2.121.44
.fnm0.110.12
.loc0.000.00
.nvd127.84  110.67
.nvm0.010.01
.pos809.23  1272.70
.si 0.020.03
.tim137.94  156.82
.tip2.523.04
Total   5679.06 14256.98


David Howe
Java Domain Architect
Postal Systems
Level 16, 111 Bourke Street Melbourne VIC 3000

T  0391067904

M  0424036591

E  david.h...@auspost.com.au

W  auspost.com.au
W  startrack.com.au

-Original Message-
From: Howe, David [mailto:david.h...@auspost.com.au]
Sent: Wednesday, 14 February 2018 12:49 PM
To: solr-user@lucene.apache.org
Subject: RE: Index size increases disproportionately to size of added field 
when indexed=false


I have set docValues=false on all of the string fields in our index that have 
indexed=false and stored=true.  This gave a small improvement in the index size 
from 13.3GB to 12.82GB.

I have also tried running an optimize, which then reduced the index to 12.6GB.

Next step is to dump the sizes of the Solr index files for the index version 
that is the correct size and the version that has the large size.

Regards,

David


David Howe
Java Domain Architect
Postal Systems
Level 16, 111 Bourke Street Melbourne VIC 3000

T  0391067904

M  0424036591

E  david.h...@auspost.com.au

W  auspost.com.au
W  startrack.com.au

-Original Message-
From: Howe, David [mailto:david.h...@auspost.com.au]
Sent: Wednesday, 14 February 2018 7:26 AM
To: solr-user@lucene.apache.org
Subject: RE: Index size increases disproportionately to size of added field 
when indexed=false


Thanks Hoss.  I will try setting docValues to false, as we only ever want to be 
able to retrieve the value of this field.

Regards,

David

David Howe
Java Domain Architect
Postal Systems
Level 16, 111 Bourke Street Melbourne VIC 3000

T  0391067904

M  0424036591

E  david.h...@auspost.com.au

W  auspost.com.au
W  startrack.com.au

Australia Post is committed to providing our customers with excellent service. 
If we can assist you in any way please telephone 13 13 18 or visit our website.

The information contained in this email communication may be proprietary, 
confidential or legally professionally privileged. It is intended exclusively 
for the individual or entity to which it is addressed. You should only read, 
disclose, re-transmit, copy, distribute, act in reliance on or commercialise 
the information if you are authorised to do so. Australia Post does not 
represent, warrant or guarantee that the integrity of this email communication 
has been maintained nor that the communication is free of errors, virus or 
interference.

If you are not the addressee or intended recipient please notify us by replying 
direct to the sender and then destroy any electronic or paper copy of this 
message. Any views expressed in this email communication are taken to be those 
of the individual sender, except where the sender specifically attributes those 
views to Australia Post and is authorised to do so.

Please consider the environment before printing this email.
Australia Post is committed to providing our customers with excellent service. 
If we can assist you in any way please telephone 13 13 18 or visit our website.

The information contained in this email communication may be proprietary, 
confidential or legally professionally privileged. It is intended exclusively 
for the individual or entity to which it is addressed. You should only read, 
disclose, re-transmit, copy, distribute, act in reliance on or commercialise 
the information if you are authorised to do so. Australia Post does not 
represent, warrant or guarantee that the integrity of this email communication 
has been maintained nor that the communication is free of errors, virus or 
interference.

If you are not the addressee or intended recipient please notify us by replying 
direct to the sender and then destroy any electronic or paper copy of this 
message. Any views expressed in this email communication are taken to be those 
of the individual sender, except where the sender specifically attributes those 
views to Australia Post and is authorised to do so.

Please consider the environment before printing this email.
Australia Post is committed to providing our customers with excellent service. 
If we can assist you in any way please telephone 13 13 18 or visit our website.

The information contained in this email

Re: Index size increases disproportionately to size of added field when indexed=false

2018-02-14 Thread Pratik Patel
You are right, in my case this field type was applied to many text fields.
These includes many copy fields and dynamic fields as well. In my case,
only specifying omitNorms=true for field type "text_general" fixed the
issue. I didn't do anything else or had any other bug.

On Wed, Feb 14, 2018 at 1:01 PM, Alessandro Benedetti 
wrote:

> Hi pratik,
> how is it possible that just the norms for a single field were causing such
> a massive index size increment in your case ?
>
> In your case I think it was for a field type used by multiple fields, but
> it's still suspicious in my opinions,
> norms should be that big.
> If I remember correctly in old versions of Solr before the drop of index
> time boost, norms were containing both an approximation of the length of
> the
> field + index time boost.
> From your mailing list problem you moved from 10 Gb to 300 Gb.
> It can't be just the norms, are you sure you didn't face some bug ?
>
> Regards
>
>
>
> -
> ---
> Alessandro Benedetti
> Search Consultant, R Software Engineer, Director
> Sease Ltd. - www.sease.io
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


Re: Index size increases disproportionately to size of added field when indexed=false

2018-02-14 Thread Alessandro Benedetti
Hi pratik,
how is it possible that just the norms for a single field were causing such
a massive index size increment in your case ?

In your case I think it was for a field type used by multiple fields, but
it's still suspicious in my opinions,
norms should be that big.
If I remember correctly in old versions of Solr before the drop of index
time boost, norms were containing both an approximation of the length of the
field + index time boost.
>From your mailing list problem you moved from 10 Gb to 300 Gb.
It can't be just the norms, are you sure you didn't face some bug ?

Regards



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Index size increases disproportionately to size of added field when indexed=false

2018-02-14 Thread Erick Erickson
Pratik may have jumped right to the difference. We'd have gotten there
eventually by looking at file extensions, but just checking his
recommendation would be the first thing to do!

bq:  what would be the right scenarios to use docvalues='true'?

Whenever you want to facet, group or sort on the field. This _will_
increase the index size on disk, but it's almost always a good
tradeoff, here's why:

To facet, group or sort you need to "uninvert" the field. If you have
docValues=false, this universion is done at run-time into Java's heap.
If you have docValues=true, the uninversion is done at _index_ time
and the result stored on disk. Now when it's required, it can be
loaded in from disk efficiently (essentially de-serialized) and is
stored on the OS memory due to the magic of MMapDirectory, see:
http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

bq:  In what situation would it make sense to have indexed=false and
docValues=true?

When you want to return _only_ fields that have docValues=true. If you
return fields with stored=true and docValues=false, Solr/Lucene has to
1> read the stored values from disk (minimum 16K block)
2> decrypt it
3> extract the field

With docValues, since they're only simple field types, all that you
have to do is read the value from the docValues structure., much more
efficient. HOWEVER, there are two caveats
1> The entire docValues field will be MMapped, so there's a time/space tradeoff.
2> docValues are stored in a sorted_set. This is relevant for
multiValued field because:
2a> values are returned in sorted order, not the order they were in the document
2b> identical values are collapsed.

So if the input values for a particular doc were 4, 3, 6, 4, 5, 2, 6,
5, 6, 5, 4, 3, 2 you'd get back 2, 3, 4, 5, 6

If you an live with those caveats, then returning field values would
involve much less work (both I/O and CPU), especially in
high-throughput situations. NOTE: there are a couple of JIRAs IIRC
that have to do with not storing the  though.

Best,
Erick

On Wed, Feb 14, 2018 at 7:01 AM, Pratik Patel <pra...@semandex.net> wrote:
> I had a similar issue with index size after upgrading to version 6.4.1 from
> 5.x. The issue for me was that the field which caused index size to be
> increased disproportionately had a field type("text_general") for which
> default value of omitNorms was not true. Turning it on explicitly on field
> fixed the problem. Following is the link to my related question.  You can
> verify value of omitNorms for your fields to check whether this is
> applicable in your case or not.
> http://search-lucene.com/m/Solr/eHNlagIB7209f1w1?subj=Fwd+Solr+dynamic+field+blowing+up+the+index+size
>
> On Tue, Feb 13, 2018 at 8:48 PM, Howe, David <david.h...@auspost.com.au>
> wrote:
>
>>
>> I have set docValues=false on all of the string fields in our index that
>> have indexed=false and stored=true.  This gave a small improvement in the
>> index size from 13.3GB to 12.82GB.
>>
>> I have also tried running an optimize, which then reduced the index to
>> 12.6GB.
>>
>> Next step is to dump the sizes of the Solr index files for the index
>> version that is the correct size and the version that has the large size.
>>
>> Regards,
>>
>> David
>>
>>
>> David Howe
>> Java Domain Architect
>> Postal Systems
>> Level 16, 111 Bourke Street Melbourne VIC 3000
>>
>> T  0391067904
>>
>> M  0424036591
>>
>> E  david.h...@auspost.com.au
>>
>> W  auspost.com.au
>> W  startrack.com.au
>>
>> -Original Message-
>> From: Howe, David [mailto:david.h...@auspost.com.au]
>> Sent: Wednesday, 14 February 2018 7:26 AM
>> To: solr-user@lucene.apache.org
>> Subject: RE: Index size increases disproportionately to size of added
>> field when indexed=false
>>
>>
>> Thanks Hoss.  I will try setting docValues to false, as we only ever want
>> to be able to retrieve the value of this field.
>>
>> Regards,
>>
>> David
>>
>> David Howe
>> Java Domain Architect
>> Postal Systems
>> Level 16, 111 Bourke Street Melbourne VIC 3000
>>
>> T  0391067904
>>
>> M  0424036591
>>
>> E  david.h...@auspost.com.au
>>
>> W  auspost.com.au
>> W  startrack.com.au
>>
>> Australia Post is committed to providing our customers with excellent
>> service. If we can assist you in any way please telephone 13 13 18 or visit
>> our website.
>>
>> The information contained in this email communication may be proprietary,
>> confidential or legally professionally privileged. It is intended
>> exclusively for the individual 

Re: Index size increases disproportionately to size of added field when indexed=false

2018-02-14 Thread Pratik Patel
I had a similar issue with index size after upgrading to version 6.4.1 from
5.x. The issue for me was that the field which caused index size to be
increased disproportionately had a field type("text_general") for which
default value of omitNorms was not true. Turning it on explicitly on field
fixed the problem. Following is the link to my related question.  You can
verify value of omitNorms for your fields to check whether this is
applicable in your case or not.
http://search-lucene.com/m/Solr/eHNlagIB7209f1w1?subj=Fwd+Solr+dynamic+field+blowing+up+the+index+size

On Tue, Feb 13, 2018 at 8:48 PM, Howe, David <david.h...@auspost.com.au>
wrote:

>
> I have set docValues=false on all of the string fields in our index that
> have indexed=false and stored=true.  This gave a small improvement in the
> index size from 13.3GB to 12.82GB.
>
> I have also tried running an optimize, which then reduced the index to
> 12.6GB.
>
> Next step is to dump the sizes of the Solr index files for the index
> version that is the correct size and the version that has the large size.
>
> Regards,
>
> David
>
>
> David Howe
> Java Domain Architect
> Postal Systems
> Level 16, 111 Bourke Street Melbourne VIC 3000
>
> T  0391067904
>
> M  0424036591
>
> E  david.h...@auspost.com.au
>
> W  auspost.com.au
> W  startrack.com.au
>
> -Original Message-
> From: Howe, David [mailto:david.h...@auspost.com.au]
> Sent: Wednesday, 14 February 2018 7:26 AM
> To: solr-user@lucene.apache.org
> Subject: RE: Index size increases disproportionately to size of added
> field when indexed=false
>
>
> Thanks Hoss.  I will try setting docValues to false, as we only ever want
> to be able to retrieve the value of this field.
>
> Regards,
>
> David
>
> David Howe
> Java Domain Architect
> Postal Systems
> Level 16, 111 Bourke Street Melbourne VIC 3000
>
> T  0391067904
>
> M  0424036591
>
> E  david.h...@auspost.com.au
>
> W  auspost.com.au
> W  startrack.com.au
>
> Australia Post is committed to providing our customers with excellent
> service. If we can assist you in any way please telephone 13 13 18 or visit
> our website.
>
> The information contained in this email communication may be proprietary,
> confidential or legally professionally privileged. It is intended
> exclusively for the individual or entity to which it is addressed. You
> should only read, disclose, re-transmit, copy, distribute, act in reliance
> on or commercialise the information if you are authorised to do so.
> Australia Post does not represent, warrant or guarantee that the integrity
> of this email communication has been maintained nor that the communication
> is free of errors, virus or interference.
>
> If you are not the addressee or intended recipient please notify us by
> replying direct to the sender and then destroy any electronic or paper copy
> of this message. Any views expressed in this email communication are taken
> to be those of the individual sender, except where the sender specifically
> attributes those views to Australia Post and is authorised to do so.
>
> Please consider the environment before printing this email.
> Australia Post is committed to providing our customers with excellent
> service. If we can assist you in any way please telephone 13 13 18 or visit
> our website.
>
> The information contained in this email communication may be proprietary,
> confidential or legally professionally privileged. It is intended
> exclusively for the individual or entity to which it is addressed. You
> should only read, disclose, re-transmit, copy, distribute, act in reliance
> on or commercialise the information if you are authorised to do so.
> Australia Post does not represent, warrant or guarantee that the integrity
> of this email communication has been maintained nor that the communication
> is free of errors, virus or interference.
>
> If you are not the addressee or intended recipient please notify us by
> replying direct to the sender and then destroy any electronic or paper copy
> of this message. Any views expressed in this email communication are taken
> to be those of the individual sender, except where the sender specifically
> attributes those views to Australia Post and is authorised to do so.
>
> Please consider the environment before printing this email.
>


RE: Index size increases disproportionately to size of added field when indexed=false

2018-02-13 Thread Howe, David

I have set docValues=false on all of the string fields in our index that have 
indexed=false and stored=true.  This gave a small improvement in the index size 
from 13.3GB to 12.82GB.

I have also tried running an optimize, which then reduced the index to 12.6GB.

Next step is to dump the sizes of the Solr index files for the index version 
that is the correct size and the version that has the large size.

Regards,

David


David Howe
Java Domain Architect
Postal Systems
Level 16, 111 Bourke Street Melbourne VIC 3000

T  0391067904

M  0424036591

E  david.h...@auspost.com.au

W  auspost.com.au
W  startrack.com.au

-Original Message-
From: Howe, David [mailto:david.h...@auspost.com.au]
Sent: Wednesday, 14 February 2018 7:26 AM
To: solr-user@lucene.apache.org
Subject: RE: Index size increases disproportionately to size of added field 
when indexed=false


Thanks Hoss.  I will try setting docValues to false, as we only ever want to be 
able to retrieve the value of this field.

Regards,

David

David Howe
Java Domain Architect
Postal Systems
Level 16, 111 Bourke Street Melbourne VIC 3000

T  0391067904

M  0424036591

E  david.h...@auspost.com.au

W  auspost.com.au
W  startrack.com.au

Australia Post is committed to providing our customers with excellent service. 
If we can assist you in any way please telephone 13 13 18 or visit our website.

The information contained in this email communication may be proprietary, 
confidential or legally professionally privileged. It is intended exclusively 
for the individual or entity to which it is addressed. You should only read, 
disclose, re-transmit, copy, distribute, act in reliance on or commercialise 
the information if you are authorised to do so. Australia Post does not 
represent, warrant or guarantee that the integrity of this email communication 
has been maintained nor that the communication is free of errors, virus or 
interference.

If you are not the addressee or intended recipient please notify us by replying 
direct to the sender and then destroy any electronic or paper copy of this 
message. Any views expressed in this email communication are taken to be those 
of the individual sender, except where the sender specifically attributes those 
views to Australia Post and is authorised to do so.

Please consider the environment before printing this email.
Australia Post is committed to providing our customers with excellent service. 
If we can assist you in any way please telephone 13 13 18 or visit our website.

The information contained in this email communication may be proprietary, 
confidential or legally professionally privileged. It is intended exclusively 
for the individual or entity to which it is addressed. You should only read, 
disclose, re-transmit, copy, distribute, act in reliance on or commercialise 
the information if you are authorised to do so. Australia Post does not 
represent, warrant or guarantee that the integrity of this email communication 
has been maintained nor that the communication is free of errors, virus or 
interference.

If you are not the addressee or intended recipient please notify us by replying 
direct to the sender and then destroy any electronic or paper copy of this 
message. Any views expressed in this email communication are taken to be those 
of the individual sender, except where the sender specifically attributes those 
views to Australia Post and is authorised to do so.

Please consider the environment before printing this email.


RE: Index size increases disproportionately to size of added field when indexed=false

2018-02-13 Thread Howe, David

Thanks Hoss.  I will try setting docValues to false, as we only ever want to be 
able to retrieve the value of this field.

Regards,

David

David Howe
Java Domain Architect
Postal Systems
Level 16, 111 Bourke Street Melbourne VIC 3000

T  0391067904

M  0424036591

E  david.h...@auspost.com.au

W  auspost.com.au
W  startrack.com.au

Australia Post is committed to providing our customers with excellent service. 
If we can assist you in any way please telephone 13 13 18 or visit our website.

The information contained in this email communication may be proprietary, 
confidential or legally professionally privileged. It is intended exclusively 
for the individual or entity to which it is addressed. You should only read, 
disclose, re-transmit, copy, distribute, act in reliance on or commercialise 
the information if you are authorised to do so. Australia Post does not 
represent, warrant or guarantee that the integrity of this email communication 
has been maintained nor that the communication is free of errors, virus or 
interference.

If you are not the addressee or intended recipient please notify us by replying 
direct to the sender and then destroy any electronic or paper copy of this 
message. Any views expressed in this email communication are taken to be those 
of the individual sender, except where the sender specifically attributes those 
views to Australia Post and is authorised to do so.

Please consider the environment before printing this email.


RE: Index size increases disproportionately to size of added field when indexed=false

2018-02-13 Thread Howe, David

Hi Erick,

Thanks for responding.  You are correct that we don't have any deleted docs.  
When we want to re-index (once a fortnight), we build a brand new installation 
of Solr from scratch and re-import the new data into an empty index.

I will try setting docValues to false and see if that makes a difference.  It 
sounds like we shouldn't have it on anyway, as we only ever want to be able to 
retrieve this field.  In what situation would it make sense to have 
indexed=false and docValues=true?

I will re-index and get a sizing for all of the different file extensions both 
with and without the problematic field.

Regards,

David

David Howe
Java Domain Architect
Postal Systems
Level 16, 111 Bourke Street Melbourne VIC 3000

T  0391067904

M  0424036591

E  david.h...@auspost.com.au

W  auspost.com.au
W  startrack.com.au

Australia Post is committed to providing our customers with excellent service. 
If we can assist you in any way please telephone 13 13 18 or visit our website.

The information contained in this email communication may be proprietary, 
confidential or legally professionally privileged. It is intended exclusively 
for the individual or entity to which it is addressed. You should only read, 
disclose, re-transmit, copy, distribute, act in reliance on or commercialise 
the information if you are authorised to do so. Australia Post does not 
represent, warrant or guarantee that the integrity of this email communication 
has been maintained nor that the communication is free of errors, virus or 
interference.

If you are not the addressee or intended recipient please notify us by replying 
direct to the sender and then destroy any electronic or paper copy of this 
message. Any views expressed in this email communication are taken to be those 
of the individual sender, except where the sender specifically attributes those 
views to Australia Post and is authorised to do so.

Please consider the environment before printing this email.


RE: Index size increases disproportionately to size of added field when indexed=false

2018-02-13 Thread Howe, David

Hi Alessandro,

The docker image is like a disk image of the entire server, so it includes the 
operating system, the Solr installation and the data.  Because we run in the 
cloud and our index isn't that big, this is an easy and fast way for us to 
scale our Solr cluster without having to configure Solr clusters, replication 
etc.  When we create a new server and "run" the docker image, the server comes 
up all ready to go, with Solr installed and the data already in the index.

I will checkout the different file extensions and how much space they are using.

Thanks,

David

David Howe
Java Domain Architect
Postal Systems
Level 16, 111 Bourke Street Melbourne VIC 3000

T  0391067904

M  0424036591

E  david.h...@auspost.com.au

W  auspost.com.au
W  startrack.com.au

Australia Post is committed to providing our customers with excellent service. 
If we can assist you in any way please telephone 13 13 18 or visit our website.

The information contained in this email communication may be proprietary, 
confidential or legally professionally privileged. It is intended exclusively 
for the individual or entity to which it is addressed. You should only read, 
disclose, re-transmit, copy, distribute, act in reliance on or commercialise 
the information if you are authorised to do so. Australia Post does not 
represent, warrant or guarantee that the integrity of this email communication 
has been maintained nor that the communication is free of errors, virus or 
interference.

If you are not the addressee or intended recipient please notify us by replying 
direct to the sender and then destroy any electronic or paper copy of this 
message. Any views expressed in this email communication are taken to be those 
of the individual sender, except where the sender specifically attributes those 
views to Australia Post and is authorised to do so.

Please consider the environment before printing this email.


Re: Index size increases disproportionately to size of added field when indexed=false

2018-02-13 Thread David Hastings
To piggy back on this, what would be the right scenarios to use
docvalues='true'?

On Tue, Feb 13, 2018 at 1:10 PM, Chris Hostetter 
wrote:

>
> : We are using Solr 7.1.0 to index a database of addresses.  We have found
> : that our index size increases massively when we add one extra field to
> : the index, even though that field is stored and not indexed, and doesn’t
>
> what about docValues?
>
> : When we run an index load without the problematic field present, the
> : Solr index size is 5.5GB.  When we add the field into the index, the
> : size grows to 13.3GB.  The field itself is a maximum of 46 characters in
> : length and on average is 19 characters. We have ~14,000,000 rows in
> : total to index of which only ~200,000 have this field present at all
> : (i.e. not null in database).  Given that we don’t want to index the
> : field, only store it I would have thought (perhaps naively) that the
> : storage increase would be approximately 200,000 * 19 = 3.8M bytes =
> : 3.6MB rather than the 7.5GB we are seeing.
>
> if the field has docValues enabled, then there will be some overhead for
> every doc in the index -- even the ones that don't have a value in this
> field.  (allthough i'd still be very suprised if it accounted for 7G)
>
> : - The problematic field is created through the API as follows:
> :
> :   curl -X POST -H 'Content-type:application/json' --data-binary '{
> : "add-field":{
> :   "name":"buildingName",
> :   "type":"string",
> :   "stored":true,
> :   "indexed":false
> : }
> :   }' http://localhost:8983/solr/address/schema
>
> ...that's going to cause the field to inherit any (non-overridden)
> settings from the fieldType "string" -- in the 7.1 _default configset,
> "string" is defined with docValues="true"
>
> You can see *all* properties set on a field -- regardless of wether they
> are set on the fieldType, or are implicit hardcoded defaults in the
> implementation of the fieldType via the 'showDefaults=true' Schema API
> option.
>
> Consider these API examples from the techproducts demo...
>
> $ curl 'http://localhost:8983/solr/techproducts/schema/fields/cat'
> {
>   "responseHeader":{
> "status":0,
> "QTime":0},
>   "field":{
> "name":"cat",
> "type":"string",
> "multiValued":true,
> "indexed":true,
> "stored":true}}
>
> $ curl 'http://localhost:8983/solr/techproducts/schema/fields/
> cat?showDefaults=true'
> {
>   "responseHeader":{
> "status":0,
> "QTime":0},
>   "field":{
> "name":"cat",
> "type":"string",
> "indexed":true,
> "stored":true,
> "docValues":false,
> "termVectors":false,
> "termPositions":false,
> "termOffsets":false,
> "termPayloads":false,
> "omitNorms":true,
> "omitTermFreqAndPositions":true,
> "omitPositions":false,
> "storeOffsetsWithPositions":false,
> "multiValued":true,
> "large":false,
> "sortMissingLast":true,
> "required":false,
> "tokenized":false,
> "useDocValuesAsStored":true}}
>
>
>
>
>
>
>
> -Hoss
> http://www.lucidworks.com/


Re: Index size increases disproportionately to size of added field when indexed=false

2018-02-13 Thread Chris Hostetter

: We are using Solr 7.1.0 to index a database of addresses.  We have found 
: that our index size increases massively when we add one extra field to 
: the index, even though that field is stored and not indexed, and doesn’t 

what about docValues?

: When we run an index load without the problematic field present, the 
: Solr index size is 5.5GB.  When we add the field into the index, the 
: size grows to 13.3GB.  The field itself is a maximum of 46 characters in 
: length and on average is 19 characters. We have ~14,000,000 rows in 
: total to index of which only ~200,000 have this field present at all 
: (i.e. not null in database).  Given that we don’t want to index the 
: field, only store it I would have thought (perhaps naively) that the 
: storage increase would be approximately 200,000 * 19 = 3.8M bytes = 
: 3.6MB rather than the 7.5GB we are seeing.

if the field has docValues enabled, then there will be some overhead for 
every doc in the index -- even the ones that don't have a value in this 
field.  (allthough i'd still be very suprised if it accounted for 7G)

: - The problematic field is created through the API as follows:
: 
:   curl -X POST -H 'Content-type:application/json' --data-binary '{
: "add-field":{
:   "name":"buildingName",
:   "type":"string",
:   "stored":true,
:   "indexed":false
: }
:   }' http://localhost:8983/solr/address/schema

...that's going to cause the field to inherit any (non-overridden) 
settings from the fieldType "string" -- in the 7.1 _default configset, 
"string" is defined with docValues="true"

You can see *all* properties set on a field -- regardless of wether they 
are set on the fieldType, or are implicit hardcoded defaults in the 
implementation of the fieldType via the 'showDefaults=true' Schema API 
option.

Consider these API examples from the techproducts demo...

$ curl 'http://localhost:8983/solr/techproducts/schema/fields/cat'
{
  "responseHeader":{
"status":0,
"QTime":0},
  "field":{
"name":"cat",
"type":"string",
"multiValued":true,
"indexed":true,
"stored":true}}

$ curl 
'http://localhost:8983/solr/techproducts/schema/fields/cat?showDefaults=true'
{
  "responseHeader":{
"status":0,
"QTime":0},
  "field":{
"name":"cat",
"type":"string",
"indexed":true,
"stored":true,
"docValues":false,
"termVectors":false,
"termPositions":false,
"termOffsets":false,
"termPayloads":false,
"omitNorms":true,
"omitTermFreqAndPositions":true,
"omitPositions":false,
"storeOffsetsWithPositions":false,
"multiValued":true,
"large":false,
"sortMissingLast":true,
"required":false,
"tokenized":false,
"useDocValuesAsStored":true}}







-Hoss
http://www.lucidworks.com/

Re: Index size increases disproportionately to size of added field when indexed=false

2018-02-13 Thread Erick Erickson
David:

Right, Optimize Is Evil. Well, actually in your case it's not. In your
specific case you can optimize every time you build your index and be
OK, gory details here:
https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/

But that's just for background. The key is how many deleted docs you
have, which you can see from the admin UI screen. If you have 0
deleted docs, you have 0 space that would be reclaimed by an optimize.
My bet is that you have no deleted docs, if so just forget the whole
optimize question as it's a red herring.

"...storage increase would be approximately 200,000 * 19 = 3.8M bytes
= 3.6MB rather than the 7.5GB..."

Actually I'd expect it to only be half that  (1.9M). Stored fields are
compressed on disk and we usually see about a 2:1 compression ratio.
There'll be a little bit of fudge for metadata, but not enough to
measure probably.

So yes, this is totally weird. I think you'll also find that docValues
is set to true by default. This _still_ shouldn't be adding that much
to this index, but if you turn docValues off for that field what
happens?

Stored data is held in your *.fdt and *.fdx files. what's the total
index space used in your index by these two extensions with and
without your field?

*.dvd files contain the docValues data, again what's the before/after
size of all these files with and without your field?

These are two specific places to look, but in general I'm asking what
the total size is by extension in your index directory with and
without your field on the guess that one extension will be massively
bigger, this is totally surprising, but it'd give us a clue where to
look.

Here are the file extensions and what they contain BTW:
https://lucene.apache.org/core/7_1_0/core/org/apache/lucene/codecs/lucene70/package-summary.html

Best,
Erick

On Tue, Feb 13, 2018 at 3:41 AM, Alessandro Benedetti
 wrote:
> Hi David,
> given the fact that you are actually building a new index from scratch, my
> shot in the dark didn't hit any target.
> When you say  : "Once the import finishes we save the docker image in the
> AWS docker repository.  We then build our cluster using that image as the
> base"
>
> Do you mean just configuraiton wise ?
> Will the new cluster have any starting index on disk?
> If i understood correctly your latest statements I expect a NO in here.
>
> So you are building a completely new index and comparing to the old index (
> which is completely separate) you denote such a big difference in size.
> This is extremely suspicious .
> Optimizing in the end is just a huge merge to force 1 ( or N) final
> segments.
> Given the additional information you gave me, it's not going to make much
> difference.
>
> I would recommend to check how the index space is divided in different file
> formats [1]
> ( i.e. list how much space is dedicated to a specific extension)
>
> Stored content is in the .fdt files.
>
>
> [1]
> https://lucene.apache.org/core/6_4_0/core/org/apache/lucene/codecs/lucene62/package-summary.html#file-names
>
>
>
> -
> ---
> Alessandro Benedetti
> Search Consultant, R Software Engineer, Director
> Sease Ltd. - www.sease.io
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


RE: Index size increases disproportionately to size of added field when indexed=false

2018-02-13 Thread Alessandro Benedetti
Hi David, 
given the fact that you are actually building a new index from scratch, my
shot in the dark didn't hit any target.
When you say  : "Once the import finishes we save the docker image in the
AWS docker repository.  We then build our cluster using that image as the
base"

Do you mean just configuraiton wise ?
Will the new cluster have any starting index on disk?
If i understood correctly your latest statements I expect a NO in here.

So you are building a completely new index and comparing to the old index (
which is completely separate) you denote such a big difference in size.
This is extremely suspicious .
Optimizing in the end is just a huge merge to force 1 ( or N) final
segments.
Given the additional information you gave me, it's not going to make much
difference.

I would recommend to check how the index space is divided in different file
formats [1]
( i.e. list how much space is dedicated to a specific extension)

Stored content is in the .fdt files.


[1]
https://lucene.apache.org/core/6_4_0/core/org/apache/lucene/codecs/lucene62/package-summary.html#file-names



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


RE: Index size increases disproportionately to size of added field when indexed=false

2018-02-13 Thread Howe, David

Hi Alessanro,

Thanks for responding.  We rebuild the index every time starting from a fresh 
installation of Solr.  Because we are running at AWS, we have automated our 
deployment so we start with the base docker image, configure Solr and then 
import our data every time the data changes (it only changes once a fortnight). 
 Once the import finishes we save the docker image in the AWS docker 
repository.  We then build our cluster using that image as the base.  So we 
never re-index an existing index, we just build another one from scratch.

We haven't configured anything special for segments and merges.

When I look in the console, the index is shown as being optimized.  There 
doesn't seem to be an option in the console anymore to optimize an index.  If I 
have only ever inserted new documents, should I need to optimize?  I will try 
an optimize when I am back in the office tomorrow.

Regards,

David

David Howe
Java Domain Architect
Postal Systems
Level 16, 111 Bourke Street Melbourne VIC 3000

T  0391067904

M  0424036591

E  david.h...@auspost.com.au

W  auspost.com.au
W  startrack.com.au

Australia Post is committed to providing our customers with excellent service. 
If we can assist you in any way please telephone 13 13 18 or visit our website.

The information contained in this email communication may be proprietary, 
confidential or legally professionally privileged. It is intended exclusively 
for the individual or entity to which it is addressed. You should only read, 
disclose, re-transmit, copy, distribute, act in reliance on or commercialise 
the information if you are authorised to do so. Australia Post does not 
represent, warrant or guarantee that the integrity of this email communication 
has been maintained nor that the communication is free of errors, virus or 
interference.

If you are not the addressee or intended recipient please notify us by replying 
direct to the sender and then destroy any electronic or paper copy of this 
message. Any views expressed in this email communication are taken to be those 
of the individual sender, except where the sender specifically attributes those 
views to Australia Post and is authorised to do so.

Please consider the environment before printing this email.


Re: Index size increases disproportionately to size of added field when indexed=false

2018-02-13 Thread Alessandro Benedetti
I assume you re-index in full right ?
My shot in the dark is that this increment is temporary.
You re-index, so effectively delete and add all documents ( this means that
even if the new field is just stored, you re-build the entire index for all
the fields).
Create new segments and the old docs are marked as deleted.
Until the background merge happens, the index could reach those sizes.

the weird thing is why the merge didn't kick in...
Have you configured any special approach in segments merging ?

What happens if you explicitly optimize ?

Let us know ...




-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Index size increases disproportionately to size of added field when indexed=false

2018-02-12 Thread Howe, David

Hi,

We are using Solr 7.1.0 to index a database of addresses.  We have found that 
our index size increases massively when we add one extra field to the index, 
even though that field is stored and not indexed, and doesn’t contain a lot of 
data.  When this occurs, we also observe a significant increase in response 
times and CPU usage on the Solr server.

When we run an index load without the problematic field present, the Solr index 
size is 5.5GB.  When we add the field into the index, the size grows to 13.3GB. 
 The field itself is a maximum of 46 characters in length and on average is 19 
characters. We have ~14,000,000 rows in total to index of which only ~200,000 
have this field present at all (i.e. not null in database).  Given that we 
don’t want to index the field, only store it I would have thought (perhaps 
naively) that the storage increase would be approximately 200,000 * 19 = 3.8M 
bytes = 3.6MB rather than the 7.5GB we are seeing.

Some further background on what we are doing:

- We are using the Solr 7.1.0 docker image for our Solr server
- We are importing the data from an Oracle table using JDBC and the standard 
dataimport request handler
- As we want to push the docker image to AWS ECR which only accepts docker 
layers of a maximum of 10GB, we load the index in four separate imports, 
stopping Solr gracefully in between each load
- Our index contains 48 fields in total
- The problematic field is created through the API as follows:

  curl -X POST -H 'Content-type:application/json' --data-binary '{
"add-field":{
  "name":"buildingName",
  "type":"string",
  "stored":true,
  "indexed":false
}
  }' http://localhost:8983/solr/address/schema

I have also tried using SolrText instead of string, but that doesn't make a 
noticeable difference.

It also makes a difference how many records are loaded.  If I only load 
1,000,000 records (that have a proportionate number of building names) then the 
size of the index with and without buildingName is about the same (~1GB).

Is there some sort of limit that I'm not aware of that we are hitting, either 
number of fields or size of data?  Is there some kind of corrupt data that I 
need to look for in the buildingName field that could cause this (it's just a 
varchar2(46) field in Oracle)?

Thanks for your assistance,

David

David Howe
Java Domain Architect
Postal Systems
Level 16, 111 Bourke Street Melbourne VIC 3000

T  0391067904

M  0424036591

E  david.h...@auspost.com.au

W  auspost.com.au
W  startrack.com.au

Australia Post is committed to providing our customers with excellent service. 
If we can assist you in any way please telephone 13 13 18 or visit our website.

The information contained in this email communication may be proprietary, 
confidential or legally professionally privileged. It is intended exclusively 
for the individual or entity to which it is addressed. You should only read, 
disclose, re-transmit, copy, distribute, act in reliance on or commercialise 
the information if you are authorised to do so. Australia Post does not 
represent, warrant or guarantee that the integrity of this email communication 
has been maintained nor that the communication is free of errors, virus or 
interference.

If you are not the addressee or intended recipient please notify us by replying 
direct to the sender and then destroy any electronic or paper copy of this 
message. Any views expressed in this email communication are taken to be those 
of the individual sender, except where the sender specifically attributes those 
views to Australia Post and is authorised to do so.

Please consider the environment before printing this email.