Did you check the "modified" header returned with the documents from Liferay? 
Some systems tend to always use "now", which could explain the behavior (this 
might even be a configuration option). You can see this in a browser's debug 
window when you reload the page a couple of times (Ctrl+F5 to force reloading).


-Konrad

________________________________
Von: Karl Wright <[email protected]>
Gesendet: Donnerstag, 23. August 2018 14:18
An: [email protected]
Betreff: [External] Re: Documents that didn't change are reindexed

I would suggest downloading the pages using curl a couple of times and 
comparing content.
Headers also matter.  Here's the code:

>>>>>>
            // Calculate version from document data, which is presumed to be 
present.
            StringBuilder sb = new StringBuilder();

            // Acls
            packList(sb,acls,'+');
            if (acls.length > 0)
            {
              sb.append('+');
              pack(sb,defaultAuthorityDenyToken,'+');
            }
            else
              sb.append('-');

            // Now, do the metadata.
            Map<String,Set<String>> metaHash = new 
HashMap<String,Set<String>>();

            String[] fixedListStrings = new String[2];
            // They're all folded into the same part of the version string.
            int headerCount = 0;
            Iterator<String> headerIterator = 
fetchStatus.headerData.keySet().iterator();
            while (headerIterator.hasNext())
            {
              String headerName = headerIterator.next();
              String lowerHeaderName = headerName.toLowerCase(Locale.ROOT);
              if (!reservedHeaders.contains(lowerHeaderName) && 
!excludedHeaders.contains(lowerHeaderName))
                headerCount += fetchStatus.headerData.get(headerName).size();
            }
            String[] fullMetadata = new String[headerCount];
            headerCount = 0;
            headerIterator = fetchStatus.headerData.keySet().iterator();
            while (headerIterator.hasNext())
            {
              String headerName = headerIterator.next();
              String lowerHeaderName = headerName.toLowerCase(Locale.ROOT);
              if (!reservedHeaders.contains(lowerHeaderName) && 
!excludedHeaders.contains(lowerHeaderName))
              {
                Set<String> valueSet = metaHash.get(headerName);
                if (valueSet == null)
                {
                  valueSet = new HashSet<String>();
                  metaHash.put(headerName,valueSet);
                }
                List<String> headerValues = 
fetchStatus.headerData.get(headerName);
                for (String headerValue : headerValues)
                {
                  valueSet.add(headerValue);
                  fixedListStrings[0] = "header-"+headerName;
                  fixedListStrings[1] = headerValue;
                  StringBuilder newsb = new StringBuilder();
                  packFixedList(newsb,fixedListStrings,'=');
                  fullMetadata[headerCount++] = newsb.toString();
                }
              }
            }
            java.util.Arrays.sort(fullMetadata);

            packList(sb,fullMetadata,'+');
            // Done with the parseable part!  Add the checksum.
            sb.append(fetchStatus.checkSum);
            // Add the filter version
            sb.append("+");
            sb.append(filterVersion);

            String versionString = sb.toString();
<<<<<<

The "filter version" comes from your job specification and will change only if 
you change the job specification, but everything else should be 
self-explanatory.  Looks like all headers matter, so that could explain it.

Karl


On Thu, Aug 23, 2018 at 5:56 AM Gustavo Beneitez 
<[email protected]<mailto:[email protected]>> wrote:
Thanks Karl,

I've been launching the job a couple of times with a small set of documents and 
what I see is that the elastic indexes every time each document, even though 
the weight of the document is always the same and I don't notice any "html 
dynamic content" like current time that could cause checksum to be different.

Consulting the "Simple history" menu option shows that Elastic output connector 
is called
"08-23-2018 06:27:19.274        Indexation (Elasticsearch 2.4.6)"

So I guess there is a miss-configuration somewhere...



El jue., 23 ago. 2018 a las 1:45, Karl Wright 
(<[email protected]<mailto:[email protected]>>) escribió:
Hi Gustavo,

I take it from your question that you are using the Web Connector?

All connectors create a version string that is used to determine whether 
content needs to be reindexed or not.  The Web Connector's version string uses 
a checksum of the page contents; we found the "last modified" header to be 
unreliable, if I recall correctly.

Thanks,
Karl


On Wed, Aug 22, 2018 at 12:35 PM Gustavo Beneitez 
<[email protected]<mailto:[email protected]>> wrote:
Hi everyone,

I am currently creating a job that indexes part of Liferay intranet content.
Every time the job is executed the documents are fully reindexed in Elastic, no 
matter they didn't change.
I thought I had read somewhere the crawler uses "last-modified" http header, 
but also that saves into database a hash.
I was looking for the right one within the user's manual but no luck, so please 
could you tell me which is the correct one?

Thanks in advance!

________________________________

This message is for the designated recipient only and may contain privileged, 
proprietary, or otherwise confidential information. If you have received it in 
error, please notify the sender immediately and delete the original. Any other 
use of the e-mail by you is prohibited. Where allowed by local law, electronic 
communications with Accenture and its affiliates, including e-mail and instant 
messaging (including content), may be scanned by our systems for the purposes 
of information security and assessment of internal compliance with Accenture 
policy. Your privacy is important to us. Accenture uses your personal data only 
in compliance with data protection laws. For further information on how 
Accenture processes your personal data, please see our privacy statement at 
https://www.accenture.com/us-en/privacy-policy.
______________________________________________________________________________________

www.accenture.com

Reply via email to