I would suggest downloading the pages using curl a couple of times and
comparing content.
Headers also matter. Here's the code:
>>>>>>
// Calculate version from document data, which is presumed to
be present.
StringBuilder sb = new StringBuilder();
// Acls
packList(sb,acls,'+');
if (acls.length > 0)
{
sb.append('+');
pack(sb,defaultAuthorityDenyToken,'+');
}
else
sb.append('-');
// Now, do the metadata.
Map<String,Set<String>> metaHash = new
HashMap<String,Set<String>>();
String[] fixedListStrings = new String[2];
// They're all folded into the same part of the version string.
int headerCount = 0;
Iterator<String> headerIterator =
fetchStatus.headerData.keySet().iterator();
while (headerIterator.hasNext())
{
String headerName = headerIterator.next();
String lowerHeaderName = headerName.toLowerCase(Locale.ROOT);
if (!reservedHeaders.contains(lowerHeaderName) &&
!excludedHeaders.contains(lowerHeaderName))
headerCount +=
fetchStatus.headerData.get(headerName).size();
}
String[] fullMetadata = new String[headerCount];
headerCount = 0;
headerIterator = fetchStatus.headerData.keySet().iterator();
while (headerIterator.hasNext())
{
String headerName = headerIterator.next();
String lowerHeaderName = headerName.toLowerCase(Locale.ROOT);
if (!reservedHeaders.contains(lowerHeaderName) &&
!excludedHeaders.contains(lowerHeaderName))
{
Set<String> valueSet = metaHash.get(headerName);
if (valueSet == null)
{
valueSet = new HashSet<String>();
metaHash.put(headerName,valueSet);
}
List<String> headerValues =
fetchStatus.headerData.get(headerName);
for (String headerValue : headerValues)
{
valueSet.add(headerValue);
fixedListStrings[0] = "header-"+headerName;
fixedListStrings[1] = headerValue;
StringBuilder newsb = new StringBuilder();
packFixedList(newsb,fixedListStrings,'=');
fullMetadata[headerCount++] = newsb.toString();
}
}
}
java.util.Arrays.sort(fullMetadata);
packList(sb,fullMetadata,'+');
// Done with the parseable part! Add the checksum.
sb.append(fetchStatus.checkSum);
// Add the filter version
sb.append("+");
sb.append(filterVersion);
String versionString = sb.toString();
<<<<<<
The "filter version" comes from your job specification and will change only
if you change the job specification, but everything else should be
self-explanatory. Looks like all headers matter, so that could explain it.
Karl
On Thu, Aug 23, 2018 at 5:56 AM Gustavo Beneitez <[email protected]>
wrote:
> Thanks Karl,
>
> I've been launching the job a couple of times with a small set of
> documents and what I see is that the elastic indexes every time each
> document, even though the weight of the document is always the same and I
> don't notice any "html dynamic content" like current time that could cause
> checksum to be different.
>
> Consulting the "Simple history" menu option shows that Elastic output
> connector is called
> "08-23-2018 06:27:19.274 Indexation (Elasticsearch 2.4.6)"
> So I guess there is a miss-configuration somewhere...
>
>
>
> El jue., 23 ago. 2018 a las 1:45, Karl Wright (<[email protected]>)
> escribió:
>
>> Hi Gustavo,
>>
>> I take it from your question that you are using the Web Connector?
>>
>> All connectors create a version string that is used to determine whether
>> content needs to be reindexed or not. The Web Connector's version string
>> uses a checksum of the page contents; we found the "last modified" header
>> to be unreliable, if I recall correctly.
>>
>> Thanks,
>> Karl
>>
>>
>> On Wed, Aug 22, 2018 at 12:35 PM Gustavo Beneitez <
>> [email protected]> wrote:
>>
>>> Hi everyone,
>>>
>>> I am currently creating a job that indexes part of Liferay intranet
>>> content.
>>> Every time the job is executed the documents are fully reindexed in
>>> Elastic, no matter they didn't change.
>>> I thought I had read somewhere the crawler uses "last-modified" http
>>> header, but also that saves into database a hash.
>>> I was looking for the right one within the user's manual but no luck, so
>>> please could you tell me which is the correct one?
>>>
>>> Thanks in advance!
>>>
>>