Hi, thanks everyone.
@Karl, many thanks I am going to write a little test and see what happens. @Konrad, yes you are right, I think Liferay is creating something wrong that might confuse the crawler. Let me write the test and see what it is. Thanks! El jue., 23 ago. 2018 a las 14:24, Holl, Konrad (<[email protected]>) escribió: > Did you check the "modified" header returned with the documents from > Liferay? Some systems tend to always use "now", which could explain the > behavior (this might even be a configuration option). You can see this in a > browser's debug window when you reload the page a couple of times (Ctrl+F5 > to force reloading). > > > -Konrad > > ------------------------------ > *Von:* Karl Wright <[email protected]> > *Gesendet:* Donnerstag, 23. August 2018 14:18 > *An:* [email protected] > *Betreff:* [External] Re: Documents that didn't change are reindexed > > I would suggest downloading the pages using curl a couple of times and > comparing content. > Headers also matter. Here's the code: > > >>>>>> > // Calculate version from document data, which is presumed to > be present. > StringBuilder sb = new StringBuilder(); > > // Acls > packList(sb,acls,'+'); > if (acls.length > 0) > { > sb.append('+'); > pack(sb,defaultAuthorityDenyToken,'+'); > } > else > sb.append('-'); > > // Now, do the metadata. > Map<String,Set<String>> metaHash = new > HashMap<String,Set<String>>(); > > String[] fixedListStrings = new String[2]; > // They're all folded into the same part of the version string. > int headerCount = 0; > Iterator<String> headerIterator = > fetchStatus.headerData.keySet().iterator(); > while (headerIterator.hasNext()) > { > String headerName = headerIterator.next(); > String lowerHeaderName = headerName.toLowerCase(Locale.ROOT); > if (!reservedHeaders.contains(lowerHeaderName) && > !excludedHeaders.contains(lowerHeaderName)) > headerCount += > fetchStatus.headerData.get(headerName).size(); > } > String[] fullMetadata = new String[headerCount]; > headerCount = 0; > headerIterator = fetchStatus.headerData.keySet().iterator(); > while (headerIterator.hasNext()) > { > String headerName = headerIterator.next(); > String lowerHeaderName = headerName.toLowerCase(Locale.ROOT); > if (!reservedHeaders.contains(lowerHeaderName) && > !excludedHeaders.contains(lowerHeaderName)) > { > Set<String> valueSet = metaHash.get(headerName); > if (valueSet == null) > { > valueSet = new HashSet<String>(); > metaHash.put(headerName,valueSet); > } > List<String> headerValues = > fetchStatus.headerData.get(headerName); > for (String headerValue : headerValues) > { > valueSet.add(headerValue); > fixedListStrings[0] = "header-"+headerName; > fixedListStrings[1] = headerValue; > StringBuilder newsb = new StringBuilder(); > packFixedList(newsb,fixedListStrings,'='); > fullMetadata[headerCount++] = newsb.toString(); > } > } > } > java.util.Arrays.sort(fullMetadata); > > packList(sb,fullMetadata,'+'); > // Done with the parseable part! Add the checksum. > sb.append(fetchStatus.checkSum); > // Add the filter version > sb.append("+"); > sb.append(filterVersion); > > String versionString = sb.toString(); > <<<<<< > > The "filter version" comes from your job specification and will change > only if you change the job specification, but everything else should be > self-explanatory. Looks like all headers matter, so that could explain it. > > Karl > > > On Thu, Aug 23, 2018 at 5:56 AM Gustavo Beneitez < > [email protected]> wrote: > > Thanks Karl, > > I've been launching the job a couple of times with a small set of > documents and what I see is that the elastic indexes every time each > document, even though the weight of the document is always the same and I > don't notice any "html dynamic content" like current time that could cause > checksum to be different. > > Consulting the "Simple history" menu option shows that Elastic output > connector is called > "08-23-2018 06:27:19.274 Indexation (Elasticsearch 2.4.6)" > So I guess there is a miss-configuration somewhere... > > > > El jue., 23 ago. 2018 a las 1:45, Karl Wright (<[email protected]>) > escribió: > > Hi Gustavo, > > I take it from your question that you are using the Web Connector? > > All connectors create a version string that is used to determine whether > content needs to be reindexed or not. The Web Connector's version string > uses a checksum of the page contents; we found the "last modified" header > to be unreliable, if I recall correctly. > > Thanks, > Karl > > > On Wed, Aug 22, 2018 at 12:35 PM Gustavo Beneitez < > [email protected]> wrote: > > Hi everyone, > > I am currently creating a job that indexes part of Liferay intranet > content. > Every time the job is executed the documents are fully reindexed in > Elastic, no matter they didn't change. > I thought I had read somewhere the crawler uses "last-modified" http > header, but also that saves into database a hash. > I was looking for the right one within the user's manual but no luck, so > please could you tell me which is the correct one? > > Thanks in advance! > > > ------------------------------ > > This message is for the designated recipient only and may contain > privileged, proprietary, or otherwise confidential information. If you have > received it in error, please notify the sender immediately and delete the > original. Any other use of the e-mail by you is prohibited. Where allowed > by local law, electronic communications with Accenture and its affiliates, > including e-mail and instant messaging (including content), may be scanned > by our systems for the purposes of information security and assessment of > internal compliance with Accenture policy. Your privacy is important to us. > Accenture uses your personal data only in compliance with data protection > laws. For further information on how Accenture processes your personal > data, please see our privacy statement at > https://www.accenture.com/us-en/privacy-policy. > > ______________________________________________________________________________________ > > www.accenture.com >
