Re: [External] Re: Documents that didn't change are reindexed

Gustavo Beneitez Thu, 23 Aug 2018 07:18:02 -0700

Hi again, I managed to review the code and also get the headers. I saw one
that is most suspicious to me and would like to exclude it but I didn't
find how to.
The code seems to look for "config properties" but the user interface does
not allow that. Do you know where is it placed?



protected static Set<String> findExcludedHeaders(Specification spec)
    throws ManifoldCFException
  {
    Set<String> rval = new HashSet<String>();
    int i = 0;
    while (i < spec.getChildCount())
    {
      SpecificationNode n = spec.getChild(i++);
      if (n.getType().equals(WebcrawlerConfig.NODE_EXCLUDEHEADER))
      {
        String value = n.getAttributeValue(WebcrawlerConfig.ATTR_VALUE);
        rval.add(value);
      }
    }
    return rval;
  }

[image: image.png]

Thanks again!

El jue., 23 ago. 2018 a las 14:33, Gustavo Beneitez (<
[email protected]>) escribió:

> Hi,
>
> thanks everyone.
>
> @Karl, many thanks I am going to write a little test and see what happens.
>
> @Konrad, yes you are right, I think Liferay is creating something wrong
> that might confuse the crawler. Let me write the test and see what it is.
>
> Thanks!
>
> El jue., 23 ago. 2018 a las 14:24, Holl, Konrad (<
> [email protected]>) escribió:
>
>> Did you check the "modified" header returned with the documents from
>> Liferay? Some systems tend to always use "now", which could explain the
>> behavior (this might even be a configuration option). You can see this in a
>> browser's debug window when you reload the page a couple of times (Ctrl+F5
>> to force reloading).
>>
>>
>> -Konrad
>>
>> ------------------------------
>> *Von:* Karl Wright <[email protected]>
>> *Gesendet:* Donnerstag, 23. August 2018 14:18
>> *An:* [email protected]
>> *Betreff:* [External] Re: Documents that didn't change are reindexed
>>
>> I would suggest downloading the pages using curl a couple of times and
>> comparing content.
>> Headers also matter.  Here's the code:
>>
>> >>>>>>
>>             // Calculate version from document data, which is presumed to
>> be present.
>>             StringBuilder sb = new StringBuilder();
>>
>>             // Acls
>>             packList(sb,acls,'+');
>>             if (acls.length > 0)
>>             {
>>               sb.append('+');
>>               pack(sb,defaultAuthorityDenyToken,'+');
>>             }
>>             else
>>               sb.append('-');
>>
>>             // Now, do the metadata.
>>             Map<String,Set<String>> metaHash = new
>> HashMap<String,Set<String>>();
>>
>>             String[] fixedListStrings = new String[2];
>>             // They're all folded into the same part of the version
>> string.
>>             int headerCount = 0;
>>             Iterator<String> headerIterator =
>> fetchStatus.headerData.keySet().iterator();
>>             while (headerIterator.hasNext())
>>             {
>>               String headerName = headerIterator.next();
>>               String lowerHeaderName =
>> headerName.toLowerCase(Locale.ROOT);
>>               if (!reservedHeaders.contains(lowerHeaderName) &&
>> !excludedHeaders.contains(lowerHeaderName))
>>                 headerCount +=
>> fetchStatus.headerData.get(headerName).size();
>>             }
>>             String[] fullMetadata = new String[headerCount];
>>             headerCount = 0;
>>             headerIterator = fetchStatus.headerData.keySet().iterator();
>>             while (headerIterator.hasNext())
>>             {
>>               String headerName = headerIterator.next();
>>               String lowerHeaderName =
>> headerName.toLowerCase(Locale.ROOT);
>>               if (!reservedHeaders.contains(lowerHeaderName) &&
>> !excludedHeaders.contains(lowerHeaderName))
>>               {
>>                 Set<String> valueSet = metaHash.get(headerName);
>>                 if (valueSet == null)
>>                 {
>>                   valueSet = new HashSet<String>();
>>                   metaHash.put(headerName,valueSet);
>>                 }
>>                 List<String> headerValues =
>> fetchStatus.headerData.get(headerName);
>>                 for (String headerValue : headerValues)
>>                 {
>>                   valueSet.add(headerValue);
>>                   fixedListStrings[0] = "header-"+headerName;
>>                   fixedListStrings[1] = headerValue;
>>                   StringBuilder newsb = new StringBuilder();
>>                   packFixedList(newsb,fixedListStrings,'=');
>>                   fullMetadata[headerCount++] = newsb.toString();
>>                 }
>>               }
>>             }
>>             java.util.Arrays.sort(fullMetadata);
>>
>>             packList(sb,fullMetadata,'+');
>>             // Done with the parseable part!  Add the checksum.
>>             sb.append(fetchStatus.checkSum);
>>             // Add the filter version
>>             sb.append("+");
>>             sb.append(filterVersion);
>>
>>             String versionString = sb.toString();
>> <<<<<<
>>
>> The "filter version" comes from your job specification and will change
>> only if you change the job specification, but everything else should be
>> self-explanatory.  Looks like all headers matter, so that could explain it.
>>
>> Karl
>>
>>
>> On Thu, Aug 23, 2018 at 5:56 AM Gustavo Beneitez <
>> [email protected]> wrote:
>>
>> Thanks Karl,
>>
>> I've been launching the job a couple of times with a small set of
>> documents and what I see is that the elastic indexes every time each
>> document, even though the weight of the document is always the same and I
>> don't notice any "html dynamic content" like current time that could cause
>> checksum to be different.
>>
>> Consulting the "Simple history" menu option shows that Elastic output
>> connector is called
>> "08-23-2018 06:27:19.274 Indexation (Elasticsearch 2.4.6)"
>> So I guess there is a miss-configuration somewhere...
>>
>>
>>
>> El jue., 23 ago. 2018 a las 1:45, Karl Wright (<[email protected]>)
>> escribió:
>>
>> Hi Gustavo,
>>
>> I take it from your question that you are using the Web Connector?
>>
>> All connectors create a version string that is used to determine whether
>> content needs to be reindexed or not.  The Web Connector's version string
>> uses a checksum of the page contents; we found the "last modified" header
>> to be unreliable, if I recall correctly.
>>
>> Thanks,
>> Karl
>>
>>
>> On Wed, Aug 22, 2018 at 12:35 PM Gustavo Beneitez <
>> [email protected]> wrote:
>>
>> Hi everyone,
>>
>> I am currently creating a job that indexes part of Liferay intranet
>> content.
>> Every time the job is executed the documents are fully reindexed in
>> Elastic, no matter they didn't change.
>> I thought I had read somewhere the crawler uses "last-modified" http
>> header, but also that saves into database a hash.
>> I was looking for the right one within the user's manual but no luck, so
>> please could you tell me which is the correct one?
>>
>> Thanks in advance!
>>
>>
>> ------------------------------
>>
>> This message is for the designated recipient only and may contain
>> privileged, proprietary, or otherwise confidential information. If you have
>> received it in error, please notify the sender immediately and delete the
>> original. Any other use of the e-mail by you is prohibited. Where allowed
>> by local law, electronic communications with Accenture and its affiliates,
>> including e-mail and instant messaging (including content), may be scanned
>> by our systems for the purposes of information security and assessment of
>> internal compliance with Accenture policy. Your privacy is important to us.
>> Accenture uses your personal data only in compliance with data protection
>> laws. For further information on how Accenture processes your personal
>> data, please see our privacy statement at
>> https://www.accenture.com/us-en/privacy-policy.
>>
>> ______________________________________________________________________________________
>>
>> www.accenture.com
>>
>

Re: [External] Re: Documents that didn't change are reindexed

Reply via email to