Re: PostgreSQL version to support MCF v2.10

Steph van Schalkwyk Wed, 05 Sep 2018 08:56:04 -0700

Thank you Karl.
You are of course correct in that the incremental crawl is now broken in
that it does a full crawl every time.
I'll jump on the Web Connector and add that functionality.
Thanks for this excellent application and all the help over the years.
Steph





*Steph van Schalkwyk*
Principal, Remcam Search Engines
+1.314.452. <+1+314+452+2896>2896    [email protected]   http://remcam.net
<http://www.remcam.net/> Skype: svanschalkwyk
<https://mail.google.com/mail/u/0/#>
<http://linkedin.com/in/vanschalkwyk>

On Wed, Sep 5, 2018 at 6:33 AM, Karl Wright <[email protected]> wrote:

> The patch I uploaded doesn't work because the entire tab is broken; looks
> like the UI refactoring broke it and it was never reported.  Fixing now.
> Karl
>
>
> On Wed, Sep 5, 2018 at 3:57 AM Karl Wright <[email protected]> wrote:
>
>> I coded up the web connector feature I think we need.  See
>> CONNECTORS-1528; I've attached a patch.  Please apply and test it out to
>> see if it solves the case problem for your IIS site.
>>
>> For the "//" issue, can you be more specific about the mapping you need
>> to do?
>>
>> Karl
>>
>>
>> On Tue, Sep 4, 2018 at 4:17 PM Karl Wright <[email protected]> wrote:
>>
>>> Hi Steph,
>>>
>>> Right, you wouldn't want to touch the framework.
>>>
>>> The effect of lower-casing the documentURI parameter in the
>>> addOrReplaceDocumentWithException method in an output connector would
>>> be to map multiple, independently-fetched, documents that differ only by
>>> the case of the URL together into one document in the index.  The
>>> ManifoldCF assumption is that a document with a certain URI can be tracked
>>> in the index using exactly that URI.  Mapping the URI to lower case would
>>> break that assumption so the framework would make the wrong decision in
>>> many cases.
>>>
>>> If you are picking up documents using the web connector, therefore, and
>>> you are getting duplicate documents because the document URLs are sloppy,
>>> it is therefore essential that INSTEAD of mapping the document URI to lower
>>> case in the output connector, you map to lower case in the repository
>>> connector.  Otherwise the framework will not work right.
>>>
>>> There is a tab in the web connector that allows you to configure URL
>>> normalization, called "Canonicalization".  This would be a very appropriate
>>> place to add URL mapping to lower case.  It should be as simple as adding
>>> one more checkbox column in the table, and modifying the method that does
>>> the URL processing to include lower-casing.
>>>
>>> Karl
>>>
>>>
>>>
>>> On Tue, Sep 4, 2018 at 2:46 PM Steph van Schalkwyk <[email protected]>
>>> wrote:
>>>
>>>> Unless I have a massive misunderstanding somewhere...
>>>>
>>>>
>>>>
>>>>
>>>> *Steph van Schalkwyk*
>>>> Principal, Remcam Search Engines
>>>> +1.314.452. <+1+314+452+2896>2896    [email protected]
>>>> http://remcam.net <http://www.remcam.net/> Skype: svanschalkwyk
>>>> <https://mail.google.com/mail/u/0/#>
>>>> <http://linkedin.com/in/vanschalkwyk>
>>>>
>>>> On Tue, Sep 4, 2018 at 1:42 PM, Steph van Schalkwyk <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi Karl
>>>>> I'm addressing it in the ES Output Connector.
>>>>> Not touching the framework :)
>>>>> S
>>>>>
>>>>>
>>>>>
>>>>> *Steph van Schalkwyk*
>>>>> Principal, Remcam Search Engines
>>>>> +1.314.452. <+1+314+452+2896>2896    [email protected]
>>>>> http://remcam.net <http://www.remcam.net/> Skype: svanschalkwyk
>>>>> <https://mail.google.com/mail/u/0/#>
>>>>> <http://linkedin.com/in/vanschalkwyk>
>>>>>
>>>>> On Tue, Sep 4, 2018 at 1:33 PM, Karl Wright <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Let's make sure we're talking about the same thing.
>>>>>>
>>>>>> Here is the output connector method that receives the ID (as the
>>>>>> documentURI parameter):
>>>>>>
>>>>>>   public int addOrReplaceDocumentWithException(String documentURI,
>>>>>> VersionContext pipelineDescription, RepositoryDocument document, String
>>>>>> authorityNameString, IOutputAddActivity activities)
>>>>>>     throws ManifoldCFException, ServiceInterruption, IOException;
>>>>>>
>>>>>> ManifoldCF doesn't say anywhere that this ID is case insensitive.  If
>>>>>> you make it case insensitive in an output connector, this will 
>>>>>> potentially
>>>>>> break a lot of things, for example incremental indexing (which organizes
>>>>>> the last indexed version by document ID).
>>>>>>
>>>>>> I therefore highly recommend that any "sloppyness" in this parameter
>>>>>> be addressed in the Repository Connector that constructs it.  If the
>>>>>> connector is crawling a repository that believes that URLs are case
>>>>>> insensitive then it should map these IDs to lower case.  If not, then it
>>>>>> shouldn't.
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>> On Tue, Sep 4, 2018 at 1:36 PM Steph van Schalkwyk <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Karl.
>>>>>>> The issue is that the ES Output Connector uses the uri to create the
>>>>>>> _id. When used with IIS which allows case variation in the URI, it 
>>>>>>> creates
>>>>>>> multiple documents. Clients on Windows IIS are rarely cognizant of that
>>>>>>> issue as IIS is so lax in policing that OTB.
>>>>>>> Currently, every case variation in URI results in a new doc in the
>>>>>>> index. This is only in the ES output connector.
>>>>>>> I can add an optional checkbox to do determien that particular
>>>>>>> action if that would help?
>>>>>>> Regards,
>>>>>>> Steph
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *Steph van Schalkwyk*
>>>>>>> Principal, Remcam Search Engines
>>>>>>> +1.314.452. <+1+314+452+2896>2896    [email protected]
>>>>>>> http://remcam.net <http://www.remcam.net/> Skype: svanschalkwyk
>>>>>>> <https://mail.google.com/mail/u/0/#>
>>>>>>> <http://linkedin.com/in/vanschalkwyk>
>>>>>>>
>>>>>>> On Tue, Sep 4, 2018 at 12:22 PM, Karl Wright <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> THanks for the update.
>>>>>>>> Lower-casing the ID would be fine except there are some connectors
>>>>>>>> that care about case.  The web connector is one such because it's up 
>>>>>>>> to the
>>>>>>>> web service to decide if case matters, so the web connector does not 
>>>>>>>> view
>>>>>>>> urls with case differences as being the same.  Other connectors also 
>>>>>>>> will
>>>>>>>> likely care as well. So I don't think lower-casing the document id is a
>>>>>>>> smart thing to do.
>>>>>>>>
>>>>>>>> You could add this bit of configuration to the web connector, if
>>>>>>>> that's what you are using, or to whatever other connector constructs 
>>>>>>>> the ID.
>>>>>>>>
>>>>>>>> Karl
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Sep 4, 2018 at 12:04 PM Steph van Schalkwyk <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Thanks Karl.
>>>>>>>>>
>>>>>>>>> I'll look into that.
>>>>>>>>>
>>>>>>>>> Another note:
>>>>>>>>> Regarding the ES connector - I have made two additions to it and
>>>>>>>>> should probably diff them for inclusion after approval:
>>>>>>>>> 1. lowercased _id (the doc URI).
>>>>>>>>> 2. Removed dual "/" , e.g. "//" in the _id (I have sloppy sources,
>>>>>>>>> particularly IIS...)
>>>>>>>>> 3. Added a "url" metadata field to the ES connector (as ES 6.x
>>>>>>>>> does not allow accedd to _id in the schema anymore, so no copy_field 
>>>>>>>>> etc.
>>>>>>>>> from _id). Hence "url".
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Steph
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *Steph van Schalkwyk*
>>>>>>>>> Principal, Remcam Search Engines
>>>>>>>>> +1.314.452. <+1+314+452+2896>2896    [email protected]
>>>>>>>>> http://remcam.net <http://www.remcam.net/> Skype: svanschalkwyk
>>>>>>>>> <https://mail.google.com/mail/u/0/#>
>>>>>>>>> <http://linkedin.com/in/vanschalkwyk>
>>>>>>>>>
>>>>>>>>> On Tue, Sep 4, 2018 at 10:50 AM, Karl Wright <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Steph, I suspect that Jetty is leaking some resource, and we
>>>>>>>>>> may need to upgrade it.
>>>>>>>>>>
>>>>>>>>>> Karl
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Sep 4, 2018 at 11:26 AM Steph van Schalkwyk <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Olivier
>>>>>>>>>>> By all means.
>>>>>>>>>>> The only issue I have seen (totally unrelated) is with Jetty,
>>>>>>>>>>> which has to be restarted about once a week. Still trying to find 
>>>>>>>>>>> the issue.
>>>>>>>>>>> I may be overly sensitive, but I suspect MCF 2.10 with
>>>>>>>>>>> Postgres10 may be a bit slower. I have no empiric evidence at the 
>>>>>>>>>>> moment as
>>>>>>>>>>> I'm still delivering the project to UAT. Will keep you posted.
>>>>>>>>>>> Regards,
>>>>>>>>>>> Steph
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> *Steph van Schalkwyk*
>>>>>>>>>>> Principal, Remcam Search Engines
>>>>>>>>>>> +1.314.452. <+1+314+452+2896>2896    [email protected]
>>>>>>>>>>> http://remcam.net <http://www.remcam.net/> Skype: svanschalkwyk
>>>>>>>>>>> <https://mail.google.com/mail/u/0/#>
>>>>>>>>>>> <http://linkedin.com/in/vanschalkwyk>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Sep 4, 2018 at 9:59 AM, Olivier Tavard <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hello,
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks a lot for sharing your PostgreSQL configuration (sorry
>>>>>>>>>>>> for the late answer). I will test it soon.
>>>>>>>>>>>>
>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Olivier TAVARD
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Le 23 août 2018 à 19:20, Steph van Schalkwyk <[email protected]>
>>>>>>>>>>>> a écrit :
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> These are the rpm installs:
>>>>>>>>>>>> - file:///tmp/postgres10/postgresql10-libs-10.4-1PGDG.
>>>>>>>>>>>> rhel7.x86_64.rpm
>>>>>>>>>>>> - file:///tmp/postgres10/postgresql10-10.4-1PGDG.rhel7.
>>>>>>>>>>>> x86_64.rpm
>>>>>>>>>>>> - file:///tmp/postgres10/postgresql10-contrib-10.4-
>>>>>>>>>>>> 1PGDG.rhel7.x86_64.rpm
>>>>>>>>>>>> - file:///tmp/postgres10/postgresql10-devel-10.4-1PGDG.
>>>>>>>>>>>> rhel7.x86_64.rpm
>>>>>>>>>>>> - file:///tmp/postgres10/postgresql10-server-10.4-
>>>>>>>>>>>> 1PGDG.rhel7.x86_64.rpm
>>>>>>>>>>>>
>>>>>>>>>>>> postgresql_version: 10
>>>>>>>>>>>> postgresql_data_dir: /var/lib/pgsql/10/data
>>>>>>>>>>>> postgresql_bin_path: /usr/pgsql-10/bin
>>>>>>>>>>>> postgresql_config_path: /var/lib/pgsql/10/data
>>>>>>>>>>>> postgresql_daemon: postgresql-10.service
>>>>>>>>>>>> postgresql_packages:
>>>>>>>>>>>> - postgresql10-libs
>>>>>>>>>>>> - postgresql10
>>>>>>>>>>>> - postgresql10-server
>>>>>>>>>>>> - postgresql10-contrib
>>>>>>>>>>>> # - postgresql10-devel
>>>>>>>>>>>>
>>>>>>>>>>>> postgresql_hba_entries:
>>>>>>>>>>>> - { type: local, database: all, user: postgres, auth_method:
>>>>>>>>>>>> peer }
>>>>>>>>>>>> - { type: local, database: all, user: all, auth_method: peer }
>>>>>>>>>>>> - { type: host, database: all, user: all, address: '
>>>>>>>>>>>> 127.0.0.1/32', auth_method: md5 }
>>>>>>>>>>>> - { type: host, database: all, user: all, address: '::1/128',
>>>>>>>>>>>> auth_method: md5 }
>>>>>>>>>>>> - { type: host, database: all, user: all, address: '0.0.0.0/0',
>>>>>>>>>>>> auth_method: md5 }
>>>>>>>>>>>> - { type: host, database: all, user: all, address: '::0/0',
>>>>>>>>>>>> auth_method: md5 }
>>>>>>>>>>>>
>>>>>>>>>>>> postgresql_global_config_options:
>>>>>>>>>>>> - option: unix_socket_directories
>>>>>>>>>>>> value: '{{ postgresql_unix_socket_directories | join(",") }}'
>>>>>>>>>>>>
>>>>>>>>>>>> - option: standard_conforming_strings
>>>>>>>>>>>> value: 'on'
>>>>>>>>>>>>
>>>>>>>>>>>> - option: shared_buffers
>>>>>>>>>>>> value: '1024MB'
>>>>>>>>>>>>
>>>>>>>>>>>> # max_wal_size = (3 * checkpoint_segments) * 16MB
>>>>>>>>>>>> # checkpoint_segments=300
>>>>>>>>>>>> - option: max_wal_size
>>>>>>>>>>>> value: '14400MB'
>>>>>>>>>>>>
>>>>>>>>>>>> - option: min_wal_size
>>>>>>>>>>>> value: '80MB'
>>>>>>>>>>>>
>>>>>>>>>>>> - option: maintenance_work_mem
>>>>>>>>>>>> value: '2MB'
>>>>>>>>>>>>
>>>>>>>>>>>> - option: listen_addresses
>>>>>>>>>>>> value: '*'
>>>>>>>>>>>>
>>>>>>>>>>>> - option: max_connections
>>>>>>>>>>>> value: '400'
>>>>>>>>>>>>
>>>>>>>>>>>> - option: checkpoint_timeout
>>>>>>>>>>>> value: '900'
>>>>>>>>>>>>
>>>>>>>>>>>> - option: datestyle
>>>>>>>>>>>> value: "iso, mdy"
>>>>>>>>>>>>
>>>>>>>>>>>> - option: autovacuum
>>>>>>>>>>>> value: 'off'
>>>>>>>>>>>>
>>>>>>>>>>>> # vacuum all databases every night (full vacuum on Sunday
>>>>>>>>>>>> night, lazy vacuum every night)
>>>>>>>>>>>> - name: add postgresql cron lazy vacuum
>>>>>>>>>>>> cron:
>>>>>>>>>>>> name: lazy_vacuum
>>>>>>>>>>>> hour: 8
>>>>>>>>>>>> minute: 0
>>>>>>>>>>>> job: "su - postgres -c 'vacuumdb --all --analyze --quiet'"
>>>>>>>>>>>> - name: add postgresql cron full vacuum
>>>>>>>>>>>> cron:
>>>>>>>>>>>> name: full_vacuum
>>>>>>>>>>>> weekday: 0
>>>>>>>>>>>> hour: 10
>>>>>>>>>>>> minute: 0
>>>>>>>>>>>> job: "su - postgres -c 'vacuumdb --all --full --analyze
>>>>>>>>>>>> --quiet'"
>>>>>>>>>>>> # re-index all databases once a week
>>>>>>>>>>>> - name: add postgresql cron reindex
>>>>>>>>>>>> cron:
>>>>>>>>>>>> name: reindex
>>>>>>>>>>>> weekday: 0
>>>>>>>>>>>> hour: 12
>>>>>>>>>>>> minute: 0
>>>>>>>>>>>> job: "su - postgres -c 'psql -t -c \"select datname from
>>>>>>>>>>>> pg_database order by datname;\" | xargs -n 1 -I\"{}\" -- psql -U 
>>>>>>>>>>>> postgres
>>>>>>>>>>>> {} -c \"reindex database {};\"' "
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> This is how I run 2.10.
>>>>>>>>>>>> Been running fine for some weeks without user intervention.
>>>>>>>>>>>> @Karl: Any comments please?
>>>>>>>>>>>> Steph
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>
>>>>

Re: PostgreSQL version to support MCF v2.10

Reply via email to