The patch I uploaded doesn't work because the entire tab is broken; looks like the UI refactoring broke it and it was never reported. Fixing now. Karl
On Wed, Sep 5, 2018 at 3:57 AM Karl Wright <[email protected]> wrote: > I coded up the web connector feature I think we need. See > CONNECTORS-1528; I've attached a patch. Please apply and test it out to > see if it solves the case problem for your IIS site. > > For the "//" issue, can you be more specific about the mapping you need to > do? > > Karl > > > On Tue, Sep 4, 2018 at 4:17 PM Karl Wright <[email protected]> wrote: > >> Hi Steph, >> >> Right, you wouldn't want to touch the framework. >> >> The effect of lower-casing the documentURI parameter in the >> addOrReplaceDocumentWithException method in an output connector would be to >> map multiple, independently-fetched, documents that differ only by the case >> of the URL together into one document in the index. The ManifoldCF >> assumption is that a document with a certain URI can be tracked in the >> index using exactly that URI. Mapping the URI to lower case would break >> that assumption so the framework would make the wrong decision in many >> cases. >> >> If you are picking up documents using the web connector, therefore, and >> you are getting duplicate documents because the document URLs are sloppy, >> it is therefore essential that INSTEAD of mapping the document URI to lower >> case in the output connector, you map to lower case in the repository >> connector. Otherwise the framework will not work right. >> >> There is a tab in the web connector that allows you to configure URL >> normalization, called "Canonicalization". This would be a very appropriate >> place to add URL mapping to lower case. It should be as simple as adding >> one more checkbox column in the table, and modifying the method that does >> the URL processing to include lower-casing. >> >> Karl >> >> >> >> On Tue, Sep 4, 2018 at 2:46 PM Steph van Schalkwyk <[email protected]> >> wrote: >> >>> Unless I have a massive misunderstanding somewhere... >>> >>> >>> >>> >>> *Steph van Schalkwyk* >>> Principal, Remcam Search Engines >>> +1.314.452. <+1+314+452+2896>2896 [email protected] >>> http://remcam.net <http://www.remcam.net/> Skype: svanschalkwyk >>> <https://mail.google.com/mail/u/0/#> >>> <http://linkedin.com/in/vanschalkwyk> >>> >>> On Tue, Sep 4, 2018 at 1:42 PM, Steph van Schalkwyk <[email protected]> >>> wrote: >>> >>>> Hi Karl >>>> I'm addressing it in the ES Output Connector. >>>> Not touching the framework :) >>>> S >>>> >>>> >>>> >>>> *Steph van Schalkwyk* >>>> Principal, Remcam Search Engines >>>> +1.314.452. <+1+314+452+2896>2896 [email protected] >>>> http://remcam.net <http://www.remcam.net/> Skype: svanschalkwyk >>>> <https://mail.google.com/mail/u/0/#> >>>> <http://linkedin.com/in/vanschalkwyk> >>>> >>>> On Tue, Sep 4, 2018 at 1:33 PM, Karl Wright <[email protected]> wrote: >>>> >>>>> Let's make sure we're talking about the same thing. >>>>> >>>>> Here is the output connector method that receives the ID (as the >>>>> documentURI parameter): >>>>> >>>>> public int addOrReplaceDocumentWithException(String documentURI, >>>>> VersionContext pipelineDescription, RepositoryDocument document, String >>>>> authorityNameString, IOutputAddActivity activities) >>>>> throws ManifoldCFException, ServiceInterruption, IOException; >>>>> >>>>> ManifoldCF doesn't say anywhere that this ID is case insensitive. If >>>>> you make it case insensitive in an output connector, this will potentially >>>>> break a lot of things, for example incremental indexing (which organizes >>>>> the last indexed version by document ID). >>>>> >>>>> I therefore highly recommend that any "sloppyness" in this parameter >>>>> be addressed in the Repository Connector that constructs it. If the >>>>> connector is crawling a repository that believes that URLs are case >>>>> insensitive then it should map these IDs to lower case. If not, then it >>>>> shouldn't. >>>>> >>>>> Karl >>>>> >>>>> >>>>> On Tue, Sep 4, 2018 at 1:36 PM Steph van Schalkwyk <[email protected]> >>>>> wrote: >>>>> >>>>>> Hi Karl. >>>>>> The issue is that the ES Output Connector uses the uri to create the >>>>>> _id. When used with IIS which allows case variation in the URI, it >>>>>> creates >>>>>> multiple documents. Clients on Windows IIS are rarely cognizant of that >>>>>> issue as IIS is so lax in policing that OTB. >>>>>> Currently, every case variation in URI results in a new doc in the >>>>>> index. This is only in the ES output connector. >>>>>> I can add an optional checkbox to do determien that particular action >>>>>> if that would help? >>>>>> Regards, >>>>>> Steph >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> *Steph van Schalkwyk* >>>>>> Principal, Remcam Search Engines >>>>>> +1.314.452. <+1+314+452+2896>2896 [email protected] >>>>>> http://remcam.net <http://www.remcam.net/> Skype: svanschalkwyk >>>>>> <https://mail.google.com/mail/u/0/#> >>>>>> <http://linkedin.com/in/vanschalkwyk> >>>>>> >>>>>> On Tue, Sep 4, 2018 at 12:22 PM, Karl Wright <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> THanks for the update. >>>>>>> Lower-casing the ID would be fine except there are some connectors >>>>>>> that care about case. The web connector is one such because it's up to >>>>>>> the >>>>>>> web service to decide if case matters, so the web connector does not >>>>>>> view >>>>>>> urls with case differences as being the same. Other connectors also >>>>>>> will >>>>>>> likely care as well. So I don't think lower-casing the document id is a >>>>>>> smart thing to do. >>>>>>> >>>>>>> You could add this bit of configuration to the web connector, if >>>>>>> that's what you are using, or to whatever other connector constructs >>>>>>> the ID. >>>>>>> >>>>>>> Karl >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Tue, Sep 4, 2018 at 12:04 PM Steph van Schalkwyk < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> Thanks Karl. >>>>>>>> >>>>>>>> I'll look into that. >>>>>>>> >>>>>>>> Another note: >>>>>>>> Regarding the ES connector - I have made two additions to it and >>>>>>>> should probably diff them for inclusion after approval: >>>>>>>> 1. lowercased _id (the doc URI). >>>>>>>> 2. Removed dual "/" , e.g. "//" in the _id (I have sloppy sources, >>>>>>>> particularly IIS...) >>>>>>>> 3. Added a "url" metadata field to the ES connector (as ES 6.x does >>>>>>>> not allow accedd to _id in the schema anymore, so no copy_field etc. >>>>>>>> from >>>>>>>> _id). Hence "url". >>>>>>>> >>>>>>>> Regards, >>>>>>>> Steph >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> *Steph van Schalkwyk* >>>>>>>> Principal, Remcam Search Engines >>>>>>>> +1.314.452. <+1+314+452+2896>2896 [email protected] >>>>>>>> http://remcam.net <http://www.remcam.net/> Skype: svanschalkwyk >>>>>>>> <https://mail.google.com/mail/u/0/#> >>>>>>>> <http://linkedin.com/in/vanschalkwyk> >>>>>>>> >>>>>>>> On Tue, Sep 4, 2018 at 10:50 AM, Karl Wright <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi Steph, I suspect that Jetty is leaking some resource, and we >>>>>>>>> may need to upgrade it. >>>>>>>>> >>>>>>>>> Karl >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, Sep 4, 2018 at 11:26 AM Steph van Schalkwyk < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> Olivier >>>>>>>>>> By all means. >>>>>>>>>> The only issue I have seen (totally unrelated) is with Jetty, >>>>>>>>>> which has to be restarted about once a week. Still trying to find >>>>>>>>>> the issue. >>>>>>>>>> I may be overly sensitive, but I suspect MCF 2.10 with Postgres10 >>>>>>>>>> may be a bit slower. I have no empiric evidence at the moment as I'm >>>>>>>>>> still >>>>>>>>>> delivering the project to UAT. Will keep you posted. >>>>>>>>>> Regards, >>>>>>>>>> Steph >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> *Steph van Schalkwyk* >>>>>>>>>> Principal, Remcam Search Engines >>>>>>>>>> +1.314.452. <+1+314+452+2896>2896 [email protected] >>>>>>>>>> http://remcam.net <http://www.remcam.net/> Skype: svanschalkwyk >>>>>>>>>> <https://mail.google.com/mail/u/0/#> >>>>>>>>>> <http://linkedin.com/in/vanschalkwyk> >>>>>>>>>> >>>>>>>>>> On Tue, Sep 4, 2018 at 9:59 AM, Olivier Tavard < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>>> Hello, >>>>>>>>>>> >>>>>>>>>>> Thanks a lot for sharing your PostgreSQL configuration (sorry >>>>>>>>>>> for the late answer). I will test it soon. >>>>>>>>>>> >>>>>>>>>>> Best regards, >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Olivier TAVARD >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Le 23 août 2018 à 19:20, Steph van Schalkwyk <[email protected]> >>>>>>>>>>> a écrit : >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> These are the rpm installs: >>>>>>>>>>> - >>>>>>>>>>> file:///tmp/postgres10/postgresql10-libs-10.4-1PGDG.rhel7.x86_64.rpm >>>>>>>>>>> - >>>>>>>>>>> file:///tmp/postgres10/postgresql10-10.4-1PGDG.rhel7.x86_64.rpm >>>>>>>>>>> - >>>>>>>>>>> file:///tmp/postgres10/postgresql10-contrib-10.4-1PGDG.rhel7.x86_64.rpm >>>>>>>>>>> - >>>>>>>>>>> file:///tmp/postgres10/postgresql10-devel-10.4-1PGDG.rhel7.x86_64.rpm >>>>>>>>>>> - >>>>>>>>>>> file:///tmp/postgres10/postgresql10-server-10.4-1PGDG.rhel7.x86_64.rpm >>>>>>>>>>> >>>>>>>>>>> postgresql_version: 10 >>>>>>>>>>> postgresql_data_dir: /var/lib/pgsql/10/data >>>>>>>>>>> postgresql_bin_path: /usr/pgsql-10/bin >>>>>>>>>>> postgresql_config_path: /var/lib/pgsql/10/data >>>>>>>>>>> postgresql_daemon: postgresql-10.service >>>>>>>>>>> postgresql_packages: >>>>>>>>>>> - postgresql10-libs >>>>>>>>>>> - postgresql10 >>>>>>>>>>> - postgresql10-server >>>>>>>>>>> - postgresql10-contrib >>>>>>>>>>> # - postgresql10-devel >>>>>>>>>>> >>>>>>>>>>> postgresql_hba_entries: >>>>>>>>>>> - { type: local, database: all, user: postgres, auth_method: >>>>>>>>>>> peer } >>>>>>>>>>> - { type: local, database: all, user: all, auth_method: peer } >>>>>>>>>>> - { type: host, database: all, user: all, address: '127.0.0.1/32 >>>>>>>>>>> ', auth_method: md5 } >>>>>>>>>>> - { type: host, database: all, user: all, address: '::1/128', >>>>>>>>>>> auth_method: md5 } >>>>>>>>>>> - { type: host, database: all, user: all, address: '0.0.0.0/0', >>>>>>>>>>> auth_method: md5 } >>>>>>>>>>> - { type: host, database: all, user: all, address: '::0/0', >>>>>>>>>>> auth_method: md5 } >>>>>>>>>>> >>>>>>>>>>> postgresql_global_config_options: >>>>>>>>>>> - option: unix_socket_directories >>>>>>>>>>> value: '{{ postgresql_unix_socket_directories | join(",") }}' >>>>>>>>>>> >>>>>>>>>>> - option: standard_conforming_strings >>>>>>>>>>> value: 'on' >>>>>>>>>>> >>>>>>>>>>> - option: shared_buffers >>>>>>>>>>> value: '1024MB' >>>>>>>>>>> >>>>>>>>>>> # max_wal_size = (3 * checkpoint_segments) * 16MB >>>>>>>>>>> # checkpoint_segments=300 >>>>>>>>>>> - option: max_wal_size >>>>>>>>>>> value: '14400MB' >>>>>>>>>>> >>>>>>>>>>> - option: min_wal_size >>>>>>>>>>> value: '80MB' >>>>>>>>>>> >>>>>>>>>>> - option: maintenance_work_mem >>>>>>>>>>> value: '2MB' >>>>>>>>>>> >>>>>>>>>>> - option: listen_addresses >>>>>>>>>>> value: '*' >>>>>>>>>>> >>>>>>>>>>> - option: max_connections >>>>>>>>>>> value: '400' >>>>>>>>>>> >>>>>>>>>>> - option: checkpoint_timeout >>>>>>>>>>> value: '900' >>>>>>>>>>> >>>>>>>>>>> - option: datestyle >>>>>>>>>>> value: "iso, mdy" >>>>>>>>>>> >>>>>>>>>>> - option: autovacuum >>>>>>>>>>> value: 'off' >>>>>>>>>>> >>>>>>>>>>> # vacuum all databases every night (full vacuum on Sunday night, >>>>>>>>>>> lazy vacuum every night) >>>>>>>>>>> - name: add postgresql cron lazy vacuum >>>>>>>>>>> cron: >>>>>>>>>>> name: lazy_vacuum >>>>>>>>>>> hour: 8 >>>>>>>>>>> minute: 0 >>>>>>>>>>> job: "su - postgres -c 'vacuumdb --all --analyze --quiet'" >>>>>>>>>>> - name: add postgresql cron full vacuum >>>>>>>>>>> cron: >>>>>>>>>>> name: full_vacuum >>>>>>>>>>> weekday: 0 >>>>>>>>>>> hour: 10 >>>>>>>>>>> minute: 0 >>>>>>>>>>> job: "su - postgres -c 'vacuumdb --all --full --analyze >>>>>>>>>>> --quiet'" >>>>>>>>>>> # re-index all databases once a week >>>>>>>>>>> - name: add postgresql cron reindex >>>>>>>>>>> cron: >>>>>>>>>>> name: reindex >>>>>>>>>>> weekday: 0 >>>>>>>>>>> hour: 12 >>>>>>>>>>> minute: 0 >>>>>>>>>>> job: "su - postgres -c 'psql -t -c \"select datname from >>>>>>>>>>> pg_database order by datname;\" | xargs -n 1 -I\"{}\" -- psql -U >>>>>>>>>>> postgres >>>>>>>>>>> {} -c \"reindex database {};\"' " >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> This is how I run 2.10. >>>>>>>>>>> Been running fine for some weeks without user intervention. >>>>>>>>>>> @Karl: Any comments please? >>>>>>>>>>> Steph >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>> >>>> >>>
