Thank you Karl. You are of course correct in that the incremental crawl is now broken in that it does a full crawl every time. I'll jump on the Web Connector and add that functionality. Thanks for this excellent application and all the help over the years. Steph
*Steph van Schalkwyk* Principal, Remcam Search Engines +1.314.452. <+1+314+452+2896>2896 [email protected] http://remcam.net <http://www.remcam.net/> Skype: svanschalkwyk <https://mail.google.com/mail/u/0/#> <http://linkedin.com/in/vanschalkwyk> On Wed, Sep 5, 2018 at 6:33 AM, Karl Wright <[email protected]> wrote: > The patch I uploaded doesn't work because the entire tab is broken; looks > like the UI refactoring broke it and it was never reported. Fixing now. > Karl > > > On Wed, Sep 5, 2018 at 3:57 AM Karl Wright <[email protected]> wrote: > >> I coded up the web connector feature I think we need. See >> CONNECTORS-1528; I've attached a patch. Please apply and test it out to >> see if it solves the case problem for your IIS site. >> >> For the "//" issue, can you be more specific about the mapping you need >> to do? >> >> Karl >> >> >> On Tue, Sep 4, 2018 at 4:17 PM Karl Wright <[email protected]> wrote: >> >>> Hi Steph, >>> >>> Right, you wouldn't want to touch the framework. >>> >>> The effect of lower-casing the documentURI parameter in the >>> addOrReplaceDocumentWithException method in an output connector would >>> be to map multiple, independently-fetched, documents that differ only by >>> the case of the URL together into one document in the index. The >>> ManifoldCF assumption is that a document with a certain URI can be tracked >>> in the index using exactly that URI. Mapping the URI to lower case would >>> break that assumption so the framework would make the wrong decision in >>> many cases. >>> >>> If you are picking up documents using the web connector, therefore, and >>> you are getting duplicate documents because the document URLs are sloppy, >>> it is therefore essential that INSTEAD of mapping the document URI to lower >>> case in the output connector, you map to lower case in the repository >>> connector. Otherwise the framework will not work right. >>> >>> There is a tab in the web connector that allows you to configure URL >>> normalization, called "Canonicalization". This would be a very appropriate >>> place to add URL mapping to lower case. It should be as simple as adding >>> one more checkbox column in the table, and modifying the method that does >>> the URL processing to include lower-casing. >>> >>> Karl >>> >>> >>> >>> On Tue, Sep 4, 2018 at 2:46 PM Steph van Schalkwyk <[email protected]> >>> wrote: >>> >>>> Unless I have a massive misunderstanding somewhere... >>>> >>>> >>>> >>>> >>>> *Steph van Schalkwyk* >>>> Principal, Remcam Search Engines >>>> +1.314.452. <+1+314+452+2896>2896 [email protected] >>>> http://remcam.net <http://www.remcam.net/> Skype: svanschalkwyk >>>> <https://mail.google.com/mail/u/0/#> >>>> <http://linkedin.com/in/vanschalkwyk> >>>> >>>> On Tue, Sep 4, 2018 at 1:42 PM, Steph van Schalkwyk <[email protected]> >>>> wrote: >>>> >>>>> Hi Karl >>>>> I'm addressing it in the ES Output Connector. >>>>> Not touching the framework :) >>>>> S >>>>> >>>>> >>>>> >>>>> *Steph van Schalkwyk* >>>>> Principal, Remcam Search Engines >>>>> +1.314.452. <+1+314+452+2896>2896 [email protected] >>>>> http://remcam.net <http://www.remcam.net/> Skype: svanschalkwyk >>>>> <https://mail.google.com/mail/u/0/#> >>>>> <http://linkedin.com/in/vanschalkwyk> >>>>> >>>>> On Tue, Sep 4, 2018 at 1:33 PM, Karl Wright <[email protected]> >>>>> wrote: >>>>> >>>>>> Let's make sure we're talking about the same thing. >>>>>> >>>>>> Here is the output connector method that receives the ID (as the >>>>>> documentURI parameter): >>>>>> >>>>>> public int addOrReplaceDocumentWithException(String documentURI, >>>>>> VersionContext pipelineDescription, RepositoryDocument document, String >>>>>> authorityNameString, IOutputAddActivity activities) >>>>>> throws ManifoldCFException, ServiceInterruption, IOException; >>>>>> >>>>>> ManifoldCF doesn't say anywhere that this ID is case insensitive. If >>>>>> you make it case insensitive in an output connector, this will >>>>>> potentially >>>>>> break a lot of things, for example incremental indexing (which organizes >>>>>> the last indexed version by document ID). >>>>>> >>>>>> I therefore highly recommend that any "sloppyness" in this parameter >>>>>> be addressed in the Repository Connector that constructs it. If the >>>>>> connector is crawling a repository that believes that URLs are case >>>>>> insensitive then it should map these IDs to lower case. If not, then it >>>>>> shouldn't. >>>>>> >>>>>> Karl >>>>>> >>>>>> >>>>>> On Tue, Sep 4, 2018 at 1:36 PM Steph van Schalkwyk <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Hi Karl. >>>>>>> The issue is that the ES Output Connector uses the uri to create the >>>>>>> _id. When used with IIS which allows case variation in the URI, it >>>>>>> creates >>>>>>> multiple documents. Clients on Windows IIS are rarely cognizant of that >>>>>>> issue as IIS is so lax in policing that OTB. >>>>>>> Currently, every case variation in URI results in a new doc in the >>>>>>> index. This is only in the ES output connector. >>>>>>> I can add an optional checkbox to do determien that particular >>>>>>> action if that would help? >>>>>>> Regards, >>>>>>> Steph >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> *Steph van Schalkwyk* >>>>>>> Principal, Remcam Search Engines >>>>>>> +1.314.452. <+1+314+452+2896>2896 [email protected] >>>>>>> http://remcam.net <http://www.remcam.net/> Skype: svanschalkwyk >>>>>>> <https://mail.google.com/mail/u/0/#> >>>>>>> <http://linkedin.com/in/vanschalkwyk> >>>>>>> >>>>>>> On Tue, Sep 4, 2018 at 12:22 PM, Karl Wright <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> THanks for the update. >>>>>>>> Lower-casing the ID would be fine except there are some connectors >>>>>>>> that care about case. The web connector is one such because it's up >>>>>>>> to the >>>>>>>> web service to decide if case matters, so the web connector does not >>>>>>>> view >>>>>>>> urls with case differences as being the same. Other connectors also >>>>>>>> will >>>>>>>> likely care as well. So I don't think lower-casing the document id is a >>>>>>>> smart thing to do. >>>>>>>> >>>>>>>> You could add this bit of configuration to the web connector, if >>>>>>>> that's what you are using, or to whatever other connector constructs >>>>>>>> the ID. >>>>>>>> >>>>>>>> Karl >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Sep 4, 2018 at 12:04 PM Steph van Schalkwyk < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> Thanks Karl. >>>>>>>>> >>>>>>>>> I'll look into that. >>>>>>>>> >>>>>>>>> Another note: >>>>>>>>> Regarding the ES connector - I have made two additions to it and >>>>>>>>> should probably diff them for inclusion after approval: >>>>>>>>> 1. lowercased _id (the doc URI). >>>>>>>>> 2. Removed dual "/" , e.g. "//" in the _id (I have sloppy sources, >>>>>>>>> particularly IIS...) >>>>>>>>> 3. Added a "url" metadata field to the ES connector (as ES 6.x >>>>>>>>> does not allow accedd to _id in the schema anymore, so no copy_field >>>>>>>>> etc. >>>>>>>>> from _id). Hence "url". >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> Steph >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> *Steph van Schalkwyk* >>>>>>>>> Principal, Remcam Search Engines >>>>>>>>> +1.314.452. <+1+314+452+2896>2896 [email protected] >>>>>>>>> http://remcam.net <http://www.remcam.net/> Skype: svanschalkwyk >>>>>>>>> <https://mail.google.com/mail/u/0/#> >>>>>>>>> <http://linkedin.com/in/vanschalkwyk> >>>>>>>>> >>>>>>>>> On Tue, Sep 4, 2018 at 10:50 AM, Karl Wright <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi Steph, I suspect that Jetty is leaking some resource, and we >>>>>>>>>> may need to upgrade it. >>>>>>>>>> >>>>>>>>>> Karl >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Tue, Sep 4, 2018 at 11:26 AM Steph van Schalkwyk < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>>> Olivier >>>>>>>>>>> By all means. >>>>>>>>>>> The only issue I have seen (totally unrelated) is with Jetty, >>>>>>>>>>> which has to be restarted about once a week. Still trying to find >>>>>>>>>>> the issue. >>>>>>>>>>> I may be overly sensitive, but I suspect MCF 2.10 with >>>>>>>>>>> Postgres10 may be a bit slower. I have no empiric evidence at the >>>>>>>>>>> moment as >>>>>>>>>>> I'm still delivering the project to UAT. Will keep you posted. >>>>>>>>>>> Regards, >>>>>>>>>>> Steph >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> *Steph van Schalkwyk* >>>>>>>>>>> Principal, Remcam Search Engines >>>>>>>>>>> +1.314.452. <+1+314+452+2896>2896 [email protected] >>>>>>>>>>> http://remcam.net <http://www.remcam.net/> Skype: svanschalkwyk >>>>>>>>>>> <https://mail.google.com/mail/u/0/#> >>>>>>>>>>> <http://linkedin.com/in/vanschalkwyk> >>>>>>>>>>> >>>>>>>>>>> On Tue, Sep 4, 2018 at 9:59 AM, Olivier Tavard < >>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hello, >>>>>>>>>>>> >>>>>>>>>>>> Thanks a lot for sharing your PostgreSQL configuration (sorry >>>>>>>>>>>> for the late answer). I will test it soon. >>>>>>>>>>>> >>>>>>>>>>>> Best regards, >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Olivier TAVARD >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Le 23 août 2018 à 19:20, Steph van Schalkwyk <[email protected]> >>>>>>>>>>>> a écrit : >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> These are the rpm installs: >>>>>>>>>>>> - file:///tmp/postgres10/postgresql10-libs-10.4-1PGDG. >>>>>>>>>>>> rhel7.x86_64.rpm >>>>>>>>>>>> - file:///tmp/postgres10/postgresql10-10.4-1PGDG.rhel7. >>>>>>>>>>>> x86_64.rpm >>>>>>>>>>>> - file:///tmp/postgres10/postgresql10-contrib-10.4- >>>>>>>>>>>> 1PGDG.rhel7.x86_64.rpm >>>>>>>>>>>> - file:///tmp/postgres10/postgresql10-devel-10.4-1PGDG. >>>>>>>>>>>> rhel7.x86_64.rpm >>>>>>>>>>>> - file:///tmp/postgres10/postgresql10-server-10.4- >>>>>>>>>>>> 1PGDG.rhel7.x86_64.rpm >>>>>>>>>>>> >>>>>>>>>>>> postgresql_version: 10 >>>>>>>>>>>> postgresql_data_dir: /var/lib/pgsql/10/data >>>>>>>>>>>> postgresql_bin_path: /usr/pgsql-10/bin >>>>>>>>>>>> postgresql_config_path: /var/lib/pgsql/10/data >>>>>>>>>>>> postgresql_daemon: postgresql-10.service >>>>>>>>>>>> postgresql_packages: >>>>>>>>>>>> - postgresql10-libs >>>>>>>>>>>> - postgresql10 >>>>>>>>>>>> - postgresql10-server >>>>>>>>>>>> - postgresql10-contrib >>>>>>>>>>>> # - postgresql10-devel >>>>>>>>>>>> >>>>>>>>>>>> postgresql_hba_entries: >>>>>>>>>>>> - { type: local, database: all, user: postgres, auth_method: >>>>>>>>>>>> peer } >>>>>>>>>>>> - { type: local, database: all, user: all, auth_method: peer } >>>>>>>>>>>> - { type: host, database: all, user: all, address: ' >>>>>>>>>>>> 127.0.0.1/32', auth_method: md5 } >>>>>>>>>>>> - { type: host, database: all, user: all, address: '::1/128', >>>>>>>>>>>> auth_method: md5 } >>>>>>>>>>>> - { type: host, database: all, user: all, address: '0.0.0.0/0', >>>>>>>>>>>> auth_method: md5 } >>>>>>>>>>>> - { type: host, database: all, user: all, address: '::0/0', >>>>>>>>>>>> auth_method: md5 } >>>>>>>>>>>> >>>>>>>>>>>> postgresql_global_config_options: >>>>>>>>>>>> - option: unix_socket_directories >>>>>>>>>>>> value: '{{ postgresql_unix_socket_directories | join(",") }}' >>>>>>>>>>>> >>>>>>>>>>>> - option: standard_conforming_strings >>>>>>>>>>>> value: 'on' >>>>>>>>>>>> >>>>>>>>>>>> - option: shared_buffers >>>>>>>>>>>> value: '1024MB' >>>>>>>>>>>> >>>>>>>>>>>> # max_wal_size = (3 * checkpoint_segments) * 16MB >>>>>>>>>>>> # checkpoint_segments=300 >>>>>>>>>>>> - option: max_wal_size >>>>>>>>>>>> value: '14400MB' >>>>>>>>>>>> >>>>>>>>>>>> - option: min_wal_size >>>>>>>>>>>> value: '80MB' >>>>>>>>>>>> >>>>>>>>>>>> - option: maintenance_work_mem >>>>>>>>>>>> value: '2MB' >>>>>>>>>>>> >>>>>>>>>>>> - option: listen_addresses >>>>>>>>>>>> value: '*' >>>>>>>>>>>> >>>>>>>>>>>> - option: max_connections >>>>>>>>>>>> value: '400' >>>>>>>>>>>> >>>>>>>>>>>> - option: checkpoint_timeout >>>>>>>>>>>> value: '900' >>>>>>>>>>>> >>>>>>>>>>>> - option: datestyle >>>>>>>>>>>> value: "iso, mdy" >>>>>>>>>>>> >>>>>>>>>>>> - option: autovacuum >>>>>>>>>>>> value: 'off' >>>>>>>>>>>> >>>>>>>>>>>> # vacuum all databases every night (full vacuum on Sunday >>>>>>>>>>>> night, lazy vacuum every night) >>>>>>>>>>>> - name: add postgresql cron lazy vacuum >>>>>>>>>>>> cron: >>>>>>>>>>>> name: lazy_vacuum >>>>>>>>>>>> hour: 8 >>>>>>>>>>>> minute: 0 >>>>>>>>>>>> job: "su - postgres -c 'vacuumdb --all --analyze --quiet'" >>>>>>>>>>>> - name: add postgresql cron full vacuum >>>>>>>>>>>> cron: >>>>>>>>>>>> name: full_vacuum >>>>>>>>>>>> weekday: 0 >>>>>>>>>>>> hour: 10 >>>>>>>>>>>> minute: 0 >>>>>>>>>>>> job: "su - postgres -c 'vacuumdb --all --full --analyze >>>>>>>>>>>> --quiet'" >>>>>>>>>>>> # re-index all databases once a week >>>>>>>>>>>> - name: add postgresql cron reindex >>>>>>>>>>>> cron: >>>>>>>>>>>> name: reindex >>>>>>>>>>>> weekday: 0 >>>>>>>>>>>> hour: 12 >>>>>>>>>>>> minute: 0 >>>>>>>>>>>> job: "su - postgres -c 'psql -t -c \"select datname from >>>>>>>>>>>> pg_database order by datname;\" | xargs -n 1 -I\"{}\" -- psql -U >>>>>>>>>>>> postgres >>>>>>>>>>>> {} -c \"reindex database {};\"' " >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> This is how I run 2.10. >>>>>>>>>>>> Been running fine for some weeks without user intervention. >>>>>>>>>>>> @Karl: Any comments please? >>>>>>>>>>>> Steph >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>> >>>>> >>>>
