Thank you. So I'll stop for now? Steph
*Steph van Schalkwyk* Principal, Remcam Search Engines +1.314.452. <+1+314+452+2896>2896 [email protected] http://remcam.net <http://www.remcam.net/> Skype: svanschalkwyk <https://mail.google.com/mail/u/0/#> <http://linkedin.com/in/vanschalkwyk> On Wed, Sep 5, 2018 at 11:05 AM, Karl Wright <[email protected]> wrote: > I'm already working on the Web Connector. The UI has problems that > predate this change and I've alerted Kishore about them -- he'll look into > them later today. > > Karl > > > On Wed, Sep 5, 2018 at 11:55 AM Steph van Schalkwyk <[email protected]> > wrote: > >> Thank you Karl. >> You are of course correct in that the incremental crawl is now broken in >> that it does a full crawl every time. >> I'll jump on the Web Connector and add that functionality. >> Thanks for this excellent application and all the help over the years. >> Steph >> >> >> >> >> *Steph van Schalkwyk* >> Principal, Remcam Search Engines >> +1.314.452. <+1+314+452+2896>2896 [email protected] http://remcam.net >> <http://www.remcam.net/> Skype: svanschalkwyk >> <https://mail.google.com/mail/u/0/#> >> <http://linkedin.com/in/vanschalkwyk> >> >> On Wed, Sep 5, 2018 at 6:33 AM, Karl Wright <[email protected]> wrote: >> >>> The patch I uploaded doesn't work because the entire tab is broken; >>> looks like the UI refactoring broke it and it was never reported. Fixing >>> now. >>> Karl >>> >>> >>> On Wed, Sep 5, 2018 at 3:57 AM Karl Wright <[email protected]> wrote: >>> >>>> I coded up the web connector feature I think we need. See >>>> CONNECTORS-1528; I've attached a patch. Please apply and test it out to >>>> see if it solves the case problem for your IIS site. >>>> >>>> For the "//" issue, can you be more specific about the mapping you need >>>> to do? >>>> >>>> Karl >>>> >>>> >>>> On Tue, Sep 4, 2018 at 4:17 PM Karl Wright <[email protected]> wrote: >>>> >>>>> Hi Steph, >>>>> >>>>> Right, you wouldn't want to touch the framework. >>>>> >>>>> The effect of lower-casing the documentURI parameter in the >>>>> addOrReplaceDocumentWithException method in an output connector would >>>>> be to map multiple, independently-fetched, documents that differ only by >>>>> the case of the URL together into one document in the index. The >>>>> ManifoldCF assumption is that a document with a certain URI can be tracked >>>>> in the index using exactly that URI. Mapping the URI to lower case would >>>>> break that assumption so the framework would make the wrong decision in >>>>> many cases. >>>>> >>>>> If you are picking up documents using the web connector, therefore, >>>>> and you are getting duplicate documents because the document URLs are >>>>> sloppy, it is therefore essential that INSTEAD of mapping the document URI >>>>> to lower case in the output connector, you map to lower case in the >>>>> repository connector. Otherwise the framework will not work right. >>>>> >>>>> There is a tab in the web connector that allows you to configure URL >>>>> normalization, called "Canonicalization". This would be a very >>>>> appropriate >>>>> place to add URL mapping to lower case. It should be as simple as adding >>>>> one more checkbox column in the table, and modifying the method that does >>>>> the URL processing to include lower-casing. >>>>> >>>>> Karl >>>>> >>>>> >>>>> >>>>> On Tue, Sep 4, 2018 at 2:46 PM Steph van Schalkwyk <[email protected]> >>>>> wrote: >>>>> >>>>>> Unless I have a massive misunderstanding somewhere... >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> *Steph van Schalkwyk* >>>>>> Principal, Remcam Search Engines >>>>>> +1.314.452. <+1+314+452+2896>2896 [email protected] >>>>>> http://remcam.net <http://www.remcam.net/> Skype: svanschalkwyk >>>>>> <https://mail.google.com/mail/u/0/#> >>>>>> <http://linkedin.com/in/vanschalkwyk> >>>>>> >>>>>> On Tue, Sep 4, 2018 at 1:42 PM, Steph van Schalkwyk <[email protected] >>>>>> > wrote: >>>>>> >>>>>>> Hi Karl >>>>>>> I'm addressing it in the ES Output Connector. >>>>>>> Not touching the framework :) >>>>>>> S >>>>>>> >>>>>>> >>>>>>> >>>>>>> *Steph van Schalkwyk* >>>>>>> Principal, Remcam Search Engines >>>>>>> +1.314.452. <+1+314+452+2896>2896 [email protected] >>>>>>> http://remcam.net <http://www.remcam.net/> Skype: svanschalkwyk >>>>>>> <https://mail.google.com/mail/u/0/#> >>>>>>> <http://linkedin.com/in/vanschalkwyk> >>>>>>> >>>>>>> On Tue, Sep 4, 2018 at 1:33 PM, Karl Wright <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Let's make sure we're talking about the same thing. >>>>>>>> >>>>>>>> Here is the output connector method that receives the ID (as the >>>>>>>> documentURI parameter): >>>>>>>> >>>>>>>> public int addOrReplaceDocumentWithException(String documentURI, >>>>>>>> VersionContext pipelineDescription, RepositoryDocument document, String >>>>>>>> authorityNameString, IOutputAddActivity activities) >>>>>>>> throws ManifoldCFException, ServiceInterruption, IOException; >>>>>>>> >>>>>>>> ManifoldCF doesn't say anywhere that this ID is case insensitive. >>>>>>>> If you make it case insensitive in an output connector, this will >>>>>>>> potentially break a lot of things, for example incremental indexing >>>>>>>> (which >>>>>>>> organizes the last indexed version by document ID). >>>>>>>> >>>>>>>> I therefore highly recommend that any "sloppyness" in this >>>>>>>> parameter be addressed in the Repository Connector that constructs it. >>>>>>>> If >>>>>>>> the connector is crawling a repository that believes that URLs are case >>>>>>>> insensitive then it should map these IDs to lower case. If not, then >>>>>>>> it >>>>>>>> shouldn't. >>>>>>>> >>>>>>>> Karl >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Sep 4, 2018 at 1:36 PM Steph van Schalkwyk < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> Hi Karl. >>>>>>>>> The issue is that the ES Output Connector uses the uri to create >>>>>>>>> the _id. When used with IIS which allows case variation in the URI, it >>>>>>>>> creates multiple documents. Clients on Windows IIS are rarely >>>>>>>>> cognizant of >>>>>>>>> that issue as IIS is so lax in policing that OTB. >>>>>>>>> Currently, every case variation in URI results in a new doc in the >>>>>>>>> index. This is only in the ES output connector. >>>>>>>>> I can add an optional checkbox to do determien that particular >>>>>>>>> action if that would help? >>>>>>>>> Regards, >>>>>>>>> Steph >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> *Steph van Schalkwyk* >>>>>>>>> Principal, Remcam Search Engines >>>>>>>>> +1.314.452. <+1+314+452+2896>2896 [email protected] >>>>>>>>> http://remcam.net <http://www.remcam.net/> Skype: svanschalkwyk >>>>>>>>> <https://mail.google.com/mail/u/0/#> >>>>>>>>> <http://linkedin.com/in/vanschalkwyk> >>>>>>>>> >>>>>>>>> On Tue, Sep 4, 2018 at 12:22 PM, Karl Wright <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> THanks for the update. >>>>>>>>>> Lower-casing the ID would be fine except there are some >>>>>>>>>> connectors that care about case. The web connector is one such >>>>>>>>>> because >>>>>>>>>> it's up to the web service to decide if case matters, so the web >>>>>>>>>> connector >>>>>>>>>> does not view urls with case differences as being the same. Other >>>>>>>>>> connectors also will likely care as well. So I don't think >>>>>>>>>> lower-casing the >>>>>>>>>> document id is a smart thing to do. >>>>>>>>>> >>>>>>>>>> You could add this bit of configuration to the web connector, if >>>>>>>>>> that's what you are using, or to whatever other connector constructs >>>>>>>>>> the ID. >>>>>>>>>> >>>>>>>>>> Karl >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Tue, Sep 4, 2018 at 12:04 PM Steph van Schalkwyk < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>>> Thanks Karl. >>>>>>>>>>> >>>>>>>>>>> I'll look into that. >>>>>>>>>>> >>>>>>>>>>> Another note: >>>>>>>>>>> Regarding the ES connector - I have made two additions to it and >>>>>>>>>>> should probably diff them for inclusion after approval: >>>>>>>>>>> 1. lowercased _id (the doc URI). >>>>>>>>>>> 2. Removed dual "/" , e.g. "//" in the _id (I have sloppy >>>>>>>>>>> sources, particularly IIS...) >>>>>>>>>>> 3. Added a "url" metadata field to the ES connector (as ES 6.x >>>>>>>>>>> does not allow accedd to _id in the schema anymore, so no >>>>>>>>>>> copy_field etc. >>>>>>>>>>> from _id). Hence "url". >>>>>>>>>>> >>>>>>>>>>> Regards, >>>>>>>>>>> Steph >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> *Steph van Schalkwyk* >>>>>>>>>>> Principal, Remcam Search Engines >>>>>>>>>>> +1.314.452. <+1+314+452+2896>2896 [email protected] >>>>>>>>>>> http://remcam.net <http://www.remcam.net/> Skype: svanschalkwyk >>>>>>>>>>> <https://mail.google.com/mail/u/0/#> >>>>>>>>>>> <http://linkedin.com/in/vanschalkwyk> >>>>>>>>>>> >>>>>>>>>>> On Tue, Sep 4, 2018 at 10:50 AM, Karl Wright <[email protected] >>>>>>>>>>> > wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi Steph, I suspect that Jetty is leaking some resource, and we >>>>>>>>>>>> may need to upgrade it. >>>>>>>>>>>> >>>>>>>>>>>> Karl >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Tue, Sep 4, 2018 at 11:26 AM Steph van Schalkwyk < >>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Olivier >>>>>>>>>>>>> By all means. >>>>>>>>>>>>> The only issue I have seen (totally unrelated) is with Jetty, >>>>>>>>>>>>> which has to be restarted about once a week. Still trying to find >>>>>>>>>>>>> the issue. >>>>>>>>>>>>> I may be overly sensitive, but I suspect MCF 2.10 with >>>>>>>>>>>>> Postgres10 may be a bit slower. I have no empiric evidence at the >>>>>>>>>>>>> moment as >>>>>>>>>>>>> I'm still delivering the project to UAT. Will keep you posted. >>>>>>>>>>>>> Regards, >>>>>>>>>>>>> Steph >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> *Steph van Schalkwyk* >>>>>>>>>>>>> Principal, Remcam Search Engines >>>>>>>>>>>>> +1.314.452. <+1+314+452+2896>2896 [email protected] >>>>>>>>>>>>> http://remcam.net <http://www.remcam.net/> Skype: svan >>>>>>>>>>>>> schalkwyk <https://mail.google.com/mail/u/0/#> >>>>>>>>>>>>> <http://linkedin.com/in/vanschalkwyk> >>>>>>>>>>>>> >>>>>>>>>>>>> On Tue, Sep 4, 2018 at 9:59 AM, Olivier Tavard < >>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hello, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks a lot for sharing your PostgreSQL configuration (sorry >>>>>>>>>>>>>> for the late answer). I will test it soon. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Best regards, >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Olivier TAVARD >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Le 23 août 2018 à 19:20, Steph van Schalkwyk < >>>>>>>>>>>>>> [email protected]> a écrit : >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> These are the rpm installs: >>>>>>>>>>>>>> - file:///tmp/postgres10/postgresql10-libs-10.4-1PGDG. >>>>>>>>>>>>>> rhel7.x86_64.rpm >>>>>>>>>>>>>> - file:///tmp/postgres10/postgresql10-10.4-1PGDG.rhel7. >>>>>>>>>>>>>> x86_64.rpm >>>>>>>>>>>>>> - file:///tmp/postgres10/postgresql10-contrib-10.4- >>>>>>>>>>>>>> 1PGDG.rhel7.x86_64.rpm >>>>>>>>>>>>>> - file:///tmp/postgres10/postgresql10-devel-10.4-1PGDG. >>>>>>>>>>>>>> rhel7.x86_64.rpm >>>>>>>>>>>>>> - file:///tmp/postgres10/postgresql10-server-10.4- >>>>>>>>>>>>>> 1PGDG.rhel7.x86_64.rpm >>>>>>>>>>>>>> >>>>>>>>>>>>>> postgresql_version: 10 >>>>>>>>>>>>>> postgresql_data_dir: /var/lib/pgsql/10/data >>>>>>>>>>>>>> postgresql_bin_path: /usr/pgsql-10/bin >>>>>>>>>>>>>> postgresql_config_path: /var/lib/pgsql/10/data >>>>>>>>>>>>>> postgresql_daemon: postgresql-10.service >>>>>>>>>>>>>> postgresql_packages: >>>>>>>>>>>>>> - postgresql10-libs >>>>>>>>>>>>>> - postgresql10 >>>>>>>>>>>>>> - postgresql10-server >>>>>>>>>>>>>> - postgresql10-contrib >>>>>>>>>>>>>> # - postgresql10-devel >>>>>>>>>>>>>> >>>>>>>>>>>>>> postgresql_hba_entries: >>>>>>>>>>>>>> - { type: local, database: all, user: postgres, auth_method: >>>>>>>>>>>>>> peer } >>>>>>>>>>>>>> - { type: local, database: all, user: all, auth_method: peer >>>>>>>>>>>>>> } >>>>>>>>>>>>>> - { type: host, database: all, user: all, address: ' >>>>>>>>>>>>>> 127.0.0.1/32', auth_method: md5 } >>>>>>>>>>>>>> - { type: host, database: all, user: all, address: '::1/128', >>>>>>>>>>>>>> auth_method: md5 } >>>>>>>>>>>>>> - { type: host, database: all, user: all, address: '0.0.0.0/0 >>>>>>>>>>>>>> ', auth_method: md5 } >>>>>>>>>>>>>> - { type: host, database: all, user: all, address: '::0/0', >>>>>>>>>>>>>> auth_method: md5 } >>>>>>>>>>>>>> >>>>>>>>>>>>>> postgresql_global_config_options: >>>>>>>>>>>>>> - option: unix_socket_directories >>>>>>>>>>>>>> value: '{{ postgresql_unix_socket_directories | join(",") }}' >>>>>>>>>>>>>> >>>>>>>>>>>>>> - option: standard_conforming_strings >>>>>>>>>>>>>> value: 'on' >>>>>>>>>>>>>> >>>>>>>>>>>>>> - option: shared_buffers >>>>>>>>>>>>>> value: '1024MB' >>>>>>>>>>>>>> >>>>>>>>>>>>>> # max_wal_size = (3 * checkpoint_segments) * 16MB >>>>>>>>>>>>>> # checkpoint_segments=300 >>>>>>>>>>>>>> - option: max_wal_size >>>>>>>>>>>>>> value: '14400MB' >>>>>>>>>>>>>> >>>>>>>>>>>>>> - option: min_wal_size >>>>>>>>>>>>>> value: '80MB' >>>>>>>>>>>>>> >>>>>>>>>>>>>> - option: maintenance_work_mem >>>>>>>>>>>>>> value: '2MB' >>>>>>>>>>>>>> >>>>>>>>>>>>>> - option: listen_addresses >>>>>>>>>>>>>> value: '*' >>>>>>>>>>>>>> >>>>>>>>>>>>>> - option: max_connections >>>>>>>>>>>>>> value: '400' >>>>>>>>>>>>>> >>>>>>>>>>>>>> - option: checkpoint_timeout >>>>>>>>>>>>>> value: '900' >>>>>>>>>>>>>> >>>>>>>>>>>>>> - option: datestyle >>>>>>>>>>>>>> value: "iso, mdy" >>>>>>>>>>>>>> >>>>>>>>>>>>>> - option: autovacuum >>>>>>>>>>>>>> value: 'off' >>>>>>>>>>>>>> >>>>>>>>>>>>>> # vacuum all databases every night (full vacuum on Sunday >>>>>>>>>>>>>> night, lazy vacuum every night) >>>>>>>>>>>>>> - name: add postgresql cron lazy vacuum >>>>>>>>>>>>>> cron: >>>>>>>>>>>>>> name: lazy_vacuum >>>>>>>>>>>>>> hour: 8 >>>>>>>>>>>>>> minute: 0 >>>>>>>>>>>>>> job: "su - postgres -c 'vacuumdb --all --analyze --quiet'" >>>>>>>>>>>>>> - name: add postgresql cron full vacuum >>>>>>>>>>>>>> cron: >>>>>>>>>>>>>> name: full_vacuum >>>>>>>>>>>>>> weekday: 0 >>>>>>>>>>>>>> hour: 10 >>>>>>>>>>>>>> minute: 0 >>>>>>>>>>>>>> job: "su - postgres -c 'vacuumdb --all --full --analyze >>>>>>>>>>>>>> --quiet'" >>>>>>>>>>>>>> # re-index all databases once a week >>>>>>>>>>>>>> - name: add postgresql cron reindex >>>>>>>>>>>>>> cron: >>>>>>>>>>>>>> name: reindex >>>>>>>>>>>>>> weekday: 0 >>>>>>>>>>>>>> hour: 12 >>>>>>>>>>>>>> minute: 0 >>>>>>>>>>>>>> job: "su - postgres -c 'psql -t -c \"select datname from >>>>>>>>>>>>>> pg_database order by datname;\" | xargs -n 1 -I\"{}\" -- psql -U >>>>>>>>>>>>>> postgres >>>>>>>>>>>>>> {} -c \"reindex database {};\"' " >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> This is how I run 2.10. >>>>>>>>>>>>>> Been running fine for some weeks without user intervention. >>>>>>>>>>>>>> @Karl: Any comments please? >>>>>>>>>>>>>> Steph >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>> >>>>>> >>
