I'm already working on the Web Connector. The UI has problems that predate this change and I've alerted Kishore about them -- he'll look into them later today.
Karl On Wed, Sep 5, 2018 at 11:55 AM Steph van Schalkwyk <[email protected]> wrote: > Thank you Karl. > You are of course correct in that the incremental crawl is now broken in > that it does a full crawl every time. > I'll jump on the Web Connector and add that functionality. > Thanks for this excellent application and all the help over the years. > Steph > > > > > *Steph van Schalkwyk* > Principal, Remcam Search Engines > +1.314.452. <+1+314+452+2896>2896 [email protected] http://remcam.net > <http://www.remcam.net/> Skype: svanschalkwyk > <https://mail.google.com/mail/u/0/#> > <http://linkedin.com/in/vanschalkwyk> > > On Wed, Sep 5, 2018 at 6:33 AM, Karl Wright <[email protected]> wrote: > >> The patch I uploaded doesn't work because the entire tab is broken; looks >> like the UI refactoring broke it and it was never reported. Fixing now. >> Karl >> >> >> On Wed, Sep 5, 2018 at 3:57 AM Karl Wright <[email protected]> wrote: >> >>> I coded up the web connector feature I think we need. See >>> CONNECTORS-1528; I've attached a patch. Please apply and test it out to >>> see if it solves the case problem for your IIS site. >>> >>> For the "//" issue, can you be more specific about the mapping you need >>> to do? >>> >>> Karl >>> >>> >>> On Tue, Sep 4, 2018 at 4:17 PM Karl Wright <[email protected]> wrote: >>> >>>> Hi Steph, >>>> >>>> Right, you wouldn't want to touch the framework. >>>> >>>> The effect of lower-casing the documentURI parameter in the >>>> addOrReplaceDocumentWithException method in an output connector would be to >>>> map multiple, independently-fetched, documents that differ only by the case >>>> of the URL together into one document in the index. The ManifoldCF >>>> assumption is that a document with a certain URI can be tracked in the >>>> index using exactly that URI. Mapping the URI to lower case would break >>>> that assumption so the framework would make the wrong decision in many >>>> cases. >>>> >>>> If you are picking up documents using the web connector, therefore, and >>>> you are getting duplicate documents because the document URLs are sloppy, >>>> it is therefore essential that INSTEAD of mapping the document URI to lower >>>> case in the output connector, you map to lower case in the repository >>>> connector. Otherwise the framework will not work right. >>>> >>>> There is a tab in the web connector that allows you to configure URL >>>> normalization, called "Canonicalization". This would be a very appropriate >>>> place to add URL mapping to lower case. It should be as simple as adding >>>> one more checkbox column in the table, and modifying the method that does >>>> the URL processing to include lower-casing. >>>> >>>> Karl >>>> >>>> >>>> >>>> On Tue, Sep 4, 2018 at 2:46 PM Steph van Schalkwyk <[email protected]> >>>> wrote: >>>> >>>>> Unless I have a massive misunderstanding somewhere... >>>>> >>>>> >>>>> >>>>> >>>>> *Steph van Schalkwyk* >>>>> Principal, Remcam Search Engines >>>>> +1.314.452. <+1+314+452+2896>2896 [email protected] >>>>> http://remcam.net <http://www.remcam.net/> Skype: svanschalkwyk >>>>> <https://mail.google.com/mail/u/0/#> >>>>> <http://linkedin.com/in/vanschalkwyk> >>>>> >>>>> On Tue, Sep 4, 2018 at 1:42 PM, Steph van Schalkwyk <[email protected]> >>>>> wrote: >>>>> >>>>>> Hi Karl >>>>>> I'm addressing it in the ES Output Connector. >>>>>> Not touching the framework :) >>>>>> S >>>>>> >>>>>> >>>>>> >>>>>> *Steph van Schalkwyk* >>>>>> Principal, Remcam Search Engines >>>>>> +1.314.452. <+1+314+452+2896>2896 [email protected] >>>>>> http://remcam.net <http://www.remcam.net/> Skype: svanschalkwyk >>>>>> <https://mail.google.com/mail/u/0/#> >>>>>> <http://linkedin.com/in/vanschalkwyk> >>>>>> >>>>>> On Tue, Sep 4, 2018 at 1:33 PM, Karl Wright <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Let's make sure we're talking about the same thing. >>>>>>> >>>>>>> Here is the output connector method that receives the ID (as the >>>>>>> documentURI parameter): >>>>>>> >>>>>>> public int addOrReplaceDocumentWithException(String documentURI, >>>>>>> VersionContext pipelineDescription, RepositoryDocument document, String >>>>>>> authorityNameString, IOutputAddActivity activities) >>>>>>> throws ManifoldCFException, ServiceInterruption, IOException; >>>>>>> >>>>>>> ManifoldCF doesn't say anywhere that this ID is case insensitive. >>>>>>> If you make it case insensitive in an output connector, this will >>>>>>> potentially break a lot of things, for example incremental indexing >>>>>>> (which >>>>>>> organizes the last indexed version by document ID). >>>>>>> >>>>>>> I therefore highly recommend that any "sloppyness" in this parameter >>>>>>> be addressed in the Repository Connector that constructs it. If the >>>>>>> connector is crawling a repository that believes that URLs are case >>>>>>> insensitive then it should map these IDs to lower case. If not, then it >>>>>>> shouldn't. >>>>>>> >>>>>>> Karl >>>>>>> >>>>>>> >>>>>>> On Tue, Sep 4, 2018 at 1:36 PM Steph van Schalkwyk <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Karl. >>>>>>>> The issue is that the ES Output Connector uses the uri to create >>>>>>>> the _id. When used with IIS which allows case variation in the URI, it >>>>>>>> creates multiple documents. Clients on Windows IIS are rarely >>>>>>>> cognizant of >>>>>>>> that issue as IIS is so lax in policing that OTB. >>>>>>>> Currently, every case variation in URI results in a new doc in the >>>>>>>> index. This is only in the ES output connector. >>>>>>>> I can add an optional checkbox to do determien that particular >>>>>>>> action if that would help? >>>>>>>> Regards, >>>>>>>> Steph >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> *Steph van Schalkwyk* >>>>>>>> Principal, Remcam Search Engines >>>>>>>> +1.314.452. <+1+314+452+2896>2896 [email protected] >>>>>>>> http://remcam.net <http://www.remcam.net/> Skype: svanschalkwyk >>>>>>>> <https://mail.google.com/mail/u/0/#> >>>>>>>> <http://linkedin.com/in/vanschalkwyk> >>>>>>>> >>>>>>>> On Tue, Sep 4, 2018 at 12:22 PM, Karl Wright <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> THanks for the update. >>>>>>>>> Lower-casing the ID would be fine except there are some connectors >>>>>>>>> that care about case. The web connector is one such because it's up >>>>>>>>> to the >>>>>>>>> web service to decide if case matters, so the web connector does not >>>>>>>>> view >>>>>>>>> urls with case differences as being the same. Other connectors also >>>>>>>>> will >>>>>>>>> likely care as well. So I don't think lower-casing the document id is >>>>>>>>> a >>>>>>>>> smart thing to do. >>>>>>>>> >>>>>>>>> You could add this bit of configuration to the web connector, if >>>>>>>>> that's what you are using, or to whatever other connector constructs >>>>>>>>> the ID. >>>>>>>>> >>>>>>>>> Karl >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, Sep 4, 2018 at 12:04 PM Steph van Schalkwyk < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> Thanks Karl. >>>>>>>>>> >>>>>>>>>> I'll look into that. >>>>>>>>>> >>>>>>>>>> Another note: >>>>>>>>>> Regarding the ES connector - I have made two additions to it and >>>>>>>>>> should probably diff them for inclusion after approval: >>>>>>>>>> 1. lowercased _id (the doc URI). >>>>>>>>>> 2. Removed dual "/" , e.g. "//" in the _id (I have sloppy >>>>>>>>>> sources, particularly IIS...) >>>>>>>>>> 3. Added a "url" metadata field to the ES connector (as ES 6.x >>>>>>>>>> does not allow accedd to _id in the schema anymore, so no copy_field >>>>>>>>>> etc. >>>>>>>>>> from _id). Hence "url". >>>>>>>>>> >>>>>>>>>> Regards, >>>>>>>>>> Steph >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> *Steph van Schalkwyk* >>>>>>>>>> Principal, Remcam Search Engines >>>>>>>>>> +1.314.452. <+1+314+452+2896>2896 [email protected] >>>>>>>>>> http://remcam.net <http://www.remcam.net/> Skype: svanschalkwyk >>>>>>>>>> <https://mail.google.com/mail/u/0/#> >>>>>>>>>> <http://linkedin.com/in/vanschalkwyk> >>>>>>>>>> >>>>>>>>>> On Tue, Sep 4, 2018 at 10:50 AM, Karl Wright <[email protected]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Hi Steph, I suspect that Jetty is leaking some resource, and we >>>>>>>>>>> may need to upgrade it. >>>>>>>>>>> >>>>>>>>>>> Karl >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Tue, Sep 4, 2018 at 11:26 AM Steph van Schalkwyk < >>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>>> Olivier >>>>>>>>>>>> By all means. >>>>>>>>>>>> The only issue I have seen (totally unrelated) is with Jetty, >>>>>>>>>>>> which has to be restarted about once a week. Still trying to find >>>>>>>>>>>> the issue. >>>>>>>>>>>> I may be overly sensitive, but I suspect MCF 2.10 with >>>>>>>>>>>> Postgres10 may be a bit slower. I have no empiric evidence at the >>>>>>>>>>>> moment as >>>>>>>>>>>> I'm still delivering the project to UAT. Will keep you posted. >>>>>>>>>>>> Regards, >>>>>>>>>>>> Steph >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> *Steph van Schalkwyk* >>>>>>>>>>>> Principal, Remcam Search Engines >>>>>>>>>>>> +1.314.452. <+1+314+452+2896>2896 [email protected] >>>>>>>>>>>> http://remcam.net <http://www.remcam.net/> Skype: svanschalkwyk >>>>>>>>>>>> <https://mail.google.com/mail/u/0/#> >>>>>>>>>>>> <http://linkedin.com/in/vanschalkwyk> >>>>>>>>>>>> >>>>>>>>>>>> On Tue, Sep 4, 2018 at 9:59 AM, Olivier Tavard < >>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hello, >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks a lot for sharing your PostgreSQL configuration (sorry >>>>>>>>>>>>> for the late answer). I will test it soon. >>>>>>>>>>>>> >>>>>>>>>>>>> Best regards, >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Olivier TAVARD >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Le 23 août 2018 à 19:20, Steph van Schalkwyk <[email protected]> >>>>>>>>>>>>> a écrit : >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> These are the rpm installs: >>>>>>>>>>>>> - >>>>>>>>>>>>> file:///tmp/postgres10/postgresql10-libs-10.4-1PGDG.rhel7.x86_64.rpm >>>>>>>>>>>>> - >>>>>>>>>>>>> file:///tmp/postgres10/postgresql10-10.4-1PGDG.rhel7.x86_64.rpm >>>>>>>>>>>>> - >>>>>>>>>>>>> file:///tmp/postgres10/postgresql10-contrib-10.4-1PGDG.rhel7.x86_64.rpm >>>>>>>>>>>>> - >>>>>>>>>>>>> file:///tmp/postgres10/postgresql10-devel-10.4-1PGDG.rhel7.x86_64.rpm >>>>>>>>>>>>> - >>>>>>>>>>>>> file:///tmp/postgres10/postgresql10-server-10.4-1PGDG.rhel7.x86_64.rpm >>>>>>>>>>>>> >>>>>>>>>>>>> postgresql_version: 10 >>>>>>>>>>>>> postgresql_data_dir: /var/lib/pgsql/10/data >>>>>>>>>>>>> postgresql_bin_path: /usr/pgsql-10/bin >>>>>>>>>>>>> postgresql_config_path: /var/lib/pgsql/10/data >>>>>>>>>>>>> postgresql_daemon: postgresql-10.service >>>>>>>>>>>>> postgresql_packages: >>>>>>>>>>>>> - postgresql10-libs >>>>>>>>>>>>> - postgresql10 >>>>>>>>>>>>> - postgresql10-server >>>>>>>>>>>>> - postgresql10-contrib >>>>>>>>>>>>> # - postgresql10-devel >>>>>>>>>>>>> >>>>>>>>>>>>> postgresql_hba_entries: >>>>>>>>>>>>> - { type: local, database: all, user: postgres, auth_method: >>>>>>>>>>>>> peer } >>>>>>>>>>>>> - { type: local, database: all, user: all, auth_method: peer } >>>>>>>>>>>>> - { type: host, database: all, user: all, address: ' >>>>>>>>>>>>> 127.0.0.1/32', auth_method: md5 } >>>>>>>>>>>>> - { type: host, database: all, user: all, address: '::1/128', >>>>>>>>>>>>> auth_method: md5 } >>>>>>>>>>>>> - { type: host, database: all, user: all, address: '0.0.0.0/0', >>>>>>>>>>>>> auth_method: md5 } >>>>>>>>>>>>> - { type: host, database: all, user: all, address: '::0/0', >>>>>>>>>>>>> auth_method: md5 } >>>>>>>>>>>>> >>>>>>>>>>>>> postgresql_global_config_options: >>>>>>>>>>>>> - option: unix_socket_directories >>>>>>>>>>>>> value: '{{ postgresql_unix_socket_directories | join(",") }}' >>>>>>>>>>>>> >>>>>>>>>>>>> - option: standard_conforming_strings >>>>>>>>>>>>> value: 'on' >>>>>>>>>>>>> >>>>>>>>>>>>> - option: shared_buffers >>>>>>>>>>>>> value: '1024MB' >>>>>>>>>>>>> >>>>>>>>>>>>> # max_wal_size = (3 * checkpoint_segments) * 16MB >>>>>>>>>>>>> # checkpoint_segments=300 >>>>>>>>>>>>> - option: max_wal_size >>>>>>>>>>>>> value: '14400MB' >>>>>>>>>>>>> >>>>>>>>>>>>> - option: min_wal_size >>>>>>>>>>>>> value: '80MB' >>>>>>>>>>>>> >>>>>>>>>>>>> - option: maintenance_work_mem >>>>>>>>>>>>> value: '2MB' >>>>>>>>>>>>> >>>>>>>>>>>>> - option: listen_addresses >>>>>>>>>>>>> value: '*' >>>>>>>>>>>>> >>>>>>>>>>>>> - option: max_connections >>>>>>>>>>>>> value: '400' >>>>>>>>>>>>> >>>>>>>>>>>>> - option: checkpoint_timeout >>>>>>>>>>>>> value: '900' >>>>>>>>>>>>> >>>>>>>>>>>>> - option: datestyle >>>>>>>>>>>>> value: "iso, mdy" >>>>>>>>>>>>> >>>>>>>>>>>>> - option: autovacuum >>>>>>>>>>>>> value: 'off' >>>>>>>>>>>>> >>>>>>>>>>>>> # vacuum all databases every night (full vacuum on Sunday >>>>>>>>>>>>> night, lazy vacuum every night) >>>>>>>>>>>>> - name: add postgresql cron lazy vacuum >>>>>>>>>>>>> cron: >>>>>>>>>>>>> name: lazy_vacuum >>>>>>>>>>>>> hour: 8 >>>>>>>>>>>>> minute: 0 >>>>>>>>>>>>> job: "su - postgres -c 'vacuumdb --all --analyze --quiet'" >>>>>>>>>>>>> - name: add postgresql cron full vacuum >>>>>>>>>>>>> cron: >>>>>>>>>>>>> name: full_vacuum >>>>>>>>>>>>> weekday: 0 >>>>>>>>>>>>> hour: 10 >>>>>>>>>>>>> minute: 0 >>>>>>>>>>>>> job: "su - postgres -c 'vacuumdb --all --full --analyze >>>>>>>>>>>>> --quiet'" >>>>>>>>>>>>> # re-index all databases once a week >>>>>>>>>>>>> - name: add postgresql cron reindex >>>>>>>>>>>>> cron: >>>>>>>>>>>>> name: reindex >>>>>>>>>>>>> weekday: 0 >>>>>>>>>>>>> hour: 12 >>>>>>>>>>>>> minute: 0 >>>>>>>>>>>>> job: "su - postgres -c 'psql -t -c \"select datname from >>>>>>>>>>>>> pg_database order by datname;\" | xargs -n 1 -I\"{}\" -- psql -U >>>>>>>>>>>>> postgres >>>>>>>>>>>>> {} -c \"reindex database {};\"' " >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> This is how I run 2.10. >>>>>>>>>>>>> Been running fine for some weeks without user intervention. >>>>>>>>>>>>> @Karl: Any comments please? >>>>>>>>>>>>> Steph >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>> >>>>> >
