CONNECTORS-1313 Karl
On Fri, May 6, 2016 at 10:02 AM, Karl Wright <[email protected]> wrote: > Hi Luca, > > This approach causes each document's binary data to be read more than > once. I think that is expensive, especially if there are a lot of values. > for a row. > > Instead I think something more like ACLs will be needed -- that is, a > separate query for each multi-valued field. This is more work but it would > work much better. > > I will create a ticket to add this to the JDBC connector, but it won't > happen for a while. > > Karl > > > On Fri, May 6, 2016 at 9:40 AM, Luca Alicata <[email protected]> > wrote: > >> I've decompile java connector and modified the code in this way: >> >> in process document, i see that just currently arrive all row of query >> result (also multi values row), but in the cycle that parse document, after >> first document with an ID, all the other with the same are skipped. >> So i removed the control that not permits to check other document with >> the same ID and i modified the method that store metadata, to permit to >> store multi value data as array in metadata mapping. >> >> I attached the code in this e-mail. You can find a comment that start >> with "---", that i insert know for you. >> >> Thanks, >> L. Alicata >> >> 2016-05-06 15:25 GMT+02:00 Karl Wright <[email protected]>: >> >>> Ok, it's now clear what you are looking for, but it is still not clear >>> how we'd integrate that in the JDBC connector. How did you do this when >>> you modified the connector for 1.8? >>> >>> Karl >>> >>> >>> On Fri, May 6, 2016 at 9:21 AM, Luca Alicata <[email protected]> >>> wrote: >>> >>>> Hi Karl, >>>> sorry for my english :). >>>> I mean the fact that i've to extract value from query with a join >>>> between two table with a relationship of one-to-many, the dataset returned >>>> from Connector is only one pair from the two table. >>>> >>>> For example: >>>> Table A with persons >>>> Table B with eyes >>>> >>>> As result of join, i aspect have two row like: >>>> person 1, eye left >>>> person 1, eye right >>>> >>>> but the connector returns only one row: >>>> person 1, eye left >>>> >>>> I hope now it's more clear. >>>> >>>> Ps. i report the phrase on Manifold documentation that explain that ( >>>> https://manifoldcf.apache.org/release/release-2.3/en_US/end-user-documentation.html#jdbcrepository >>>> ): >>>> ------ >>>> There is currently no support in the JDBC connection type for natively >>>> handling multi-valued metadata. >>>> ------ >>>> >>>> Thanks, >>>> L. Alicata >>>> >>>> >>>> 2016-05-06 15:10 GMT+02:00 Karl Wright <[email protected]>: >>>> >>>>> Hi Luca, >>>>> >>>>> It is not clear what you mean by "multi value extraction" using the >>>>> JDBC connector. The JDBC connector allows collection of primary binary >>>>> content as well as metadata from a database row. So maybe if you can >>>>> explain what you need beyond that it would help. >>>>> >>>>> Thanks, >>>>> Karl >>>>> >>>>> >>>>> On Fri, May 6, 2016 at 9:04 AM, Luca Alicata <[email protected]> >>>>> wrote: >>>>> >>>>>> Hi Karl, >>>>>> thanks for information, fortunately in other jboss instance i have a >>>>>> old Manifold configuration with single process, that i've dismissed. But >>>>>> in >>>>>> this moment, i start to test this jobs with that and if it work fine, i >>>>>> can >>>>>> use it only for this job and use it also in production. Maybe after, if i >>>>>> can, i try to check the possible problem that stop the agent. >>>>>> >>>>>> I Take advantage of this discussion to ask you, if multi-value >>>>>> extraction from db is consider as possible future work or no. Because >>>>>> i've >>>>>> used this generi connector to resolve this lack of JDBC Connector. In >>>>>> fact >>>>>> with Manifold 1.8 i've modified the connector to support this behavior >>>>>> (in >>>>>> addiction to parse blob file), but upgrade Manifold Version, to not >>>>>> rewrite >>>>>> the new connector i decide to use Generic Connector with application that >>>>>> do the work of extraction data from DB. >>>>>> >>>>>> Thanks, >>>>>> L. Alicata >>>>>> >>>>>> 2016-05-06 14:42 GMT+02:00 Karl Wright <[email protected]>: >>>>>> >>>>>>> Hi Luca, >>>>>>> >>>>>>> If you do a lock clean and the process still stops, then the locks >>>>>>> are not the problem. >>>>>>> >>>>>>> One way we can drill down into the problem is to get a thread dump >>>>>>> of the agents process after it stops. The thread dump must be of the >>>>>>> agents process, not any of the others. >>>>>>> >>>>>>> FWIW, the generic connector is not well supported; the person who >>>>>>> wrote it is still a committer but is not actively involved in MCF >>>>>>> development at this time. I suspect that the problem may have to do >>>>>>> with >>>>>>> how that connector deals with exceptions or errors, but I am not sure. >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> Karl >>>>>>> >>>>>>> >>>>>>> On Fri, May 6, 2016 at 8:38 AM, Luca Alicata <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Karl, >>>>>>>> I've just tried with lock-clean after agents stop to work, >>>>>>>> obviously after stopping process. After this, job start correctly, but >>>>>>>> just >>>>>>>> second time that i start a job with a lot of data (or sometimes the >>>>>>>> third >>>>>>>> time), agent stop again. >>>>>>>> >>>>>>>> Unfortunately, it's difficult start, for the moment, to using >>>>>>>> Zookeeper in this environment, but this can resolve the fact that >>>>>>>> during >>>>>>>> working agents stop to work? or help only for cleaning lock agent when >>>>>>>> i >>>>>>>> restart the process? >>>>>>>> >>>>>>>> Thanks, >>>>>>>> L. Alicata >>>>>>>> >>>>>>>> 2016-05-06 14:15 GMT+02:00 Karl Wright <[email protected]>: >>>>>>>> >>>>>>>>> Hi Luca, >>>>>>>>> >>>>>>>>> With file-based synchronization, if you kill any of the processes >>>>>>>>> involved, you will need to execute the lock-clean procedure to make >>>>>>>>> sure >>>>>>>>> you have no dangling locks in the file system. >>>>>>>>> >>>>>>>>> - shut down all MCF processes (except the database) >>>>>>>>> - run the lock-clean script >>>>>>>>> - start your MCF processes back up >>>>>>>>> >>>>>>>>> I suspect what you are seeing is related to this. >>>>>>>>> >>>>>>>>> Also, please consider using Zookeeper instead, since it is more >>>>>>>>> robust about cleaning out dangling locks. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Karl >>>>>>>>> >>>>>>>>> >>>>>>>>> On Fri, May 6, 2016 at 8:06 AM, Luca Alicata < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> Hi Karl, >>>>>>>>>> thanks for help. >>>>>>>>>> In my case i've only one instance of MCF running, with both type >>>>>>>>>> of job (SP and Generic), and so i have only one properties files >>>>>>>>>> (that i >>>>>>>>>> have attached). >>>>>>>>>> For information i used (multiprocess-file configuration) with >>>>>>>>>> postgres. >>>>>>>>>> >>>>>>>>>> Do you have other suggestions? do you need more information, that >>>>>>>>>> i can give you? >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> >>>>>>>>>> L.Alicata >>>>>>>>>> >>>>>>>>>> 2016-05-06 12:55 GMT+02:00 Karl Wright <[email protected]>: >>>>>>>>>> >>>>>>>>>>> Hi Luca, >>>>>>>>>>> >>>>>>>>>>> Do you have multiple independent MCF clusters running at the >>>>>>>>>>> same time? It sounds like you do: you have SP on one, and Generic >>>>>>>>>>> on >>>>>>>>>>> another. If so, you will need to be sure that the synchronization >>>>>>>>>>> you are >>>>>>>>>>> using (either zookeeper or file-based) does not overlap. Each >>>>>>>>>>> cluster >>>>>>>>>>> needs its own synchronization. If there is overlap, then doing >>>>>>>>>>> things with >>>>>>>>>>> one cluster may cause the other cluster to hang. This also means >>>>>>>>>>> you have >>>>>>>>>>> to have different properties files for the two clusters, of course. >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Karl >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Fri, May 6, 2016 at 4:32 AM, Luca Alicata < >>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi, >>>>>>>>>>>> i'm using Manifold 2.2 with multi-process configuration in >>>>>>>>>>>> Jboss instance inside a Windows Server 2012 and i've a set of job >>>>>>>>>>>> that work >>>>>>>>>>>> with Sharepoint (SP) or Generic Connector (GC), that get file from >>>>>>>>>>>> a db. >>>>>>>>>>>> With SP i've no problem, while with GC with a lot of document >>>>>>>>>>>> (one with 47k and another with 60k), the Seed taking process, >>>>>>>>>>>> sometimes, >>>>>>>>>>>> not finish, because the agents seem to stop (although java process >>>>>>>>>>>> is still >>>>>>>>>>>> alive). >>>>>>>>>>>> After this, if i try to start any other job, that not start, >>>>>>>>>>>> like the agents are stopped. >>>>>>>>>>>> >>>>>>>>>>>> Other times, this jobs work correctly and one time together >>>>>>>>>>>> work correctly, running in the same moment. >>>>>>>>>>>> >>>>>>>>>>>> For information: >>>>>>>>>>>> >>>>>>>>>>>> - On Jboss there are only Manifold and Generic Repository >>>>>>>>>>>> application. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> - On the same Virtual Server, there is another Jboss >>>>>>>>>>>> istance, with solr istance and a web application. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> - I've check if it was a type of memory problem, but it's >>>>>>>>>>>> not the case. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> - GC with almost 23k seed work always, at least in test >>>>>>>>>>>> that i've done. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> - In local instance of Jboss with Manifold and Generic >>>>>>>>>>>> Rpository Application, i've not keep this problem. >>>>>>>>>>>> >>>>>>>>>>>> This is the only recurrent information that i've seen on >>>>>>>>>>>> manifold.log: >>>>>>>>>>>> --------------- >>>>>>>>>>>> Connection 0.0.0.0:62755<-><ip-address>:<port> shut down >>>>>>>>>>>> Releasing connection >>>>>>>>>>>> org.apache.http.impl.conn.ManagedClientConnectionImpl@6c98c1bd >>>>>>>>>>>> >>>>>>>>>>>> --------------- >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> L. Alicata >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
