Hi Luca, I've put together code that should allow multivalued attributes to be crawled. In order to try it, you will need to check out the CONNECTORS-1313 branch:
svn checkout https://svn.apache.org/repos/asf/manifoldcf/branches/CONNECTORS-1313 Then, build: ant make-core-deps ant build Please give this a try and see if it works for you. Thanks, Karl On Fri, May 6, 2016 at 10:15 AM, Luca Alicata <[email protected]> wrote: > Hi Karl, > I can confirm that it is a little expensive, but at that time, i haven't > much time, and i stop to work after found the solution. > Thanks for the creation of the ticket, for the moment, i try to use > generic connector. > > An other question, there is another connector that can use an application > to receive data? Like GenericConnector? > > Thanks, > L. Alicata > > 2016-05-06 16:02 GMT+02:00 Karl Wright <[email protected]>: > >> Hi Luca, >> >> This approach causes each document's binary data to be read more than >> once. I think that is expensive, especially if there are a lot of values. >> for a row. >> >> Instead I think something more like ACLs will be needed -- that is, a >> separate query for each multi-valued field. This is more work but it would >> work much better. >> >> I will create a ticket to add this to the JDBC connector, but it won't >> happen for a while. >> >> Karl >> >> >> On Fri, May 6, 2016 at 9:40 AM, Luca Alicata <[email protected]> >> wrote: >> >>> I've decompile java connector and modified the code in this way: >>> >>> in process document, i see that just currently arrive all row of query >>> result (also multi values row), but in the cycle that parse document, after >>> first document with an ID, all the other with the same are skipped. >>> So i removed the control that not permits to check other document with >>> the same ID and i modified the method that store metadata, to permit to >>> store multi value data as array in metadata mapping. >>> >>> I attached the code in this e-mail. You can find a comment that start >>> with "---", that i insert know for you. >>> >>> Thanks, >>> L. Alicata >>> >>> 2016-05-06 15:25 GMT+02:00 Karl Wright <[email protected]>: >>> >>>> Ok, it's now clear what you are looking for, but it is still not clear >>>> how we'd integrate that in the JDBC connector. How did you do this when >>>> you modified the connector for 1.8? >>>> >>>> Karl >>>> >>>> >>>> On Fri, May 6, 2016 at 9:21 AM, Luca Alicata <[email protected]> >>>> wrote: >>>> >>>>> Hi Karl, >>>>> sorry for my english :). >>>>> I mean the fact that i've to extract value from query with a join >>>>> between two table with a relationship of one-to-many, the dataset returned >>>>> from Connector is only one pair from the two table. >>>>> >>>>> For example: >>>>> Table A with persons >>>>> Table B with eyes >>>>> >>>>> As result of join, i aspect have two row like: >>>>> person 1, eye left >>>>> person 1, eye right >>>>> >>>>> but the connector returns only one row: >>>>> person 1, eye left >>>>> >>>>> I hope now it's more clear. >>>>> >>>>> Ps. i report the phrase on Manifold documentation that explain that ( >>>>> https://manifoldcf.apache.org/release/release-2.3/en_US/end-user-documentation.html#jdbcrepository >>>>> ): >>>>> ------ >>>>> There is currently no support in the JDBC connection type for natively >>>>> handling multi-valued metadata. >>>>> ------ >>>>> >>>>> Thanks, >>>>> L. Alicata >>>>> >>>>> >>>>> 2016-05-06 15:10 GMT+02:00 Karl Wright <[email protected]>: >>>>> >>>>>> Hi Luca, >>>>>> >>>>>> It is not clear what you mean by "multi value extraction" using the >>>>>> JDBC connector. The JDBC connector allows collection of primary binary >>>>>> content as well as metadata from a database row. So maybe if you can >>>>>> explain what you need beyond that it would help. >>>>>> >>>>>> Thanks, >>>>>> Karl >>>>>> >>>>>> >>>>>> On Fri, May 6, 2016 at 9:04 AM, Luca Alicata <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Hi Karl, >>>>>>> thanks for information, fortunately in other jboss instance i have a >>>>>>> old Manifold configuration with single process, that i've dismissed. >>>>>>> But in >>>>>>> this moment, i start to test this jobs with that and if it work fine, i >>>>>>> can >>>>>>> use it only for this job and use it also in production. Maybe after, if >>>>>>> i >>>>>>> can, i try to check the possible problem that stop the agent. >>>>>>> >>>>>>> I Take advantage of this discussion to ask you, if multi-value >>>>>>> extraction from db is consider as possible future work or no. Because >>>>>>> i've >>>>>>> used this generi connector to resolve this lack of JDBC Connector. In >>>>>>> fact >>>>>>> with Manifold 1.8 i've modified the connector to support this behavior >>>>>>> (in >>>>>>> addiction to parse blob file), but upgrade Manifold Version, to not >>>>>>> rewrite >>>>>>> the new connector i decide to use Generic Connector with application >>>>>>> that >>>>>>> do the work of extraction data from DB. >>>>>>> >>>>>>> Thanks, >>>>>>> L. Alicata >>>>>>> >>>>>>> 2016-05-06 14:42 GMT+02:00 Karl Wright <[email protected]>: >>>>>>> >>>>>>>> Hi Luca, >>>>>>>> >>>>>>>> If you do a lock clean and the process still stops, then the locks >>>>>>>> are not the problem. >>>>>>>> >>>>>>>> One way we can drill down into the problem is to get a thread dump >>>>>>>> of the agents process after it stops. The thread dump must be of the >>>>>>>> agents process, not any of the others. >>>>>>>> >>>>>>>> FWIW, the generic connector is not well supported; the person who >>>>>>>> wrote it is still a committer but is not actively involved in MCF >>>>>>>> development at this time. I suspect that the problem may have to do >>>>>>>> with >>>>>>>> how that connector deals with exceptions or errors, but I am not sure. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> >>>>>>>> Karl >>>>>>>> >>>>>>>> >>>>>>>> On Fri, May 6, 2016 at 8:38 AM, Luca Alicata <[email protected] >>>>>>>> > wrote: >>>>>>>> >>>>>>>>> Hi Karl, >>>>>>>>> I've just tried with lock-clean after agents stop to work, >>>>>>>>> obviously after stopping process. After this, job start correctly, >>>>>>>>> but just >>>>>>>>> second time that i start a job with a lot of data (or sometimes the >>>>>>>>> third >>>>>>>>> time), agent stop again. >>>>>>>>> >>>>>>>>> Unfortunately, it's difficult start, for the moment, to using >>>>>>>>> Zookeeper in this environment, but this can resolve the fact that >>>>>>>>> during >>>>>>>>> working agents stop to work? or help only for cleaning lock agent >>>>>>>>> when i >>>>>>>>> restart the process? >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> L. Alicata >>>>>>>>> >>>>>>>>> 2016-05-06 14:15 GMT+02:00 Karl Wright <[email protected]>: >>>>>>>>> >>>>>>>>>> Hi Luca, >>>>>>>>>> >>>>>>>>>> With file-based synchronization, if you kill any of the processes >>>>>>>>>> involved, you will need to execute the lock-clean procedure to make >>>>>>>>>> sure >>>>>>>>>> you have no dangling locks in the file system. >>>>>>>>>> >>>>>>>>>> - shut down all MCF processes (except the database) >>>>>>>>>> - run the lock-clean script >>>>>>>>>> - start your MCF processes back up >>>>>>>>>> >>>>>>>>>> I suspect what you are seeing is related to this. >>>>>>>>>> >>>>>>>>>> Also, please consider using Zookeeper instead, since it is more >>>>>>>>>> robust about cleaning out dangling locks. >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Karl >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Fri, May 6, 2016 at 8:06 AM, Luca Alicata < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>>> Hi Karl, >>>>>>>>>>> thanks for help. >>>>>>>>>>> In my case i've only one instance of MCF running, with both type >>>>>>>>>>> of job (SP and Generic), and so i have only one properties files >>>>>>>>>>> (that i >>>>>>>>>>> have attached). >>>>>>>>>>> For information i used (multiprocess-file configuration) with >>>>>>>>>>> postgres. >>>>>>>>>>> >>>>>>>>>>> Do you have other suggestions? do you need more information, >>>>>>>>>>> that i can give you? >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> >>>>>>>>>>> L.Alicata >>>>>>>>>>> >>>>>>>>>>> 2016-05-06 12:55 GMT+02:00 Karl Wright <[email protected]>: >>>>>>>>>>> >>>>>>>>>>>> Hi Luca, >>>>>>>>>>>> >>>>>>>>>>>> Do you have multiple independent MCF clusters running at the >>>>>>>>>>>> same time? It sounds like you do: you have SP on one, and Generic >>>>>>>>>>>> on >>>>>>>>>>>> another. If so, you will need to be sure that the synchronization >>>>>>>>>>>> you are >>>>>>>>>>>> using (either zookeeper or file-based) does not overlap. Each >>>>>>>>>>>> cluster >>>>>>>>>>>> needs its own synchronization. If there is overlap, then doing >>>>>>>>>>>> things with >>>>>>>>>>>> one cluster may cause the other cluster to hang. This also means >>>>>>>>>>>> you have >>>>>>>>>>>> to have different properties files for the two clusters, of course. >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> Karl >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Fri, May 6, 2016 at 4:32 AM, Luca Alicata < >>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi, >>>>>>>>>>>>> i'm using Manifold 2.2 with multi-process configuration in >>>>>>>>>>>>> Jboss instance inside a Windows Server 2012 and i've a set of job >>>>>>>>>>>>> that work >>>>>>>>>>>>> with Sharepoint (SP) or Generic Connector (GC), that get file >>>>>>>>>>>>> from a db. >>>>>>>>>>>>> With SP i've no problem, while with GC with a lot of document >>>>>>>>>>>>> (one with 47k and another with 60k), the Seed taking process, >>>>>>>>>>>>> sometimes, >>>>>>>>>>>>> not finish, because the agents seem to stop (although java >>>>>>>>>>>>> process is still >>>>>>>>>>>>> alive). >>>>>>>>>>>>> After this, if i try to start any other job, that not start, >>>>>>>>>>>>> like the agents are stopped. >>>>>>>>>>>>> >>>>>>>>>>>>> Other times, this jobs work correctly and one time together >>>>>>>>>>>>> work correctly, running in the same moment. >>>>>>>>>>>>> >>>>>>>>>>>>> For information: >>>>>>>>>>>>> >>>>>>>>>>>>> - On Jboss there are only Manifold and Generic Repository >>>>>>>>>>>>> application. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> - On the same Virtual Server, there is another Jboss >>>>>>>>>>>>> istance, with solr istance and a web application. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> - I've check if it was a type of memory problem, but it's >>>>>>>>>>>>> not the case. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> - GC with almost 23k seed work always, at least in test >>>>>>>>>>>>> that i've done. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> - In local instance of Jboss with Manifold and Generic >>>>>>>>>>>>> Rpository Application, i've not keep this problem. >>>>>>>>>>>>> >>>>>>>>>>>>> This is the only recurrent information that i've seen on >>>>>>>>>>>>> manifold.log: >>>>>>>>>>>>> --------------- >>>>>>>>>>>>> Connection 0.0.0.0:62755<-><ip-address>:<port> shut down >>>>>>>>>>>>> Releasing connection >>>>>>>>>>>>> org.apache.http.impl.conn.ManagedClientConnectionImpl@6c98c1bd >>>>>>>>>>>>> >>>>>>>>>>>>> --------------- >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks, >>>>>>>>>>>>> L. Alicata >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
