Re: Job with Generic Connector stop to work

Luca Alicata Fri, 06 May 2016 06:42:06 -0700

I've decompile java connector and modified the code in this way:

in process document, i see that just currently arrive all row of query
result (also multi values row), but in the cycle that parse document, after
first document with an ID, all the other with the same are skipped.
So i removed the control that not permits to check other document with the
same ID and i modified the method that store metadata, to permit to store
multi value data as array in metadata mapping.


I attached the code in this e-mail. You can find a comment that start with
"---", that i insert know for you.

Thanks,
L. Alicata

2016-05-06 15:25 GMT+02:00 Karl Wright <[email protected]>:

> Ok, it's now clear what you are looking for, but it is still not clear how
> we'd integrate that in the JDBC connector.  How did you do this when you
> modified the connector for 1.8?
>
> Karl
>
>
> On Fri, May 6, 2016 at 9:21 AM, Luca Alicata <[email protected]>
> wrote:
>
>> Hi Karl,
>> sorry for my english :).
>> I mean the fact that i've to extract value from query with a join between
>> two table with a relationship of one-to-many, the dataset returned from
>> Connector is only one pair from the two table.
>>
>> For example:
>> Table A with persons
>> Table B with eyes
>>
>> As result of join, i aspect have two row like:
>> person 1, eye left
>> person 1, eye right
>>
>> but the connector returns only one row:
>> person 1, eye left
>>
>> I hope now it's more clear.
>>
>> Ps. i report the phrase on Manifold documentation that explain that (
>> https://manifoldcf.apache.org/release/release-2.3/en_US/end-user-documentation.html#jdbcrepository
>> ):
>> ------
>> There is currently no support in the JDBC connection type for natively
>> handling multi-valued metadata.
>> ------
>>
>> Thanks,
>> L. Alicata
>>
>>
>> 2016-05-06 15:10 GMT+02:00 Karl Wright <[email protected]>:
>>
>>> Hi Luca,
>>>
>>> It is not clear what you mean by "multi value extraction" using the JDBC
>>> connector.  The JDBC connector allows collection of primary binary content
>>> as well as metadata from a database row.  So maybe if you can explain what
>>> you need beyond that it would help.
>>>
>>> Thanks,
>>> Karl
>>>
>>>
>>> On Fri, May 6, 2016 at 9:04 AM, Luca Alicata <[email protected]>
>>> wrote:
>>>
>>>> Hi Karl,
>>>> thanks for information, fortunately in other jboss instance i have a
>>>> old Manifold configuration with single process, that i've dismissed. But in
>>>> this moment, i start to test this jobs with that and if it work fine, i can
>>>> use it only for this job and use it also in production. Maybe after, if i
>>>> can, i try to check the possible problem that stop the agent.
>>>>
>>>> I Take advantage of this discussion to ask you, if multi-value
>>>> extraction from db is consider as possible future work or no. Because i've
>>>> used this generi connector to resolve this lack of JDBC Connector. In fact
>>>> with Manifold 1.8 i've modified the connector to support this behavior (in
>>>> addiction to parse blob file), but upgrade Manifold Version, to not rewrite
>>>> the new connector i decide to use Generic Connector with application that
>>>> do the work of extraction data from DB.
>>>>
>>>> Thanks,
>>>> L. Alicata
>>>>
>>>> 2016-05-06 14:42 GMT+02:00 Karl Wright <[email protected]>:
>>>>
>>>>> Hi Luca,
>>>>>
>>>>> If you do a lock clean and the process still stops, then the locks are
>>>>> not the problem.
>>>>>
>>>>> One way we can drill down into the problem is to get a thread dump of
>>>>> the agents process after it stops.  The thread dump must be of the agents
>>>>> process, not any of the others.
>>>>>
>>>>> FWIW, the generic connector is not well supported; the person who
>>>>> wrote it is still a committer but is not actively involved in MCF
>>>>> development at this time.  I suspect that the problem may have to do with
>>>>> how that connector deals with exceptions or errors, but I am not sure.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>> On Fri, May 6, 2016 at 8:38 AM, Luca Alicata <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi Karl,
>>>>>> I've just tried with lock-clean after agents stop to work, obviously
>>>>>> after stopping process. After this, job start correctly, but just second
>>>>>> time that i start a job with a lot of data (or sometimes the third time),
>>>>>> agent stop again.
>>>>>>
>>>>>> Unfortunately, it's difficult start, for the moment, to using
>>>>>> Zookeeper in this environment, but this can resolve the fact that during
>>>>>> working agents stop to work? or help only for cleaning lock agent when i
>>>>>> restart the process?
>>>>>>
>>>>>> Thanks,
>>>>>> L. Alicata
>>>>>>
>>>>>> 2016-05-06 14:15 GMT+02:00 Karl Wright <[email protected]>:
>>>>>>
>>>>>>> Hi Luca,
>>>>>>>
>>>>>>> With file-based synchronization, if you kill any of the processes
>>>>>>> involved, you will need to execute the lock-clean procedure to make sure
>>>>>>> you have no dangling locks in the file system.
>>>>>>>
>>>>>>> - shut down all MCF processes (except the database)
>>>>>>> - run the lock-clean script
>>>>>>> - start your MCF processes back up
>>>>>>>
>>>>>>> I suspect what you are seeing is related to this.
>>>>>>>
>>>>>>> Also, please consider using Zookeeper instead, since it is more
>>>>>>> robust about cleaning out dangling locks.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>> On Fri, May 6, 2016 at 8:06 AM, Luca Alicata <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Karl,
>>>>>>>> thanks for help.
>>>>>>>> In my case i've only one instance of MCF running, with both type of
>>>>>>>> job (SP and Generic), and so i have only one properties files (that i 
>>>>>>>> have
>>>>>>>> attached).
>>>>>>>> For information i used (multiprocess-file configuration) with
>>>>>>>> postgres.
>>>>>>>>
>>>>>>>> Do you have other suggestions? do you need more information, that i
>>>>>>>> can give you?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> L.Alicata
>>>>>>>>
>>>>>>>> 2016-05-06 12:55 GMT+02:00 Karl Wright <[email protected]>:
>>>>>>>>
>>>>>>>>> Hi Luca,
>>>>>>>>>
>>>>>>>>> Do you have multiple independent MCF clusters running at the same
>>>>>>>>> time?  It sounds like you do: you have SP on one, and Generic on 
>>>>>>>>> another.
>>>>>>>>> If so, you will need to be sure that the synchronization you are using
>>>>>>>>> (either zookeeper or file-based) does not overlap.  Each cluster 
>>>>>>>>> needs its
>>>>>>>>> own synchronization.  If there is overlap, then doing things with one
>>>>>>>>> cluster may cause the other cluster to hang.  This also means you 
>>>>>>>>> have to
>>>>>>>>> have different properties files for the two clusters, of course.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Karl
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, May 6, 2016 at 4:32 AM, Luca Alicata <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>> i'm using Manifold 2.2 with multi-process configuration in Jboss
>>>>>>>>>> instance inside a Windows Server 2012 and i've a set of job that 
>>>>>>>>>> work with
>>>>>>>>>> Sharepoint (SP) or Generic Connector (GC), that get file from a db.
>>>>>>>>>> With SP i've no problem, while with GC with a lot of document
>>>>>>>>>> (one with 47k and another with 60k), the Seed taking process, 
>>>>>>>>>> sometimes,
>>>>>>>>>> not finish, because the agents seem to stop (although java process 
>>>>>>>>>> is still
>>>>>>>>>> alive).
>>>>>>>>>> After this, if i try to start any other job, that not start, like
>>>>>>>>>> the agents are stopped.
>>>>>>>>>>
>>>>>>>>>> Other times, this jobs work correctly and one time together work
>>>>>>>>>> correctly, running in the same moment.
>>>>>>>>>>
>>>>>>>>>> For information:
>>>>>>>>>>
>>>>>>>>>>    - On Jboss there are only Manifold and Generic Repository
>>>>>>>>>>    application.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    - On the same Virtual Server, there is another Jboss istance,
>>>>>>>>>>    with solr istance and a web application.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    - I've check if it was a type of memory problem, but it's not
>>>>>>>>>>    the case.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    - GC with almost 23k seed work always, at least in test that
>>>>>>>>>>    i've done.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    - In local instance of Jboss with Manifold and Generic
>>>>>>>>>>    Rpository Application, i've not keep this problem.
>>>>>>>>>>
>>>>>>>>>> This is the only recurrent information that i've seen on
>>>>>>>>>> manifold.log:
>>>>>>>>>> ---------------
>>>>>>>>>> Connection 0.0.0.0:62755<-><ip-address>:<port> shut down
>>>>>>>>>> Releasing connection
>>>>>>>>>> org.apache.http.impl.conn.ManagedClientConnectionImpl@6c98c1bd
>>>>>>>>>>
>>>>>>>>>> ---------------
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> L. Alicata
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

@Override
    public void processDocuments(String[] documentIdentifiers, 
IExistingVersions statuses, Specification spec, IProcessActivity activities, 
int jobMode, boolean usesDefaultAuthority) throws ManifoldCFException, 
ServiceInterruption
    {
        TableSpec ts = new TableSpec(spec);

        Set<String> acls = ts.getAcls();

        String[] versionsReturned = new String[documentIdentifiers.length];

        // If there is no version query, then always return empty string for 
all documents.
        // This will mean that processDocuments will be called
        // for all. ProcessDocuments will then be responsible for doing 
document deletes itself,
        // based on the query results.

        Map<String, String> documentVersions = new HashMap<String, String>();
        if (ts.versionQuery != null && ts.versionQuery.length() > 0)
        {
            // If there IS a versions query, do it. First set up the variables, 
then do the substitution.
            VariableMap vm = new VariableMap();
            addConstant(vm, JDBCConstants.idReturnVariable, 
JDBCConstants.idReturnColumnName);
            addConstant(vm, JDBCConstants.versionReturnVariable, 
JDBCConstants.versionReturnColumnName);
            if (addIDList(vm, JDBCConstants.idListVariable, 
documentIdentifiers, null))
            {
                // Do the substitution
                ArrayList paramList = new ArrayList();
                StringBuilder sb = new StringBuilder();
                substituteQuery(ts.versionQuery, vm, sb, paramList);

                // Now, build a result return, and a hash table so we can 
correlate the returned values with the place to put them.
                // We presume that if the row is missing, the document is gone.
                // Fire off the query!
                getSession();
                IDynamicResultSet result;
                String queryText = sb.toString();
                long startTime = System.currentTimeMillis();
                // Get a dynamic resultset. Contract for dynamic resultset is 
that if
                // one is returned, it MUST be closed, or a connection will 
leak.
                try
                {
                    result = connection.executeUncachedQuery(queryText, 
paramList, -1);
                }
                catch (ManifoldCFException e)
                {
                    // If failure, record the failure.
                    if (e.getErrorCode() != ManifoldCFException.INTERRUPTED)
                        activities.recordActivity(new Long(startTime), 
ACTIVITY_EXTERNAL_QUERY, null, createQueryString(queryText, paramList), 
"ERROR", e.getMessage(), null);
                    throw e;
                }
                try
                {
                    // If success, record that too.
                    activities.recordActivity(new Long(startTime), 
ACTIVITY_EXTERNAL_QUERY, null, createQueryString(queryText, paramList), "OK", 
null, null);
                    // Now, go through resultset
                    while (true)
                    {
                        IDynamicResultRow row = result.getNextRow();
                        if (row == null)
                            break;
                        try
                        {
                            Object o = 
row.getValue(JDBCConstants.idReturnColumnName);
                            if (o == null)
                                throw new ManifoldCFException("Bad version 
query; doesn't return $(IDCOLUMN) column.  Try using quotes around $(IDCOLUMN) 
variable, e.g. \"$(IDCOLUMN)\", or, for MySQL, select \"by label\" in your 
repository connection.");
                            String idValue = JDBCConnection.readAsString(o);
                            o = 
row.getValue(JDBCConstants.versionReturnColumnName);
                            String versionValue;
                            // Null version is OK; make it a ""
                            if (o == null)
                                versionValue = "";
                            else
                                versionValue = JDBCConnection.readAsString(o);
                            documentVersions.put(idValue, versionValue);
                        }
                        finally
                        {
                            row.close();
                        }
                    }
                }
                finally
                {
                    result.close();
                }
            }
        }
        else
        {
            for (String documentIdentifier : documentIdentifiers)
            {
                // Logging.connectors.warn("String documentIdentifier : 
documentIdentifiers:  "+documentIdentifier);
                documentVersions.put(documentIdentifier, "");
            }
        }

        // Delete the documents that had no version, and work only on ones that 
did
        Set<String> fetchDocuments = documentVersions.keySet();
        for (String documentIdentifier : documentIdentifiers)
        {
            String documentVersion = documentVersions.get(documentIdentifier);
            if (documentVersion == null)
            {
                // Logging.connectors.warn("deleteDocument : 
documentIdentifiers:  "+documentIdentifier);
                activities.deleteDocument(documentIdentifier);
                continue;
            }
        }

        // Pick up document acls
        Map<String, Set<String>> documentAcls = new HashMap<String, 
Set<String>>();
        if (ts.securityOn)
        {
            if (acls.size() == 0 && ts.aclQuery != null && ts.aclQuery.length() 
> 0)
            {
                // If there IS an acls query, do it. First set up the 
variables, then do the substitution.
                VariableMap vm = new VariableMap();
                addConstant(vm, JDBCConstants.idReturnVariable, 
JDBCConstants.idReturnColumnName);
                addConstant(vm, JDBCConstants.tokenReturnVariable, 
JDBCConstants.tokenReturnColumnName);
                if (addIDList(vm, JDBCConstants.idListVariable, 
documentIdentifiers, fetchDocuments))
                {
                    // Do the substitution
                    ArrayList paramList = new ArrayList();
                    StringBuilder sb = new StringBuilder();
                    substituteQuery(ts.aclQuery, vm, sb, paramList);

                    // Fire off the query!
                    getSession();
                    IDynamicResultSet result;
                    String queryText = sb.toString();
                    long startTime = System.currentTimeMillis();
                    // Get a dynamic resultset. Contract for dynamic resultset 
is that if
                    // one is returned, it MUST be closed, or a connection will 
leak.
                    try
                    {
                        result = connection.executeUncachedQuery(queryText, 
paramList, -1);
                    }
                    catch (ManifoldCFException e)
                    {
                        // If failure, record the failure.
                        if (e.getErrorCode() != ManifoldCFException.INTERRUPTED)
                            activities.recordActivity(new Long(startTime), 
ACTIVITY_EXTERNAL_QUERY, null, createQueryString(queryText, paramList), 
"ERROR", e.getMessage(), null);
                        throw e;
                    }
                    try
                    {
                        // If success, record that too.
                        activities.recordActivity(new Long(startTime), 
ACTIVITY_EXTERNAL_QUERY, null, createQueryString(queryText, paramList), "OK", 
null, null);
                        // Now, go through resultset
                        while (true)
                        {
                            IDynamicResultRow row = result.getNextRow();
                            if (row == null)
                                break;
                            try
                            {
                                Object o = 
row.getValue(JDBCConstants.idReturnColumnName);
                                if (o == null)
                                    throw new ManifoldCFException("Bad acl 
query; doesn't return $(IDCOLUMN) column.  Try using quotes around $(IDCOLUMN) 
variable, e.g. \"$(IDCOLUMN)\", or, for MySQL, select \"by label\" in your 
repository connection.");
                                String idValue = JDBCConnection.readAsString(o);
                                o = 
row.getValue(JDBCConstants.tokenReturnColumnName);
                                String tokenValue;
                                if (o == null)
                                    tokenValue = "";
                                else
                                    tokenValue = JDBCConnection.readAsString(o);
                                // Versions that are "", when processed, will 
have their acls fetched at that time...
                                Set<String> dcs = documentAcls.get(idValue);
                                if (dcs == null)
                                {
                                    dcs = new HashSet<String>();
                                    documentAcls.put(idValue, dcs);
                                }
                                dcs.add(tokenValue);
                            }
                            finally
                            {
                                row.close();
                            }
                        }
                    }
                    finally
                    {
                        result.close();
                    }
                }
            }
            else
            {
                for (String documentIdentifier : fetchDocuments)
                {
                    documentAcls.put(documentIdentifier, acls);
                }
            }
        }

        Map<String, String> map = new HashMap<String, String>();
        for (String documentIdentifier : fetchDocuments)
        {
            String documentVersion = documentVersions.get(documentIdentifier);
            if (documentVersion.length() == 0)
            {
                map.put(documentIdentifier, documentVersion);
            }
            else
            {
                // Compute a full version string
                StringBuilder sb = new StringBuilder();
                Set<String> dAcls = documentAcls.get(documentIdentifier);
                if (dAcls == null)
                    sb.append('-');
                else
                {
                    sb.append('+');
                    String[] aclValues = new String[dAcls.size()];
                    int k = 0;
                    for (String acl : dAcls)
                    {
                        aclValues[k++] = acl;
                    }
                    java.util.Arrays.sort(aclValues);
                    packList(sb, aclValues, '+');
                }

                sb.append(documentVersion).append("=").append(ts.dataQuery);
                String versionValue = sb.toString();

                if (activities.checkDocumentNeedsReindexing(documentIdentifier, 
versionValue))
                {
                    map.put(documentIdentifier, versionValue);
                }
            }
        }

        // For all the documents not marked "scan only", form a query and pick 
up the contents.
        // If the contents is not found, then explicitly call the delete action 
method.
        VariableMap vm = new VariableMap();
        addConstant(vm, JDBCConstants.idReturnVariable, 
JDBCConstants.idReturnColumnName);
        addConstant(vm, JDBCConstants.urlReturnVariable, 
JDBCConstants.urlReturnColumnName);
        addConstant(vm, JDBCConstants.dataReturnVariable, 
JDBCConstants.dataReturnColumnName);
        addConstant(vm, JDBCConstants.contentTypeReturnVariable, 
JDBCConstants.contentTypeReturnColumnName);
        if (!addIDList(vm, JDBCConstants.idListVariable, documentIdentifiers, 
map.keySet()))
            return;

        // Do the substitution
        ArrayList paramList = new ArrayList();
        StringBuilder sb = new StringBuilder();
        substituteQuery(ts.dataQuery, vm, sb, paramList);

        // Execute the query
        getSession();
        IDynamicResultSet result;
        String queryText = sb.toString();
        Logging.connectors.warn("queryText:  "+queryText);
        long startTime = System.currentTimeMillis();
        // Get a dynamic resultset. Contract for dynamic resultset is that if
        // one is returned, it MUST be closed, or a connection will leak.
        try
        {
            result = connection.executeUncachedQuery(queryText, paramList, -1);
        }
        catch (ManifoldCFException e)
        {
            // If failure, record the failure.
            activities.recordActivity(new Long(startTime), 
ACTIVITY_EXTERNAL_QUERY, null, createQueryString(queryText, paramList), 
"ERROR", e.getMessage(), null);
            throw e;
        }
        try
        {
            // If success, record that too.
            activities.recordActivity(new Long(startTime), 
ACTIVITY_EXTERNAL_QUERY, null, createQueryString(queryText, paramList), "OK", 
null, null);

            IDynamicResultRow row = result.getNextRow();

            oldId = "";
            metadataMap.clear();

            Object o = row.getValue(JDBCConstants.idReturnColumnName);
            if (o == null)
                throw new ManifoldCFException("Bad document query; doesn't 
return $(IDCOLUMN) column.  Try using quotes around $(IDCOLUMN) variable, e.g. 
\"$(IDCOLUMN)\", or, for MySQL, select \"by label\" in your repository 
connection.");
            String id = JDBCConnection.readAsString(o);

            while (true)
            {
                if (row == null)
                    break;
                try
                {
                    Logging.connectors.warn("id:  " + id);
                    String errorCode = null;
                    String errorDesc = null;
                    Long fileLengthLong = null;
                    long fetchStartTime = System.currentTimeMillis();

                    try
                    {

//                         Logging.connectors.warn("**************** CONTROLLO 
******************");
//                         Logging.connectors.warn("**************** oldId 
******************: "+oldId);
//                         Logging.connectors.warn("**************** id 
******************: "+id);

                                                --- her i remove the controll 
that block to process other document with same id and i handle multi value data
                        do{
                            storeMetadataInArray(row);
                            oldId = id;
                            oldRow = row;
                            row = result.getNextRow();

                            if (row == null) {
                                break;
                            }

                            Object otmp = 
row.getValue(JDBCConstants.idReturnColumnName);
                            if (otmp == null)
                                throw new ManifoldCFException("Bad document 
query; doesn't return $(IDCOLUMN) column.  Try using quotes around $(IDCOLUMN) 
variable, e.g. \"$(IDCOLUMN)\", or, for MySQL, select \"by label\" in your 
repository connection.");
                            id = JDBCConnection.readAsString(otmp);

                        }while(oldId.equals(id));

                        String version = map.get(oldId);

                        RepositoryDocument rd = new RepositoryDocument();

                        o = oldRow.getValue(JDBCConstants.urlReturnColumnName);
                        if (o == null)
                        {
                            Logging.connectors.warn("JDBC: Document '" + oldId 
+ "' has a null url - skipping");
                            errorCode = activities.NULL_URL;
                            errorDesc = "Excluded because document had a null 
URL";
                            activities.noDocument(oldId, version);
                            continue;
                        }

                        // This is not right - url can apparently be a 
BinaryInput
                        String url = JDBCConnection.readAsString(o);
                        boolean validURL;
                        try
                        {
                            // Check to be sure url is valid
                            new java.net.URI(url);
                            validURL = true;
                        }
                        catch (java.net.URISyntaxException e)
                        {
                            validURL = false;
                        }

                        if (!validURL)
                        {
                            Logging.connectors.warn("JDBC: Document '" + oldId 
+ "' has an illegal url: '" + url + "' - skipping");
                            errorCode = activities.BAD_URL;
                            errorDesc = "Excluded because document had illegal 
URL ('" + url + "')";
                            activities.noDocument(oldId, version);
                            continue;
                        }

                        // Process the document itself
                        Object contents = 
oldRow.getValue(JDBCConstants.dataReturnColumnName);
                        // Logging.connectors.warn("JDBC: Document contents 
questa volta è: "+contents);
                        // Null data is allowed; we just ignore these
                        if (contents == null)
                        {
                            Logging.connectors.warn("JDBC: Document '" + oldId 
+ "' seems to have null data - skipping");
                            errorCode = "NULLDATA";
                            errorDesc = "Excluded because document had null 
data";
                            activities.noDocument(id, version);
                            continue;
                        }

                        String contentType;
                        o = 
oldRow.getValue(JDBCConstants.contentTypeReturnColumnName);
                        if (o != null)
                            contentType = JDBCConnection.readAsString(o);
                        else
                        {
                            if (contents instanceof BinaryInput)
                                contentType = "application/octet-stream";
                            else if (contents instanceof CharacterInput)
                                contentType = "text/plain; charset=utf-8";
                            else
                                contentType = "text/plain";
                        }

                        if (!activities.checkMimeTypeIndexable(contentType))
                        {
                            Logging.connectors.warn("JDBC: Document '" + oldId 
+ "' excluded because of mime type - skipping");
                            errorCode = activities.EXCLUDED_MIMETYPE;
                            errorDesc = "Excluded because of mime type (" + 
contentType + ")";
                            activities.noDocument(id, version);
                            continue;
                        }

                        if (!activities.checkURLIndexable(url))
                        {
                            Logging.connectors.warn("JDBC: Document '" + oldId 
+ "' excluded because of url - skipping");
                            errorCode = activities.EXCLUDED_URL;
                            errorDesc = "Excluded because of URL ('" + url + 
"')";
                            activities.noDocument(oldId, version);
                            continue;
                        }

                        // caso in cui cambia l'id e quindi devo salvare i dati 
precedenti
//                            String idTmp = oldId;
//                            oldId = id;
//                            id = idTmp;
                        // We will ingest something, so remove this id from the 
map in order that we know what we still
                        // need to delete when all done.
                        rd.setMimeType(contentType);
                        applyAccessTokens(rd, documentAcls.get(oldId));
//                            if(row!=null){
                        applyMetadata(rd, oldRow);
//                            }
                        metadataMap.clear();
                        map.remove(oldId);
//                            contents = oldContents;
                        version = "";
                        url = oldId;

//                            Logging.connectors.warn("**************** POST 
CONTROLLO ******************");
//                            Logging.connectors.warn("**************** POST 
oldId ******************: "+oldId);
//                            Logging.connectors.warn("**************** POST id 
******************: "+id);
//                            Logging.connectors.warn("**************** POST 
version ******************: "+version);
//                            Logging.connectors.warn("**************** POST 
contents ******************: "+contents);
//                            Logging.connectors.warn("**************** POST 
url ******************: "+url);

                        if (contents instanceof BinaryInput)
                        {
                            // Logging.connectors.warn("contents instanceof 
BinaryInput");
                            BinaryInput bi = (BinaryInput) contents;
                            long fileLength = bi.getLength();

                            if (!activities.checkLengthIndexable(fileLength))
                            {
                                Logging.connectors.warn("JDBC: Document '" + 
oldId + "' excluded because of length - skipping");
                                errorCode = activities.EXCLUDED_LENGTH;
                                errorDesc = "Excluded because of length (" + 
fileLength + ")";
                                activities.noDocument(oldId, version);
                                continue;
                            }

                            try
                            {
                                // Read the stream
                                InputStream is = bi.getStream();
                                try
                                {
                                    rd.setBinary(is, fileLength);
                                    
activities.ingestDocumentWithException(oldId, version, url, rd);
                                    errorCode = "OK";
                                    fileLengthLong = new Long(fileLength);
                                }
                                finally
                                {
                                    is.close();
                                }
                            }
                            catch (IOException e)
                            {
                                errorCode = 
e.getClass().getSimpleName().toUpperCase(Locale.ROOT);
                                errorDesc = e.getMessage();
                                handleIOException(id, e);
                            }
                        }
                        else if (contents instanceof CharacterInput)
                        {
                            // Logging.connectors.warn("contents instanceof 
CharacterInput");
                            CharacterInput ci = (CharacterInput) contents;
                            long fileLength = ci.getUtf8StreamLength();

                            if (!activities.checkLengthIndexable(fileLength))
                            {
                                Logging.connectors.warn("JDBC: Document '" + 
oldId + "' excluded because of length - skipping");
                                errorCode = activities.EXCLUDED_LENGTH;
                                errorDesc = "Excluded because of length (" + 
fileLength + ")";
                                activities.noDocument(oldId, version);
                                continue;
                            }

                            try
                            {
                                // Read the stream
                                InputStream is = ci.getUtf8Stream();
                                try
                                {
                                    rd.setBinary(is, fileLength);
                                    
activities.ingestDocumentWithException(oldId, version, url, rd);
                                    errorCode = "OK";
                                    fileLengthLong = new Long(fileLength);
                                }
                                finally
                                {
                                    is.close();
                                }
                            }
                            catch (IOException e)
                            {
                                errorCode = 
e.getClass().getSimpleName().toUpperCase(Locale.ROOT);
                                errorDesc = e.getMessage();
                                handleIOException(id, e);
                            }
                        }
                        else
                        {
                            // Logging.connectors.warn("contents instanceof 
nothing " + contents.toString());
                            // Turn it into a string, and then into a stream
                            String value = contents.toString();
                            byte[] bytes = 
value.getBytes(StandardCharsets.UTF_8);
                            long fileLength = bytes.length;

                            if (!activities.checkLengthIndexable(fileLength))
                            {
                                Logging.connectors.warn("JDBC: Document '" + 
oldId + "' excluded because of length - skipping");
                                errorCode = activities.EXCLUDED_LENGTH;
                                errorDesc = "Excluded because of length (" + 
fileLength + ")";
                                activities.noDocument(oldId, version);
                                continue;
                            }

                            try
                            {
                                InputStream is = new 
ByteArrayInputStream(bytes);
                                try
                                {
                                    rd.setBinary(is, fileLength);
                                    
activities.ingestDocumentWithException(oldId, version, url, rd);
                                    errorCode = "OK";
                                    fileLengthLong = new Long(fileLength);
                                }
                                finally
                                {
                                    is.close();
                                }
                            }
                            catch (IOException e)
                            {
                                errorCode = 
e.getClass().getSimpleName().toUpperCase(Locale.ROOT);
                                errorDesc = e.getMessage();
                                handleIOException(oldId, e);
                            }
                        }
                    }
                    catch (ManifoldCFException e)
                    {
                        if (e.getErrorCode() == ManifoldCFException.INTERRUPTED)
                            errorCode = null;
                        throw e;
                    }
                    finally
                    {
                        if (errorCode != null)
                            activities.recordActivity(new Long(fetchStartTime), 
ACTIVITY_FETCH, fileLengthLong, id, errorCode, errorDesc, null);
                    }
                }
                finally
                {

                }
            }
            if (row != null)
            {
                row.close();
            }

        }
        finally
        {
            result.close();
        }

        // Now, go through the original id's, and see which ones are still in 
the map. These
        // did not appear in the result and are presumed to be gone from the 
database, and thus must be deleted.
        for (String documentIdentifier : documentIdentifiers)
        {
            if (fetchDocuments.contains(documentIdentifier))
            {
                String documentVersion = map.get(documentIdentifier);
                if (documentVersion != null)
                {
                    // This means we did not see it (or data for it) in the 
result set. Delete it!
                    activities.noDocument(documentIdentifier, documentVersion);
                    activities.recordActivity(null, ACTIVITY_FETCH, null, 
documentIdentifier, "NOTFETCHED", "Document was not seen by processing query", 
null);
                }
            }
        }

    }
        
        --- modified to support metadata with array 
         /**
     * Apply metadata to a repository document.
     *
     * @param rd
     *            is the repository document to apply the metadata to.
     * @param row
     *            is the resultset row to use to get the metadata. All 
non-special columns from this row will be considered to be metadata.
     */
    protected void applyMetadata(RepositoryDocument rd, IResultRow row) throws 
ManifoldCFException
    {
        // Cycle through the row's columns
        Iterator iter = row.getColumns();
//        int i = 0;
        while (iter.hasNext())
        {
            String columnName = (String) iter.next();
            // Logging.connectors.warn("JDBC: columnName for "+i+": 
"+columnName);
            if (documentKnownColumns.get(columnName) == null)
            {
                // Logging.connectors.warn("Dento if 
(documentKnownColumns.get(columnName) == null) per columnName: "+columnName);
                // Consider this column to contain metadata.
                // We can only accept non-binary metadata at this time.
                Object metadata = row.getValue(columnName);
//                 
Logging.connectors.warn("JDBCConnection.readAsString(metadata): 
"+JDBCConnection.readAsString(metadata)+" per columnName: "+columnName);

                if (metadataMap.containsKey(columnName))
                {
                    Logging.connectors.warn("is in metadataMap, columnName: 
"+columnName);

                    String[] arrayOfMetadata = new 
String[metadataMap.get(columnName).size()];
                    arrayOfMetadata = 
metadataMap.get(columnName).toArray(arrayOfMetadata);
//                    for(int l= 0; l<arrayOfMetadata.length;l++){
//                      Logging.connectors.warn("arrayOfMetadata: "+ 
arrayOfMetadata[l] +", for columnName: "+columnName);
//                    }
                    rd.addField(columnName, arrayOfMetadata);
                    metadataMap.remove(columnName);
                }
                else
                {
                    Logging.connectors.warn("not is in metadataMap, per 
columnName: "+columnName);

                    rd.addField(columnName, 
JDBCConnection.readAsString(metadata));
                }
            }
//            i++;
        }

        for(String columnName: metadataMap.keySet()){
            String[] arrayOfMetadata = new 
String[metadataMap.get(columnName).size()];
            arrayOfMetadata = 
metadataMap.get(columnName).toArray(arrayOfMetadata);
            rd.addField(columnName, arrayOfMetadata);
        }
    }

        ---method created to store multi value data
    /**
     * Apply metadata to a repository document.
     *
     * @param rd
     *            is the repository document to apply the metadata to.
     * @param row
     *            is the resultset row to use to get the metadata. All 
non-special columns from this row will be considered to be metadata.
     */
    protected void storeMetadataInArray(IResultRow row) throws 
ManifoldCFException
    {
        // Cycle through the row's columns
        Iterator iter = row.getColumns();
        while (iter.hasNext())
        {
            String columnName = (String) iter.next();
            if (documentKnownColumns.get(columnName) == null)
            {
                Object metadata = row.getValue(columnName);
//                Logging.connectors.warn("estrazione metadata per columnName: 
"+columnName);
                String metadataString;
                                ---_filedata is a key word to indicate a url 
string to file
                if (columnName.toLowerCase().contains("_filedata")) {
                    metadataString = 
extractDocumentFromUrl(JDBCConnection.readAsString(metadata));
                } else if (columnName.toLowerCase().contains("_blobfile")) { 
---_blobfile is a key word to indicate a blob file
                    if(metadata instanceof BinaryInput){
//                                              
Logging.connectors.warn("estrazione metadata come BinaryInput per columnName: 
"+columnName);
                        try {
                            InputStream stream = ((BinaryInput) 
metadata).getStream();
                            metadataString = tikaParse(stream);
//                                                      
Logging.connectors.warn("estrazione metadata come BinaryInput per columnName: 
"+columnName+", ANDATO BENE!");
                        } catch (TikaTransformerException e) {
                            e.printStackTrace();
                            metadataString = "";
                        } catch (Exception e) {
                            e.printStackTrace();
                            metadataString = "";
                        }
                    }else{
                        metadataString = "";
                    }
                } else {
                    metadataString = JDBCConnection.readAsString(metadata);
                }
                if (metadataMap.get(columnName) != null)
                {
                                        --- _multi is a key word to indicate 
that this field is a multi value field
                    if (columnName.toLowerCase().contains("_multi"))
                    {
//                        Logging.connectors.warn("inserisco multi valore per 
columnName: "+columnName+ " - con valore: "+metadataString);
                        metadataMap.get(columnName).add(metadataString);
                    }
                }
                else
                {
//                     Logging.connectors.warn("inizializzo mappa per per 
columnName: "+columnName);
//                     Logging.connectors.warn("inserisco valore per 
columnName: "+columnName+ " - con valore: "+metadataString);

                    ArrayList<String> newArraylist = new ArrayList<String>();
                    newArraylist.add(metadataString);
                    metadataMap.put(columnName, newArraylist);
                }
            }
        }
    }

        ---utility to extract content from file indicates by db field
    // INIZIO METODI PER ESTRARRE I CONTENUTI DI UN FILE DATO UN URL
    private String extractDocumentFromUrl(String urlString){
//        Logging.connectors.warn("url in extractDocument: "+urlString);
        URI host;
        try {
            URL url = new URL(urlString);
            String nullFragment = null;
            host = new URI(url.getProtocol(), null, url.getHost(), 
url.getPort(), url.getPath(), url.getQuery(), null);
//            host = new URI(url);
//            final URI fullpath = new URI(host.getScheme(), null, 
host.getHost(), host.getPort(),  host.getPath(), null, null);
            final InputStream stream = host.toURL().openStream();
            String extractedValue = tikaParse(stream);
//            Logging.connectors.warn("extractedValue in extractDocument: 
"+extractedValue);
            return extractedValue;
        } catch (URISyntaxException e) {
            Logging.connectors.warn("URISyntaxException in extractDocument: 
"+e.getMessage());
            e.printStackTrace();
        }catch (IOException e) {
            Logging.connectors.warn("IOException in extractDocument: 
"+e.getMessage());
            e.printStackTrace();
        } catch (TikaTransformerException e) {
            Logging.connectors.warn("TikaTransformerException in 
extractDocument: "+e.getMessage());
            e.printStackTrace();
        }


        return "";
    }

        ---method insert to extract content from file
    private String tikaParse(final InputStream stream) throws 
TikaTransformerException {
        final StringWriter stringWriter = new StringWriter();
        final BufferedWriter buf = new BufferedWriter(stringWriter);
        final Parser parser = new AutoDetectParser();
        final WriteOutContentHandler handler = new WriteOutContentHandler(buf);
        try {
            parser.parse(stream, handler, new Metadata(), new ParseContext());
            buf.flush();
        }
        catch (IOException e) {
            throw new TikaTransformerException("Errore di I/O", e);
        }
        catch (SAXException e) {
            throw new TikaTransformerException("Errore di sax", e);
        }
        catch (TikaException e) {
            throw new TikaTransformerException("Errore di Parsing di tika", e);
        }
        catch (IllegalStateException e) {
            throw new TikaTransformerException("Errore di Parsing non gestito 
da Tika", e);
        }
        catch (IllegalArgumentException e) {
            throw new TikaTransformerException("Errore di Parsing non gestito 
da Tika", e);
        }
        catch (Throwable e) {
            throw new TikaTransformerException("Errore non gestito da Tika", e);
        }
        finally {
            try {
                stream.close();
            } catch (IOException e) {
                throw new TikaTransformerException("Non sono riuscito a 
chiudere lo stream");
            }
        }
        final String res = stringWriter.toString();
//        if(!"".equals(res))
//            ParsedLogger.incrementParsed();
        return res;
    }

Re: Job with Generic Connector stop to work

Reply via email to