Re: Modify job to add excludes files and directory

Karl Wright Tue, 13 Mar 2018 13:57:12 -0700

I created a ticket (CONNECTORS-1499) and attached a patch that uses the
more detailed format in all situations where hash order could affect
things.  If you apply the patch, you should definitely see a difference in
the JSON output when you dump a job in JSON format.  You will still need to
learn to use the order-preserving format when generating your own JSON.


Thanks,
Karl


On Tue, Mar 13, 2018 at 4:33 PM, Karl Wright <daddy...@gmail.com> wrote:

> Right, so the new org.simple.json JSON parser uses hash order for keys.
> That scrambles their order on reading.  So unless you intermingle includes
> and excludes within a start point, you are currently at risk of getting the
> order switched on you.
>
> There's a clean-room implementation of the old JSON parser available now;
> I'll have to look into going back to it.  But for now I'm going to change
> how output is done so that it only uses arrays if there's a single child
> type possible.
>
> Karl
>
>
> On Tue, Mar 13, 2018 at 4:19 PM, Karl Wright <daddy...@gmail.com> wrote:
>
>> The code has two ways of representing the same thing in JSON.  One way
>> collapses similar child types into arrays.  The other way (which is used
>> when it's determined that the first way won't maintain order) is quite
>> different.  Please see the following code:
>>
>> >>>>>>
>>   /** Get as JSON.
>>   *@return the json corresponding to this Configuration.
>>   */
>>   public String toJSON()
>>     throws ManifoldCFException
>>   {
>>     JSONWriter writer = new JSONWriter();
>>     writer.startObject();
>>     // We do NOT use the root node label, unlike XML.
>>
>>     // Now, do children.  To get the arrays right, we need to glue
>> together all children with the
>>     // same type, which requires us to do an appropriate pass to gather
>> that stuff together.
>>     // Since we also need to maintain order, it is essential that we
>> detect the out-of-order condition
>>     // properly, and use an alternate representation if we should find it.
>>     Map<String,List<ConfigurationNode>> childMap = new
>> HashMap<String,List<ConfigurationNode>>();
>>     List<String> childList = new ArrayList<String>();
>>     String lastChildType = null;
>>     boolean needAlternate = false;
>>     int i = 0;
>>     while (i < getChildCount())
>>     {
>>       ConfigurationNode child = findChild(i++);
>>       String key = child.getType();
>>       List<ConfigurationNode> list = childMap.get(key);
>>       if (list == null)
>>       {
>>         list = new ArrayList<ConfigurationNode>();
>>         childMap.put(key,list);
>>         childList.add(key);
>>       }
>>       else
>>       {
>>         if (!lastChildType.equals(key))
>>         {
>>           needAlternate = true;
>>           break;
>>         }
>>       }
>>       list.add(child);
>>       lastChildType = key;
>>     }
>>
>>     if (needAlternate)
>>     {
>>       // Can't use the array representation.  We'll need to start do a
>> _children_ object, and enumerate
>>       // each child.  So, the JSON will look like:
>>       // <key>:{_attribute_<attr>:xxx,_children_:[{_type_:<child_key>,
>> ...},{_type_:<child_key_2>, ...}, ...]}
>>       writer.key(JSON_CHILDREN);
>>       writer.startArray();
>>       i = 0;
>>       while (i < getChildCount())
>>       {
>>         ConfigurationNode child = findChild(i++);
>>         writeNode(writer,child,false,true);
>>       }
>>       writer.endArray();
>>     }
>>     else
>>     {
>>       // We can collapse child nodes to arrays and still maintain order.
>>       // The JSON will look like this:
>>       // 
>> <key>:{_attribute_<attr>:xxx,<child_key>:[stuff],<child_key_2>:[more_stuff]
>> ...}
>>       int q = 0;
>>       while (q < childList.size())
>>       {
>>         String key = childList.get(q++);
>>         List<ConfigurationNode> list = childMap.get(key);
>>         if (list.size() > 1)
>>         {
>>           // Write it as an array
>>           writer.key(key);
>>           writer.startArray();
>>           i = 0;
>>           while (i < list.size())
>>           {
>>             ConfigurationNode child = list.get(i++);
>>             writeNode(writer,child,false,false);
>>           }
>>           writer.endArray();
>>         }
>>         else
>>         {
>>           // Write it as a singleton
>>           writeNode(writer,list.get(0),true,false);
>>         }
>>       }
>>     }
>>     writer.endObject();
>>
>>     // Convert to a string.
>>     return writer.toString();
>>   }
>> <<<<<<
>>
>> *IF* the specification from your UI-ordered rules cannot be output as the
>> array-style JSON, *THEN* the alternate representation will be used.  That
>> is why I suggested that you hand-order your example job and then output the
>> JSON, because you will see the format that will definitely preserve the
>> order.  I strongly suggest using that format to guarantee the order.
>>
>> There is a possibility that we have a bug where the ordering within types
>> is preserved, but the ordering between types is not properly preserved.
>> This is what I suspect is happening.  If true, it is because we migrated to
>> a different JSON implementation as a result of legal issues a year or two
>> back.  That's what I'm going to look at next.  But in any case you should
>> be able to use the order-guaranteed JSON format to get past your problems.
>>
>> Thanks,
>> Karl
>>
>>
>> On Tue, Mar 13, 2018 at 4:02 PM, Karl Wright <daddy...@gmail.com> wrote:
>>
>>> The issue is due to the mapping from XML to JSON.  Order is preserved,
>>> but only within each level.  So the includes are all in order but all
>>> includes go before all excludes, etc.  I'll have to consider how best to
>>> resolve that.
>>>
>>> Karl
>>>
>>> On Tue, Mar 13, 2018 at 3:50 PM, Karl Wright <daddy...@gmail.com> wrote:
>>>
>>>> Hi Maxence,
>>>>
>>>> If you EXPORT a job that works in JSON, and then IMPORT the exported
>>>> JSON into a new job, is that job broken?
>>>>
>>>> Karl
>>>>
>>>>
>>>> On Tue, Mar 13, 2018 at 1:50 PM, msaunier <msaun...@citya.com> wrote:
>>>>
>>>>> Hello Karl,
>>>>>
>>>>>
>>>>>
>>>>> I have created 3 situations :
>>>>>
>>>>>
>>>>>
>>>>> 1.      Create job manually (1_job_manually.json | 1_job_manually.png)
>>>>>
>>>>> 2.      Create job with script and modify the order manually
>>>>> (2_job_mixte.json | 2_job_mixte.png)
>>>>>
>>>>> 3.      Create job with script (3_job_script.json | 3_job_script.png)
>>>>>
>>>>>
>>>>>
>>>>> I do not see the difference.
>>>>>
>>>>>
>>>>>
>>>>> So : 1 and 2 work good, with the good order, but 3 have included files
>>>>> and directories in first.
>>>>>
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Maxence
>>>>>
>>>>>
>>>>>
>>>>> *De :* Karl Wright [mailto:daddy...@gmail.com]
>>>>> *Envoyé :* lundi 12 mars 2018 21:29
>>>>> *À :* user@manifoldcf.apache.org
>>>>> *Cc :* Fabien Harrang <fharr...@citya.com>; REUILLON Dominique <
>>>>> dreuil...@citya.com>
>>>>>
>>>>> *Objet :* Re: Modify job to add excludes files and directory
>>>>>
>>>>>
>>>>>
>>>>> Here is an idea.  Define your job in the ui and use the API to fetch
>>>>> the json for it.
>>>>>
>>>>>
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Mar 12, 2018, 12:51 PM Karl Wright <daddy...@gmail.com> wrote:
>>>>>
>>>>> I will need to look at this later tonight before I can respond in
>>>>> detail.
>>>>>
>>>>> The document specification part of the API uses EXACTLY the same data
>>>>> as is stored for the job.  There only difference is that the job
>>>>> specification is stored in XML, not JSON.  The converters between the two
>>>>> do preserve ordering, however.
>>>>>
>>>>>
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Mar 12, 2018 at 12:38 PM, msaunier <msaun...@citya.com> wrote:
>>>>>
>>>>> *1 :*
>>>>>
>>>>> I have find a problem on the *file system connector* parts in this
>>>>> page (I think) : https://manifoldcf.apache.org/
>>>>> release/release-2.9.1/en_US/programmatic-operation.html
>>>>>
>>>>>
>>>>>
>>>>> You have read this JSON :
>>>>>
>>>>>
>>>>>
>>>>> {"startpoint":[{"_attribute_path":"c:\path_to_files","includ
>>>>> e":[{"_attribute_type":"file","_attribute_match":"*.txt"},{"
>>>>> _attribute_type":"file","_attribute_match":"*.doc"\,"_attrib
>>>>> ute_type":"directory","_attribute_match":"*"],"exclude":["*.mov"]]}
>>>>>
>>>>>
>>>>>
>>>>> I think, the json syntax is bad. I fink the correct JSON is :
>>>>>
>>>>>
>>>>>
>>>>> {"startpoint":[{"_attribute_path":"c:\\path_to_files","inclu
>>>>> de":[{"_attribute_type":"file","_attribute_match":"*.txt"},{
>>>>> "_attribute_type":"file","_attribute_match":"*.doc","_attrib
>>>>> ute_type":"directory","_attribute_match":"*"}],"exclude":["*.mov"]}]}
>>>>>
>>>>>
>>>>>
>>>>> Corrections list :
>>>>>
>>>>> {"startpoint":[{"_attribute_path":"c:\*\*path_to_files","inclu
>>>>> de":[{"_attribute_type":"file","_attribute_match":"*.txt"},{
>>>>> "_attribute_type":"file","_attribute_match":"*.doc"*\*,"_attri
>>>>> bute_type":"directory","_attribute_match":"*"*}*],"exclude":["*.mov"]
>>>>> *}*]}
>>>>>
>>>>>
>>>>>
>>>>> But, this configuration does not working with the *Windows Share*
>>>>> connector. Syntax error on the exclude.
>>>>>
>>>>>
>>>>>
>>>>> *2 :*
>>>>>
>>>>> For my problem, the JSON format is not the problem. It work. I join
>>>>> the json, generated with my python script and my database.
>>>>> *(srvics33.json)*
>>>>>
>>>>>
>>>>>
>>>>> If I go on the interface after PUT the configuration, they included
>>>>> files are in first and excluded in second. *(image1.png) *In my JSON,
>>>>> I have add excludes in first, but they are in second.
>>>>>
>>>>> I am forced to go on the interface and manually modify the order to
>>>>> optain a good result. *(image2.png)*
>>>>>
>>>>>
>>>>>
>>>>> Can I enter an order parameter [1-*] to place excluded files and
>>>>> directories in first?
>>>>>
>>>>>
>>>>>
>>>>> Thanks.
>>>>>
>>>>>
>>>>>
>>>>> Maxence
>>>>>
>>>>>
>>>>>
>>>>> *De :* Karl Wright [mailto:daddy...@gmail.com]
>>>>> *Envoyé :* lundi 12 mars 2018 14:38
>>>>>
>>>>>
>>>>> *À :* user@manifoldcf.apache.org
>>>>> *Cc :* Fabien Harrang <fharr...@citya.com>; REUILLON Dominique <
>>>>> dreuil...@citya.com>
>>>>> *Objet :* Re: Modify job to add excludes files and directory
>>>>>
>>>>>
>>>>>
>>>>> Hi Maxence,
>>>>>
>>>>>
>>>>>
>>>>> You can have as many clauses in your JSON rule list as you like.  You
>>>>> do not need to have both include and exclude rules in each clause.  So you
>>>>> can precisely do in the JSON what you do in the UI.
>>>>>
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Mar 12, 2018 at 9:07 AM, msaunier <msaun...@citya.com> wrote:
>>>>>
>>>>> Ok. I have read that on the documentation :
>>>>>
>>>>>
>>>>>
>>>>>  Rules are evaluated from top to bottom, and the first rule that
>>>>> matches the file name is the one that is chosen.
>>>>>
>>>>>
>>>>>
>>>>> But, in the API, if I PUT a new Job definition with the good order,
>>>>> ManifoldCF add included documents in first all the time. If I need to
>>>>> exlude in first, I can’t with API definition. I add the JSON at this 
>>>>> email.
>>>>>
>>>>>
>>>>>
>>>>> API have an order parameter for the Startpoint, included and excluded
>>>>> files/directories ?
>>>>>
>>>>>
>>>>>
>>>>> (PS : I prefer exclude in first and include * to have a total control
>>>>> on the GED, to keep an eye on they documents)
>>>>>
>>>>> (PS2 : I generate this JSON and send it with a python script and it
>>>>> working good)
>>>>>
>>>>>
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>>
>>>>> *De :* Karl Wright [mailto:daddy...@gmail.com]
>>>>> *Envoyé :* vendredi 9 mars 2018 12:53
>>>>> *À :* user@manifoldcf.apache.org
>>>>> *Cc :* Fabien Harrang <fharr...@citya.com>; REUILLON Dominique <
>>>>> dreuil...@citya.com>
>>>>> *Objet :* Re: Modify job to add excludes files and directory
>>>>>
>>>>>
>>>>>
>>>>> Hi Maxence,
>>>>>
>>>>>
>>>>>
>>>>> In the middle of job run, if you change the specification of what
>>>>> documents are included and excluded, the implementation of the connector
>>>>> determines how it will behave.  There is no guarantee that documents that
>>>>> are excluded will be removed, for example if the connector filters
>>>>> documents only when they are queued.  You may need to run the job a second
>>>>> time to be sure everything is removed.
>>>>>
>>>>> So the official answer is that "it depends".
>>>>>
>>>>>
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Mar 9, 2018 at 5:38 AM, msaunier <msaun...@citya.com> wrote:
>>>>>
>>>>> Hello Karl,
>>>>>
>>>>>
>>>>>
>>>>> If I add on a job (in live) new files and directories to exclude,
>>>>> ManifoldCF delete old indexed files that meet these exclusions? Or I need
>>>>> to reseed all of my documents?
>>>>>
>>>>>
>>>>>
>>>>> Thanks you.
>>>>>
>>>>>
>>>>>
>>>>> Maxence SAUNIER
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Modify job to add excludes files and directory

Reply via email to