Re: Modify job to add excludes files and directory

Karl Wright Tue, 13 Mar 2018 13:34:03 -0700

Right, so the new org.simple.json JSON parser uses hash order for keys.
That scrambles their order on reading.  So unless you intermingle includes
and excludes within a start point, you are currently at risk of getting the
order switched on you.


There's a clean-room implementation of the old JSON parser available now;
I'll have to look into going back to it.  But for now I'm going to change
how output is done so that it only uses arrays if there's a single child
type possible.

Karl


On Tue, Mar 13, 2018 at 4:19 PM, Karl Wright <[email protected]> wrote:

> The code has two ways of representing the same thing in JSON.  One way
> collapses similar child types into arrays.  The other way (which is used
> when it's determined that the first way won't maintain order) is quite
> different.  Please see the following code:
>
> >>>>>>
>   /** Get as JSON.
>   *@return the json corresponding to this Configuration.
>   */
>   public String toJSON()
>     throws ManifoldCFException
>   {
>     JSONWriter writer = new JSONWriter();
>     writer.startObject();
>     // We do NOT use the root node label, unlike XML.
>
>     // Now, do children.  To get the arrays right, we need to glue
> together all children with the
>     // same type, which requires us to do an appropriate pass to gather
> that stuff together.
>     // Since we also need to maintain order, it is essential that we
> detect the out-of-order condition
>     // properly, and use an alternate representation if we should find it.
>     Map<String,List<ConfigurationNode>> childMap = new
> HashMap<String,List<ConfigurationNode>>();
>     List<String> childList = new ArrayList<String>();
>     String lastChildType = null;
>     boolean needAlternate = false;
>     int i = 0;
>     while (i < getChildCount())
>     {
>       ConfigurationNode child = findChild(i++);
>       String key = child.getType();
>       List<ConfigurationNode> list = childMap.get(key);
>       if (list == null)
>       {
>         list = new ArrayList<ConfigurationNode>();
>         childMap.put(key,list);
>         childList.add(key);
>       }
>       else
>       {
>         if (!lastChildType.equals(key))
>         {
>           needAlternate = true;
>           break;
>         }
>       }
>       list.add(child);
>       lastChildType = key;
>     }
>
>     if (needAlternate)
>     {
>       // Can't use the array representation.  We'll need to start do a
> _children_ object, and enumerate
>       // each child.  So, the JSON will look like:
>       // <key>:{_attribute_<attr>:xxx,_children_:[{_type_:<child_key>,
> ...},{_type_:<child_key_2>, ...}, ...]}
>       writer.key(JSON_CHILDREN);
>       writer.startArray();
>       i = 0;
>       while (i < getChildCount())
>       {
>         ConfigurationNode child = findChild(i++);
>         writeNode(writer,child,false,true);
>       }
>       writer.endArray();
>     }
>     else
>     {
>       // We can collapse child nodes to arrays and still maintain order.
>       // The JSON will look like this:
>       // 
> <key>:{_attribute_<attr>:xxx,<child_key>:[stuff],<child_key_2>:[more_stuff]
> ...}
>       int q = 0;
>       while (q < childList.size())
>       {
>         String key = childList.get(q++);
>         List<ConfigurationNode> list = childMap.get(key);
>         if (list.size() > 1)
>         {
>           // Write it as an array
>           writer.key(key);
>           writer.startArray();
>           i = 0;
>           while (i < list.size())
>           {
>             ConfigurationNode child = list.get(i++);
>             writeNode(writer,child,false,false);
>           }
>           writer.endArray();
>         }
>         else
>         {
>           // Write it as a singleton
>           writeNode(writer,list.get(0),true,false);
>         }
>       }
>     }
>     writer.endObject();
>
>     // Convert to a string.
>     return writer.toString();
>   }
> <<<<<<
>
> *IF* the specification from your UI-ordered rules cannot be output as the
> array-style JSON, *THEN* the alternate representation will be used.  That
> is why I suggested that you hand-order your example job and then output the
> JSON, because you will see the format that will definitely preserve the
> order.  I strongly suggest using that format to guarantee the order.
>
> There is a possibility that we have a bug where the ordering within types
> is preserved, but the ordering between types is not properly preserved.
> This is what I suspect is happening.  If true, it is because we migrated to
> a different JSON implementation as a result of legal issues a year or two
> back.  That's what I'm going to look at next.  But in any case you should
> be able to use the order-guaranteed JSON format to get past your problems.
>
> Thanks,
> Karl
>
>
> On Tue, Mar 13, 2018 at 4:02 PM, Karl Wright <[email protected]> wrote:
>
>> The issue is due to the mapping from XML to JSON.  Order is preserved,
>> but only within each level.  So the includes are all in order but all
>> includes go before all excludes, etc.  I'll have to consider how best to
>> resolve that.
>>
>> Karl
>>
>> On Tue, Mar 13, 2018 at 3:50 PM, Karl Wright <[email protected]> wrote:
>>
>>> Hi Maxence,
>>>
>>> If you EXPORT a job that works in JSON, and then IMPORT the exported
>>> JSON into a new job, is that job broken?
>>>
>>> Karl
>>>
>>>
>>> On Tue, Mar 13, 2018 at 1:50 PM, msaunier <[email protected]> wrote:
>>>
>>>> Hello Karl,
>>>>
>>>>
>>>>
>>>> I have created 3 situations :
>>>>
>>>>
>>>>
>>>> 1.      Create job manually (1_job_manually.json | 1_job_manually.png)
>>>>
>>>> 2.      Create job with script and modify the order manually
>>>> (2_job_mixte.json | 2_job_mixte.png)
>>>>
>>>> 3.      Create job with script (3_job_script.json | 3_job_script.png)
>>>>
>>>>
>>>>
>>>> I do not see the difference.
>>>>
>>>>
>>>>
>>>> So : 1 and 2 work good, with the good order, but 3 have included files
>>>> and directories in first.
>>>>
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Maxence
>>>>
>>>>
>>>>
>>>> *De :* Karl Wright [mailto:[email protected]]
>>>> *Envoyé :* lundi 12 mars 2018 21:29
>>>> *À :* [email protected]
>>>> *Cc :* Fabien Harrang <[email protected]>; REUILLON Dominique <
>>>> [email protected]>
>>>>
>>>> *Objet :* Re: Modify job to add excludes files and directory
>>>>
>>>>
>>>>
>>>> Here is an idea.  Define your job in the ui and use the API to fetch
>>>> the json for it.
>>>>
>>>>
>>>>
>>>> Karl
>>>>
>>>>
>>>>
>>>> On Mon, Mar 12, 2018, 12:51 PM Karl Wright <[email protected]> wrote:
>>>>
>>>> I will need to look at this later tonight before I can respond in
>>>> detail.
>>>>
>>>> The document specification part of the API uses EXACTLY the same data
>>>> as is stored for the job.  There only difference is that the job
>>>> specification is stored in XML, not JSON.  The converters between the two
>>>> do preserve ordering, however.
>>>>
>>>>
>>>>
>>>> Karl
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, Mar 12, 2018 at 12:38 PM, msaunier <[email protected]> wrote:
>>>>
>>>> *1 :*
>>>>
>>>> I have find a problem on the *file system connector* parts in this
>>>> page (I think) : https://manifoldcf.apache.org/
>>>> release/release-2.9.1/en_US/programmatic-operation.html
>>>>
>>>>
>>>>
>>>> You have read this JSON :
>>>>
>>>>
>>>>
>>>> {"startpoint":[{"_attribute_path":"c:\path_to_files","includ
>>>> e":[{"_attribute_type":"file","_attribute_match":"*.txt"},{"
>>>> _attribute_type":"file","_attribute_match":"*.doc"\,"_attrib
>>>> ute_type":"directory","_attribute_match":"*"],"exclude":["*.mov"]]}
>>>>
>>>>
>>>>
>>>> I think, the json syntax is bad. I fink the correct JSON is :
>>>>
>>>>
>>>>
>>>> {"startpoint":[{"_attribute_path":"c:\\path_to_files","inclu
>>>> de":[{"_attribute_type":"file","_attribute_match":"*.txt"},{
>>>> "_attribute_type":"file","_attribute_match":"*.doc","_attrib
>>>> ute_type":"directory","_attribute_match":"*"}],"exclude":["*.mov"]}]}
>>>>
>>>>
>>>>
>>>> Corrections list :
>>>>
>>>> {"startpoint":[{"_attribute_path":"c:\*\*path_to_files","inclu
>>>> de":[{"_attribute_type":"file","_attribute_match":"*.txt"},{
>>>> "_attribute_type":"file","_attribute_match":"*.doc"*\*,"_attri
>>>> bute_type":"directory","_attribute_match":"*"*}*],"exclude":["*.mov"]
>>>> *}*]}
>>>>
>>>>
>>>>
>>>> But, this configuration does not working with the *Windows Share*
>>>> connector. Syntax error on the exclude.
>>>>
>>>>
>>>>
>>>> *2 :*
>>>>
>>>> For my problem, the JSON format is not the problem. It work. I join the
>>>> json, generated with my python script and my database.
>>>> *(srvics33.json)*
>>>>
>>>>
>>>>
>>>> If I go on the interface after PUT the configuration, they included
>>>> files are in first and excluded in second. *(image1.png) *In my JSON,
>>>> I have add excludes in first, but they are in second.
>>>>
>>>> I am forced to go on the interface and manually modify the order to
>>>> optain a good result. *(image2.png)*
>>>>
>>>>
>>>>
>>>> Can I enter an order parameter [1-*] to place excluded files and
>>>> directories in first?
>>>>
>>>>
>>>>
>>>> Thanks.
>>>>
>>>>
>>>>
>>>> Maxence
>>>>
>>>>
>>>>
>>>> *De :* Karl Wright [mailto:[email protected]]
>>>> *Envoyé :* lundi 12 mars 2018 14:38
>>>>
>>>>
>>>> *À :* [email protected]
>>>> *Cc :* Fabien Harrang <[email protected]>; REUILLON Dominique <
>>>> [email protected]>
>>>> *Objet :* Re: Modify job to add excludes files and directory
>>>>
>>>>
>>>>
>>>> Hi Maxence,
>>>>
>>>>
>>>>
>>>> You can have as many clauses in your JSON rule list as you like.  You
>>>> do not need to have both include and exclude rules in each clause.  So you
>>>> can precisely do in the JSON what you do in the UI.
>>>>
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Karl
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, Mar 12, 2018 at 9:07 AM, msaunier <[email protected]> wrote:
>>>>
>>>> Ok. I have read that on the documentation :
>>>>
>>>>
>>>>
>>>>  Rules are evaluated from top to bottom, and the first rule that
>>>> matches the file name is the one that is chosen.
>>>>
>>>>
>>>>
>>>> But, in the API, if I PUT a new Job definition with the good order,
>>>> ManifoldCF add included documents in first all the time. If I need to
>>>> exlude in first, I can’t with API definition. I add the JSON at this email.
>>>>
>>>>
>>>>
>>>> API have an order parameter for the Startpoint, included and excluded
>>>> files/directories ?
>>>>
>>>>
>>>>
>>>> (PS : I prefer exclude in first and include * to have a total control
>>>> on the GED, to keep an eye on they documents)
>>>>
>>>> (PS2 : I generate this JSON and send it with a python script and it
>>>> working good)
>>>>
>>>>
>>>>
>>>> Thanks
>>>>
>>>>
>>>>
>>>> *De :* Karl Wright [mailto:[email protected]]
>>>> *Envoyé :* vendredi 9 mars 2018 12:53
>>>> *À :* [email protected]
>>>> *Cc :* Fabien Harrang <[email protected]>; REUILLON Dominique <
>>>> [email protected]>
>>>> *Objet :* Re: Modify job to add excludes files and directory
>>>>
>>>>
>>>>
>>>> Hi Maxence,
>>>>
>>>>
>>>>
>>>> In the middle of job run, if you change the specification of what
>>>> documents are included and excluded, the implementation of the connector
>>>> determines how it will behave.  There is no guarantee that documents that
>>>> are excluded will be removed, for example if the connector filters
>>>> documents only when they are queued.  You may need to run the job a second
>>>> time to be sure everything is removed.
>>>>
>>>> So the official answer is that "it depends".
>>>>
>>>>
>>>>
>>>> Karl
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Mar 9, 2018 at 5:38 AM, msaunier <[email protected]> wrote:
>>>>
>>>> Hello Karl,
>>>>
>>>>
>>>>
>>>> If I add on a job (in live) new files and directories to exclude,
>>>> ManifoldCF delete old indexed files that meet these exclusions? Or I need
>>>> to reseed all of my documents?
>>>>
>>>>
>>>>
>>>> Thanks you.
>>>>
>>>>
>>>>
>>>> Maxence SAUNIER
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Re: Modify job to add excludes files and directory

Reply via email to