Re: [PATCH v2 4/6] list-objects: filter objects in traverse_commit_list

2017-11-07 Thread Jeff Hostetler



On 11/2/2017 3:32 PM, Jonathan Tan wrote:

On Thu,  2 Nov 2017 17:50:11 +
Jeff Hostetler  wrote:


+   if (skip_prefix(v0, "oid=", )) {
+   filter_options->choice = LOFC_SPARSE_OID;
+   if (!get_oid_with_context(v1, GET_OID_BLOB,
+ _oid, )) {
+   /*
+* We successfully converted the 
+* into an actual OID.  Rewrite the raw_value
+* in canonoical form with just the OID.
+* (If we send this request to the server, we
+* want an absolute expression rather than a
+* local-ref-relative expression.)
+*/


I think this would lead to confusing behavior - for example, a fetch
with "--filter=oid=mybranch:sparseconfig" would have different results
depending on whether "mybranch" refers to a valid object locally.

The way I see it, this should either (i) only accept full 40-character
OIDs or (ii) retain the raw string to be interpreted only when the
filtering is done. (i) is simpler and safer, but is not so useful. In
both cases, if the user really wants client-side interpretation, they
can still use "$(git rev-parse foo)" to make it explicit.


Good point. I'll change it to always pass the original expression
so that it is evaluated wherever the filtering is actually performed.





+   free((char *)filter_options->raw_value);
+   filter_options->raw_value =
+   xstrfmt("sparse:oid=%s",
+   oid_to_hex(_oid));
+   filter_options->sparse_oid_value =
+   oiddup(_oid);
+   } else {
+   /*
+* We could not turn the  into an
+* OID.  Leave the raw_value as is in case
+* the server can parse it.  (It may refer to
+* a branch, commit, or blob we don't have.)
+*/
+   }
+   return 0;
+   }
+
+   if (skip_prefix(v0, "path=", )) {
+   filter_options->choice = LOFC_SPARSE_PATH;
+   filter_options->sparse_path_value = strdup(v1);
+   return 0;
+   }
+   }
+
+   die(_("invalid filter expression '%s'"), arg);
+   return 0;
+}


[snip]


+void arg_format_list_objects_filter(
+   struct argv_array *argv_array,
+   const struct list_objects_filter_options *filter_options)


Is this function used anywhere (in this patch or subsequent patches)?


It is used in upload-pack.c in part 3.  I'll remove it from part 1
and revisit in part 3.
 




diff --git a/list-objects-filter.c b/list-objects-filter.c
+/* See object.h and revision.h */
+#define FILTER_REVISIT (1<<25)


Looking later in the code, this flag indicates that a tree has been
SHOWN, so it might be better to just call this FILTER_SHOWN.


I'll amend this. There are already several SHOWN bits that behave
slightly differently.  I'll update and document this better.  Thanks.




[snip]


+struct frame {
+   int defval;


Document this variable?


+   int child_prov_omit : 1;


I think it's clearer if we use "unsigned" here. Also, document this
(e.g. "1 if any descendant of this tree object was provisionally
omitted").


got it. thanks.



+enum list_objects_filter_type {
+   LOFT_BEGIN_TREE,
+   LOFT_END_TREE,
+   LOFT_BLOB
+};


Optional: probably a better name would be list_objects_filter_situation.


got it. thanks.

 

+void traverse_commit_list_filtered(
+   struct list_objects_filter_options *filter_options,
+   struct rev_info *revs,
+   show_commit_fn show_commit,
+   show_object_fn show_object,
+   void *show_data,
+   struct oidset *omitted)
+{
+   filter_object_fn filter_fn = NULL;
+   filter_free_fn filter_free_fn = NULL;
+   void *filter_data = NULL;
+
+   filter_data = list_objects_filter__init(omitted, filter_options,
+   _fn, _free_fn);
+   do_traverse(revs, show_commit, show_object, show_data,
+   filter_fn, filter_data);
+   if (filter_data && filter_free_fn)
+   filter_free_fn(filter_data);
+}


This function traverse_commit_list_filtered() is in list-objects.c but
in list-objects-filter.h, if I'm reading the diff correctly?


oops.  thanks.




Overall, this looks like a good change. Object traversal was upgraded
with the behaviors of MARK_SEEN and SHOW independently controllable and
with the ability 

Re: [PATCH v2 4/6] list-objects: filter objects in traverse_commit_list

2017-11-06 Thread Jonathan Tan
On Mon, 6 Nov 2017 12:51:52 -0500
Jeff Hostetler  wrote:

> Jonathan and I were talking off-list about the performance
> effects of inspecting the pathnames to identify the ".git*"
> special files. I added it in my first draft back in the spring,
> thinking that even if you set the blob-limit to a small
> number (or zero), you'd probably still always want the
> .gitattribute and .gitignore files.  But now with the addition
> of the sparse filter and functional dynamic object fetching,
> I'm not sure I see the need for this.
> 
> Also, if the primary use of the blob-limit is to filter out
> giant binary assets, it is unlikely anyone is going to have
> a 1MB+ .git* file, so it is unlikely that the is_special_file
> would include anything that wouldn't already be included by
> the size criteria.
> 
> So, if there's no objections, I think I'll remove this and
> simplify the blob-limit filter function.  (That would let me
> get rid of the provisional omit code here.)

This sounds like a good idea to me. (For the record, one of the
performance impacts of checking the filename is that bitmaps can't be
used to obtain a whitelist of what is to be packed - instead, a regular
object walk must be used.)


Re: [PATCH v2 4/6] list-objects: filter objects in traverse_commit_list

2017-11-06 Thread Jeff Hostetler



On 11/2/2017 1:50 PM, Jeff Hostetler wrote:

From: Jeff Hostetler 

Create traverse_commit_list_filtered() and add filtering
interface to allow certain objects to be omitted from the
traversal.
...
diff --git a/list-objects-filter.c b/list-objects-filter.c
new file mode 100644
index 000..7f28425
--- /dev/null
+++ b/list-objects-filter.c
...
+/*
+ * A filter for list-objects to omit large blobs,
+ * but always include ".git*" special files.
+ * And to OPTIONALLY collect a list of the omitted OIDs.
+ */


Jonathan and I were talking off-list about the performance
effects of inspecting the pathnames to identify the ".git*"
special files. I added it in my first draft back in the spring,
thinking that even if you set the blob-limit to a small
number (or zero), you'd probably still always want the
.gitattribute and .gitignore files.  But now with the addition
of the sparse filter and functional dynamic object fetching,
I'm not sure I see the need for this.

Also, if the primary use of the blob-limit is to filter out
giant binary assets, it is unlikely anyone is going to have
a 1MB+ .git* file, so it is unlikely that the is_special_file
would include anything that wouldn't already be included by
the size criteria.

So, if there's no objections, I think I'll remove this and
simplify the blob-limit filter function.  (That would let me
get rid of the provisional omit code here.)

Jeff


Re: [PATCH v2 4/6] list-objects: filter objects in traverse_commit_list

2017-11-03 Thread Jeff Hostetler



On 11/3/2017 7:54 AM, Johannes Schindelin wrote:

Hi Jonathan,

On Thu, 2 Nov 2017, Jonathan Tan wrote:


On Thu,  2 Nov 2017 17:50:11 +
Jeff Hostetler  wrote:


+int parse_list_objects_filter(struct list_objects_filter_options 
*filter_options,
+ const char *arg)


Returning void is fine, I think. It seems that all your code paths
either return 0 or die.


Can we please start to encourage libified code, rather than discourage it?



I did that so that I could call it from the opt_parse_... version below
it that is used by the OPT_ macros.

And Johannes is right, it bothers me that there doesn't seem to be a hard
line where one should or should not call die() vs returning an error code.

Jeff



Re: [PATCH v2 4/6] list-objects: filter objects in traverse_commit_list

2017-11-03 Thread Johannes Schindelin
Hi Jonathan,

On Thu, 2 Nov 2017, Jonathan Tan wrote:

> On Thu,  2 Nov 2017 17:50:11 +
> Jeff Hostetler  wrote:
> 
> > +int parse_list_objects_filter(struct list_objects_filter_options 
> > *filter_options,
> > + const char *arg)
> 
> Returning void is fine, I think. It seems that all your code paths
> either return 0 or die.

Can we please start to encourage libified code, rather than discourage it?

Thank you,
Dscho


Re: [PATCH v2 4/6] list-objects: filter objects in traverse_commit_list

2017-11-02 Thread Jonathan Tan
On Thu,  2 Nov 2017 17:50:11 +
Jeff Hostetler  wrote:

> +int parse_list_objects_filter(struct list_objects_filter_options 
> *filter_options,
> +   const char *arg)

Returning void is fine, I think. It seems that all your code paths
either return 0 or die.

> +{
> + struct object_context oc;
> + struct object_id sparse_oid;
> + const char *v0;
> + const char *v1;
> +
> + if (filter_options->choice)
> + die(_("multiple object filter types cannot be combined"));
> +
> + /*
> +  * TODO consider rejecting 'arg' if it contains any
> +  * TODO injection characters (since we might send this
> +  * TODO to a sub-command or to the server and we don't
> +  * TODO want to deal with legacy quoting/escaping for
> +  * TODO a new feature).
> +  */
> +
> + filter_options->raw_value = strdup(arg);
> +
> + if (skip_prefix(arg, "blob:", ) || skip_prefix(arg, "blobs:", )) {
> + if (!strcmp(v0, "none")) {
> + filter_options->choice = LOFC_BLOB_NONE;
> + return 0;
> + }
> +
> + if (skip_prefix(v0, "limit=", ) &&
> + git_parse_ulong(v1, _options->blob_limit_value)) {
> + filter_options->choice = LOFC_BLOB_LIMIT;
> + return 0;
> + }
> + }
> + else if (skip_prefix(arg, "sparse:", )) {

Style: join the 2 lines above.

> + if (skip_prefix(v0, "oid=", )) {
> + filter_options->choice = LOFC_SPARSE_OID;
> + if (!get_oid_with_context(v1, GET_OID_BLOB,
> +   _oid, )) {
> + /*
> +  * We successfully converted the 
> +  * into an actual OID.  Rewrite the raw_value
> +  * in canonoical form with just the OID.
> +  * (If we send this request to the server, we
> +  * want an absolute expression rather than a
> +  * local-ref-relative expression.)
> +  */

I think this would lead to confusing behavior - for example, a fetch
with "--filter=oid=mybranch:sparseconfig" would have different results
depending on whether "mybranch" refers to a valid object locally.

The way I see it, this should either (i) only accept full 40-character
OIDs or (ii) retain the raw string to be interpreted only when the
filtering is done. (i) is simpler and safer, but is not so useful. In
both cases, if the user really wants client-side interpretation, they
can still use "$(git rev-parse foo)" to make it explicit.

> + free((char *)filter_options->raw_value);
> + filter_options->raw_value =
> + xstrfmt("sparse:oid=%s",
> + oid_to_hex(_oid));
> + filter_options->sparse_oid_value =
> + oiddup(_oid);
> + } else {
> + /*
> +  * We could not turn the  into an
> +  * OID.  Leave the raw_value as is in case
> +  * the server can parse it.  (It may refer to
> +  * a branch, commit, or blob we don't have.)
> +  */
> + }
> + return 0;
> + }
> +
> + if (skip_prefix(v0, "path=", )) {
> + filter_options->choice = LOFC_SPARSE_PATH;
> + filter_options->sparse_path_value = strdup(v1);
> + return 0;
> + }
> + }
> +
> + die(_("invalid filter expression '%s'"), arg);
> + return 0;
> +}

[snip]

> +void arg_format_list_objects_filter(
> + struct argv_array *argv_array,
> + const struct list_objects_filter_options *filter_options)

Is this function used anywhere (in this patch or subsequent patches)?

> diff --git a/list-objects-filter.c b/list-objects-filter.c
> +/* See object.h and revision.h */
> +#define FILTER_REVISIT (1<<25)

Looking later in the code, this flag indicates that a tree has been
SHOWN, so it might be better to just call this FILTER_SHOWN.

Also document this flag in object.h in the table above FLAG_BITS.

> +static enum list_objects_filter_result filter_blobs_limit(
> + enum list_objects_filter_type filter_type,
> + struct object *obj,
> + const char *pathname,
> + const char *filename,
> + void *filter_data_)
> +{
> + struct filter_blobs_limit_data *filter_data = filter_data_;
> + unsigned long object_length;
> + enum object_type t;
> + int is_special_filename;
> +
> + switch (filter_type) {
> + default:
> +  

[PATCH v2 4/6] list-objects: filter objects in traverse_commit_list

2017-11-02 Thread Jeff Hostetler
From: Jeff Hostetler 

Create traverse_commit_list_filtered() and add filtering
interface to allow certain objects to be omitted from the
traversal.

Update traverse_commit_list() to be a wrapper for the above
with a null filter to minimize the number of callers that
needed to be changed.

Object filtering will be used in a future commit by rev-list
and pack-objects for partial clone and fetch to omit unwanted
objects from the result.

traverse_bitmap_commit_list() does not work with filtering.

If a packfile bitmap is present, it will not be used.

Signed-off-by: Jeff Hostetler 
---
 Makefile  |   2 +
 list-objects-filter-options.c | 119 
 list-objects-filter-options.h |  55 ++
 list-objects-filter.c | 408 ++
 list-objects-filter.h |  84 +
 list-objects.c|  95 --
 list-objects.h|   2 +-
 7 files changed, 748 insertions(+), 17 deletions(-)
 create mode 100644 list-objects-filter-options.c
 create mode 100644 list-objects-filter-options.h
 create mode 100644 list-objects-filter.c
 create mode 100644 list-objects-filter.h

diff --git a/Makefile b/Makefile
index cd75985..ca378a4 100644
--- a/Makefile
+++ b/Makefile
@@ -807,6 +807,8 @@ LIB_OBJS += levenshtein.o
 LIB_OBJS += line-log.o
 LIB_OBJS += line-range.o
 LIB_OBJS += list-objects.o
+LIB_OBJS += list-objects-filter.o
+LIB_OBJS += list-objects-filter-options.o
 LIB_OBJS += ll-merge.o
 LIB_OBJS += lockfile.o
 LIB_OBJS += log-tree.o
diff --git a/list-objects-filter-options.c b/list-objects-filter-options.c
new file mode 100644
index 000..31255e7
--- /dev/null
+++ b/list-objects-filter-options.c
@@ -0,0 +1,119 @@
+#include "cache.h"
+#include "commit.h"
+#include "config.h"
+#include "revision.h"
+#include "argv-array.h"
+#include "list-objects.h"
+#include "list-objects-filter.h"
+#include "list-objects-filter-options.h"
+
+/*
+ * Parse value of the argument to the "filter" keword.
+ * On the command line this looks like:
+ *   --filter=
+ * and in the pack protocol as:
+ *   "filter" SP 
+ *
+ *  ::= blob:none
+ *   blob:limit=[kmg]
+ *   sparse:oid=
+ *   sparse:path=
+ */
+int parse_list_objects_filter(struct list_objects_filter_options 
*filter_options,
+ const char *arg)
+{
+   struct object_context oc;
+   struct object_id sparse_oid;
+   const char *v0;
+   const char *v1;
+
+   if (filter_options->choice)
+   die(_("multiple object filter types cannot be combined"));
+
+   /*
+* TODO consider rejecting 'arg' if it contains any
+* TODO injection characters (since we might send this
+* TODO to a sub-command or to the server and we don't
+* TODO want to deal with legacy quoting/escaping for
+* TODO a new feature).
+*/
+
+   filter_options->raw_value = strdup(arg);
+
+   if (skip_prefix(arg, "blob:", ) || skip_prefix(arg, "blobs:", )) {
+   if (!strcmp(v0, "none")) {
+   filter_options->choice = LOFC_BLOB_NONE;
+   return 0;
+   }
+
+   if (skip_prefix(v0, "limit=", ) &&
+   git_parse_ulong(v1, _options->blob_limit_value)) {
+   filter_options->choice = LOFC_BLOB_LIMIT;
+   return 0;
+   }
+   }
+   else if (skip_prefix(arg, "sparse:", )) {
+   if (skip_prefix(v0, "oid=", )) {
+   filter_options->choice = LOFC_SPARSE_OID;
+   if (!get_oid_with_context(v1, GET_OID_BLOB,
+ _oid, )) {
+   /*
+* We successfully converted the 
+* into an actual OID.  Rewrite the raw_value
+* in canonoical form with just the OID.
+* (If we send this request to the server, we
+* want an absolute expression rather than a
+* local-ref-relative expression.)
+*/
+   free((char *)filter_options->raw_value);
+   filter_options->raw_value =
+   xstrfmt("sparse:oid=%s",
+   oid_to_hex(_oid));
+   filter_options->sparse_oid_value =
+   oiddup(_oid);
+   } else {
+   /*
+* We could not turn the  into an
+* OID.  Leave the raw_value as is in case
+* the server can parse it.  (It may refer to
+