Re: [RFC] Tokenizer API

2013-12-10 Thread Amos Jeffries
On 10/12/2013 9:38 p.m., Robert Collins wrote:
 On 10 December 2013 19:13, Amos Jeffries squ...@treenet.co.nz wrote:
 
 The problem with comparing input strings to a SBuf of characters is that
 parsing a input of length N againt charset of size M takes O(N*M) time.
 
 Huh? There are linear time parsers with PEGs. Or maybe I don't
 understand one of your preconditions to come up with an N*M complexity
 here.

The brute-force way is to scan each N position for the M delimiters
individually.
 -- SBuf uses while(0..N) {memchr(M)} ... O(N*M)


PS. I suspect the linear time parsers you know of are converting the PEG
into bool-array first before doing linear time scan with it or using
multiple CPU threads to do a parallel linear scan on small M.

Amos



Re: [RFC] Tokenizer API

2013-12-10 Thread Kinkie
I'd need to check.
To align the others to our IRC discussion: I'm fine with the design
Alex and you are suggesting.
My only suggestions are to have remaining() return SBuf instead of
const SBuf , and to have a few predefined CharacterSets.


On Tue, Dec 10, 2013 at 10:52 AM, Amos Jeffries squ...@treenet.co.nz wrote:
 On 10/12/2013 9:38 p.m., Robert Collins wrote:
 On 10 December 2013 19:13, Amos Jeffries squ...@treenet.co.nz wrote:

 The problem with comparing input strings to a SBuf of characters is that
 parsing a input of length N againt charset of size M takes O(N*M) time.

 Huh? There are linear time parsers with PEGs. Or maybe I don't
 understand one of your preconditions to come up with an N*M complexity
 here.

 The brute-force way is to scan each N position for the M delimiters
 individually.
  -- SBuf uses while(0..N) {memchr(M)} ... O(N*M)


 PS. I suspect the linear time parsers you know of are converting the PEG
 into bool-array first before doing linear time scan with it or using
 multiple CPU threads to do a parallel linear scan on small M.

 Amos




-- 
/kinkie


Re: [RFC] Tokenizer API

2013-12-10 Thread Robert Collins
On 10 December 2013 19:13, Amos Jeffries squ...@treenet.co.nz wrote:

 The problem with comparing input strings to a SBuf of characters is that
 parsing a input of length N againt charset of size M takes O(N*M) time.

Huh? There are linear time parsers with PEGs. Or maybe I don't
understand one of your preconditions to come up with an N*M complexity
here.

-Rob



-- 
Robert Collins rbtcoll...@hp.com
Distinguished Technologist
HP Converged Cloud


Re: [RFC] Tokenizer API

2013-12-10 Thread Alex Rousskov
On 12/09/2013 10:46 PM, Francesco Chemolli wrote:

 My suggestion is to have CharacterSet be a SBuf and
 rely on them, at least for now. In any case having them be a SBuf
 promotes better interface decoupling and abstraction.

CharacterSet is a set of characters, which is semantically very
different from a string. It would be overall wrong to abuse SBuf for
that purpose. It should either start simple (as I suggested) or
immediately optimized into a dedicated array-based class (as Amos proposed).

 Oh, one more argument for having the low-level matching primitives in
 SBuf: it's a pet peeve of mine to use some form of compact tries
 and/or FSM to do single-pass low-level string matching in SBuf

The purpose of Tokenizer is to help transform input into tokens using
simple rules. There is absolutely nothing in that design that prevents
someone from adding more matching methods to SBuf or optimizing existing
ones. Tokenizer can call those new/optimized methods as needed. That
does not lead to code duplication or even performance overheads.

On the other hand, it would be very wrong for the caller to use raw SBuf
methods to tokenize input. Tokenization is Tokenizer job, not SBuf job.
Please let me know if I should give more arguments explaining the
difference and supporting this design.


 SBuf was not really designed to be passed by nonconst reference. 

I do not think that is a valid claim. Nearly any C++ object can be
passed by reference. There is nothing wrong with that as such, it is
often a good idea, and no general-purpose design can (or should)
prohibit that. The const part of the API (or lack of thereof) is just
an indication that the method argument cannot (or can) be modified.


 const SBuf remaining() const

 I'd change the signature to
 SBuf remaining() const

 copying a SBuf is easy, returning one puts a lower requirement on the
 caller and is less constrained

Copying SBuf is easy but relatively expensive compared to not copying. I
am not sure GCC would optimize that copy away when it can be. Here, I am
talking about calling various refcounting methods, not copying of data
buffers, of course.

The reference-based method would not allow the caller to modify the
return value directly, but can you think of a common use pattern or two
where such on-the-fly modification would actually be needed or desired?

Please note that we can change this aspect of the interface later if
needed without any need to change any of the callers (AFAICT).


 I'd also add to the interface a few constants to describe common
 character sets such as ALPHA, ALNUM, LOWERALPHA, UPPERALPHA etc. (I'd
 use the predefined character classes from grep(1) as a refetence for
 common patterns).

Yes, the proposed design will work well only if we predefine virtually
all character sets. There is no need for a new interface for that
though. Just declare global constant CharacterSet objects (or functions
creating/returning them).

However, I recommend against adding a lot of such constants until they
are needed by a specific caller. Why waste 256 bytes? It is easy to add
more as needed.

I would recommend adding a macro to create a CharacterSet from one of
the ISALPHA(3) isfoo() functions/macros, but I am worried that those are
locale-dependent while almost everything in Squid should not be. We may
be better off hard-coding any protocol-defined character sets instead.


Thank you,

Alex.



Re: [PATCH] annotation support for external ACL helpers

2013-12-10 Thread Eliezer Croitoru

Now I just had small hole in time to look.
The mentioned logic seems pretty reasonable to me.

Eliezer

On 23/11/13 13:02, Amos Jeffries wrote:

  entryData.tag = label;
@@ -1603,6 +1605,18 @@
  {
  ACLFilledChecklist*checklist = Filled(static_castACLChecklist*(data));
  checklist-extacl_entry = cbdataReference((external_acl_entry *)result);
+
+// attach the helper kv-pair to the transaction
+if (HttpRequest * req = checklist-request) {
+// XXX: we have no access to the transaction / AccessLogEntry so cant 
SyncNotes().
+// workaround by using anything already set in HttpRequest
+// OR use new and rely on a later Sync copying these to AccessLogEntry
+if (!req-notes)
+req-notes = new NotePairs;
+
+req-notes-appendNewOnly(checklist-extacl_entry-notes);
+}
+
  checklist-resumeNonBlockingCheck(ExternalACLLookup::Instance());
  }






Re: [RFC] Tokenizer API

2013-12-10 Thread Alex Rousskov
On 12/09/2013 04:13 PM, Amos Jeffries wrote:

 Two requests for additional scope:

 * can we place this is a separate src/parse/ library please?
  - we have other generic parse code the deserves to all be bundled up
 together instead of spread out. Might as well start that collection
 process now.

No objections.


 * Lets do the charset boolean array earlier rather than later.

No objections.


I hope I will not be the one implementing the proposed API or the above
additions though. :-)


   CharacterSet(const char * const c, size_t len)

You do not want the len argument. These character sets will be nearly
always initialized once, from constant hard-coded c-strings.

For esoteric cases, I would add an add(const char c) method to add a
single character to the set (and use the merge operation to produce more
elaborate sets as needed, see below for a sketch).


It is a good idea to add a public label/name argument/member though, for
debugging/triaging:

  CharacterSet(const char *label, const char *characters);
  ...
  const char *name; /// a label for debugging


   /// add all characters from the given CharacterSet to this one
   void merge(const CharacterSet src) const {

Please provide a '+=' operator [that will use merge()]. Here is how I
envision some sets might be created:

const CharacterSet MySpaces() {
static CharacterSet cs;
if (!cs.name) {
cs = CharacterSet(MySpaces, \n\t ); // RFC 12345
cs += OtherProtocolSpaces();
if (Config.weNeedToBeSpecial)
cs.add('\r');
}
return cs;
}


 private:
   bool match_[256];

I suggest using std::vectorbool to avoid sprinkling the code with 256
constants or adding another member to store the hard-coded length.

I would also call this member chars_ instead of match_ because some sets
will be used in negative sense and match will be a little confusing in
that negative context.


 NP: most of the time we will be wanting to define these CharacterSet as
 global once-off objects. So I'm not sure if the merge() method is
 useful, but shown here for completeness in case we want it for
 generating composite character sets.

Agreed on all counts.


Cheers,

Alex.



Re: [RFC] Tokenizer API

2013-12-10 Thread Kinkie
On Tue, Dec 10, 2013 at 5:06 PM, Alex Rousskov
rouss...@measurement-factory.com wrote:
 On 12/09/2013 04:13 PM, Amos Jeffries wrote:

 Two requests for additional scope:

 * can we place this is a separate src/parse/ library please?
  - we have other generic parse code the deserves to all be bundled up
 together instead of spread out. Might as well start that collection
 process now.

 No objections.


 * Lets do the charset boolean array earlier rather than later.

 No objections.


 I hope I will not be the one implementing the proposed API or the above
 additions though. :-)

I can do it.
Worst-case there will be more audit rounds than strictly needed.


-- 
/kinkie


[PATCH] sslcrtvalidator_children concurrency option default value

2013-12-10 Thread Tsantilas Christos
Hi all,

currently we have the following situation for sslcrtvalidator_children
configuration option, which is may confusing people:
 1) The testing sslcrtvalidator helper supports concurrency
 2) The default concurrency if the sslcrtvalidator_children is not set,
is concurrency=0
 3) The default sslcrtvalidator_children line includes concurrency=1
 4) The documentation says:
 Defaults to 0 which indicates the certficate validator
  is a old-style single threaded redirector.

This is make a confusion.

I am attaching a simple patch which set the concurrency option for
sslcrtvalidator_children by default to 1. I believe that this is the
best because the testing helper we provide supports concurrency by default.

Other options is
   a) set sslcrtvalidator_children default line to use concurrency=0
   b) Make a better documentation of the current behaviour to cf.data.pre


Opinions?
=== modified file 'src/cf.data.pre'
--- src/cf.data.pre	2013-12-05 11:04:45 +
+++ src/cf.data.pre	2013-12-10 20:02:47 +
@@ -2655,42 +2655,42 @@
 	
 		startup=N
 	
 	Sets the minimum number of processes to spawn when Squid
 	starts or reconfigures. When set to zero the first request will
 	cause spawning of the first child process to handle it.
 	
 	Starting too few children temporary slows Squid under load while it
 	tries to spawn enough additional processes to cope with traffic.
 	
 		idle=N
 	
 	Sets a minimum of how many processes Squid is to try and keep available
 	at all times. When traffic begins to rise above what the existing
 	processes can handle this many more will be spawned up to the maximum
 	configured. A minimum setting of 1 is required.
 
 		concurrency=
 	
 	The number of requests each certificate validator helper can handle in
-	parallel. Defaults to 0 which indicates the certficate validator
-	is a old-style single threaded redirector.
+	parallel. A value of 0 indicates the certficate validator is an 
+	old-style single threaded redirector. Defaults to 1.
 	
 	When this directive is set to a value = 1 then the protocol
 	used to communicate with the helper is modified to include
 	a request ID in front of the request/response. The request
 	ID from the request must be echoed back with the response
 	to that request.
 	
 	You must have at least one ssl_crt_validator process.
 DOC_END
 
 COMMENT_START
  OPTIONS WHICH AFFECT THE NEIGHBOR SELECTION ALGORITHM
  -
 COMMENT_END
 
 NAME: cache_peer
 TYPE: peer
 DEFAULT: none
 LOC: Config.peers
 DOC_START

=== modified file 'src/ssl/Config.cc'
--- src/ssl/Config.cc	2013-02-03 10:45:53 +
+++ src/ssl/Config.cc	2013-12-10 19:59:30 +
@@ -1,12 +1,21 @@
 #include squid.h
 #include ssl/Config.h
 
 Ssl::Config Ssl::TheConfig;
 
+Ssl::Config::Config():
+#if USE_SSL_CRTD
+ssl_crtd(NULL),
+#endif
+ssl_crt_validator(NULL)
+{ 
+ssl_crt_validator_Children.concurrency = 1;
+}
+
 Ssl::Config::~Config()
 {
 #if USE_SSL_CRTD
 xfree(ssl_crtd);
 #endif
 xfree(ssl_crt_validator);
 }

=== modified file 'src/ssl/Config.h'
--- src/ssl/Config.h	2013-02-03 10:45:53 +
+++ src/ssl/Config.h	2013-12-10 19:58:13 +
@@ -1,34 +1,29 @@
 #ifndef SQUID_SSL_CONFIG_H
 #define SQUID_SSL_CONFIG_H
 
 #include HelperChildConfig.h
 
 namespace Ssl
 {
 
 class Config
 {
 public:
 #if USE_SSL_CRTD
 char *ssl_crtd; /// Name of external ssl_crtd application.
 /// The number of processes spawn for ssl_crtd.
 HelperChildConfig ssl_crtdChildren;
 #endif
 char *ssl_crt_validator;
 HelperChildConfig ssl_crt_validator_Children;
-Config():
-#if USE_SSL_CRTD
-ssl_crtd(NULL),
-#endif
-ssl_crt_validator(NULL) {}
-
+Config();
 ~Config();
 private:
 Config(const Config ); // not implemented
 Config operator =(const Config ); // not implemented
 };
 
 extern Config TheConfig;
 
 } // namespace Ssl
 #endif



Re: [PATCH] sslcrtvalidator_children concurrency option default value

2013-12-10 Thread Amos Jeffries

On 2013-12-11 10:46, Tsantilas Christos wrote:

Hi all,

currently we have the following situation for sslcrtvalidator_children
configuration option, which is may confusing people:
 1) The testing sslcrtvalidator helper supports concurrency
 2) The default concurrency if the sslcrtvalidator_children is not set,
is concurrency=0
 3) The default sslcrtvalidator_children line includes concurrency=1
 4) The documentation says:
 Defaults to 0 which indicates the certficate validator
  is a old-style single threaded redirector.

This is make a confusion.

I am attaching a simple patch which set the concurrency option for
sslcrtvalidator_children by default to 1. I believe that this is the
best because the testing helper we provide supports concurrency by 
default.


Other options is
   a) set sslcrtvalidator_children default line to use concurrency=0
   b) Make a better documentation of the current behaviour to 
cf.data.pre



Opinions?


+1. Looks good. But can you change the new documentation line to avoid 
calling it a redirector and single-threaded?  A value of 0 indicates 
the certficate validator does not support concurrency. should be fine.


Amos


[PATCH] Destroy ACLs properly, take2

2013-12-10 Thread Alex Rousskov
On 12/01/2013 05:43 PM, Alex Rousskov wrote:

 The attached patch destroys ACLs in the reverse order of creation to
 avoid destruction segfaults during reconfiguration.

 Done as trunk r13165.


Sorry, that was not enough. I somehow missed an obvious use case that
the committed fix does not cover, despite specifically trying to imagine
it during testing: It is actually possible for a group ACL G to use an
ACL A1 that was created before G and an ACL A2 that was created after G:

  acl A1 src 127.0.0.1
  acl G all-of A1
  acl A2 src 127.0.0.2
  acl G all-of A2

Such order of ACL creation makes any creation-based destruction order
invalid: Whether we use FIFO (as before) or LIFO (as after r13165), G's
destructor may dump core when deleting either A1 or A2 because one of
them will be already deleted earlier via Config.aclList.

The attached patch fixes the problem differently. Instead of using
Config.aclList to register and destroy explicit ACLs and then relying on
InnerNode destructor to destroy its sometimes explicit children, the
take2 patch dedicates a new container to register and destroy both
implicit and explicit ACLs in one sweep.

The InnerNode destructor is gone now, with aclDestroyAcls() replacing
its functionality.

The Config.aclList global is still used for searching explicit ACLs by
name -- no changes there.

Refcounting remains the preferred long-term solution, of course. I have
once again reviewed the changes required to make that happen, and can
confirm that they are too big for me to volunteer for that project right
now. Volunteers welcome.


Thank you,

Alex.
P.S. I left the registration order as in r13165 because it is much
faster (eliminates one linear search through all ACLs when adding a new
ACL). That can be reverted if we find a reason to search starting with
older ACLs. However, if that search order is important, we should
replace Config.aclList linked list with an ordered-by-ACL-name
container, to eliminate another case of a linear search.

Centrally destroy all explicit and implicit ACLs to avoid destruction segfaults
during reconfiguration.

Group ACLs created later may use other ACLs created earlier and vice versa, a
group ACL created earlier may use other ACLs created later. The latter is
possible when an ACL (e.g., A2 below) is declared when the group already
exists:

  acl A1 src 127.0.0.1
  acl Group all-of A1
  acl A2 src 127.0.0.2
  acl Group all-of A2

Thus, the group (i.e., InnerNode) ACL destructor may access already deleted
children regardless of the global ACL deletion order (FIFO or LIFO with
respect to ACL creation). Instead of relying on the deletion order to protect
InnerNode, we remove the InnerNode ACL destructor completely and rely on a
global set of registered ACLs to destroy all ACLs.

The old code was destroying all explicit ACLs in the same centralized fashion.
We now add implicit ACLs (commonly used by InnerNodes) to the centralized
destruction sequence. We added a new destruction-dedicated container to avoid
messing with the by-name ACL search that Config.aclList global is used for.
This new container will become unnecessary once we start refcounting ACLs.

=== modified file 'src/acl/Acl.cc'
--- src/acl/Acl.cc	2013-12-02 00:35:50 +
+++ src/acl/Acl.cc	2013-12-10 23:48:56 +
@@ -14,40 +14,41 @@
  *  sources; see the CREDITS file for full details.
  *
  *  This program is free software; you can redistribute it and/or modify
  *  it under the terms of the GNU General Public License as published by
  *  the Free Software Foundation; either version 2 of the License, or
  *  (at your option) any later version.
  *
  *  This program is distributed in the hope that it will be useful,
  *  but WITHOUT ANY WARRANTY; without even the implied warranty of
  *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
  *  GNU General Public License for more details.
  *
  *  You should have received a copy of the GNU General Public License
  *  along with this program; if not, write to the Free Software
  *  Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111, USA.
  *
  */
 #include squid.h
 #include acl/Acl.h
 #include acl/Checklist.h
+#include acl/Gadgets.h
 #include anyp/PortCfg.h
 #include cache_cf.h
 #include ConfigParser.h
 #include Debug.h
 #include dlink.h
 #include globals.h
 #include profiler/Profiler.h
 #include SquidConfig.h
 
 const ACLFlag ACLFlags::NoFlags[1] = {ACL_F_END};
 
 const char *AclMatchedName = NULL;
 
 bool ACLFlags::supported(const ACLFlag f) const
 {
 if (f == ACL_F_REGEX_CASE)
 return true;
 return (supported_.find(f) != std::string::npos);
 }
 
@@ -278,46 +279,47 @@ ACL::ParseAclLine(ConfigParser parser, 
 /*split the function here */
 A-parse();
 
 /*
  * Clear AclMatchedName from our temporary hack
  */
 AclMatchedName = NULL;	/* ugly */
 
 if (!new_acl)
 return;
 
 if (A-empty()) {
 debugs(28, DBG_CRITICAL, Warning: empty ACL:   A-cfgline);
 }