[GitHub] drill pull request #1200: DRILL-143: Support CGROUPs resource management

2018-04-03 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1200#discussion_r179026313
  
--- Diff: distribution/src/resources/drillbit.sh ---
@@ -127,6 +127,44 @@ check_before_start()
   fi
 }
 
+check_after_start(){
+#check if the process is running
+if [ -f $pid ]; then
+  dbitProc=$(ps -ef | grep `cat $pid` | grep Drillbit)
+  if [ -n "$dbitProc" ]; then
+# Check and enforce for CGroup
+if [ -n "$DRILLBIT_CGROUP" ]; then 
+  check_and_enforce_cgroup `cat $pid`
+fi
+  fi
+fi
+}
+
+check_and_enforce_cgroup(){
+dbitPid=$1;
+#if [ $(`ps -o cgroup` | grep -c $DRILLBIT_CGROUP ) -eq 1 ]; then 
--- End diff --

Remove commented out line


---


[GitHub] drill pull request #1200: DRILL-143: Support CGROUPs resource management

2018-04-03 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1200#discussion_r179027582
  
--- Diff: distribution/src/resources/drillbit.sh ---
@@ -154,6 +192,7 @@ start_bit ( )
   nohup nice -n $DRILL_NICENESS "$DRILL_HOME/bin/runbit" exec ${args[@]} 
>> "$logout" 2>&1 &
   echo $! > $pid
   sleep 1
+  check_after_start
--- End diff --

To make things easier:

```
procId=$!
echo $procId > $pid # Yeah, $pid is a file, $procId is the pid...
sleep 1
check_after_start $procId
```

Also, we now have to naming conventions: `waitForProcessEnd` and 
`check_after_start`. Doesn't match which we choose, but we should stick with 
it. Where is check style for scripts?


---


[GitHub] drill pull request #1200: DRILL-143: Support CGROUPs resource management

2018-04-03 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1200#discussion_r179025617
  
--- Diff: distribution/src/resources/drill-env.sh ---
@@ -86,6 +86,12 @@
 
 #export DRILL_PID_DIR=${DRILL_PID_DIR:-$DRILL_HOME}
 
+# CGroup to which the Drillbit belong when running as a daemon using
--- End diff --

Also, maybe provide a bit more of a description. My guess is that the 
CGroup must already be set up? That we only enforce CPU? Or, is what we enforce 
determined entirely by what the user has configured? If so, wouldn't "drill" be 
a better CGroup name?


---


[GitHub] drill pull request #1200: DRILL-143: Support CGROUPs resource management

2018-04-03 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1200#discussion_r179028127
  
--- Diff: distribution/src/resources/drillbit.sh ---
@@ -127,6 +127,44 @@ check_before_start()
   fi
 }
 
+check_after_start(){
+#check if the process is running
+if [ -f $pid ]; then
+  dbitProc=$(ps -ef | grep `cat $pid` | grep Drillbit)
+  if [ -n "$dbitProc" ]; then
+# Check and enforce for CGroup
+if [ -n "$DRILLBIT_CGROUP" ]; then 
+  check_and_enforce_cgroup `cat $pid`
+fi
+  fi
+fi
+}
+
+check_and_enforce_cgroup(){
+dbitPid=$1;
+#if [ $(`ps -o cgroup` | grep -c $DRILLBIT_CGROUP ) -eq 1 ]; then 
+if [ -f /cgroup/cpu/${DRILLBIT_CGROUP}/cgroup.procs ]; then 
+  echo $dbitPid > /cgroup/cpu/${DRILLBIT_CGROUP}/cgroup.procs
+  # Verify Enforcement
+  cgroupStatus=`grep -w $pid 
/cgroup/cpu/${DRILLBIT_CGROUP}/cgroup.procs`
+  if [ -z "$cgroupStatus" ]; then
+#Ref: 
https://www.kernel.org/doc/Documentation/scheduler/sched-bwc.txt
+let cpu_quota=`cat /cgroup/cpu/${DRILLBIT_CGROUP}/cpu.cfs_quota_us`
+let cpu_period=`cat 
/cgroup/cpu/${DRILLBIT_CGROUP}/cpu.cfs_period_us`
+if [ $cpu_period -gt 0 ] && [ $cpu_quota -gt 0 ]; then
+  coresAllowed="(up to "`echo $(( 100 * $cpu_quota / $cpu_period 
)) | sed 's/..$/.&/'`" cores allowed)"
+fi
+echo "WARN: Drillbit's CPU resource usage will be managed under 
the CGroup : $DRILLBIT_CGROUP "$coresAllowed
+  else
+echo "ERROR: Failed to add Drillbit to CGroup ( $DRILLBIT_CGROUP ) 
for resource usage management.  Ensure that the cgroup manages CPU"
+exit 1
+  fi
+else
+  echo "ERROR: cgroup $DRILLBIT_CGROUP does not found. Ensure that 
daemon is running and cgroup exists"
+  exit 1
--- End diff --

Not sure we want to fail the script in this case. Failure means that the 
Drillbit did not start, but it did.


---


[GitHub] drill pull request #1200: DRILL-143: Support CGROUPs resource management

2018-04-03 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1200#discussion_r179028625
  
--- Diff: distribution/src/resources/drillbit.sh ---
@@ -127,6 +127,44 @@ check_before_start()
   fi
 }
 
+check_after_start(){
+#check if the process is running
+if [ -f $pid ]; then
+  dbitProc=$(ps -ef | grep `cat $pid` | grep Drillbit)
+  if [ -n "$dbitProc" ]; then
+# Check and enforce for CGroup
+if [ -n "$DRILLBIT_CGROUP" ]; then 
+  check_and_enforce_cgroup `cat $pid`
+fi
+  fi
+fi
+}
+
+check_and_enforce_cgroup(){
+dbitPid=$1;
+#if [ $(`ps -o cgroup` | grep -c $DRILLBIT_CGROUP ) -eq 1 ]; then 
+if [ -f /cgroup/cpu/${DRILLBIT_CGROUP}/cgroup.procs ]; then 
+  echo $dbitPid > /cgroup/cpu/${DRILLBIT_CGROUP}/cgroup.procs
+  # Verify Enforcement
+  cgroupStatus=`grep -w $pid 
/cgroup/cpu/${DRILLBIT_CGROUP}/cgroup.procs`
+  if [ -z "$cgroupStatus" ]; then
+#Ref: 
https://www.kernel.org/doc/Documentation/scheduler/sched-bwc.txt
+let cpu_quota=`cat /cgroup/cpu/${DRILLBIT_CGROUP}/cpu.cfs_quota_us`
--- End diff --

`let` is used only for math. Just say `cpu_quota=cat...`


---


[GitHub] drill pull request #1200: DRILL-143: Support CGROUPs resource management

2018-04-03 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1200#discussion_r179027761
  
--- Diff: distribution/src/resources/drillbit.sh ---
@@ -127,6 +127,44 @@ check_before_start()
   fi
 }
 
+check_after_start(){
+#check if the process is running
+if [ -f $pid ]; then
+  dbitProc=$(ps -ef | grep `cat $pid` | grep Drillbit)
+  if [ -n "$dbitProc" ]; then
+# Check and enforce for CGroup
+if [ -n "$DRILLBIT_CGROUP" ]; then 
+  check_and_enforce_cgroup `cat $pid`
+fi
+  fi
+fi
+}
+
+check_and_enforce_cgroup(){
+dbitPid=$1;
+#if [ $(`ps -o cgroup` | grep -c $DRILLBIT_CGROUP ) -eq 1 ]; then 
+if [ -f /cgroup/cpu/${DRILLBIT_CGROUP}/cgroup.procs ]; then 
+  echo $dbitPid > /cgroup/cpu/${DRILLBIT_CGROUP}/cgroup.procs
+  # Verify Enforcement
+  cgroupStatus=`grep -w $pid 
/cgroup/cpu/${DRILLBIT_CGROUP}/cgroup.procs`
+  if [ -z "$cgroupStatus" ]; then
+#Ref: 
https://www.kernel.org/doc/Documentation/scheduler/sched-bwc.txt
+let cpu_quota=`cat /cgroup/cpu/${DRILLBIT_CGROUP}/cpu.cfs_quota_us`
+let cpu_period=`cat 
/cgroup/cpu/${DRILLBIT_CGROUP}/cpu.cfs_period_us`
+if [ $cpu_period -gt 0 ] && [ $cpu_quota -gt 0 ]; then
+  coresAllowed="(up to "`echo $(( 100 * $cpu_quota / $cpu_period 
)) | sed 's/..$/.&/'`" cores allowed)"
+fi
+echo "WARN: Drillbit's CPU resource usage will be managed under 
the CGroup : $DRILLBIT_CGROUP "$coresAllowed
--- End diff --

Why is this a WARN? Isn't this exactly what I requested?

Maybe emit this as a message (no WARN) if a -v (verbose) flag is set.

Also, put the variable inside the quotes; bash is handy that way...

Message will be "...CGroup : drillcpu 5". Maybe, "CGroup drillcpu will 
limit Drill to 5 cpu(s)".


---


[GitHub] drill pull request #1200: DRILL-143: Support CGROUPs resource management

2018-04-03 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1200#discussion_r179028078
  
--- Diff: distribution/src/resources/drillbit.sh ---
@@ -127,6 +127,44 @@ check_before_start()
   fi
 }
 
+check_after_start(){
+#check if the process is running
+if [ -f $pid ]; then
+  dbitProc=$(ps -ef | grep `cat $pid` | grep Drillbit)
+  if [ -n "$dbitProc" ]; then
+# Check and enforce for CGroup
+if [ -n "$DRILLBIT_CGROUP" ]; then 
+  check_and_enforce_cgroup `cat $pid`
+fi
+  fi
+fi
+}
+
+check_and_enforce_cgroup(){
+dbitPid=$1;
+#if [ $(`ps -o cgroup` | grep -c $DRILLBIT_CGROUP ) -eq 1 ]; then 
+if [ -f /cgroup/cpu/${DRILLBIT_CGROUP}/cgroup.procs ]; then 
+  echo $dbitPid > /cgroup/cpu/${DRILLBIT_CGROUP}/cgroup.procs
+  # Verify Enforcement
+  cgroupStatus=`grep -w $pid 
/cgroup/cpu/${DRILLBIT_CGROUP}/cgroup.procs`
+  if [ -z "$cgroupStatus" ]; then
+#Ref: 
https://www.kernel.org/doc/Documentation/scheduler/sched-bwc.txt
+let cpu_quota=`cat /cgroup/cpu/${DRILLBIT_CGROUP}/cpu.cfs_quota_us`
+let cpu_period=`cat 
/cgroup/cpu/${DRILLBIT_CGROUP}/cpu.cfs_period_us`
+if [ $cpu_period -gt 0 ] && [ $cpu_quota -gt 0 ]; then
+  coresAllowed="(up to "`echo $(( 100 * $cpu_quota / $cpu_period 
)) | sed 's/..$/.&/'`" cores allowed)"
+fi
+echo "WARN: Drillbit's CPU resource usage will be managed under 
the CGroup : $DRILLBIT_CGROUP "$coresAllowed
+  else
+echo "ERROR: Failed to add Drillbit to CGroup ( $DRILLBIT_CGROUP ) 
for resource usage management.  Ensure that the cgroup manages CPU"
+exit 1
+  fi
+else
+  echo "ERROR: cgroup $DRILLBIT_CGROUP does not found. Ensure that 
daemon is running and cgroup exists"
--- End diff --

cgroup --> CGroup (or at least make the format consistent across messages)

does not found --> not found

We can check if the daemon is running using `kill -0 $pid`. If so, then we 
just tell the user something line "Check CGroup configuration."

Also, errors should go to stdout: `echo "msg" >&2`


---


[GitHub] drill pull request #1200: DRILL-143: Support CGROUPs resource management

2018-04-03 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1200#discussion_r179026462
  
--- Diff: distribution/src/resources/drillbit.sh ---
@@ -127,6 +127,44 @@ check_before_start()
   fi
 }
 
+check_after_start(){
+#check if the process is running
+if [ -f $pid ]; then
+  dbitProc=$(ps -ef | grep `cat $pid` | grep Drillbit)
+  if [ -n "$dbitProc" ]; then
+# Check and enforce for CGroup
+if [ -n "$DRILLBIT_CGROUP" ]; then 
+  check_and_enforce_cgroup `cat $pid`
+fi
+  fi
+fi
+}
+
+check_and_enforce_cgroup(){
+dbitPid=$1;
+#if [ $(`ps -o cgroup` | grep -c $DRILLBIT_CGROUP ) -eq 1 ]; then 
+if [ -f /cgroup/cpu/${DRILLBIT_CGROUP}/cgroup.procs ]; then 
--- End diff --

I believe cgroup setup is OS-specific. There are, AFAIK, multiple group 
versions.

For this reason, I wonder if this needs to have checks. Are there scripts 
we can crib from somewhere that do those checks?


---


[GitHub] drill pull request #1200: DRILL-143: Support CGROUPs resource management

2018-04-03 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1200#discussion_r179025777
  
--- Diff: distribution/src/resources/drillbit.sh ---
@@ -127,6 +127,44 @@ check_before_start()
   fi
 }
 
+check_after_start(){
+#check if the process is running
+if [ -f $pid ]; then
+  dbitProc=$(ps -ef | grep `cat $pid` | grep Drillbit)
--- End diff --

If you get this far, the pid must exist or the scripts are broken.

The only question is whether the process is still running. The chance that 
it is not is very low. Also, there is a race condition: the process may exit 
just after we check it. So, not sure it is even worth doing the check.

Finally, the standard way to check for the process is `kill -0 $pid`". See 
`waitForProcessEnd()`.


---


[GitHub] drill pull request #1200: DRILL-143: Support CGROUPs resource management

2018-04-03 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1200#discussion_r179028377
  
--- Diff: distribution/src/resources/drillbit.sh ---
@@ -127,6 +127,44 @@ check_before_start()
   fi
 }
 
+check_after_start(){
+#check if the process is running
+if [ -f $pid ]; then
+  dbitProc=$(ps -ef | grep `cat $pid` | grep Drillbit)
+  if [ -n "$dbitProc" ]; then
+# Check and enforce for CGroup
+if [ -n "$DRILLBIT_CGROUP" ]; then 
+  check_and_enforce_cgroup `cat $pid`
+fi
+  fi
+fi
+}
+
+check_and_enforce_cgroup(){
+dbitPid=$1;
+#if [ $(`ps -o cgroup` | grep -c $DRILLBIT_CGROUP ) -eq 1 ]; then 
+if [ -f /cgroup/cpu/${DRILLBIT_CGROUP}/cgroup.procs ]; then 
+  echo $dbitPid > /cgroup/cpu/${DRILLBIT_CGROUP}/cgroup.procs
+  # Verify Enforcement
+  cgroupStatus=`grep -w $pid 
/cgroup/cpu/${DRILLBIT_CGROUP}/cgroup.procs`
+  if [ -z "$cgroupStatus" ]; then
+#Ref: 
https://www.kernel.org/doc/Documentation/scheduler/sched-bwc.txt
+let cpu_quota=`cat /cgroup/cpu/${DRILLBIT_CGROUP}/cpu.cfs_quota_us`
+let cpu_period=`cat 
/cgroup/cpu/${DRILLBIT_CGROUP}/cpu.cfs_period_us`
+if [ $cpu_period -gt 0 ] && [ $cpu_quota -gt 0 ]; then
+  coresAllowed="(up to "`echo $(( 100 * $cpu_quota / $cpu_period 
)) | sed 's/..$/.&/'`" cores allowed)"
+fi
+echo "WARN: Drillbit's CPU resource usage will be managed under 
the CGroup : $DRILLBIT_CGROUP "$coresAllowed
+  else
+echo "ERROR: Failed to add Drillbit to CGroup ( $DRILLBIT_CGROUP ) 
for resource usage management.  Ensure that the cgroup manages CPU"
+exit 1
--- End diff --

See below about `exit 1`


---


[GitHub] drill pull request #1200: DRILL-143: Support CGROUPs resource management

2018-04-03 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1200#discussion_r179025400
  
--- Diff: distribution/src/resources/drill-env.sh ---
@@ -86,6 +86,12 @@
 
 #export DRILL_PID_DIR=${DRILL_PID_DIR:-$DRILL_HOME}
 
+# CGroup to which the Drillbit belong when running as a daemon using
--- End diff --

belong --> belongs


---


[GitHub] drill pull request #1200: DRILL-143: Support CGROUPs resource management

2018-04-03 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1200#discussion_r179027910
  
--- Diff: distribution/src/resources/drillbit.sh ---
@@ -127,6 +127,44 @@ check_before_start()
   fi
 }
 
+check_after_start(){
+#check if the process is running
+if [ -f $pid ]; then
+  dbitProc=$(ps -ef | grep `cat $pid` | grep Drillbit)
+  if [ -n "$dbitProc" ]; then
+# Check and enforce for CGroup
+if [ -n "$DRILLBIT_CGROUP" ]; then 
+  check_and_enforce_cgroup `cat $pid`
+fi
+  fi
+fi
+}
+
+check_and_enforce_cgroup(){
+dbitPid=$1;
+#if [ $(`ps -o cgroup` | grep -c $DRILLBIT_CGROUP ) -eq 1 ]; then 
+if [ -f /cgroup/cpu/${DRILLBIT_CGROUP}/cgroup.procs ]; then 
+  echo $dbitPid > /cgroup/cpu/${DRILLBIT_CGROUP}/cgroup.procs
+  # Verify Enforcement
+  cgroupStatus=`grep -w $pid 
/cgroup/cpu/${DRILLBIT_CGROUP}/cgroup.procs`
+  if [ -z "$cgroupStatus" ]; then
+#Ref: 
https://www.kernel.org/doc/Documentation/scheduler/sched-bwc.txt
+let cpu_quota=`cat /cgroup/cpu/${DRILLBIT_CGROUP}/cpu.cfs_quota_us`
+let cpu_period=`cat 
/cgroup/cpu/${DRILLBIT_CGROUP}/cpu.cfs_period_us`
+if [ $cpu_period -gt 0 ] && [ $cpu_quota -gt 0 ]; then
+  coresAllowed="(up to "`echo $(( 100 * $cpu_quota / $cpu_period 
)) | sed 's/..$/.&/'`" cores allowed)"
+fi
+echo "WARN: Drillbit's CPU resource usage will be managed under 
the CGroup : $DRILLBIT_CGROUP "$coresAllowed
+  else
+echo "ERROR: Failed to add Drillbit to CGroup ( $DRILLBIT_CGROUP ) 
for resource usage management.  Ensure that the cgroup manages CPU"
--- End diff --

Inconsistent format: parens here, colon above. Actually, neither are needed.

Also, "for resource usage management" --> "for cpu"


---


Re: "Death of Schema-on-Read"

2018-04-03 Thread Ted Dunning
Well, the restart strategy still works for your examples. And you only pay
once. From them you look at the cached type information and used an upper
bound data type as you read the data. Since it works to read the values in
the right order, it is obviously possible to push down typing information
even into the json reader.



On Tue, Apr 3, 2018, 21:42 Paul Rogers  wrote:

> Subtle point. I can provide schema with Parquet, as you note. (Actually,
> for Parquet, Drill is schema-required: I can't not provide a schema due to
> the nature of Parquet...)
>
> But, I can't provide a schema for JSON, CSV, etc. The point is, Drill
> forbids the user from providing a schema; only the file format itself can
> provide the schema (or not, in the case of JSON). This is the very heart of
> the problem.
>
> The root cause of our schema change exception is that vectors are, indeed,
> strongly typed. But, file columns are not. Here is my favorite:
>
> {x: 10} {x: 10.1}
>
> Blam! Query fails because the vector is chosen as BigInt, then we discover
> it really should have been Float8. (If the answer is: go back and rebuild
> the vector with the new type, consider the case that 100K records separate
> the two above so that the first batch is long gone by the time we see the
> offending record. If only I could tell Drill to use Float8 (or Decimal) up
> front...
>
> Views won't help here because the failure occurs before a view can kick
> in. However, presumably, I could write a view to handle a different classic
> case:
>
> myDir /
> |- File 1: {a: 10, b: "foo"}
> |- File 2: {a: 20}
>
> With query: SELECT a, b FROM myDir
>
> For File 2, Drill will guess that b is a Nullable Int, but it is really
> VarChar. I think I could write clever SQL that says:
>
> If b is of type Nullable Int, return NULL cast to nullable VarChar, else
> return b
>
> The irony is that I must to write procedural code to declare a static
> attribute of the data. Yet SQL is otherwise declarative: I state what I
> want, not how to implement it.
>
> Life would be so much easier if I could just say, "trust me, when you read
> column b, it is a VarChar."
>
> Thanks,
> - Paul
>
>
>
> On Tuesday, April 3, 2018, 10:53:27 AM PDT, Ted Dunning <
> ted.dunn...@gmail.com> wrote:
>
>  I don't see why you say that Drill is schema-forbidden.
>
> The Parquet reader, for instance, makes strong use of the implied schema to
> facilitate reading of typed data.
>
> Likewise, the vectorized internal format is strongly typed and, as such,
> uses schema information.
>
> Views are another way to communicate schema information.
>
> It is true that you can't, say, view comments on fields from the command
> line. But I don't understand saying "schema-forbidden".
>
>
> On Tue, Apr 3, 2018 at 10:01 AM, Paul Rogers 
> wrote:
>
> > Here is another way to think about it. Today, Drill is
> "schema-forbidden":
> > even if I know the schema, I can't communicate that to Drill; Drill must
> > figure it out on its own, making the same mistakes every time on
> ambiguous
> > schemas.
> >
>


Re: "Death of Schema-on-Read"

2018-04-03 Thread Paul Rogers
Subtle point. I can provide schema with Parquet, as you note. (Actually, for 
Parquet, Drill is schema-required: I can't not provide a schema due to the 
nature of Parquet...)

But, I can't provide a schema for JSON, CSV, etc. The point is, Drill forbids 
the user from providing a schema; only the file format itself can provide the 
schema (or not, in the case of JSON). This is the very heart of the problem.

The root cause of our schema change exception is that vectors are, indeed, 
strongly typed. But, file columns are not. Here is my favorite:

{x: 10} {x: 10.1}

Blam! Query fails because the vector is chosen as BigInt, then we discover it 
really should have been Float8. (If the answer is: go back and rebuild the 
vector with the new type, consider the case that 100K records separate the two 
above so that the first batch is long gone by the time we see the offending 
record. If only I could tell Drill to use Float8 (or Decimal) up front...

Views won't help here because the failure occurs before a view can kick in. 
However, presumably, I could write a view to handle a different classic case:

myDir /
|- File 1: {a: 10, b: "foo"}
|- File 2: {a: 20}

With query: SELECT a, b FROM myDir

For File 2, Drill will guess that b is a Nullable Int, but it is really 
VarChar. I think I could write clever SQL that says:

If b is of type Nullable Int, return NULL cast to nullable VarChar, else return 
b

The irony is that I must to write procedural code to declare a static attribute 
of the data. Yet SQL is otherwise declarative: I state what I want, not how to 
implement it.

Life would be so much easier if I could just say, "trust me, when you read 
column b, it is a VarChar."

Thanks,
- Paul

 

On Tuesday, April 3, 2018, 10:53:27 AM PDT, Ted Dunning 
 wrote:  
 
 I don't see why you say that Drill is schema-forbidden.

The Parquet reader, for instance, makes strong use of the implied schema to
facilitate reading of typed data.

Likewise, the vectorized internal format is strongly typed and, as such,
uses schema information.

Views are another way to communicate schema information.

It is true that you can't, say, view comments on fields from the command
line. But I don't understand saying "schema-forbidden".


On Tue, Apr 3, 2018 at 10:01 AM, Paul Rogers 
wrote:

> Here is another way to think about it. Today, Drill is "schema-forbidden":
> even if I know the schema, I can't communicate that to Drill; Drill must
> figure it out on its own, making the same mistakes every time on ambiguous
> schemas.
>
  

[GitHub] drill pull request #1144: DRILL-6202: Deprecate usage of IndexOutOfBoundsExc...

2018-04-03 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1144#discussion_r179023466
  
--- Diff: src/main/resources/checkstyle-config.xml ---
@@ -30,10 +30,15 @@
 
 
 
+
--- End diff --

Do we want to do this? Why are all these forbidden? `Throwable` was needed 
by @Ben-Zvi in his very first Drill bug fix. `RuntimeException` is the root of 
the unchecked exceptions beloved by Drill code and is used as a general 
"something went wrong" placeholder.

These are all commented out, but still...

Finally, `IndexOutOfBoundsException` is a perfectly fine exception. Adding 
this line is probably handy to catch all reference as part of this fix, but 
probably not helpful as a general policy.


---


[GitHub] drill pull request #1144: DRILL-6202: Deprecate usage of IndexOutOfBoundsExc...

2018-04-03 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1144#discussion_r179022090
  
--- Diff: exec/memory/base/src/main/java/io/netty/buffer/DrillBuf.java ---
@@ -777,23 +778,20 @@ public int getActualMemoryConsumed() {
* @return A hex dump in a String.
*/
   public String toHexString(final int start, final int length) {
-final int roundedStart = (start / LOG_BYTES_PER_ROW) * 
LOG_BYTES_PER_ROW;
-
-final StringBuilder sb = new StringBuilder("buffer byte dump\n");
-int index = roundedStart;
-for (int nLogged = 0; nLogged < length; nLogged += LOG_BYTES_PER_ROW) {
-  sb.append(String.format(" [%05d-%05d]", index, index + 
LOG_BYTES_PER_ROW - 1));
-  for (int i = 0; i < LOG_BYTES_PER_ROW; ++i) {
-try {
-  final byte b = getByte(index++);
-  sb.append(String.format(" 0x%02x", b));
-} catch (IndexOutOfBoundsException ioob) {
-  sb.append(" ");
-}
+Preconditions.checkArgument(start >= 0);
+final StringBuilder sb = new StringBuilder("buffer byte dump");
+final int end = Math.min(length, this.length - start);
+for (int i = 0; i < end; i++) {
+  if (i % LOG_BYTES_PER_ROW == 0) {
+sb.append(String.format("%n [%05d-%05d]", i + start, Math.min(i + 
LOG_BYTES_PER_ROW - 1, end - 1) + start));
   }
-  sb.append('\n');
+  byte b = _getByte(i + start);
+  sb.append(" 0x").append(HEX_CHAR[b >> 4]).append(HEX_CHAR[b & 0x0F]);
--- End diff --

Is this complexity needed relative to the original `format(" 0x02x", b)` 
implementation? Is this code performance sensitive enough to justify the extra 
complexity?


---


[GitHub] drill pull request #1144: DRILL-6202: Deprecate usage of IndexOutOfBoundsExc...

2018-04-03 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1144#discussion_r179022533
  
--- Diff: exec/vector/src/main/codegen/templates/VariableLengthVectors.java 
---
@@ -534,15 +534,11 @@ public void setSafe(int index, byte[] bytes) {
   assert index >= 0;
 
   final int currentOffset = offsetVector.getAccessor().get(index);
-  offsetVector.getMutator().setSafe(index + 1, currentOffset + 
bytes.length);
-  try {
-data.setBytes(currentOffset, bytes, 0, bytes.length);
-  } catch (IndexOutOfBoundsException e) {
-while (data.capacity() < currentOffset + bytes.length) {
-  reAlloc();
-}
-data.setBytes(currentOffset, bytes, 0, bytes.length);
+  while (data.capacity() < currentOffset + bytes.length) {
--- End diff --

One trick used in the row set reader code is to avoid redundant re-allocs 
by computing the new size and rounding to a power of two. This is handy at the 
start with the vector doubles from 256 to 512 to 1K and the data size is, say, 
600 bytes.

However, it is not clear that such an optimization is as necessary as it 
once was: the team's been doing a good job at adding up-front vector allocation 
where it was missing, reducing the gratuitous reallocs.


---


[GitHub] drill pull request #1144: DRILL-6202: Deprecate usage of IndexOutOfBoundsExc...

2018-04-03 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1144#discussion_r179023079
  
--- Diff: 
exec/vector/src/main/java/org/apache/drill/exec/vector/accessor/writer/BaseScalarWriter.java
 ---
@@ -211,7 +211,7 @@ protected boolean canExpand(int delta) {
 
   protected void overflowed() {
 if (listener == null) {
-  throw new IndexOutOfBoundsException("Overflow not supported");
+  throw new UnsupportedOperationException("Overflow not supported");
--- End diff --

Maybe change to `OversizedAllocationException`. See 
`FixedValueVectors.setInitialCapacity()`.


---


[GitHub] drill pull request #1144: DRILL-6202: Deprecate usage of IndexOutOfBoundsExc...

2018-04-03 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1144#discussion_r179023675
  
--- Diff: src/main/resources/checkstyle-suppressions.xml ---
@@ -16,4 +16,13 @@
 
 
   
+  

[GitHub] drill pull request #1144: DRILL-6202: Deprecate usage of IndexOutOfBoundsExc...

2018-04-03 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1144#discussion_r179022338
  
--- Diff: exec/vector/src/main/codegen/templates/VariableLengthVectors.java 
---
@@ -534,15 +534,11 @@ public void setSafe(int index, byte[] bytes) {
   assert index >= 0;
 
   final int currentOffset = offsetVector.getAccessor().get(index);
-  offsetVector.getMutator().setSafe(index + 1, currentOffset + 
bytes.length);
-  try {
-data.setBytes(currentOffset, bytes, 0, bytes.length);
-  } catch (IndexOutOfBoundsException e) {
-while (data.capacity() < currentOffset + bytes.length) {
-  reAlloc();
-}
-data.setBytes(currentOffset, bytes, 0, bytes.length);
+  while (data.capacity() < currentOffset + bytes.length) {
--- End diff --

If we compare this implementation to the one from, say, a year or 18 months 
ago, I think we're back where we started.

This would actually be a good use case for a "checkedSetBytes" method that 
does the bounds checks. Still, whether done here or in the underlying method, 
the `if` statement is needed, so no savings in using a "checked" method,


---


[GitHub] drill issue #1144: DRILL-6202: Deprecate usage of IndexOutOfBoundsException ...

2018-04-03 Thread paul-rogers
Github user paul-rogers commented on the issue:

https://github.com/apache/drill/pull/1144
  
My two cents... DrillBuf is the only memory-level abstraction that (low 
level) Drill code should reference. The UDLE and other bits should be fully 
encapsulated. This guideline lets us evolve the representation if we ever need 
to do so.

The original design appeared to be that value vectors would be the primary 
interface to memory. But, a great many issues made that difficult, not least of 
which is that vector access methods are heavily typed, resulting in far too 
much casting. Also, the mutator methods try to do the full operation, leading 
to inefficiency (especially around VarChars).

A more general rule is that application code should work with vectors until 
they can migrate to working with the result set loader or reader. (We should 
probably call these the row set emitter and collector to be more 
Hadoop-like...) The higher-level abstractions handle the grunt work currently 
spread throughout operators.

(And, to answer a prior question: we want to use the row set abstractions 
so we have a uniform way to write to vectors, to control batch size, to handle 
schema issues and so on on write. And, to have a standard way to handle 
indirection vectors and vector navigation on read.)

Ideally only, the vector mutator or row set loader implementation works 
with DrillBuf to do actual data reads and writes. In an early version, the row 
set loader code used `PlatformDependent` to avoid bounds checks. But, with 
@vrozov's improvements, doing so became unnecessary -- a nice improvement.

Still, bounds checks should be done during tests: it is handy to work with 
a safety net.

Since bounds checks are optional (turned off in production), then the 
changes here make good sense: no code should count on bounds checks from the 
"unchecked" methods for the simple reason that the checks are normally off.

That said, if there is a reason to have "checked" access, we could provide 
such methods. Those methods would throw the `IndexOutOfBoundsException`.  That 
is, the checked methods would recreate the original "get/set" methods prior to 
@vrozov's improvements. I can't think of a reason to do that off the top of my 
head, but someone might present a valid use case.


---


[GitHub] drill issue #1200: DRILL-143: Support CGROUPs resource management

2018-04-03 Thread Ben-Zvi
Github user Ben-Zvi commented on the issue:

https://github.com/apache/drill/pull/1200
  
Could there be users with an older Linux (pre 4.5, circa March 2016) which 
does not support cgroups V2 ?



---


[GitHub] drill pull request #1200: DRILL-143: Support CGROUPs resource management

2018-04-03 Thread Ben-Zvi
Github user Ben-Zvi commented on a diff in the pull request:

https://github.com/apache/drill/pull/1200#discussion_r179005732
  
--- Diff: distribution/src/resources/drillbit.sh ---
@@ -127,6 +127,44 @@ check_before_start()
   fi
 }
 
+check_after_start(){
+#check if the process is running
+if [ -f $pid ]; then
+  dbitProc=$(ps -ef | grep `cat $pid` | grep Drillbit)
+  if [ -n "$dbitProc" ]; then
+# Check and enforce for CGroup
+if [ -n "$DRILLBIT_CGROUP" ]; then 
+  check_and_enforce_cgroup `cat $pid`
+fi
+  fi
+fi
+}
+
+check_and_enforce_cgroup(){
+dbitPid=$1;
+#if [ $(`ps -o cgroup` | grep -c $DRILLBIT_CGROUP ) -eq 1 ]; then 
+if [ -f /cgroup/cpu/${DRILLBIT_CGROUP}/cgroup.procs ]; then 
--- End diff --

Why is the path hardcoded ? Is it a standard ?  I checked on my cluster, 
where /sys/fs/cgroup/. is used instead.



---


[jira] [Created] (DRILL-6307) Handle empty batches in record batch sizer correctly

2018-04-03 Thread Padma Penumarthy (JIRA)
Padma Penumarthy created DRILL-6307:
---

 Summary: Handle empty batches in record batch sizer correctly
 Key: DRILL-6307
 URL: https://issues.apache.org/jira/browse/DRILL-6307
 Project: Apache Drill
  Issue Type: Bug
  Components: Execution - Flow
Affects Versions: 1.13.0
Reporter: Padma Penumarthy
Assignee: Padma Penumarthy
 Fix For: 1.14.0


when we get empty batch, record batch sizer calculates row width as zero. In 
that case, we do not do accounting and memory allocation correctly for outgoing 
batches. 

For example, in merge join, for outer left join, if right side batch is empty, 
we still have to include the right side columns as null in outgoing batch. We 
have to allocate memory for those vectors correctly. We also have to include 
row width of those columns in the outgoing row width and number of rows 
calculation. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] drill issue #1144: DRILL-6202: Deprecate usage of IndexOutOfBoundsException ...

2018-04-03 Thread vrozov
Github user vrozov commented on the issue:

https://github.com/apache/drill/pull/1144
  
IMO, this PR or DRILL-6202 is not the best place to discuss boundary 
checking as the PR and JIRA deals with `IndexOutOfBoundsException` but does not 
change how DrillBuf, Vector or Operators ensure that reads and writes are done 
within proper boundaries. I'll bring that to dev@drill. Please repeat your 
arguments on the mailing list. 


---


[GitHub] drill issue #1144: DRILL-6202: Deprecate usage of IndexOutOfBoundsException ...

2018-04-03 Thread sachouche
Github user sachouche commented on the issue:

https://github.com/apache/drill/pull/1144
  
I have added a comment within this PR associated JIRA: 
[DRILL-6202](https://issues.apache.org/jira/browse/DRILL-6202)


---


[GitHub] drill pull request #1195: DRILL-6273: Removed dependency licensed under Cate...

2018-04-03 Thread vrozov
Github user vrozov commented on a diff in the pull request:

https://github.com/apache/drill/pull/1195#discussion_r178953868
  
--- Diff: tools/fmpp/src/main/java/bsh/package-info.java ---
@@ -0,0 +1,24 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+ /**
+ * Generate-fmpp has a dependency on beanshell EvalError. Beanshell 
doesn't have a valid
+ * Apache License, So beanshell is excluded and EvalError class is added 
to handle the dependency.
+ */
+
+package bsh;
--- End diff --

Please add LF and squash commits.


---


Re: "Death of Schema-on-Read"

2018-04-03 Thread Ted Dunning
I don't see why you say that Drill is schema-forbidden.

The Parquet reader, for instance, makes strong use of the implied schema to
facilitate reading of typed data.

Likewise, the vectorized internal format is strongly typed and, as such,
uses schema information.

Views are another way to communicate schema information.

It is true that you can't, say, view comments on fields from the command
line. But I don't understand saying "schema-forbidden".


On Tue, Apr 3, 2018 at 10:01 AM, Paul Rogers 
wrote:

> Here is another way to think about it. Today, Drill is "schema-forbidden":
> even if I know the schema, I can't communicate that to Drill; Drill must
> figure it out on its own, making the same mistakes every time on ambiguous
> schemas.
>


Re: "Death of Schema-on-Read"

2018-04-03 Thread Paul Rogers
Here is another way to think about it. Today, Drill is "schema-forbidden": even 
if I know the schema, I can't communicate that to Drill; Drill must figure it 
out on its own, making the same mistakes every time on ambiguous schemas.

Contrast this with Hive, which is "schema-required": I must tell Hive the 
schema even in cases where Hive could easily figure it out on its own. 

Perhaps Drill can occupy a middle ground: "schema-optional": Drill will figure 
out the schema as best it can, but will accept suggestions (hints) which the 
user can provide when it is the most efficient path to get work done.

Once hints are supported, then a system such as Ted's can be built on top: 
retry the query to use a bit of machine learning to infer the schema. Or, get 
the schema from Hive. Or from Comcast's Avro files. Or whatever.

The point is, if the user knows the schema, and is willing to resolve the 
ambiguities for us, what value do we provide by refusing to accept those hints?

On the other hand, since schema is optional, then Drill can continue to be used 
for Parth's schema exploration use case. 

Still, after doing a bit of exploration; the needs to move into getting work 
done based on that exploration. This seems to be the case at Comcast: they've 
move past exploration into production. But, Drill has limited means to use the 
result of exploration to resolve schema ambiguities on future queries. (Views 
are a partial answer, but have gaps.)

Ted makes a good point: Drill works most of the time already. The suggestion is 
that users might prefer that Drill works not just most of the time, but rather 
all of the time so users can reliably get their work done with no surprises, 
even with less-than-perfect schemas. If providing a few schema hints is the 
cost to pay to get that reliability, shouldn't the user in a position to choose 
to make that tradeoff?

Thanks,
- Paul

 

On Tuesday, April 3, 2018, 2:32:05 AM PDT, Parth Chandra 
 wrote:  
 
 This, of course, begs the question [1], doesn't it?

If you have the schema, then you have either a) spent time designing and
documenting your data (both the schema and dictionary containing the
semantics) or b) spent time "finding, interpreting, and cleaning data" to
discover the data schema and dictionary.

Data that has "no documentation beyond attribute names, which may be
inscrutable, vacuous, or even misleading" will continue to be so even after
you specify the schema.

Asking users to design their schemas when they have already accumulated
data that is unclean and undocumented is asking them to do the work that
they use your software for in the first place.

The goal of schema on read is to facilitate the task of interpreting the
data that already exists, is mutating, and is undocumented (or documented
badly).


[1] https://en.wikipedia.org/wiki/Begging_the_question


On Mon, Apr 2, 2018 at 11:16 AM, Paul Rogers 
wrote:

> ...is the name of a provocative blog post [1].
> Quote: "Once found, diverse data sets are very hard to integrate, since
> the data typically contains no documentation on the semantics of its
> attributes. ... The rule of thumb is that data scientists spend 70% of
> their time finding, interpreting, and cleaning data, and only 30% actually
> analyzing it. Schema on read offers no help in these tasks, because data
> gives up none of its secrets until actually read, and even when read has no
> documentation beyond attribute names, which may be inscrutable, vacuous, or
> even misleading."
> This quote relates to a discussion Salim & I have been having: that Drill
> struggles to extract a usable schema directly from anything but the
> cleanest of data sets, leading to unwanted and unexpected schema change
> exceptions due to inherent ambiguities in how to interpret the data. (E.g.
> in JSON, if we see nothing but nulls, what type is the null?)
> A possible answer is further down in the post: "At Comcast, for instance,
> Kafka topics are associated with Apache Avro schemas that include
> non-trivial documentation on every attribute and use common subschemas to
> capture commonly used data... 'Schema on read' using Avro files thus
> includes rich documentation and common structures and naming conventions."
> Food for thought.
> Thanks,
> - Paul
> [1] https://www.oreilly.com/ideas/data-governance-and-the-
> death-of-schema-on-read?imm_mid=0fc3c6=em-data-na-na-newsltr_20180328
>
>
>
>
>
  

Re: Drill Hangout today at 10 am PST

2018-04-03 Thread Timothy Farkas
Hi All,

I'll be giving a presentation about unit testing in Drill.

Thanks,
Tim


From: Vitalii Diravka 
Sent: Tuesday, April 3, 2018 7:53:49 AM
To: dev@drill.apache.org
Subject: Re: Drill Hangout today at 10 am PST

I have a small topic/question regarding "Jenkins builds" for drill-scm
project [1]
It started to fail due to the old Java version.
I'm interested in the possibility of updating the JDK version there.

1. 
https://urldefense.proofpoint.com/v2/url?u=https-3A__builds.apache.org_blue_organizations_jenkins_drill-2Dscm_activity=DwIBaQ=cskdkSMqhcnjZxdQVpwTXg=4eQVr8zB8ZBff-yxTimdOQ=E5wHe7hUBxUtMNWgSGNovsXsqnzW_B0VpaeTV46JRbc=6oExtsK1UUNIOcLmk5QpNf4SZZOHKpnwgpp8jNEmbEw=


Kind regards
Vitalii

On Tue, Apr 3, 2018 at 2:29 PM, Pritesh Maker  wrote:

> We will have our bi-weekly hangout today Apr 3rd at 10 am PST.
>
> Please reply to this post with proposed topics to discuss.
>
>
> Hangout link:
> https://urldefense.proofpoint.com/v2/url?u=https-3A__plus.
> google.com_hangouts_-5F_event_ci4rdiju8bv04a64efj5fedd0lc=DwIBaQ=
> cskdkSMqhcnjZxdQVpwTXg=zySISmkmM4WNViCKijENtQ=TLTWkiVI6X1oT5VN_HKSm5S-
> Jx1XaSbkgb7NaQVFX6A=eZDHSD7ZkSGgMrxqiXHHDG1norNoCE_Gd8NDImWyla4=
>
>
>


[GitHub] drill issue #1201: DRILL-4091: Support for additional gis operations in gis ...

2018-04-03 Thread kkhatua
Github user kkhatua commented on the issue:

https://github.com/apache/drill/pull/1201
  
@cgivre could you review this?


---


[GitHub] drill issue #1198: DRILL-6294: Changes to support Calcite 1.16.0

2018-04-03 Thread vvysotskyi
Github user vvysotskyi commented on the issue:

https://github.com/apache/drill/pull/1198
  
@chunhui-shi, I have addressed all CR comments, could you please take a 
look again?


---


Re: Drill Hangout today at 10 am PST

2018-04-03 Thread Vitalii Diravka
I have a small topic/question regarding "Jenkins builds" for drill-scm
project [1]
It started to fail due to the old Java version.
I'm interested in the possibility of updating the JDK version there.

1. https://builds.apache.org/blue/organizations/jenkins/drill-scm/activity


Kind regards
Vitalii

On Tue, Apr 3, 2018 at 2:29 PM, Pritesh Maker  wrote:

> We will have our bi-weekly hangout today Apr 3rd at 10 am PST.
>
> Please reply to this post with proposed topics to discuss.
>
>
> Hangout link:
> https://urldefense.proofpoint.com/v2/url?u=https-3A__plus.
> google.com_hangouts_-5F_event_ci4rdiju8bv04a64efj5fedd0lc=DwIBaQ=
> cskdkSMqhcnjZxdQVpwTXg=zySISmkmM4WNViCKijENtQ=TLTWkiVI6X1oT5VN_HKSm5S-
> Jx1XaSbkgb7NaQVFX6A=eZDHSD7ZkSGgMrxqiXHHDG1norNoCE_Gd8NDImWyla4=
>
>
>


[GitHub] drill pull request #1201: DRILL-4091: Support for additional gis operations ...

2018-04-03 Thread ChrisSandison
GitHub user ChrisSandison opened a pull request:

https://github.com/apache/drill/pull/1201

DRILL-4091: Support for additional gis operations in gis contrib module

This based off of PR https://github.com/apache/drill/pull/258. I have 
attempted to address the comments from the PR left by @cgivre 

Some additional notes:

- in `STUnionAggregate.java`, there was a comment mentioning null handling, 
but I'm unable to find any null handling support when defining interfaces for 
aggregate functions.
- I have removed one of the failing tests from 
`TestGeometryFunctions.java`. The failing tests were attempting to test 
non-point geometry on functions that require it. While returning `Double.NaN` 
seems appropriate for the situation, I found two issues with testing this:

1. `DefaultSequelHandler.transform()` casts the results of the 
`Float8Holder` to `BigInteger` which does not have a representation for `NaN`. 
As a result, this is returning a `UserException` giving a system error, which 
would be the expected behaviour. However, I can't seem to test for this 
exception given the test builder that is used.
2. No other tests in the file are testing for invalid geometry. This is 
perhaps a larger issue.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ChrisSandison/drill DRILL-4091

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/1201.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1201


commit cb819d706f085cab761f4047682ff8e1086fdc57
Author: potocki 
Date:   2015-11-16T13:05:18Z

Support for additional gis operations (relate, contains, touches, union, 
get x y of a point and more)

commit a2674b3e99e79e3e32ffcb82a39b3d1956f43b13
Author: potocki 
Date:   2015-12-03T07:51:31Z

cleaning

commit 31656350bbdc446d50fe215074c28e57afe7783d
Author: potocki 
Date:   2016-01-18T13:27:41Z

geometry coordinates transformation using proj4j

commit 3ef4c082def3b1f088ed0afccb2cb2ffa6c44f20
Author: potocki 
Date:   2016-01-18T13:50:03Z

ST_Distance function

commit 0e2e2d8b241f8e9d16b7d4d5d9562f58450a1670
Author: potocki 
Date:   2016-01-19T07:39:02Z

test for srid transformation query

commit 053678b104d3427116ed8ffd3ac84ee12d28b04b
Author: potocki 
Date:   2016-01-19T09:03:27Z

fixed style and missing file

commit 18c993096b523f8aece9cd5fac7d0c3110f3bbac
Author: potocki 
Date:   2016-02-11T12:50:46Z

aggregate version of ST_Union function

commit 9bf8ba66570cca078e0ff6aee27c02908801a6b6
Author: potocki 
Date:   2017-10-26T21:16:34Z

rename aggregate version of st_union to st_unionaggregate. minor 
reformatting

commit 9d0ebffb3b1d00d97c225130568f381ddc826c0d
Author: potocki 
Date:   2017-10-26T21:16:51Z

ST_Envelope, ST_X[Min|Max], ST_Y[Min|Max] geobounds functions

commit 95c0e5e01a413562b7ca4c4cfb4b4d657203d5ec
Author: potocki 
Date:   2017-11-08T19:36:15Z

fixes in union aggregate + tests

commit fa6239c681514e8d57437c1e09cedd4715344fbc
Author: chris 
Date:   2018-04-03T14:23:56Z

DRILL-4091: Addressed comments and fixed tests




---


Drill Hangout today at 10 am PST

2018-04-03 Thread Pritesh Maker
We will have our bi-weekly hangout today Apr 3rd at 10 am PST.

Please reply to this post with proposed topics to discuss.


Hangout link:
https://urldefense.proofpoint.com/v2/url?u=https-3A__plus.google.com_hangouts_-5F_event_ci4rdiju8bv04a64efj5fedd0lc=DwIBaQ=cskdkSMqhcnjZxdQVpwTXg=zySISmkmM4WNViCKijENtQ=TLTWkiVI6X1oT5VN_HKSm5S-Jx1XaSbkgb7NaQVFX6A=eZDHSD7ZkSGgMrxqiXHHDG1norNoCE_Gd8NDImWyla4=




[GitHub] drill issue #1144: DRILL-6202: Deprecate usage of IndexOutOfBoundsException ...

2018-04-03 Thread parthchandra
Github user parthchandra commented on the issue:

https://github.com/apache/drill/pull/1144
  
I think we need to include a few other folks into this. @paul-rogers, 
@sachouche, have also looked into the issue of excessive bounds checking and 
ways to write to direct memory with minimum overhead.

Both Salim and Paul have done work which has tried to eliminate the 
excessive checking and use `PlatformDependent` directly, so it might be the 
right time to agree on the right approach here. At a high level, I believe 
there is agreement that we need to 1) reduce bounds checking to (preferably) 
once per write, and 2) to minimise the number of function calls before memory 
is actually written to.

We have three layers where we could potentially check bounds - i) the 
operators, ii) the vectors, iii) DrillBuf. Right now, we do so at each level, 
at multiple times at that. Paul's work on batch sizing provides us with a layer 
that gives us the bounds check guarantees at the operator level. This means we 
could potentially use value-vectors' set methods (as opposed to the setSafe 
methods) and DrillBuf can use PlatformDependent directly. 

Some caveats - 
UDLE's check for and enforce little-endianness. Checking for endianness is 
important for value vectors because they assume little endian,  but the 
enforcement is sometimes not so desirable. Drill's Java client uses the same 
DrillBuf backed by a UDLE and that means that client applications can no longer 
run on big endian machines (and yes, I have heard this request from end users). 
However, the fact is that UDLE's are an intrinsic part of the drill-memory 
design [1] [2]. Eliminating UDLE's can lead to re-doing large parts of very 
well tested code.

The caveat to using the vector set methods is that the setSafe methods 
provide resizing capability that operators have come to rely upon. Switching 
from setSafe to set breaks practically every operator. 


[1] 
https://github.com/jacques-n/drill/blob/DRILL-4144/exec/memory/base/src/main/java/org/apache/drill/exec/memory/README.md
[2] 
https://docs.google.com/nonceSigner?nonce=nj279efks0ro0=https://doc-0o-as-docs.googleusercontent.com/docs/securesc/gipu3hlcf22l6svruqr71h7qe2k3djum/5v7eb2cm4bghq76nj658ai5hkk9h52ur/152274960/11021365158327727934/11021365158327727934/0B6CmYjIAywyCalVwcURkaFlkc1U?e%3Ddownload=41l8jspccbj1pp63750c5von8ol4ijtl

 






---


Re: "Death of Schema-on-Read"

2018-04-03 Thread Parth Chandra
This, of course, begs the question [1], doesn't it?

If you have the schema, then you have either a) spent time designing and
documenting your data (both the schema and dictionary containing the
semantics) or b) spent time "finding, interpreting, and cleaning data" to
discover the data schema and dictionary.

Data that has "no documentation beyond attribute names, which may be
inscrutable, vacuous, or even misleading" will continue to be so even after
you specify the schema.

Asking users to design their schemas when they have already accumulated
data that is unclean and undocumented is asking them to do the work that
they use your software for in the first place.

The goal of schema on read is to facilitate the task of interpreting the
data that already exists, is mutating, and is undocumented (or documented
badly).


[1] https://en.wikipedia.org/wiki/Begging_the_question


On Mon, Apr 2, 2018 at 11:16 AM, Paul Rogers 
wrote:

> ...is the name of a provocative blog post [1].
> Quote: "Once found, diverse data sets are very hard to integrate, since
> the data typically contains no documentation on the semantics of its
> attributes. ... The rule of thumb is that data scientists spend 70% of
> their time finding, interpreting, and cleaning data, and only 30% actually
> analyzing it. Schema on read offers no help in these tasks, because data
> gives up none of its secrets until actually read, and even when read has no
> documentation beyond attribute names, which may be inscrutable, vacuous, or
> even misleading."
> This quote relates to a discussion Salim & I have been having: that Drill
> struggles to extract a usable schema directly from anything but the
> cleanest of data sets, leading to unwanted and unexpected schema change
> exceptions due to inherent ambiguities in how to interpret the data. (E.g.
> in JSON, if we see nothing but nulls, what type is the null?)
> A possible answer is further down in the post: "At Comcast, for instance,
> Kafka topics are associated with Apache Avro schemas that include
> non-trivial documentation on every attribute and use common subschemas to
> capture commonly used data... 'Schema on read' using Avro files thus
> includes rich documentation and common structures and naming conventions."
> Food for thought.
> Thanks,
> - Paul
> [1] https://www.oreilly.com/ideas/data-governance-and-the-
> death-of-schema-on-read?imm_mid=0fc3c6=em-data-na-na-newsltr_20180328
>
>
>
>
>


[GitHub] drill pull request #814: Add clickable localhost URL

2018-04-03 Thread ebuildy
Github user ebuildy closed the pull request at:

https://github.com/apache/drill/pull/814


---


[GitHub] drill issue #1200: DRILL-143: Support CGROUPs resource management

2018-04-03 Thread kkhatua
Github user kkhatua commented on the issue:

https://github.com/apache/drill/pull/1200
  
@paul-rogers / @Ben-Zvi  , could you review this?


---


[GitHub] drill pull request #1200: DRILL-143: Support CGROUPs resource management

2018-04-03 Thread kkhatua
GitHub user kkhatua opened a pull request:

https://github.com/apache/drill/pull/1200

DRILL-143: Support CGROUPs resource management

Introduces the `DRILLBIT_CGROUP` option in defined in `drill-env.sh`
The startup script checks if the specified CGroup (ver 2) is available and 
tries to apply it to the launched Drillbit JVM.
This would benefit not just Drill-on-YARN usecases, but  any setup that 
would like CGroups for enforcement of (cpu) resources management.

e.g when Drillbit is configured to use `drillcpu` cgroup
```
[root@maprlabs ~]# 
/opt/mapr/drill/apache-drill-1.14.0-SNAPSHOT/bin/drillbit.sh restart
Stopping drillbit
..
Starting drillbit, logging to /var/log/drill/drillbit.out
WARN: Drillbit's CPU resource usage will be managed under the CGroup : 
drillcpu (up to 4.00 cores allowed)
```

e.g. Non-existent CGroup `droolcpu` is used
```
[root@maprlabs ~]# 
/opt/mapr/drill/apache-drill-1.14.0-SNAPSHOT/bin/drillbit.sh restart
Stopping drillbit
..
Starting drillbit, logging to /var/log/drill/drillbit.out
ERROR: cgroup droolcpu does not found. Ensure that daemon is running and 
cgroup exists
```

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/kkhatua/drill DRILL-143

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/1200.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1200


commit e9a2551b315b4395c5227dad017f2b4340f41108
Author: Kunal Khatua 
Date:   2018-04-03T05:35:53Z

DRILL-143: Support CGROUPs resource management

Introduces the DRILLBIT_CGROUP option in drill-env.sh.
The startup script checks if the specified CGroup (ver 2) is available and 
tries to apply it to the launched Drillbit JVM.
This would benefit not just Drill-on-YARN usecases, but  any setup that 
would like CGroups for enforcement of (cpu) resources management.

e.g when Drillbit is configured to use `drillcpu` cgroup
```
[root@maprlabs ~]# 
/opt/mapr/drill/apache-drill-1.14.0-SNAPSHOT/bin/drillbit.sh restart
Stopping drillbit
..
Starting drillbit, logging to /var/log/drill/drillbit.out
WARN: Drillbit's CPU resource usage will be managed under the CGroup : 
drillcpu (up to 4.00 cores allowed)
```

e.g. Non-existent CGroup `droolcpu` is used
```
[root@kk127 ~]# 
/opt/mapr/drill/apache-drill-1.14.0-SNAPSHOT/bin/drillbit.sh restart
Stopping drillbit
..
Starting drillbit, logging to /var/log/drill/drillbit.out
ERROR: cgroup droolcpu does not found. Ensure that daemon is running and 
cgroup exists
```




---