[jira] [Resolved] (DRILL-4909) Refinements to Drill web UI - Query profile page

2018-04-04 Thread Kunal Khatua (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-4909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kunal Khatua resolved DRILL-4909.
-
Resolution: Fixed

> Refinements to Drill web UI - Query profile page
> 
>
> Key: DRILL-4909
> URL: https://issues.apache.org/jira/browse/DRILL-4909
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Web Server
>Affects Versions: 1.8.0
>Reporter: Paul Rogers
>Assignee: Kunal Khatua
>Priority: Minor
> Fix For: Future
>
>
> The plan I'm looking at has hundreds of nodes. It takes a long time to scroll 
> around the pages to get to the top. Fix the tab bar at the top of the page to 
> simplify navigation. "The one with Physical Plan, Visualization, etc."
> On the Physical Plan page: The top of the page displays histogram of minor 
> fragment execution. However, it is hard to infer what it displays.
> * Label the x-axis. The units seem to be seconds, but a legend of: "Runtime 
> (sec.)" would help. ( DRILL-5801 )
> * Label the y-axis. Seems to be colored by major fragment, lines by minor 
> fragment. But, took some sleuthing to figure this out.
> * Tooltip on each color band to identify the major fragment. (Probably too 
> fiddly to label minor fragment lines.) ( DRILL-5801 )
> * Choose a wider palette of colors. On my chart, the top two groups are 
> shades of organge, the third is blue. Seems we could rotate among the 
> standard set of colors for better contrast. ( _Need to identify a good small 
> palette_ )
> In the tables:
> * -For each operator, list the number of rows processed. (Available in the 
> details already.)- ( DRILL-5195 )
> * In the table that sumarizes major fragments, have as a tool-tip the names 
> of the minor fragments to give the numbers some meaning. That is, hovering 
> over 00-xx-xx should say "Project, Merging Receiver".
> * In the table that shows minor fragments for major fragments, either add a 
> list of minor fragment names to the title, or as a pop-up. That is, in the 
> heading that says, "Major Fragment: 02-xx-xx", add "(PARQUET_ROW_GROUP_SCAN, 
> PROJECT, ...)
> * -For each minor fragment, label the host on which it runs- ( DRILL-5803 )
> * For larger queries, shows groups by host, expanded to show fragments. (It 
> is hard to read, say, 400 minor fragments in one big table. Showing 20 nodes 
> (with summaries) is easier, each expanding to show 20 minor fragments.
> In the Operator Profiles overview, add a tool-tip with details about each 
> operator such as:
> * Number of vector allocations
> * Number of vector extensions (increasing the size of vectors)
> * Average vector utilization (ratio of selected to unselected rows)
> * Average batch size: number of rows, bytes per row, bytes per batch
> For scanners:
> * Number of files scanned
> * Number of schemas found
> * Number of bytes read (or file length if a table scan)
> * Name of the file scanned (or first several if a group)
> For filters:
> * Rows in, rows out and selectivity (as a ratio)
> In the operator detail table:
> * -Add a line for totals (records, batches)- ( DRILL-5195 )
> * Add a line for averages (most fields)
> Under the "Full JSON Profile", part of the JSON is formatted, but the plan 
> part is not. Display the plan in a formatted version (with proper 
> indentation). It is not very useful in the current, streamed, non-indented 
> form.
> Better, move the JSON Profile to a new tab since it causes the Physical Plan 
> page to get too large for large queries.
> On the Visualized Plan page,
> * The coloring of the fragments does not match the coloring used in the chart 
> on the Physical Plan page. Please use the same to make them easier to 
> correlate.
> * Perhaps enclose each fragment in a box (with the border passing through the 
> middle of each eachange operator.
> * For each aspect of the plan, provide basic stats such as number of minor 
> fragments, number of records, average time.
> * For each node, provide a link to the Physical Plan page to see more detail.
> * The visualized plan page shows the same info as the Physical Plan page. 
> Better, keep each page focused and make it easy to navigate between them.
> * Naming is inconsistent between this page and the Physical Plan page. 
> "HASH_AGGREGATE" on the Physical Plan page, "HashAgg" on the visualization 
> page.
> On the Edit Query page, the actual field to edit the query is two lines long 
> (but my query is over a dozen lines.) Then, the rest of the page repeats the 
> chart, details, etc. Seems that, if I want to edit the query, I should have a 
> field large enough to do so. And, I don't need the other info (I'll go to the 
> corresponding tab it I want to see it.)
> Then, regarding the Query page and Edit query, can these be combined? On the 
> Query page, I see the 

[GitHub] drill pull request #1060: DRILL-5846: Improve parquet performance for Flat D...

2018-04-04 Thread vrozov
Github user vrozov commented on a diff in the pull request:

https://github.com/apache/drill/pull/1060#discussion_r179344792
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/MemoryUtils.java
 ---
@@ -0,0 +1,199 @@
+/**
--- End diff --

I refer to `/**`, please use `/*`


---


[GitHub] drill pull request #1060: DRILL-5846: Improve parquet performance for Flat D...

2018-04-04 Thread vrozov
Github user vrozov commented on a diff in the pull request:

https://github.com/apache/drill/pull/1060#discussion_r179321187
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/VLAbstractEntryReader.java
 ---
@@ -0,0 +1,214 @@

+/***
--- End diff --

Please use '/*' and '*/`


---


[GitHub] drill issue #1200: DRILL-143: Support CGROUPs resource management

2018-04-04 Thread kkhatua
Github user kkhatua commented on the issue:

https://github.com/apache/drill/pull/1200
  
@Ben-Zvi / @paul-rogers 
I discovered that the path `/cgroup` is unique only to my installation of 
the package on CentOS. The standard path is `/sys/fs/cgroup`.
So, I've made changes that allow for users to specify the CGroups location. 
Without this, the latest commit complains as expected:
```
[root@kk127 ~]#  
/opt/mapr/drill/apache-drill-1.14.0-SNAPSHOT/bin/drillbit.sh restart
Stopping drillbit
..
Starting drillbit, logging to /var/log/drill/drillbit.out
ERROR: CGroup drillcpu not found. Ensure that daemon is running, 
SYS_CGROUP_DIR is correctly set (currently, /sys/fs/cgroup ), and that the 
CGroup exists
```
The `drill-env.sh` also specifies that the enforcement is only for CPU. 
I'll put this in the documentation of the feature as well.


---


Re: Drill Java doc

2018-04-04 Thread Kunal Khatua
Agreed. We'll do it as part of the next release and try to get one for 
Drill-1.13.0 in the meanwhile.
On 4/4/2018 2:33:47 PM, Vlad Rozov  wrote:
I'd suggest to make complete Java doc available as part of the release
process. For example Hadoop API is available
https://hadoop.apache.org/docs/stable/api/index.html

Thank you,

Vlad

On 4/4/18 07:28, Kunal Khatua wrote:
> Looks like it was done only for JDBC with release 1.2 .
>
> https://drill.apache.org/api/1.2/jdbc/
>
>
> On 4/4/2018 7:23:08 AM, Vlad Rozov wrote:
> Is Drill Java doc hosted on http://drill.apache.org or any other site?
>
> Thank you,
>
> Vlad
>



[GitHub] drill pull request #1060: DRILL-5846: Improve parquet performance for Flat D...

2018-04-04 Thread vrozov
Github user vrozov commented on a diff in the pull request:

https://github.com/apache/drill/pull/1060#discussion_r179296715
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/ParquetRecordReader.java
 ---
@@ -161,30 +162,25 @@ public ParquetRecordReader(
   ParquetMetadata footer,
   List columns,
   ParquetReaderUtility.DateCorruptionStatus dateCorruptionStatus) 
throws ExecutionSetupException {
-this.name = path;
-this.hadoopPath = new Path(path);
-this.fileSystem = fs;
-this.codecFactory = codecFactory;
-this.rowGroupIndex = rowGroupIndex;
-this.batchSize = batchSize;
-this.footer = footer;
+
--- End diff --

Please avoid changes that affect code style only.


---


[GitHub] drill pull request #1060: DRILL-5846: Improve parquet performance for Flat D...

2018-04-04 Thread sachouche
Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1060#discussion_r179296101
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/MemoryUtils.java
 ---
@@ -0,0 +1,199 @@
+/**
--- End diff --

Vlad, can you please explain what you meant? are you only referring for 
specific members or is it a general comment?



---


[GitHub] drill pull request #1060: DRILL-5846: Improve parquet performance for Flat D...

2018-04-04 Thread sachouche
Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1060#discussion_r179295602
  
--- Diff: exec/java-exec/pom.xml ---
@@ -836,6 +836,14 @@
 org.apache.maven.plugins
 maven-surefire-plugin
   
+  
+org.apache.maven.plugins
+maven-compiler-plugin
+2.3.2
+
+-XDignore.symbol.file
--- End diff --

Will do!


---


[GitHub] drill pull request #1060: DRILL-5846: Improve parquet performance for Flat D...

2018-04-04 Thread vrozov
Github user vrozov commented on a diff in the pull request:

https://github.com/apache/drill/pull/1060#discussion_r179295162
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/MemoryUtils.java
 ---
@@ -0,0 +1,199 @@
+/**
--- End diff --

Please do not use javadoc comment.


---


[GitHub] drill pull request #1060: DRILL-5846: Improve parquet performance for Flat D...

2018-04-04 Thread vrozov
Github user vrozov commented on a diff in the pull request:

https://github.com/apache/drill/pull/1060#discussion_r179293968
  
--- Diff: exec/java-exec/pom.xml ---
@@ -836,6 +836,14 @@
 org.apache.maven.plugins
 maven-surefire-plugin
   
+  
+org.apache.maven.plugins
+maven-compiler-plugin
+2.3.2
+
+-XDignore.symbol.file
--- End diff --

Consider using `io.netty.util.internal.PlatformDependent` instead of 
`sun.misc.Unsafe`. 


---


[GitHub] drill issue #1200: DRILL-143: Support CGROUPs resource management

2018-04-04 Thread kkhatua
Github user kkhatua commented on the issue:

https://github.com/apache/drill/pull/1200
  
@paul-rogers very good point about the issue of the heartbeat thread 
competing with other Drillbit process threads for CPU. When writing up the 
documentation for this, we'll have to ensure that users also configure the 
parallelization to a reasonable level to ensure that CPU thrashing is minimized 
with CGroups enforced. 


---


[GitHub] drill pull request #1200: DRILL-143: Support CGROUPs resource management

2018-04-04 Thread kkhatua
Github user kkhatua commented on a diff in the pull request:

https://github.com/apache/drill/pull/1200#discussion_r179292376
  
--- Diff: distribution/src/resources/drillbit.sh ---
@@ -127,6 +127,44 @@ check_before_start()
   fi
 }
 
+check_after_start(){
+#check if the process is running
+if [ -f $pid ]; then
+  dbitProc=$(ps -ef | grep `cat $pid` | grep Drillbit)
+  if [ -n "$dbitProc" ]; then
+# Check and enforce for CGroup
+if [ -n "$DRILLBIT_CGROUP" ]; then 
+  check_and_enforce_cgroup `cat $pid`
+fi
+  fi
+fi
+}
+
+check_and_enforce_cgroup(){
+dbitPid=$1;
+#if [ $(`ps -o cgroup` | grep -c $DRILLBIT_CGROUP ) -eq 1 ]; then 
+if [ -f /cgroup/cpu/${DRILLBIT_CGROUP}/cgroup.procs ]; then 
+  echo $dbitPid > /cgroup/cpu/${DRILLBIT_CGROUP}/cgroup.procs
+  # Verify Enforcement
+  cgroupStatus=`grep -w $pid 
/cgroup/cpu/${DRILLBIT_CGROUP}/cgroup.procs`
+  if [ -z "$cgroupStatus" ]; then
+#Ref: 
https://www.kernel.org/doc/Documentation/scheduler/sched-bwc.txt
+let cpu_quota=`cat /cgroup/cpu/${DRILLBIT_CGROUP}/cpu.cfs_quota_us`
+let cpu_period=`cat 
/cgroup/cpu/${DRILLBIT_CGROUP}/cpu.cfs_period_us`
+if [ $cpu_period -gt 0 ] && [ $cpu_quota -gt 0 ]; then
+  coresAllowed="(up to "`echo $(( 100 * $cpu_quota / $cpu_period 
)) | sed 's/..$/.&/'`" cores allowed)"
+fi
+echo "WARN: Drillbit's CPU resource usage will be managed under 
the CGroup : $DRILLBIT_CGROUP "$coresAllowed
+  else
+echo "ERROR: Failed to add Drillbit to CGroup ( $DRILLBIT_CGROUP ) 
for resource usage management.  Ensure that the cgroup manages CPU"
+exit 1
+  fi
+else
+  echo "ERROR: cgroup $DRILLBIT_CGROUP does not found. Ensure that 
daemon is running and cgroup exists"
+  exit 1
--- End diff --

Good point. Ideally, we should prevent the Drillbit from starting up (or in 
this case, it should shut down), if CGroups couldn't be applied. My concern is 
that if CGroups is not being enforced, we're running a process that can 
(potentially) consume excess CPU resources. 
Should I shut down the Drillbit in such a scenario, or move on with just a 
WARN message?


---


Re: Drill Java doc

2018-04-04 Thread Vlad Rozov
I'd suggest to make complete Java doc available as part of the release 
process. For example Hadoop API is available 
https://hadoop.apache.org/docs/stable/api/index.html


Thank you,

Vlad

On 4/4/18 07:28, Kunal Khatua wrote:

Looks like it was done only for JDBC with release 1.2 .

https://drill.apache.org/api/1.2/jdbc/


On 4/4/2018 7:23:08 AM, Vlad Rozov  wrote:
Is Drill Java doc hosted on http://drill.apache.org or any other site?

Thank you,

Vlad





[jira] [Created] (DRILL-6308) Exception handling - OOM

2018-04-04 Thread Khurram Faraaz (JIRA)
Khurram Faraaz created DRILL-6308:
-

 Summary: Exception handling - OOM
 Key: DRILL-6308
 URL: https://issues.apache.org/jira/browse/DRILL-6308
 Project: Apache Drill
  Issue Type: Bug
  Components: Execution - Flow
Affects Versions: 1.14.0
Reporter: Khurram Faraaz
Assignee: Sorabh Hamirwasia


For RpcHandlers in Drill, when there is an OutOfMemory condition ,then all the 
handlers call OutOfMemoryHandler with the caught exception. Every channel, 
except for DataServer the default instance of MemoryHandler is used which 
throws UnsupportedOperationException. This is misleading since actual exception 
is for OutOfMemory not UnsupportedOperationException.

 

Here is an example where it is not handled the right way.
{noformat}
Caused by: java.lang.UnsupportedOperationException: null
at 
org.apache.drill.exec.rpc.OutOfMemoryHandler$1.handle(OutOfMemoryHandler.java:25)
 ~[drill-rpc-1.13.0-mapr.jar:1.13.0-mapr]
at 
org.apache.drill.exec.rpc.SaslEncryptionHandler.encode(SaslEncryptionHandler.java:170)
 ~[drill-rpc-1.13.0-mapr.jar:1.13.0-mapr]
at 
org.apache.drill.exec.rpc.SaslEncryptionHandler.encode(SaslEncryptionHandler.java:44)
 ~[drill-rpc-1.13.0-mapr.jar:1.13.0-mapr]
at 
io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:88)
 [netty-codec-4.0.48.Final.jar:4.0.48.Final]
... 25 common frames omitted{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] drill pull request #1200: DRILL-143: Support CGROUPs resource management

2018-04-04 Thread kkhatua
Github user kkhatua commented on a diff in the pull request:

https://github.com/apache/drill/pull/1200#discussion_r179287166
  
--- Diff: distribution/src/resources/drillbit.sh ---
@@ -127,6 +127,44 @@ check_before_start()
   fi
 }
 
+check_after_start(){
+#check if the process is running
+if [ -f $pid ]; then
+  dbitProc=$(ps -ef | grep `cat $pid` | grep Drillbit)
+  if [ -n "$dbitProc" ]; then
+# Check and enforce for CGroup
+if [ -n "$DRILLBIT_CGROUP" ]; then 
+  check_and_enforce_cgroup `cat $pid`
+fi
+  fi
+fi
+}
+
+check_and_enforce_cgroup(){
+dbitPid=$1;
+#if [ $(`ps -o cgroup` | grep -c $DRILLBIT_CGROUP ) -eq 1 ]; then 
+if [ -f /cgroup/cpu/${DRILLBIT_CGROUP}/cgroup.procs ]; then 
+  echo $dbitPid > /cgroup/cpu/${DRILLBIT_CGROUP}/cgroup.procs
+  # Verify Enforcement
+  cgroupStatus=`grep -w $pid 
/cgroup/cpu/${DRILLBIT_CGROUP}/cgroup.procs`
+  if [ -z "$cgroupStatus" ]; then
+#Ref: 
https://www.kernel.org/doc/Documentation/scheduler/sched-bwc.txt
+let cpu_quota=`cat /cgroup/cpu/${DRILLBIT_CGROUP}/cpu.cfs_quota_us`
+let cpu_period=`cat 
/cgroup/cpu/${DRILLBIT_CGROUP}/cpu.cfs_period_us`
+if [ $cpu_period -gt 0 ] && [ $cpu_quota -gt 0 ]; then
+  coresAllowed="(up to "`echo $(( 100 * $cpu_quota / $cpu_period 
)) | sed 's/..$/.&/'`" cores allowed)"
+fi
+echo "WARN: Drillbit's CPU resource usage will be managed under 
the CGroup : $DRILLBIT_CGROUP "$coresAllowed
--- End diff --

Thought that since the Drill process is being brought up in a restricted 
mode, it should appear as a WARN.
I can switch it to an INFO because there is nothing compelling to use WARN.
Will make the change on the variable within quotes.


---


[GitHub] drill pull request #1200: DRILL-143: Support CGROUPs resource management

2018-04-04 Thread kkhatua
Github user kkhatua commented on a diff in the pull request:

https://github.com/apache/drill/pull/1200#discussion_r179286598
  
--- Diff: distribution/src/resources/drillbit.sh ---
@@ -127,6 +127,44 @@ check_before_start()
   fi
 }
 
+check_after_start(){
+#check if the process is running
+if [ -f $pid ]; then
+  dbitProc=$(ps -ef | grep `cat $pid` | grep Drillbit)
+  if [ -n "$dbitProc" ]; then
+# Check and enforce for CGroup
+if [ -n "$DRILLBIT_CGROUP" ]; then 
+  check_and_enforce_cgroup `cat $pid`
+fi
+  fi
+fi
+}
+
+check_and_enforce_cgroup(){
+dbitPid=$1;
+#if [ $(`ps -o cgroup` | grep -c $DRILLBIT_CGROUP ) -eq 1 ]; then 
+if [ -f /cgroup/cpu/${DRILLBIT_CGROUP}/cgroup.procs ]; then 
+  echo $dbitPid > /cgroup/cpu/${DRILLBIT_CGROUP}/cgroup.procs
+  # Verify Enforcement
+  cgroupStatus=`grep -w $pid 
/cgroup/cpu/${DRILLBIT_CGROUP}/cgroup.procs`
+  if [ -z "$cgroupStatus" ]; then
+#Ref: 
https://www.kernel.org/doc/Documentation/scheduler/sched-bwc.txt
+let cpu_quota=`cat /cgroup/cpu/${DRILLBIT_CGROUP}/cpu.cfs_quota_us`
--- End diff --

Ok. Will fix this 


---


[GitHub] drill issue #1199: DRILL-6303: Provide a button to copy the Drillbit's JStac...

2018-04-04 Thread priteshm
Github user priteshm commented on the issue:

https://github.com/apache/drill/pull/1199
  
Nice work, @kkhatua! This seems very useful!


---


Re: "Death of Schema-on-Read"

2018-04-04 Thread Jinfeng Ni
I feel it's probably premature to cal it "death of schema-on-read" just
based on one application case. For one product I have been working on
recently, one use case is for IOT related application where data is sent
from a variety of small devices (sensors, camera, etc). It would be a hard
requirement to pre-define schema upfront for each device, before write data
into the system. Further, the value of data is likely to decrease
significantly over time; data within hours/days is way more important than
that of weeks/months ago. It's unimaginable to wait for weeks to run data
clean/preparation job, before user could query such data. In other words,
for application with requirements of  flexibility and time-sensitivity,
'schema-on-read' provides a huge benefit, compared with traditional
ETL-then-query approach.

Drill's schema-on-read is actually trying to solve a rather hard problem,
in that we deal with not only relational type, but also nested type. In
that sense, Drill is walking in an uncharted territory where not many
others are doing similar things.  Dealing with undocumented/unstructured
data is a big challenge. Although Drill's solution is not perfect, IMHO,
it's still a big step towards such a problem.

With that said, I agreed with points people raised earlier. In addition to
"schema-on-read", Drill has to do a better to handle the traditional cases
where schema is known beforehand, by introducing a meta-store /catalog, or
by allowing users to declare schema upfront ( I probably will not call
Drill "schema-forbidden"). The restart strategy seems to be also
interesting to handle failure caused by missing schema / schema change.




On Tue, Apr 3, 2018 at 10:01 PM, Ted Dunning  wrote:

> Well, the restart strategy still works for your examples. And you only pay
> once. From them you look at the cached type information and used an upper
> bound data type as you read the data. Since it works to read the values in
> the right order, it is obviously possible to push down typing information
> even into the json reader.
>
>
>
> On Tue, Apr 3, 2018, 21:42 Paul Rogers  wrote:
>
> > Subtle point. I can provide schema with Parquet, as you note. (Actually,
> > for Parquet, Drill is schema-required: I can't not provide a schema due
> to
> > the nature of Parquet...)
> >
> > But, I can't provide a schema for JSON, CSV, etc. The point is, Drill
> > forbids the user from providing a schema; only the file format itself can
> > provide the schema (or not, in the case of JSON). This is the very heart
> of
> > the problem.
> >
> > The root cause of our schema change exception is that vectors are,
> indeed,
> > strongly typed. But, file columns are not. Here is my favorite:
> >
> > {x: 10} {x: 10.1}
> >
> > Blam! Query fails because the vector is chosen as BigInt, then we
> discover
> > it really should have been Float8. (If the answer is: go back and rebuild
> > the vector with the new type, consider the case that 100K records
> separate
> > the two above so that the first batch is long gone by the time we see the
> > offending record. If only I could tell Drill to use Float8 (or Decimal)
> up
> > front...
> >
> > Views won't help here because the failure occurs before a view can kick
> > in. However, presumably, I could write a view to handle a different
> classic
> > case:
> >
> > myDir /
> > |- File 1: {a: 10, b: "foo"}
> > |- File 2: {a: 20}
> >
> > With query: SELECT a, b FROM myDir
> >
> > For File 2, Drill will guess that b is a Nullable Int, but it is really
> > VarChar. I think I could write clever SQL that says:
> >
> > If b is of type Nullable Int, return NULL cast to nullable VarChar, else
> > return b
> >
> > The irony is that I must to write procedural code to declare a static
> > attribute of the data. Yet SQL is otherwise declarative: I state what I
> > want, not how to implement it.
> >
> > Life would be so much easier if I could just say, "trust me, when you
> read
> > column b, it is a VarChar."
> >
> > Thanks,
> > - Paul
> >
> >
> >
> > On Tuesday, April 3, 2018, 10:53:27 AM PDT, Ted Dunning <
> > ted.dunn...@gmail.com> wrote:
> >
> >  I don't see why you say that Drill is schema-forbidden.
> >
> > The Parquet reader, for instance, makes strong use of the implied schema
> to
> > facilitate reading of typed data.
> >
> > Likewise, the vectorized internal format is strongly typed and, as such,
> > uses schema information.
> >
> > Views are another way to communicate schema information.
> >
> > It is true that you can't, say, view comments on fields from the command
> > line. But I don't understand saying "schema-forbidden".
> >
> >
> > On Tue, Apr 3, 2018 at 10:01 AM, Paul Rogers 
> > wrote:
> >
> > > Here is another way to think about it. Today, Drill is
> > "schema-forbidden":
> > > even if I know the schema, I can't communicate that to Drill; Drill
> must
> > > figure it out on its own, making the same mistakes every 

[GitHub] drill issue #1199: DRILL-6303: Provide a button to copy the Drillbit's JStac...

2018-04-04 Thread kkhatua
Github user kkhatua commented on the issue:

https://github.com/apache/drill/pull/1199
  
@arina-ielchiieva 

I've added 2 snapshots and tested in Chrome.

**Mouse Over**, showing tooltip

![mouseover](https://user-images.githubusercontent.com/4335237/38327176-74ac0aac-37fc-11e8-9a0e-d5b5ad638460.png)

**On clicking**, the copy of the text to the clipboard, followed by the 
alert.

![mouseonclick](https://user-images.githubusercontent.com/4335237/38327185-79c0377a-37fc-11e8-955f-5dd9ad7bfaea.png)



---


Re: [DISCUSS] DrillBuf

2018-04-04 Thread Paul Rogers
Hi Vlad,

Would be great to get insight from the original authors. Here ismy two cents as 
a late comer who made extensive use of the classes in question.

Many of your questions are at the implementation level. It is worth looking at 
the question from two other perspectives: history and design.

Historically, Drill adopted Netty for networking, and wisely looked for ways of 
using the same buffers for both network transfer and internal operations to 
avoid copies. Some overview is in [1]. In this view, a Drill vector is a 
network buffer. Network buffers use the ByteBuffer protocol to serialize binary 
values. DrillBuf follows that model for the most part. Because a ByteBuffer is 
a low-level abstraction over a buffer, each operation must perform bounds 
checks to ensure safe operation.


DrillBuf provides the ability to present a "view" of a slice of a larger 
underlying buffer. For example, when reading data from a spill file, all data 
for all internal vectors is read into a single buffer. For a nullable VarChar, 
for example, the buffer contains the bit vectors, the offset vectors and the 
data vectors. The value vectors point to DrillBufs which point to a slice of 
the underlying buffers. It is this layout (there are at least three different 
layouts) that makes our "record batch sizer" so complex: the size of memory 
used is NOT the sum of the DrillBufs.

Drill is a columnar system. So, the team introduced a typed "vector" 
abstraction.Value vectors provide an abstraction that sweeps away the 
ByteBuffer heritage and replaces it with a strongly typed, accessor/mutator 
structure that works in terms of Drill data types and record counts. Vectors 
also understand the relationship between bit vectors and data vectors, between 
offset vectors and data vectors, and so on.

Your question implies a desire to think about the future direction. Two things 
to say. First, vectors themselves do not provide sufficient abstraction for the 
needs of operators. As a result, operators become very complex, we must 
generate large amounts of boiler-plate code, and we fix the same bugs over and 
over. These issues are discussed at length in [2]. This is the motivation for 
the result set reader and loader.

The row set abstractions encapsulate not just knowledge of a vector, but of the 
entire batch. As a result, these abstractions know the number of records, know 
the vector and batch size targets, and track vectors as they fill. One key 
result is that these abstractions ensure that data is read or written within 
the bounds of each buffer, eliminating the need for bounds checks on every 
access.


The other consideration is memory management. Drill has a very complex, but 
surprisingly robust, memory management system. However, it is based on a 
"malloc" model of memory with operators negotiating among themselves (via the 
OUT_OF_MEMORY iterator status) about who needs memory and who should release 
it. [2] discusses the limitations of this system. As a result, we've been 
moving to a budget-based system in which each fragment and operator is given a 
budget based on total available memory, and operators use spilling to stay 
within the budget.

Memory fragmentation is a classic problem in malloc-based systems which strive 
to operate at high memory utilization rates and which do not include memory 
compaction. Drill is such a system. So, if this issue ever prevents Drill from 
achieving maximum performance, we can consider the classic system used by 
databases to solve this problem: fixed-size memory blocks.

If we were to move to fixed-size buffers, we'd want the row set and vector 
abstractions to remain unchanged. We'd only want to replace DrillBuf with a new 
block-based abstraction, perhaps with chaining (a vector may consist of a chain 
of, say, 1 MB blocks.) The buffer slicing mechanism would become unnecessary, 
as would the existing malloc-based allocator. Instead, data would be read, 
written and held in buffers allocated from and returned to a buffer pool.


We may or may not ever make such a change. But, by considering this 
possibility, we readily see that DrillBuf should be an implementation detail of 
the higher-level abstractions and that operators should only use those 
higher-level abstractions because doing so isolates operators from the details 
of memory layout. This argument applies even more so to the abstractions below 
DrillBuf: UDLE, Netty ByteBuf, ledgers and so on.

Said another way, even with the current system, we should be free to improve 
DrillBuf on down with no impact to operator code because vectors and the row 
set abstractions should be the only clients of DrillBuf.


In short, by understanding the history of the code, and agreeing upon the right 
design abstractions, we can then make informed decisions about how best to 
improve our low-level abstractions, including DrillBuf.


Thanks,

- Paul

[1] http://drill.apache.org/docs/value-vectors/
[2] 

[GitHub] drill pull request #1144: DRILL-6202: Deprecate usage of IndexOutOfBoundsExc...

2018-04-04 Thread vrozov
Github user vrozov commented on a diff in the pull request:

https://github.com/apache/drill/pull/1144#discussion_r179231151
  
--- Diff: src/main/resources/checkstyle-config.xml ---
@@ -30,10 +30,15 @@
 
 
 
+
--- End diff --

I commented out the default rule value with the addition of 
IndexOutOfBoundsException. IndexOutOfBoundsException can be used, but it is not 
common to handle it.


---


[GitHub] drill pull request #1144: DRILL-6202: Deprecate usage of IndexOutOfBoundsExc...

2018-04-04 Thread vrozov
Github user vrozov commented on a diff in the pull request:

https://github.com/apache/drill/pull/1144#discussion_r179229891
  
--- Diff: 
exec/vector/src/main/java/org/apache/drill/exec/vector/accessor/writer/BaseScalarWriter.java
 ---
@@ -211,7 +211,7 @@ protected boolean canExpand(int delta) {
 
   protected void overflowed() {
 if (listener == null) {
-  throw new IndexOutOfBoundsException("Overflow not supported");
+  throw new UnsupportedOperationException("Overflow not supported");
--- End diff --

I also considered `OversizedAllocationException` but IMO 
`UnsupportedOperationException` is a better fit.


---


[GitHub] drill pull request #1144: DRILL-6202: Deprecate usage of IndexOutOfBoundsExc...

2018-04-04 Thread vrozov
Github user vrozov commented on a diff in the pull request:

https://github.com/apache/drill/pull/1144#discussion_r179226227
  
--- Diff: exec/memory/base/src/main/java/io/netty/buffer/DrillBuf.java ---
@@ -777,23 +778,20 @@ public int getActualMemoryConsumed() {
* @return A hex dump in a String.
*/
   public String toHexString(final int start, final int length) {
-final int roundedStart = (start / LOG_BYTES_PER_ROW) * 
LOG_BYTES_PER_ROW;
-
-final StringBuilder sb = new StringBuilder("buffer byte dump\n");
-int index = roundedStart;
-for (int nLogged = 0; nLogged < length; nLogged += LOG_BYTES_PER_ROW) {
-  sb.append(String.format(" [%05d-%05d]", index, index + 
LOG_BYTES_PER_ROW - 1));
-  for (int i = 0; i < LOG_BYTES_PER_ROW; ++i) {
-try {
-  final byte b = getByte(index++);
-  sb.append(String.format(" 0x%02x", b));
-} catch (IndexOutOfBoundsException ioob) {
-  sb.append(" ");
-}
+Preconditions.checkArgument(start >= 0);
+final StringBuilder sb = new StringBuilder("buffer byte dump");
+final int end = Math.min(length, this.length - start);
+for (int i = 0; i < end; i++) {
+  if (i % LOG_BYTES_PER_ROW == 0) {
+sb.append(String.format("%n [%05d-%05d]", i + start, Math.min(i + 
LOG_BYTES_PER_ROW - 1, end - 1) + start));
   }
-  sb.append('\n');
+  byte b = _getByte(i + start);
+  sb.append(" 0x").append(HEX_CHAR[b >> 4]).append(HEX_CHAR[b & 0x0F]);
--- End diff --

The goal is to avoid new String allocation for every output byte inside 
DrillBuf.


---


[DISCUSS] DrillBuf

2018-04-04 Thread Vlad Rozov
I have several questions and concerns regarding DrillBuf usage, design 
and implementation. There is a limited documentation available for the 
subject (Java doc, 
https://github.com/apache/drill/blob/master/exec/memory/base/src/main/java/org/apache/drill/exec/memory/README.md 
and https://github.com/paul-rogers/drill/wiki/Memory-Management) and I 
hope that a few members of the community may have more information.


What are the design goals behind DrillBuf? It seems like it is supposed 
to be Drill access gate for direct byte buffers. How is it different 
(for that goal) from UnsafeDirectLittleEndian? Both use 
wrapper/delegation pattern, with DrillBuf delegating to 
UnsafeDirectLittleEndian (not always) and UnsafeDirectLittleEndian 
delegating to ByteBuf it wraps. Is it necessary to have both? Are there 
any out of the box netty classes that already provide required 
functionality? I guess that answer to the last question was "no" back 
when DrillBuf and UnsafeDirectLittleEndian were introduced into Drill. 
Is it still "no" for the latest netty release? What extra functionality 
DrillBuf (and UnsafeDirectLittleEndian) provides on top of existing 
netty classes?


As far as I can see from the source code, DrillBuf changes validation 
(boundary and reference count checks) mechanism, making it optional 
(compared to always enabled boundary checks inside netty) for get/set 
Byte/Char/Short/Long/Float/Double. Is this a proper place to make 
validation optional or the validation (or portion of the validation) 
must be always on or off (there are different opinions, see 
https://issues.apache.org/jira/browse/DRILL-6004, 
https://issues.apache.org/jira/browse/DRILL-6202, 
https://github.com/apache/drill/pull/1060 and 
https://github.com/apache/drill/pull/1144)? Are there any performance 
benchmark that justify or explain such behavior (if such benchmark does 
not exist, are there any volunteer to do the benchmark)? My experience 
is that the reference count check is significantly more expensive 
compared to boundary checking and boundary checking adds tens of percent 
to direct memory read when reading just a few bytes, so my vote is to 
keep validation as optional with the ability to enable it for debug 
purposes at run-time. What is the reason the same approach do not apply 
to get/set Bytes and those methods are delegated to 
UnsafeDirectLittleEndian that delegates it further?


Why DrillBuf reverses how AbstractByteBuf calls _get from get (and _set 
from set), making _get to call get (_set to call set)? Why not to follow 
a base class design patter?


Another question is usage of netty "io.netty.buffer" package for Drill 
classes. Is this absolutely necessary? I don't think that netty 
developers expect this and support semantic version compatibility for 
package private classes/members.


Thank you,

Vlad


Re: Drill Java doc

2018-04-04 Thread Kunal Khatua
Looks like it was done only for JDBC with release 1.2 .

https://drill.apache.org/api/1.2/jdbc/


On 4/4/2018 7:23:08 AM, Vlad Rozov  wrote:
Is Drill Java doc hosted on http://drill.apache.org or any other site?

Thank you,

Vlad


[GitHub] drill pull request #1200: DRILL-143: Support CGROUPs resource management

2018-04-04 Thread kkhatua
Github user kkhatua commented on a diff in the pull request:

https://github.com/apache/drill/pull/1200#discussion_r179159323
  
--- Diff: distribution/src/resources/drillbit.sh ---
@@ -154,6 +192,7 @@ start_bit ( )
   nohup nice -n $DRILL_NICENESS "$DRILL_HOME/bin/runbit" exec ${args[@]} 
>> "$logout" 2>&1 &
   echo $! > $pid
   sleep 1
+  check_after_start
--- End diff --

😄 
Will fix this, thanks!


---


[GitHub] drill pull request #1200: DRILL-143: Support CGROUPs resource management

2018-04-04 Thread kkhatua
Github user kkhatua commented on a diff in the pull request:

https://github.com/apache/drill/pull/1200#discussion_r179158686
  
--- Diff: distribution/src/resources/drillbit.sh ---
@@ -127,6 +127,44 @@ check_before_start()
   fi
 }
 
+check_after_start(){
+#check if the process is running
+if [ -f $pid ]; then
+  dbitProc=$(ps -ef | grep `cat $pid` | grep Drillbit)
+  if [ -n "$dbitProc" ]; then
+# Check and enforce for CGroup
+if [ -n "$DRILLBIT_CGROUP" ]; then 
+  check_and_enforce_cgroup `cat $pid`
+fi
+  fi
+fi
+}
+
+check_and_enforce_cgroup(){
+dbitPid=$1;
+#if [ $(`ps -o cgroup` | grep -c $DRILLBIT_CGROUP ) -eq 1 ]; then 
--- End diff --

Thanks for catching that!


---


[GitHub] drill pull request #1200: DRILL-143: Support CGROUPs resource management

2018-04-04 Thread kkhatua
Github user kkhatua commented on a diff in the pull request:

https://github.com/apache/drill/pull/1200#discussion_r179158585
  
--- Diff: distribution/src/resources/drillbit.sh ---
@@ -127,6 +127,44 @@ check_before_start()
   fi
 }
 
+check_after_start(){
+#check if the process is running
+if [ -f $pid ]; then
+  dbitProc=$(ps -ef | grep `cat $pid` | grep Drillbit)
--- End diff --

Agreed. I'll incorporate this change.


---


[GitHub] drill pull request #1200: DRILL-143: Support CGROUPs resource management

2018-04-04 Thread kkhatua
Github user kkhatua commented on a diff in the pull request:

https://github.com/apache/drill/pull/1200#discussion_r179158191
  
--- Diff: distribution/src/resources/drill-env.sh ---
@@ -86,6 +86,12 @@
 
 #export DRILL_PID_DIR=${DRILL_PID_DIR:-$DRILL_HOME}
 
+# CGroup to which the Drillbit belong when running as a daemon using
--- End diff --

:)
I did speculate about enforcement for other resources, but it seems that 
CPU is the primary resource that needs to be managed. Memory 'management' (by 
YARN, for e.g.) otherwise works without the need for CGroup.


---


Drill Java doc

2018-04-04 Thread Vlad Rozov

Is Drill Java doc hosted on http://drill.apache.org or any other site?

Thank you,

Vlad


[GitHub] drill pull request #1200: DRILL-143: Support CGROUPs resource management

2018-04-04 Thread kkhatua
Github user kkhatua commented on a diff in the pull request:

https://github.com/apache/drill/pull/1200#discussion_r179156839
  
--- Diff: distribution/src/resources/drillbit.sh ---
@@ -127,6 +127,44 @@ check_before_start()
   fi
 }
 
+check_after_start(){
+#check if the process is running
+if [ -f $pid ]; then
+  dbitProc=$(ps -ef | grep `cat $pid` | grep Drillbit)
+  if [ -n "$dbitProc" ]; then
+# Check and enforce for CGroup
+if [ -n "$DRILLBIT_CGROUP" ]; then 
+  check_and_enforce_cgroup `cat $pid`
+fi
+  fi
+fi
+}
+
+check_and_enforce_cgroup(){
+dbitPid=$1;
+#if [ $(`ps -o cgroup` | grep -c $DRILLBIT_CGROUP ) -eq 1 ]; then 
+if [ -f /cgroup/cpu/${DRILLBIT_CGROUP}/cgroup.procs ]; then 
--- End diff --

I've looked to check for older versions, but wasn't sure which one aligned 
to which version and OS. It's easy to check if CGroups is running, but I don't 
see a guaranteed way of confirming specifically about that CGroup. 
Let me do some more research on this front.


---