Re: Discuss about Drill's schedule policy

2017-08-23 Thread Paul Rogers
Hi Weijie,

Thanks for the link. I’d seen this project a bit earlier, along with Apollo 
[1]. Sparrow is quite interesting, but is designed to place tasks (processes) 
on available nodes. This is not quite what Drill does: Drill launches multiple 
waves of “fragments” to all nodes across the cluster.

Other systems take the approach of just-in-time scheduling in which a fragment 
starts only when its inputs are available, and terminates (and releases its 
resources) after it has processed its last row. While this may be a very good 
technique for longer-running tasks (something like map/reduce or Hive), it 
introduces too much latency for short-running, interactive queries.

One could argue that Drill needs two levels of scheduling:

1. Schedule queries as a whole.
2. Schedule tasks (“minor fragments”) within queries.

(There is, of course, a third level: scheduling the Drillbits themselves. Let’s 
leave that aside for now.)

The simplest place to start in Drill is to schedule entire queries, where each 
query gets a slice of cluster-wide resources (memory, CPU, etc.) Then, we can 
reuse Drill’s existing mechanism to schedule fragments on nodes.

The next level of refinement is to select the proper level of parallelization 
for a query: a balance between maximizing width, but not overwhelming the 
cluster with too many threads. For truly huge queries (dozens of nested 
subqueries), it might even make sense to introduce a way of sharing threads 
across fragments (something that Hanifi looked into a while back) or staging 
queries so that we don’t try to run all stages simultaneously. These are more 
advanced topics.

A good place to start would be a scheduler; with a model somewhat like YARNs, 
that selects queries to run when Drill resources are available; then to ensure 
that queries run within those resources.

Anyone know of such a schedule we could borrow to use with Drill? Or maybe we 
could adopt the core of Sparrow (or whatever) with the algorithm needed for 
Drill to avoid the need to invent yet another new scheduler.

Thanks,

- Paul


[1] 
https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-boutin_0.pdf

On Aug 23, 2017, at 7:41 AM, weijie tong 
> wrote:

@paul  have you noticed the Sparrow project (
https://github.com/radlab/sparrow ) and related paper mentioned in the
github .  Sparrow is a non-central ,low latency scheduler . This seems meet
Drill's demand. I think we can first abstract a scheduler interface like
what Spark does , then we can have different scheduler implementations
(central or non-central ,maybe non-central like sparrow be the default one
).

On Mon, Aug 21, 2017 at 11:51 PM, weijie tong 
>
wrote:

Thanks for all your suggestions.

@paul your analysis is impressive . I agree with  your opinion. Current
queue solution can not solve this problem perfectly. Our system is
suffering a  hard time once the cluster is in high load. I will think about
this more deeply. welcome more ideas or suggestions to  be shared in this
thread,maybe some little improvement .


On Mon, 21 Aug 2017 at 4:06 AM Paul Rogers 
> wrote:

Hi Weijie,

Great analysis. Let’s look at a few more data points.

Drill has no central scheduler (this is a feature: it makes the cluster
much easier to manage and has no single point of failure. It was probably
the easiest possible solution while Drill was being built.) Instead of
central control, Drill is based on the assumption of symmetry: all
Drillbits are identical. So, each Foreman, acting independently, should try
to schedule its load in a way that evenly distributes work across nodes in
the cluster. If all Drillbits do the same, then load should be balanced;
there should be no “hot spots.”

Note, for this to work, Drill should either own the cluster, or any other
workload on the cluster should also be evenly distributed.

Drill makes another simplification: that the cluster has infinite
resources (or, equivalently, that the admin sized the cluster for peak
load.) That is, as Sudheesh puts it, “Drill is optimistic” Therefore, Drill
usually runs with no throttling mechanism to limit overall cluster load. In
real clusters, of course, resources are limited and either a large query
load, or a few large queries, can saturate some or all of the available
resources.

Drill has a feature, seldom used, to throttle queries based purely on
number. These ZK-based queues can allow, say, 5 queries to run (each of
which is assumed to be evenly distributed.) In actual fact, the ZK-based
queues recognize that typical workload have many small, and a few large,
queries and so Drill offers the “small query” and “large query” queues.

OK, so that’s where we are today. I think I’m not stepping too far out of
line to observe that the above model is just a bit naive. It does not take
into consideration the available 

[jira] [Created] (DRILL-5739) Query reads all files after issuing REFRESH TABLE METADATA command

2017-08-23 Thread Divya (JIRA)
Divya created DRILL-5739:


 Summary: Query reads all files after issuing REFRESH TABLE 
METADATA command 
 Key: DRILL-5739
 URL: https://issues.apache.org/jira/browse/DRILL-5739
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.10.0
Reporter: Divya


Hi,
Query takes lot of time after issuing refresh metadata command as it is reading 
all the files .

||Value||Before Refresh Metadata||After Refresh Metadata||
|Fragments|1|13|
|DURATION|01 min 0.233 sec|18 min 0.744 sec|
|PLANNING|59.818 sec|33.087 sec|
|QUEUED|Not Available|Not Available|
|EXECUTION|0.415 sec|17 min 27.657 sec|

I cant paste the whole physical plan for the query 
Pasting the relevant one :
*Physical Plan Before  Refresh Meta *
numFiles=4, usedMetadataFile=false, columns=

rowcount = 12.0, cumulative cost = {12.0 rows, 780.0 cpu, 0.0 io, 0.0 network, 
0.0 memory}, id = 9395

*Physical Plan After Refresh Meta *
numFiles=102290, usedMetadataFile=true, cacheFileRoot=

rowcount = 1182008.0, cumulative cost = {1182008.0 rows, 7.683052E7 cpu, 0.0 
io, 0.0 network, 0.0 memory}, id = 9685


*Additional Info :*
file format - Parquet
table - partitioned by year,month,day,hour 
query format - selecting all the columns with by using partition column in 
where conditions  











--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (DRILL-5738) Drill query takes 10+ minutes before start executing, excessive Hive metastore queries

2017-08-23 Thread kevin zou (JIRA)
kevin zou created DRILL-5738:


 Summary: Drill query takes 10+ minutes before start executing, 
excessive Hive metastore queries
 Key: DRILL-5738
 URL: https://issues.apache.org/jira/browse/DRILL-5738
 Project: Apache Drill
  Issue Type: Bug
  Components: Query Planning & Optimization
Affects Versions: 1.6.0
 Environment: mapr 5.2
Reporter: kevin zou
Priority: Critical


I ve a Drill query on 14 tables in Hive. The query took a few seconds to 
execute. However, the query would stay in "Starting" state for 10+ minutes 
before execution. 
 
I set up the log to "Debug" mode to figure out what Drill had been doing during 
the 10+ minutes, only to find out Drill generated excessive number of meta data 
queries to hive meta store.  
 
Although each query took a few micro seconds (meta data cached in memory), the 
number of queries was 3438793.
drillbit.log:2017-06-05 18:50:57,201 
[26ca5bda-5e87-475a-cd93-17c6957cc3ee:foreman] DEBUG 
o.a.d.e.s.hive.HiveMetadataProvider
 - Took 4 µs to get stats from idm_intel_1x.lu_jde_emp_directory
drillbit.log:2017-06-05 18:50:57,201 
[26ca5bda-5e87-475a-cd93-17c6957cc3ee:foreman] DEBUG 
o.a.drill.exec.store.hive.HiveScan
- HiveStats: numRows: 15, sizeInBytes: 15




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[GitHub] drill issue #920: DRILL-5737: Hash Agg uses more than the allocated memory u...

2017-08-23 Thread sohami
Github user sohami commented on the issue:

https://github.com/apache/drill/pull/920
  
@Ben-Zvi - Please help to review this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #909: DRILL-4264: Allow field names to include dots

2017-08-23 Thread vvysotskyi
Github user vvysotskyi commented on a diff in the pull request:

https://github.com/apache/drill/pull/909#discussion_r134778939
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/exec/expr/TestSchemaPathMaterialization.java
 ---
@@ -93,4 +93,23 @@ public void testProjectionMultipleFiles() throws 
Exception {
   .go();
   }
 
+  @Test //DRILL-4264
+  public void testFieldNameWithDot() throws Exception {
--- End diff --

Added more tests


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #909: DRILL-4264: Allow field names to include dots

2017-08-23 Thread vvysotskyi
Github user vvysotskyi commented on a diff in the pull request:

https://github.com/apache/drill/pull/909#discussion_r134788919
  
--- Diff: 
exec/vector/src/main/java/org/apache/drill/exec/vector/accessor/TupleAccessor.java
 ---
@@ -48,9 +48,21 @@
 
 MaterializedField column(int index);
 
-MaterializedField column(String name);
+/**
+ * Returns {@code MaterializedField} instance from schema using the 
name path specified in param.
+ *
+ * @param name full name path of the column in the schema
+ * @return {@code MaterializedField} instance
+ */
+MaterializedField column(String[] name);
--- End diff --

Thanks, reverted my changes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #909: DRILL-4264: Allow field names to include dots

2017-08-23 Thread vvysotskyi
Github user vvysotskyi commented on a diff in the pull request:

https://github.com/apache/drill/pull/909#discussion_r134539810
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/ScanBatch.java 
---
@@ -359,30 +361,109 @@ public Mutator(OperatorExecContext oContext, 
BufferAllocator allocator, VectorCo
 public  T addField(MaterializedField field,
   Class clazz) throws 
SchemaChangeException {
   // Check if the field exists.
-  ValueVector v = fieldVectorMap.get(field.getPath());
-  if (v == null || v.getClass() != clazz) {
+  ValueVector vector = fieldVectorMap.get(field.getName());
+  ValueVector childVector = vector;
+  // if a vector does not exist yet, creates value vector, or if it 
exists and has map type, omit this code
+  if (vector == null || (vector.getClass() != clazz
+&& (vector.getField().getType().getMinorType() != MinorType.MAP
+|| field.getType().getMinorType() != MinorType.MAP))) {
 // Field does not exist--add it to the map and the output 
container.
-v = TypeHelper.getNewVector(field, allocator, callBack);
-if (!clazz.isAssignableFrom(v.getClass())) {
+vector = TypeHelper.getNewVector(field, allocator, callBack);
+childVector = vector;
+// gets inner field if the map was created the first time
+if (field.getType().getMinorType() == MinorType.MAP) {
+  childVector = getChildVectorByField(vector, field);
+} else if (!clazz.isAssignableFrom(vector.getClass())) {
   throw new SchemaChangeException(
   String.format(
   "The class that was provided, %s, does not correspond to 
the "
   + "expected vector type of %s.",
-  clazz.getSimpleName(), v.getClass().getSimpleName()));
+  clazz.getSimpleName(), 
vector.getClass().getSimpleName()));
 }
 
-final ValueVector old = fieldVectorMap.put(field.getPath(), v);
+final ValueVector old = fieldVectorMap.put(field.getName(), 
vector);
 if (old != null) {
   old.clear();
   container.remove(old);
 }
 
-container.add(v);
+container.add(vector);
 // Added new vectors to the container--mark that the schema has 
changed.
 schemaChanged = true;
   }
+  // otherwise, checks that field and existing vector have a map type
--- End diff --

I was suggesting that the work here may be produced on the nested fields 
thru the map.
I agree with you that it would be correct to deal with the desired field. 
So thanks for pointing this, I reverted the changes in this method.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #909: DRILL-4264: Allow field names to include dots

2017-08-23 Thread vvysotskyi
Github user vvysotskyi commented on a diff in the pull request:

https://github.com/apache/drill/pull/909#discussion_r134686740
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/ScanBatch.java 
---
@@ -359,30 +361,109 @@ public Mutator(OperatorExecContext oContext, 
BufferAllocator allocator, VectorCo
 public  T addField(MaterializedField field,
   Class clazz) throws 
SchemaChangeException {
   // Check if the field exists.
-  ValueVector v = fieldVectorMap.get(field.getPath());
-  if (v == null || v.getClass() != clazz) {
+  ValueVector vector = fieldVectorMap.get(field.getName());
+  ValueVector childVector = vector;
+  // if a vector does not exist yet, creates value vector, or if it 
exists and has map type, omit this code
+  if (vector == null || (vector.getClass() != clazz
+&& (vector.getField().getType().getMinorType() != MinorType.MAP
+|| field.getType().getMinorType() != MinorType.MAP))) {
 // Field does not exist--add it to the map and the output 
container.
-v = TypeHelper.getNewVector(field, allocator, callBack);
-if (!clazz.isAssignableFrom(v.getClass())) {
+vector = TypeHelper.getNewVector(field, allocator, callBack);
+childVector = vector;
+// gets inner field if the map was created the first time
+if (field.getType().getMinorType() == MinorType.MAP) {
+  childVector = getChildVectorByField(vector, field);
+} else if (!clazz.isAssignableFrom(vector.getClass())) {
   throw new SchemaChangeException(
   String.format(
   "The class that was provided, %s, does not correspond to 
the "
   + "expected vector type of %s.",
-  clazz.getSimpleName(), v.getClass().getSimpleName()));
+  clazz.getSimpleName(), 
vector.getClass().getSimpleName()));
 }
 
-final ValueVector old = fieldVectorMap.put(field.getPath(), v);
+final ValueVector old = fieldVectorMap.put(field.getName(), 
vector);
 if (old != null) {
   old.clear();
   container.remove(old);
 }
 
-container.add(v);
+container.add(vector);
 // Added new vectors to the container--mark that the schema has 
changed.
 schemaChanged = true;
   }
+  // otherwise, checks that field and existing vector have a map type
+  // and adds child fields from the field to the vector
+  else if (field.getType().getMinorType() == MinorType.MAP
+  && vector.getField().getType().getMinorType() == 
MinorType.MAP
+  && !field.getChildren().isEmpty()) {
+// an incoming field contains only single child since it determines
+// full name path of the field in the schema
+childVector = addNestedChildToMap((MapVector) vector, 
Iterables.getLast(field.getChildren()));
+schemaChanged = true;
+  }
 
-  return clazz.cast(v);
+  return clazz.cast(childVector);
+}
+
+/**
+ * Finds and returns value vector which path corresponds to the 
specified field.
+ * If required vector is nested in the map, gets and returns this 
vector from the map.
+ *
+ * @param valueVector vector that should be checked
+ * @param field   field that corresponds to required vector
+ * @return value vector whose path corresponds to the specified field
+ *
+ * @throws SchemaChangeException if the field does not correspond to 
the found vector
+ */
+private ValueVector getChildVectorByField(ValueVector valueVector,
+  MaterializedField field) 
throws SchemaChangeException {
+  if (field.getChildren().isEmpty()) {
+if (valueVector.getField().equals(field)) {
+  return valueVector;
+} else {
+  throw new SchemaChangeException(
+String.format(
+  "The field that was provided, %s, does not correspond to the 
"
++ "expected vector type of %s.",
+  field, valueVector.getClass().getSimpleName()));
+}
+  } else {
+// an incoming field contains only single child since it determines
+// full name path of the field in the schema
+MaterializedField childField = 
Iterables.getLast(field.getChildren());
+return getChildVectorByField(((MapVector) 
valueVector).getChild(childField.getName()), childField);
+  }
+}
+
+/**
+ * Adds new vector with the specified field to the 

[GitHub] drill pull request #909: DRILL-4264: Allow field names to include dots

2017-08-23 Thread vvysotskyi
Github user vvysotskyi commented on a diff in the pull request:

https://github.com/apache/drill/pull/909#discussion_r134688368
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggBatch.java
 ---
@@ -293,7 +294,7 @@ private StreamingAggregator createAggregatorInternal() 
throws SchemaChangeExcept
 continue;
   }
   keyExprs[i] = expr;
-  final MaterializedField outputField = 
MaterializedField.create(ne.getRef().getAsUnescapedPath(), expr.getMajorType());
+  final MaterializedField outputField = 
SchemaPathUtil.getMaterializedFieldFromSchemaPath(ne.getRef(), 
expr.getMajorType());
--- End diff --

Yes, it should. But `MaterializedField` class is in the `vector` module, 
`SchemaPath` class is in the `drill-logical` module and `vector` module does 
not have the dependency on the `drill-logical` module.
Replaced this code by the 
`MaterializedField.create(ne.getRef().getLastSegment().getNameSegment().getPath(),
 expr.getMajorType());` since simple name path is used here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #909: DRILL-4264: Allow field names to include dots

2017-08-23 Thread vvysotskyi
Github user vvysotskyi commented on a diff in the pull request:

https://github.com/apache/drill/pull/909#discussion_r134764401
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/vector/complex/SchemaPathUtil.java
 ---
@@ -0,0 +1,59 @@
+/*
+* Licensed to the Apache Software Foundation (ASF) under one or more
+* contributor license agreements.  See the NOTICE file distributed with
+* this work for additional information regarding copyright ownership.
+* The ASF licenses this file to you under the Apache License, Version 2.0
+* (the "License"); you may not use this file except in compliance with
+* the License.  You may obtain a copy of the License at
+*
+* http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+package org.apache.drill.exec.vector.complex;
+
+import org.apache.drill.common.expression.PathSegment;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.common.types.Types;
+import org.apache.drill.exec.record.MaterializedField;
+
+public class SchemaPathUtil {
--- End diff --

Removed this class, since all code where it was used, uses simple name path 
so it is not needed anymore.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #909: DRILL-4264: Allow field names to include dots

2017-08-23 Thread vvysotskyi
Github user vvysotskyi commented on a diff in the pull request:

https://github.com/apache/drill/pull/909#discussion_r134782398
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/exec/record/TestMaterializedField.java
 ---
@@ -84,4 +89,22 @@ public void testClone() {
 
   }
 
+  @Test // DRILL-4264
+  public void testSchemaPathToMaterializedFieldConverting() {
--- End diff --

This test was designed to check the 
`SchemaPathUtil.getMaterializedFieldFromSchemaPath()` method. Since this method 
removed, I removed this test.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #909: DRILL-4264: Allow field names to include dots

2017-08-23 Thread vvysotskyi
Github user vvysotskyi commented on a diff in the pull request:

https://github.com/apache/drill/pull/909#discussion_r134787569
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/test/rowSet/RowSetSchema.java ---
@@ -83,7 +85,13 @@ private void updateStructure(int index, PhysicalSchema 
children) {
 public boolean isMap() { return mapSchema != null; }
 public PhysicalSchema mapSchema() { return mapSchema; }
 public MaterializedField field() { return field; }
-public String fullName() { return fullName; }
+
+/**
+ * Returns full name path of the column.
--- End diff --

I reverted these changes. Also, I commented out the test where this code is 
used with the map fields. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #909: DRILL-4264: Allow field names to include dots

2017-08-23 Thread vvysotskyi
Github user vvysotskyi commented on a diff in the pull request:

https://github.com/apache/drill/pull/909#discussion_r134540459
  
--- Diff: 
contrib/storage-hbase/src/main/java/org/apache/drill/exec/store/hbase/CompareFunctionsProcessor.java
 ---
@@ -147,10 +147,10 @@ public Boolean visitCastExpression(CastExpression e, 
LogicalExpression valueArg)
 
   @Override
   public Boolean visitConvertExpression(ConvertExpression e, 
LogicalExpression valueArg) throws RuntimeException {
-if (e.getConvertFunction() == ConvertExpression.CONVERT_FROM) {
+if (ConvertExpression.CONVERT_FROM.equals(e.getConvertFunction())) {
--- End diff --

Since both these classes almost the same, I moved mutual code to the 
abstract class.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #909: DRILL-4264: Allow field names to include dots

2017-08-23 Thread vvysotskyi
Github user vvysotskyi commented on a diff in the pull request:

https://github.com/apache/drill/pull/909#discussion_r134788014
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/test/rowSet/RowSetSchema.java ---
@@ -94,12 +102,20 @@ private void updateStructure(int index, PhysicalSchema 
children) {
*/
 
   public static class NameSpace {
-private final Map nameSpace = new HashMap<>();
+private final Map nameSpace = new HashMap<>();
 private final List columns = new ArrayList<>();
 
-public int add(String key, T value) {
+/**
+ * Adds column path with specified value to the columns list
+ * and returns the index of the column in the list.
+ *
+ * @param key   full name path of the column in the schema
+ * @param value value to be added to the list
+ * @return index of the column in the list
+ */
+public int add(String[] key, T value) {
   int index = columns.size();
-  nameSpace.put(key, index);
+  nameSpace.put(SchemaPath.getCompoundPath(key).toExpr(), index);
--- End diff --

Thanks for the explanation, reverted my changes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #909: DRILL-4264: Allow field names to include dots

2017-08-23 Thread vvysotskyi
Github user vvysotskyi commented on a diff in the pull request:

https://github.com/apache/drill/pull/909#discussion_r134795741
  
--- Diff: 
logical/src/main/java/org/apache/drill/common/expression/SchemaPath.java ---
@@ -115,6 +112,33 @@ public static SchemaPath create(NamePart namePart) {
   }
 
   /**
+   * Parses input string and returns {@code SchemaPath} instance.
+   *
+   * @param expr input string to be parsed
+   * @return {@code SchemaPath} instance
+   */
+  public static SchemaPath parseFromString(String expr) {
--- End diff --

It parses a string using the same rules which are used for the field in the 
query. If a string contains dot outside backticks, or there are no backticks in 
the string, will be created `SchemaPath` with the `NameSegment` which contains 
one else `NameSegment`, etc. If a string contains [] then `ArraySegment` will 
be created.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #909: DRILL-4264: Allow field names to include dots

2017-08-23 Thread vvysotskyi
Github user vvysotskyi commented on a diff in the pull request:

https://github.com/apache/drill/pull/909#discussion_r134689741
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/project/ProjectRecordBatch.java
 ---
@@ -362,16 +363,16 @@ protected boolean setupNewSchema() throws 
SchemaChangeException {
   final TransferPair tp = vvIn.makeTransferPair(vvOut);
   transfers.add(tp);
 }
-  } else if (value != null && value.intValue() > 1) { // 
subsequent wildcards should do a copy of incoming valuevectors
+  } else if (value != null && value > 1) { // subsequent wildcards 
should do a copy of incoming valuevectors
 int k = 0;
 for (final VectorWrapper wrapper : incoming) {
   final ValueVector vvIn = wrapper.getValueVector();
-  final SchemaPath originalPath = 
SchemaPath.getSimplePath(vvIn.getField().getPath());
-  if (k > result.outputNames.size()-1) {
+  final SchemaPath originalPath = 
SchemaPath.getSimplePath(vvIn.getField().getName());
+  if (k > result.outputNames.size() - 1) {
 assert false;
   }
   final String name = result.outputNames.get(k++);  // get the 
renamed column names
-  if (name == EMPTY_STRING) {
+  if (EMPTY_STRING.equals(name)) {
--- End diff --

Thanks, replaced by `name.isEmpty()`, but `EMPTY_STRING` is used in other 
places, so left it in the class.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill issue #910: DRILL-5726: Support Impersonation without authentication f...

2017-08-23 Thread arina-ielchiieva
Github user arina-ielchiieva commented on the issue:

https://github.com/apache/drill/pull/910
  
@sohami PR is updated, information about new approach is added in Jira.
Please review when possible.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: Discuss about Drill's schedule policy

2017-08-23 Thread weijie tong
@paul  have you noticed the Sparrow project (
https://github.com/radlab/sparrow ) and related paper mentioned in the
github .  Sparrow is a non-central ,low latency scheduler . This seems meet
Drill's demand. I think we can first abstract a scheduler interface like
what Spark does , then we can have different scheduler implementations
(central or non-central ,maybe non-central like sparrow be the default one
).

On Mon, Aug 21, 2017 at 11:51 PM, weijie tong 
wrote:

> Thanks for all your suggestions.
>
>  @paul your analysis is impressive . I agree with  your opinion. Current
> queue solution can not solve this problem perfectly. Our system is
> suffering a  hard time once the cluster is in high load. I will think about
> this more deeply. welcome more ideas or suggestions to  be shared in this
> thread,maybe some little improvement .
>
>
> On Mon, 21 Aug 2017 at 4:06 AM Paul Rogers  wrote:
>
>> Hi Weijie,
>>
>> Great analysis. Let’s look at a few more data points.
>>
>> Drill has no central scheduler (this is a feature: it makes the cluster
>> much easier to manage and has no single point of failure. It was probably
>> the easiest possible solution while Drill was being built.) Instead of
>> central control, Drill is based on the assumption of symmetry: all
>> Drillbits are identical. So, each Foreman, acting independently, should try
>> to schedule its load in a way that evenly distributes work across nodes in
>> the cluster. If all Drillbits do the same, then load should be balanced;
>> there should be no “hot spots.”
>>
>> Note, for this to work, Drill should either own the cluster, or any other
>> workload on the cluster should also be evenly distributed.
>>
>> Drill makes another simplification: that the cluster has infinite
>> resources (or, equivalently, that the admin sized the cluster for peak
>> load.) That is, as Sudheesh puts it, “Drill is optimistic” Therefore, Drill
>> usually runs with no throttling mechanism to limit overall cluster load. In
>> real clusters, of course, resources are limited and either a large query
>> load, or a few large queries, can saturate some or all of the available
>> resources.
>>
>> Drill has a feature, seldom used, to throttle queries based purely on
>> number. These ZK-based queues can allow, say, 5 queries to run (each of
>> which is assumed to be evenly distributed.) In actual fact, the ZK-based
>> queues recognize that typical workload have many small, and a few large,
>> queries and so Drill offers the “small query” and “large query” queues.
>>
>> OK, so that’s where we are today. I think I’m not stepping too far out of
>> line to observe that the above model is just a bit naive. It does not take
>> into consideration the available cores, memory or disk I/Os. It does not
>> consider the fact that memory has a hard upper limit and must be managed.
>> Drill creates one thread for each minor fragment limited by the number of
>> cores. But, each query can contain dozens or more fragments, resulting in
>> far, far more threads per query than a node has cores. That is, Drill’s
>> current scheduling model does not consider that, above a certain level,
>> adding more threads makes the system slower because of thrashing.
>>
>> You propose a closed-loop, reactive control system (schedule load based
>> on observed load on each Drillbit.) However, control-system theory tells us
>> that such a system is subject to oscillation. All Foremen observe that a
>> node X is loaded so none send it work. Node X later finishes its work and
>> becomes underloaded. All Foremen now prefer node X and it swings back to
>> being overloaded. In fact, Impala tried an open-loop design and there is
>> some evidence in their documentation that they hit these very problems.
>>
>> So, what else could we do? As we’ve wrestled with these issues, we’ve
>> come to the understanding that we need an open-loop, predictive solution.
>> That is a fancy name for what YARN or Mesos does: keep track of available
>> resources, reserve them for a task, and monitor the task so that it stays
>> within the resource allocation. Predict load via allocation rather than
>> reacting to actual load.
>>
>> In Drill, that might mean a scheduler which looks at all incoming queries
>> and assigns cluster resources to each; queueing the query if necessary
>> until resources become available. It also means that queries must live
>> within their resource allocation. (The planner can help by predicting the
>> likely needed resources. Then, at run time, spill-to-disk and other
>> mechanisms allow queries to honor the resource limits.)
>>
>> The scheduler-based design is nothing new: it seems to be what Impala
>> settled on, it is what YARN does for batch jobs, and it is a common pattern
>> in other query engines.
>>
>> Back to the RPC issue. With proper scheduling, we limit load on each
>> Drillbit so that RPC (and ZK heartbeats) can operate correctly. That is,
>> rather than 

[GitHub] drill issue #889: DRILL-5691: enhance scalar sub queries checking for the ca...

2017-08-23 Thread arina-ielchiieva
Github user arina-ielchiieva commented on the issue:

https://github.com/apache/drill/pull/889
  
+1, LGMT.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill issue #889: DRILL-5691: enhance scalar sub queries checking for the ca...

2017-08-23 Thread weijietong
Github user weijietong commented on the issue:

https://github.com/apache/drill/pull/889
  
@arina-ielchiieva thanks for the advice, has corrected that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---