Re: [Discuss] Integrate Arrow gandiva into Drill

2019-04-18 Thread Paul Rogers
Hi Weijie, Thanks much for the update on your Gandiva work. It is great work. Can you say more about how you are doing the integration? As you mentioned the memory layout of Arrow's null vector differs from the "is set" vector in Drill. How did you work around that? The Project operator is

Re: Query Question

2019-04-12 Thread Paul Rogers
On Thu, Apr 11, 2019 at 6:37 AM Charles Givre wrote: > > > That’s a good idea.  I’ll work on a equivalent ZIP() function and submit > > as a separate PR. > > — C > > > > > On Apr 10, 2019, at 20:44, Paul Rogers > > wrote: > > > > > > Hi Charles,

Re: Query Question

2019-04-10 Thread Paul Rogers
Hi Charles, In Python [1], the "zip" function does this task: zip([1, 2, 3], [4, 5, 6]) --> [(1, 4), (2, 5), (3, 6)] When you gathered the list of functions for the Drill book, did you come across anything like this in Drill? I presume you didn't, hence the question. I did a quick

Re: [ANNOUNCE] New PMC member: Sorabh Hamirwasia

2019-04-05 Thread Paul Rogers
Congratulations Sorabh, well deserved! - Paul On Friday, April 5, 2019, 9:06:37 AM PDT, Arina Ielchiieva wrote: I am pleased to announce that Drill PMC invited Sorabh Hamirwasia to the PMC and he has accepted the invitation. Congratulations Sorabh and welcome! - Arina (on behalf

Re: [DISCUSS]: Hadoop 3

2019-04-03 Thread Paul Rogers
Hi All, Note that Hive 3 has introduced Hive ACID: an innovative way to handle transactional data on a traditional big data warehouse. Some distros appear to be talking about enabling ACID by default for all Hive-managed tables. In order for Drill to continue to work with such tables, Drill

Re: [DISCUSS]: Additional Formats for Drill

2019-04-02 Thread Paul Rogers
Hi All, Daffodil is an interesting project as is the DFDLSchemas project. Thanks for sharing! An interesting challenge is how these libraries load data: what is their internal format, or what API do they use for the application to consume data? Found this for Daffodil, it will "parse data

Re: [apache|drill] What is the Memory per Large Query?

2019-04-02 Thread Paul Rogers
Hi, The queue documentation can be a bit hard to find, but it is available at [1]. However, it appears that either a) this information is out of date, or b) the feature has changed. About 18 months ago we added additional options to make it easier to tune the queues, but that information is

[jira] [Created] (DRILL-7143) Enforce column-level constraints when using a schema

2019-03-30 Thread Paul Rogers (JIRA)
Paul Rogers created DRILL-7143: -- Summary: Enforce column-level constraints when using a schema Key: DRILL-7143 URL: https://issues.apache.org/jira/browse/DRILL-7143 Project: Apache Drill Issue

[jira] [Created] (DRILL-7086) Enhance row-set scan framework for to use external schema

2019-03-09 Thread Paul Rogers (JIRA)
Paul Rogers created DRILL-7086: -- Summary: Enhance row-set scan framework for to use external schema Key: DRILL-7086 URL: https://issues.apache.org/jira/browse/DRILL-7086 Project: Apache Drill

[jira] [Resolved] (DRILL-5954) ListVector shadows "offsets" from BaseRepeatedValueVector

2019-03-09 Thread Paul Rogers (JIRA)
[ https://issues.apache.org/jira/browse/DRILL-5954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Rogers resolved DRILL-5954. Resolution: Fixed Fixed in a prior commit. > ListVector shadows "offse

[jira] [Created] (DRILL-7083) Wrong data type for explicit partition column beyond file depth

2019-03-06 Thread Paul Rogers (JIRA)
Paul Rogers created DRILL-7083: -- Summary: Wrong data type for explicit partition column beyond file depth Key: DRILL-7083 URL: https://issues.apache.org/jira/browse/DRILL-7083 Project: Apache Drill

[jira] [Created] (DRILL-7082) Inconsistent results with implicit partition columns, multi scans

2019-03-06 Thread Paul Rogers (JIRA)
Paul Rogers created DRILL-7082: -- Summary: Inconsistent results with implicit partition columns, multi scans Key: DRILL-7082 URL: https://issues.apache.org/jira/browse/DRILL-7082 Project: Apache Drill

[jira] [Created] (DRILL-7080) Inconsistent behavior with wildcard and partition columns

2019-03-06 Thread Paul Rogers (JIRA)
Paul Rogers created DRILL-7080: -- Summary: Inconsistent behavior with wildcard and partition columns Key: DRILL-7080 URL: https://issues.apache.org/jira/browse/DRILL-7080 Project: Apache Drill

Re: [DISCUSS]: Git instructions

2019-03-03 Thread Paul Rogers
Hi Charles, As someone who struggled though learning these topics over the last few years, I'd point out that there is no right way to do this stuff. You can use the Git command line tools, You can use a UI. You can keep branches locally, or publish everything to GitHub. As Parth wisely noted

[jira] [Created] (DRILL-7074) Fixes and improvements to the scan framework for CSV

2019-03-03 Thread Paul Rogers (JIRA)
Paul Rogers created DRILL-7074: -- Summary: Fixes and improvements to the scan framework for CSV Key: DRILL-7074 URL: https://issues.apache.org/jira/browse/DRILL-7074 Project: Apache Drill Issue

[jira] [Resolved] (DRILL-5265) External Sort consumes more memory than allocated

2019-02-25 Thread Paul Rogers (JIRA)
[ https://issues.apache.org/jira/browse/DRILL-5265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Rogers resolved DRILL-5265. Resolution: Fixed > External Sort consumes more memory than alloca

[jira] [Resolved] (DRILL-5805) External Sort runs out of memory

2019-02-25 Thread Paul Rogers (JIRA)
[ https://issues.apache.org/jira/browse/DRILL-5805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Rogers resolved DRILL-5805. Resolution: Fixed > External Sort runs out of mem

[jira] [Created] (DRILL-7055) Project operator cannot handle wildcard + implicit cols

2019-02-24 Thread Paul Rogers (JIRA)
Paul Rogers created DRILL-7055: -- Summary: Project operator cannot handle wildcard + implicit cols Key: DRILL-7055 URL: https://issues.apache.org/jira/browse/DRILL-7055 Project: Apache Drill

[jira] [Created] (DRILL-7053) Benign, but unexpected, failure in CsvTest

2019-02-23 Thread Paul Rogers (JIRA)
Paul Rogers created DRILL-7053: -- Summary: Benign, but unexpected, failure in CsvTest Key: DRILL-7053 URL: https://issues.apache.org/jira/browse/DRILL-7053 Project: Apache Drill Issue Type: Bug

Re: [DISCUSS] Format plugins in contrib module

2019-02-06 Thread Paul Rogers
+1 Moving forward, we'd like to evolve the format plugin API to use the new scan framework based on the result set loader. Doing so will abstract away all the vector-twiddling headaches that several people have had fun with over the last couple of years. The framework will enable integration

Re: Problem of using ListVector for representing Hive arrays

2019-02-06 Thread Paul Rogers
Hi Igor, Hive complex type integration will be a valuable addition to Drill. You mentioned running into issues with List vector. I believe you will find that you'll encounter four separate issues. First, the List vector is "experimental": the core functionality exists, but there are holes.

[jira] [Created] (DRILL-7024) Refactor ColumnWriter to simplify type-conversion shim

2019-02-01 Thread Paul Rogers (JIRA)
Paul Rogers created DRILL-7024: -- Summary: Refactor ColumnWriter to simplify type-conversion shim Key: DRILL-7024 URL: https://issues.apache.org/jira/browse/DRILL-7024 Project: Apache Drill

Re: "Crude-but-effective" Arrow integration

2019-01-30 Thread Paul Rogers
this is well isolated and not hard if you take it step-by-step. That's why it seemed a good Summer of Code project for an enterprising student interested in networking and data munging. Thanks, - Paul [1] https://github.com/paul-rogers/drill-jig On Wednesday, January 30, 2019, 10:18:47 AM PST

Re: "Crude-but-effective" Arrow integration

2019-01-29 Thread Paul Rogers
/17I2jZq2HdDwUDXFOIg1Vecry8yGTDWhn Aman On Tue, Jan 29, 2019 at 12:08 AM Paul Rogers wrote: > Hi Charles, > I didn't see anything on this on the public mailing list. Haven't seen any > commits related to it either. My guess is that this kind of interface is > not important for the kind of data warehou

Re: "Crude-but-effective" Arrow integration

2019-01-29 Thread Paul Rogers
018, at 13:51, Paul Rogers wrote: > > Hi Ted, > > We may be confusing two very different ideas. The one is a Drill-to-Arrow > adapter on Drill's periphery, this is the "crude-but-effective" integration > suggestion. On the periphery we are not changing existing code,

[jira] [Created] (DRILL-7007) Revise row-set based tests to use simplified verify method

2019-01-27 Thread Paul Rogers (JIRA)
Paul Rogers created DRILL-7007: -- Summary: Revise row-set based tests to use simplified verify method Key: DRILL-7007 URL: https://issues.apache.org/jira/browse/DRILL-7007 Project: Apache Drill

[jira] [Created] (DRILL-7006) Support type conversion shims in RowSetWriter

2019-01-26 Thread Paul Rogers (JIRA)
Paul Rogers created DRILL-7006: -- Summary: Support type conversion shims in RowSetWriter Key: DRILL-7006 URL: https://issues.apache.org/jira/browse/DRILL-7006 Project: Apache Drill Issue Type

Re: Regression? Drill Truncating Varchars

2019-01-26 Thread Paul Rogers
Hi Charles, A managed buffer is just a DrillBuf that the execution framework will free for you when the query fragment shuts down. However, nothing can determine when you write past the end of the buffer and automatically resize it. You still must do the reallocation yourself. You probably

Good DB theory references

2019-01-21 Thread Paul Rogers
Hi All, Wanted to pass along some good foundational material about databases. We find ourselves immersed day-to-day in the details of Drill's implementation. It is helpful to occasionally step back and look at the larger DB tradition in which Drill resides. This material is especially good for

Re: Beginner Jira Bugs

2019-01-18 Thread Paul Rogers
: https://github.com/apache/drill/tree/master/docs/dev https://github.com/paul-rogers/drill/wiki Kind regards Vitalii On Fri, Jan 18, 2019 at 12:31 PM srungarapu vamsi wrote: > Hi, > > I find Drill Apache project interesting and i want to contribute to the > project. I have cloned the

[jira] [Created] (DRILL-6953) Merge row set-based JSON reader

2019-01-07 Thread Paul Rogers (JIRA)
Paul Rogers created DRILL-6953: -- Summary: Merge row set-based JSON reader Key: DRILL-6953 URL: https://issues.apache.org/jira/browse/DRILL-6953 Project: Apache Drill Issue Type: Improvement

[jira] [Created] (DRILL-6951) Merge row set based mock data source

2019-01-07 Thread Paul Rogers (JIRA)
Paul Rogers created DRILL-6951: -- Summary: Merge row set based mock data source Key: DRILL-6951 URL: https://issues.apache.org/jira/browse/DRILL-6951 Project: Apache Drill Issue Type

[jira] [Created] (DRILL-6952) Merge row set based "compliant" text reader

2019-01-07 Thread Paul Rogers (JIRA)
Paul Rogers created DRILL-6952: -- Summary: Merge row set based "compliant" text reader Key: DRILL-6952 URL: https://issues.apache.org/jira/browse/DRILL-6952 Project: Apache Drill

[jira] [Created] (DRILL-6950) Pull request for row set-based scan framework

2019-01-07 Thread Paul Rogers (JIRA)
Paul Rogers created DRILL-6950: -- Summary: Pull request for row set-based scan framework Key: DRILL-6950 URL: https://issues.apache.org/jira/browse/DRILL-6950 Project: Apache Drill Issue Type

Re: Drill on YARN Questions

2018-12-17 Thread Paul Rogers
Hi Charles, I'm not quite sure what "dynamic queue allocation" means: all YARN containers are allocated dynamically through YARN via queues.  It may be helpful to review how Drill-on-YARN (DoY) works. DoY does NOT attempt to use YARN for each query. Impala tried that with Llama and discovered

[jira] [Created] (DRILL-6901) Move SchemaBuilder from test to main for use outside tests

2018-12-12 Thread Paul Rogers (JIRA)
Paul Rogers created DRILL-6901: -- Summary: Move SchemaBuilder from test to main for use outside tests Key: DRILL-6901 URL: https://issues.apache.org/jira/browse/DRILL-6901 Project: Apache Drill

Re: Is there any instructions for new developers for drill

2018-12-07 Thread Paul Rogers
Thanks! Glad you found the book useful. - Paul On Friday, December 7, 2018, 8:00:29 PM PST, 王亮 wrote: Thanks, I have one "Learning Apache Drill:Query and Analyze Distributed Data Sources with SQL" , wonderful book: )

Re: [ANNOUNCE] New Committer: Karthikeyan Manivannan

2018-12-07 Thread Paul Rogers
Congrats Karthik! - Paul Sent from my iPhone > On Dec 7, 2018, at 11:12 AM, Abhishek Girish wrote: > > Congratulations Karthik! > >> On Fri, Dec 7, 2018 at 11:11 AM Arina Ielchiieva wrote: >> >> The Project Management Committee (PMC) for Apache Drill has invited >> Karthikeyan >>

Re: Unit Test Question

2018-11-11 Thread Paul Rogers
java:89) Results : Tests in error:   TestSyslogFormat>ClusterTest.shutdown:89 » Runtime Exception while closing > On Nov 9, 2018, at 16:09, Paul Rogers wrote: > > Hi Charles, > > Thanks for the PR. Two suggestions for your test. First, use TupleSchema: > > TupleSchema sch

Re: [DISCUSS] Resurrect support for Table Statistics in Drill

2018-11-10 Thread Paul Rogers
Hi Gautam, You touched on the key issue: storage. You mention that the Drill stats implementation learned from Oracle. Very wise: Oracle is the clear expert in this space. There is a very important difference, however, between Drill and Oracle. Oracle is a complete database including both

Re: msgpack pull request

2018-11-09 Thread Paul Rogers
Hi JC, Thanks much for the updates. I’ll take another look over the weekend. - Paul Sent from my iPhone > On Nov 9, 2018, at 2:02 PM, Jean-Claude Cote wrote: > > Hey Paul, > > In my pull request you mentioned handling splits.. I put a comment in the > pull request but essentially msgpack

Re: Unit Test Question

2018-11-09 Thread Paul Rogers
Hi Charles, Thanks for the PR. Two suggestions for your test. First, use TupleSchema: TupleSchema schema = new SchemaBuilder() ... .buildSchema(). BatchSchema has some limitations that TupleSchema overcomes. Second, when I did a PR that added unions, I normalized the "buildFoo()" methods.

Re: [DISCUSS] Resurrect support for Table Statistics in Drill

2018-11-09 Thread Paul Rogers
, would be good to get the existing version into the code base so folks can play with it. Thanks, - Paul On Thursday, November 8, 2018, 3:57:35 PM PST, Paul Rogers wrote: Hi Gautam, Thanks much for the explanations. You raise some interesting points. I noticed that Boaz has just filed

Re: [DISCUSS] Resurrect support for Table Statistics in Drill

2018-11-08 Thread Paul Rogers
Hi Gautam, Thanks much for the explanations. You raise some interesting points. I noticed that Boaz has just filed a JIRA ticket to tackle the inefficient count distinct case. To take a step back, recall that Arina is working on a metadata proposal. A key aspect of that proposal is that it

Re: Handling schema change in blocking operators

2018-11-06 Thread Paul Rogers
possibility for the Hash operators is to have some hash function compatibility, like  HashFunc( INT 567 ) == HashFunc( BIGINT 567 ), to simplify (and avoid rehashing).     Thanks, Boaz On 11/6/18 12:25 PM, Paul Rogers wrote: > HI Aman, > > I would completely agree with the

[jira] [Created] (DRILL-6832) Remove old "unmanaged" sort implementation

2018-11-06 Thread Paul Rogers (JIRA)
Paul Rogers created DRILL-6832: -- Summary: Remove old "unmanaged" sort implementation Key: DRILL-6832 URL: https://issues.apache.org/jira/browse/DRILL-6832 Project: Apache Drill

Re: Handling schema change in blocking operators

2018-11-06 Thread Paul Rogers
s it means that as soon as the schema changes, it emits the previous Record Batch and starts a new output batch.  For the blocking operators,  there's more things to take care of and I created DRILL-6829 <https://issues.apache.org/jira/browse/DRILL-6829>  to capture that. Aman On Mon, Nov 5, 2018 at 8:50

Re: [DISCUSS] Resurrect support for Table Statistics in Drill

2018-11-06 Thread Paul Rogers
Hi All, Stats would be a great addition. Here are a couple of issues that came up in the earlier code review, revisited in light of recent proposed work. First, the code to gather the stats is rather complex; it is the evolution of some work an intern did way back when. We'd be advised to find

Re: Handling schema change in blocking operators

2018-11-05 Thread Paul Rogers
Hi Aman, Thanks much for the write-up. My two cents, FWIW. As the history of this list has shown, I've fought with the schema change issue multiple times: in sort, in JSON, in the row set loader framework, and in writing the "Data Engineering" chapter in the Learning Drill book. What I have

Re: logging in test cases produces two outputs

2018-11-04 Thread Paul Rogers
ore logback.xml. https://github.com/apache/drill/blob/7b0c9034753a8c5035fd1c0f1f84a37b376e6748/common/src/test/resources/logback-test.xml Should I be using a logback-text.xml in my personal project or should that common logback-test.xml be removed ? Thanks Paul jc On Sat, Nov 3, 2018 at 3:39 PM Paul

Re: logging in test cases produces two outputs

2018-11-03 Thread Paul Rogers
Hi JC, Your code looks fine. I usually start with the default log level (ERROR), then turn on DEBUG for specific modules, as you do. I then see my INFO or DEBUG messages. My code looks like yours, so I'm not sure why you are seeing two messages. Perhaps you are logging ERROR level messages?

Re: msgpack reading schema files checksum error

2018-10-30 Thread Paul Rogers
Looks like Google found a couple of hits: [1] and [2] I'm not an expert here, but I wonder if you can just remove the file. Never had Drill or HDFS complain when asking it to read a local file without the .crc file... Thanks, - Paul [1] 

Re: Is there any instructions for new developers for drill

2018-10-25 Thread Paul Rogers
-storage-plugin/ [3] https://drill.apache.org/docs/connect-a-data-source-introduction/ [4] https://github.com/apache/drill/tree/master/contrib [5] https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/store [6] https://github.com/paul-rogers/drill/wiki Kind

Re: [ANNOUNCE] New Committer: Gautam Parai

2018-10-22 Thread Paul Rogers
Congrats Guatam! - Paul Sent from my iPhone > On Oct 22, 2018, at 8:46 AM, salim achouche wrote: > > Congrats Gautam! > >> On Mon, Oct 22, 2018 at 7:25 AM Arina Ielchiieva wrote: >> >> The Project Management Committee (PMC) for Apache Drill has invited Gautam >> Parai to become a

Re: msgpack test case fails same as with json, problem with testing framework?

2018-10-21 Thread Paul Rogers
May be a bug in my code. Please create a JIRA ticket and attach your input file and test code so I can reproduce the problem. Thanks, - Paul On Sunday, October 21, 2018, 6:24:18 AM PDT, Jean-Claude Cote wrote: I trying to write a test case for a repeated map scenario. However

Re: msgpack handling lists with elements of different types

2018-10-17 Thread Paul Rogers
Hi JC, Bingo, you just hit the core problem with schema-on-read: there is no "right" rule for how to handle ambiguous or inconsistent schemas. Take your string/binary example. You determined that the binary fields were actually strings (encoded in what, UTF-8? ASCII? Host's native codeset?)

Re: msgpack read batch size larger than 4096 causes assertion error

2018-10-15 Thread Paul Rogers
  } The test passes. Then I change   public static final long DEFAULT_ROWS_PER_BATCH = BaseValueVector.INITIAL_VALUE_ALLOCATION ; to be   public static final long DEFAULT_ROWS_PER_BATCH = BaseValueVector.INITIAL_VALUE_ALLOCATION + 1; and the test case fails. I can attach the whole trace outpu

Re: msgpack read batch size larger than 4096 causes assertion error

2018-10-13 Thread Paul Rogers
, 2018, 6:22:40 PM PDT, Paul Rogers wrote: Drill enforces two hard limits: 1. The maximum number of rows in a batch is 64K. 2. The maximum size of any vector is 4 GB. We have found, however, that fragmentation occurs in our memory allocator for any vector larger than 16 MB. (This is, in fact

Re: msgpack read batch size larger than 4096 causes assertion error

2018-10-12 Thread Paul Rogers
Drill enforces two hard limits: 1. The maximum number of rows in a batch is 64K. 2. The maximum size of any vector is 4 GB. We have found, however, that fragmentation occurs in our memory allocator for any vector larger than 16 MB. (This is, in fact the original reason for the result set loader

Re: configure logback to trace level in junit tests

2018-10-12 Thread Paul Rogers
Hi JC, Your are asking how to use logs with unit tests. Let's talk about the two ways you might be using logging, because each has a different answer. In general, a unit test should use JUnit assert calls to verify that behavior is as expected. No-one ever looks at output from tests unless a

Scan mechanism PR

2018-10-12 Thread Paul Rogers
ofit other readers as the need arises. The entire mechanism, and the design goals behind it, are documented in [1]. Thanks, - Paul [1] https://github.com/paul-rogers/drill/wiki/Batch-Handling-Upgrades On Thursday, October 11, 2018, 2:51:22 AM PDT, Arina Yelchiyeva wrote: Paul,

[jira] [Created] (DRILL-6791) Merge scan projection framework into master

2018-10-11 Thread Paul Rogers (JIRA)
Paul Rogers created DRILL-6791: -- Summary: Merge scan projection framework into master Key: DRILL-6791 URL: https://issues.apache.org/jira/browse/DRILL-6791 Project: Apache Drill Issue Type

Re: Unsupported type LIST when CTAS arrayOfArray (JSON or Msgpack) into Parquet

2018-10-11 Thread Paul Rogers
I don't believe Parquet supports 2D arrays, does it? Thanks, - Paul On Thursday, October 11, 2018, 7:52:38 PM PDT, Jean-Claude Cote wrote: I'm trying to write the following JSON file into a parquet file. However my CTAS query returns an error Unsupported type LIST. Any ideas why,

Re: msgpack format reader with schema learning feature

2018-10-10 Thread Paul Rogers
any ETA when you will be able to submit the PRs? Maybe > also do some presentation? Can you please share Jira number(-s) as well? > > Kind regards, > Arina > > On Wed, Oct 10, 2018 at 7:31 AM Paul Rogers > wrote: > > > Hi JC, > > > > Very cool indeed. You

Re: msgpack format reader with schema learning feature

2018-10-10 Thread Paul Rogers
Maybe also do some presentation? Can you please share Jira number(-s) as well? Kind regards, Arina On Wed, Oct 10, 2018 at 7:31 AM Paul Rogers wrote: > Hi JC, > > Very cool indeed. You are the man! > > Ted's been advocating for this approach for as long as I can remember (2+ >

Re: msgpack format reader with schema learning feature

2018-10-09 Thread Paul Rogers
Hi JC, Very cool indeed. You are the man! Ted's been advocating for this approach for as long as I can remember (2+ years). You're well on your way to solving the JSON problems that I documented a while back in DRILL-4710 and summarize as "Drill can't predict the future." Basically, without a

Re: How to use alter session to configure contributed format plugins

2018-10-07 Thread Paul Rogers
Hi JC, Unless something has changed recently, it turns out that system/session options are global: they must be defined in the one big file you discovered, and default values must be listed in the master drill-module.conf file. It would be a handy feature to modify this to allow modules to add

Re: Possible way to specify column types in query

2018-10-02 Thread Paul Rogers
DESCRIBE to work I need to implement it at the planner level? Thanks Paul jc On Tue, Oct 2, 2018 at 12:54 PM Paul Rogers wrote: > Hi JC, > > Now that you have a working reader, sounds like your next task is to pass > column schema to the reader. There are two ways to do that. There a

Re: Possible way to specify column types in query

2018-10-02 Thread Paul Rogers
ea. How would I best leverage such a file. Thank you very much jc On Mon, Oct 1, 2018 at 9:51 PM Paul Rogers wrote: > Hi JC, > > One of Drill's challenges is that it cannot predict the future: it can't > know what type your column will be in later records or in another file. Al

Re: Possible way to specify column types in query

2018-10-01 Thread Paul Rogers
l/exec/store/log [5]  https://github.com/paul-rogers/drill/tree/RowSetRev4/exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/json On Monday, October 1, 2018, 6:03:38 PM PDT, Jean-Claude Cote wrote: I'm implementing a msgpack reader and use the JSON reader as inspi

Re: [ANNOUNCE] New Committer: Chunhui Shi

2018-09-28 Thread Paul Rogers
Congrats Chunhui! Thanks, - Paul On Friday, September 28, 2018, 2:17:42 AM PDT, Arina Ielchiieva wrote: The Project Management Committee (PMC) for Apache Drill has invited Chunhui Shi to become a committer, and we are pleased to announce that he has accepted. Chunhui Shi has

[jira] [Created] (DRILL-6759) CSV 'columns' array is incorrectly case sensitive

2018-09-23 Thread Paul Rogers (JIRA)
Paul Rogers created DRILL-6759: -- Summary: CSV 'columns' array is incorrectly case sensitive Key: DRILL-6759 URL: https://issues.apache.org/jira/browse/DRILL-6759 Project: Apache Drill Issue

Contrib module not in root pom.xml?

2018-09-11 Thread Paul Rogers
Hi All, I'm hoping someone can explain a mystery in the root pom.xml file. We have a list of modules:       tools     protocol     common     logical     exec     drill-yarn     distribution   Note that contrib is not part of this list. The result is that, in a normal build, the contrib

Re: Drill in the distributed compute jungle

2018-09-10 Thread Paul Rogers
018 at 10:21 PM Paul Rogers wrote: > Hi All, > > Been reading up on distributed DB papers of late, including those passed > along by this group. Got me thinking about Arina's question about where > Drill might go in the long term. > > One thing I've noticed is that t

Drill in the distributed compute jungle

2018-09-09 Thread Paul Rogers
Hi All, Been reading up on distributed DB papers of late, including those passed along by this group. Got me thinking about Arina's question about where Drill might go in the long term. One thing I've noticed is that there are now quite a few distributed compute frameworks, many of which

Re: Possible way to specify column types in query

2018-09-09 Thread Paul Rogers
ee section 6.3. Also need to declare column datatype before the query. [1] http://www.vldb.org/pvldb/vol11/p1835-samwel.pdf On Fri, Sep 7, 2018 at 9:47 AM Paul Rogers wrote: > Hi All, > > We've discussed quite a few times whether Drill should or should not > support or require schema

Possible way to specify column types in query

2018-09-06 Thread Paul Rogers
Hi All, We've discussed quite a few times whether Drill should or should not support or require schemas, and if so, how the user might express the schema. I came across a paper [1] that suggests a simple, elegant SQL extension: EXTRACT [:] {,[:]} FROM Paraphrasing into Drill's SQL: SELECT

Re: Contrib Plugin Question

2018-09-03 Thread Paul Rogers
I've been helping Charles with this. He's got a branch that works some times, but not others. * If I run his unit test from Eclipse, it works. * If I run his unit test from the command line with Maven, it works. * If he runs his unit test using the mechanism he is using, Drill can't find his

Re: [ANNOUNCE] New PMC member: Charles Givre

2018-09-03 Thread Paul Rogers
Congratulations Charles! I look forward to your continued strong voice as an expert Drill user in your new role. - Paul Sent from my iPhone > On Sep 3, 2018, at 10:22 AM, Vitalii Diravka wrote: > > Congrats Charles! > And thank you for your enthusiasm and work on Drill > >> On Mon, Sep 3,

Re: [ANNOUNCE] New Committer: Weijie Tong

2018-08-31 Thread Paul Rogers
Congratulations Weijie, thanks for your contributions to Drill. Thanks, - Paul On Friday, August 31, 2018, 8:51:30 AM PDT, Arina Ielchiieva wrote: The Project Management Committee (PMC) for Apache Drill has invited Weijie Tong to become a committer, and we are pleased to announce

Re: Issue reading JSON file prohibiting from creating a Parquet file from it.

2018-08-30 Thread Paul Rogers
Hi Sri, The fact that each line can be converted, but multiple throw an error suggests that you may have conflicting types. Drill tries to handle such cases, but there are many holes, sounds like you are hitting one of them. The error message mentions "SingleListWriter". The single list

Re: [Question] HiveStoragePlugin and NativeParquetRowGroupScan

2018-08-24 Thread Paul Rogers
Ro= > QDZyPZEwolNN1wu5z4QMwajvdQ3iQPPQ0yycxhUUKw0= > > > Kind regards > Vitalii > > > On Thu, Aug 23, 2018 at 3:02 AM Paul Rogers > wrote: > > > Hi Tim, > > > > I don't have an answer. But, I can point out some factors to consider. > > > > H

Re: [ANNOUNCE] New PMC member: Volodymyr Vysotskyi

2018-08-24 Thread Paul Rogers
Congratulations Volodymyr! Thanks, - Paul On Friday, August 24, 2018, 5:53:25 AM PDT, Arina Ielchiieva wrote: I am pleased to announce that Drill PMC invited Volodymyr Vysotskyi to the PMC and he has accepted the invitation. Congratulations Vova and thanks for your contributions!

Re: [Question] HiveStoragePlugin and NativeParquetRowGroupScan

2018-08-22 Thread Paul Rogers
Hi Tim, I don't have an answer. But, I can point out some factors to consider. Hive describes a set of data in a specific file system. Would make sense to associate that file system with the Hive configuration. Else, I could use a Hive metastore for FS A, with a DFS configured for FS B, and

Re: [DISCUSSION] Does schema-free really need

2018-08-22 Thread Paul Rogers
d of > me. > > On Tue, Aug 21, 2018 at 4:43 PM Paul Rogers > wrote: > >> Hi Chris, >> > > >> Later, when Drill sees the first Varchar, it can change the type from, >> say, batch 3 onwards. But, JDBC and ODBC generally require the schema be >> know

The Ray framework

2018-08-21 Thread Paul Rogers
Hi All, There is a cool new distributed framework coming out of UC Berkeley: Ray [1]. This is part of the RISE project which is the successor to the AmpLab project that produced Spark. The Ray paper [2] provides a great overview. (quote) Ray is a high-performance distributed execution

Re: 回复:Is Drill query execution processing model just the same idea with the Spark whole-stage codegen improvement

2018-08-21 Thread Paul Rogers
ld not listen to today's hangouts session unfortunately, sorry for possible ignorance) Thanks, Best Regards, Alex On Thu, Aug 9, 2018 at 7:51 PM Paul Rogers wrote: > Hi Alex, > > Perhaps Parth can jump in here as he has deeper knowledge of Parquet. > > My understanding is

Re: [DISCUSSION] Replacing Preconditions.checkNotNull() with Objects.requireNonNull()

2018-08-21 Thread Paul Rogers
Hi All, My two cents... The gist of the discussion is that 1) using Objects.checkNotNull() reduces the Guava import footprint, vs. 2) we are not removing the Guava dependency, so switching to Objects.checkNotNull() is unnecessary technically and is instead a personal preference. We make

Re: [DISCUSSION] Does schema-free really need

2018-08-21 Thread Paul Rogers
:55 PM PDT, Chris Cunningham wrote: Hi.  Mostly off topic, but reading about this issue has finally prompted a response. On Wed, Aug 15, 2018 at 5:46 PM Paul Rogers wrote: > If we provide schema hints ("field x, when it appears, will be a Double"), > then Drill need

Re: "Crude-but-effective" Arrow integration

2018-08-20 Thread Paul Rogers
or Drill's internals? That's really the question the group will want to answer. More details below. Thanks, - Paul On Monday, August 20, 2018, 9:41:49 AM PDT, Ted Dunning wrote: Inline. On Mon, Aug 20, 2018 at 9:20 AM Paul Rogers wrote: > ... > By contrast, migrating Drill in

Re: "Crude-but-effective" Arrow integration

2018-08-20 Thread Paul Rogers
ould be easier than was thought. On Sat, Aug 18, 2018, 16:44 Paul Rogers wrote: > Hi All, > > Charles recently suggested why Arrow integration could be helpful. (See > quote below.)  When we've looked at reworking Drill's internals to use > Arrow, we found the project to be cost

Re: "Crude-but-effective" Arrow integration

2018-08-20 Thread Paul Rogers
we could avoid major work to Drill. I was concerned in reading about the ideas for Arrow integration, that it would complicate existing UDFs and/or Format-plugins.  How much of this do you envision would be included with Drill? —C > On Aug 18, 2018, at 19:44, Paul Rogers wrote: > >

"Crude-but-effective" Arrow integration

2018-08-18 Thread Paul Rogers
Hi All, Charles recently suggested why Arrow integration could be helpful. (See quote below.)  When we've looked at reworking Drill's internals to use Arrow, we found the project to be costly with little direct benefit in terms of performance or stability. But, Charles points out that the real

Re: [DISCUSSION] current project state

2018-08-17 Thread Paul Rogers
; do. > > We also need some  evangelists to broadcast the Drill project  to adopt > more contributors. > It’s rarely to see Drill’s tech show to expand its community influence. > > On Wed, Aug 15, 2018 at 4:26 AM Paul Rogers > wrote: > > > I wonder if we should pop th

Re: [ANNOUNCE] New PMC member: Boaz Ben-Zvi

2018-08-17 Thread Paul Rogers
Congratulations Boaz! - Paul On Friday, August 17, 2018, 2:56:27 AM PDT, Vitalii Diravka wrote: Congrats Boaz! Kind regards Vitalii On Fri, Aug 17, 2018 at 12:51 PM Arina Ielchiieva wrote: > I am pleased to announce that Drill PMC invited Boaz Ben-Zvi to the PMC and > he has

Re: [DISCUSSION] Does schema-free really need

2018-08-16 Thread Paul Rogers
; hope we move the mess schema solving logic out of Drill to let the code > cleaner by defining the schema firstly with DDL statements. If we agree on > this, the work should be a sub work of DRILL-6552. > > On Thu, Aug 16, 2018 at 8:51 AM Paul Rogers > wrote: > > > Hi Ted, &

Re: [Question] ValueVector Contract and Usage

2018-08-15 Thread Paul Rogers
Hi Tim, IIRC, you have to do an initial allocation. There was a bug that, if you didn't, the setSafe would try to double your vector from 0 items to 0 items. This would be t0o small, so it would double again, forever. In general, you don't want to start with an empty vector (or the default

Re: [DISCUSSION] Does schema-free really need

2018-08-15 Thread Paul Rogers
Hi Ted, I like the "schema auto-detect" idea. As we discussed in a prior thread, caching of schema is a nice-add on once we have defined the schema-on-read mechanism. Maybe we first get it to work with a user-provided schema. Then, as an enhancement, we offer to infer the schema by scanning

Re: [DISCUSSION] Does schema-free really need

2018-08-15 Thread Paul Rogers
Hi Weijie, Thanks for raising this topic. I think you've got a great suggestion. My two cents: there is no harm in reading all manner of ugly data. But, rather than try to process the mess throughout Drill (as we do today with schema changes, just-in-time code generation, union vectors and the

Re: [DISCUSSION] current project state

2018-08-14 Thread Paul Rogers
lities like that are really needed.  I’d like to see a > generic HTTP storage plugin, a storage plugin for Google Sheets,  If I can > figure out how storage plugins work, I’ll gladly work on some of these. > > Just my .02. > — C > > > > > > > On Aug 13, 2018, at 2

<    1   2   3   4   5   6   7   8   9   10   >