current version and report back your
results. I will also try to contact Timothy Chen, the committer who drove
the effort, to see if he would be interested in helping to update the
script if need be.
http://tnachen.wordpress.com/2013/12/24/drill-on-aws-emr/
- Jason Altekruse
On Thu, Dec 4, 2014
While it is not part of the open source Drill project, Mapr Technologies
provides an ODBC driver for Drill. A quick search on the PHP docs seems to
indicate that PHP can connect to an ODBC provider. The rest interface is
also available, and would be bundled with the default open source build,
altho
Hi Adam,
I have a few thoughts that might explain the difference in query times.
Drill is able to read a subset of the data from a parquet file, when
selecting only a few columns out of a large file. Drill will give you
faster results if you ask for 3 columns instead of 10 in terms of read
perform
Just made one, I put some comments there from the design discussions we
have had in the past.
https://issues.apache.org/jira/browse/DRILL-1950
- Jason Altekruse
On Tue, Jan 6, 2015 at 11:04 PM, Adam Gilmore wrote:
> Just a quick follow up on this - is there a JIRA item for implementing p
As we currently use file suffixes to determine file types on read, I think
it would make sense to have the same behavior on write (obviously with the
option to define overrides as users need them). Thoughts on the best user
experience here?
-Jason Altekruse
On Tue, Jan 6, 2015 at 1:01 PM
ments have been made to the parquet mainline that may give us the
performance we are looking for in these cases. We haven't had time to
revisit it so far.
-Jason Altekruse
On Wed, Jan 7, 2015 at 4:04 PM, Adam Gilmore wrote:
> Out of interest, is there a reason Drill implemented e
attachments is prohibited. The recipient should check
> > this email and any attachments for viruses and other defects. The Company
> > disclaims any liability for loss or damage arising in any way from this
> > communication including any file attachments.
> >
> > On
-mail and
> delete the original transmission and its contents. Any unauthorised use,
> dissemination, forwarding, printing, or copying of this communication
> including any file attachments is prohibited. The recipient should check
> this email and any attachments for viruses and other defect
I believe that Jim may have given the appropriate query to satisfy the
needs of the original question, but for anyone who finds this thread I
wanted to give a quick clarification about kvgen. The purpose of this
function is to allow queries against maps where the keys themselves
represent data rath
I do not think we currently consider JSON files splittable. If we do treat
them as such, it would depend on the file size and the available read
locality available on the nodes. Especially with a select * (or a count(*))
query there is nothing to parallelize except for the read operation and a
simp
We do not currently have information gathered during execution. There was a
discussion at some point about gathering and exposing information that is
usually reported at the end of they query in the Web UI query profile view
during execution and updating that interface to track progress. I'm not
su
Ralph,
The most common reason for not being able to start a Drillbit is that the
port used for the Web UI is still being consumed by another instance of
Drill. If you have previously tried to start a Drillbit outside of embedded
mode there might still be some portion of the process still running.
Hello Drillers,
Please join us tomorrow at 10am Pacific for our community meeting. If you
are new to Drill, have questions about the current work being done
throughout the community, or you just want to listen in, anyone is welcome
to participate. The link is always available from the website unde
Currently this will fail on a repeated list or repeated map. This function
has only been defined for lists of scalars. There is a JIRA open for the
enhancement request, the priority on this should probably be bumped up with
the use cases we have seen. Currently it is marked "Future".
https://issue
Hi Jen,
Unfortunately the mailing list does not allow attachments. Please feel free
to upload your image to a service like IMGUR and share a link.
http://imgur.com/
-Jason
On Mon, Jan 26, 2015 at 8:05 AM, Andries Engelbrecht <
aengelbre...@maprtech.com> wrote:
> Hi Jen,
>
> Not sure if the mai
As Aditya commented before this will work if the lists only contain scalars
repeated_count('entities.urls') > 0
If the lists contain maps unfortunately this is not available today. There
is an enhancement request open for this feature. I have marked it for a fix
in 0.9 as it is more of a feature
under
"Community -> Get Involved", I have copied it below as well.
https://plus.google.com/hangouts/_/event/ci4rdiju8bv04a64efj5fedd0lc
As I have done for the last few sessions, I encourage anyone thinking of
attending to suggest topics they would like to discuss.
Thanks,
Jason Altekruse
Sudhakar opened an issue for this, so I responded there. Steven is right,
this is the current expected functionality, but I discuss there the reasons
for it and opened the discussion for use cases that need this functionality.
On Tue, Feb 3, 2015 at 2:28 PM, Steven Phillips
wrote:
> I think flat
. I haven’t used Big Query for
> this, I used it for flat tables, but I can check.
>
> Thanks
> Sudhakar Thota
>
>
>
> On Feb 3, 2015, at 2:45 PM, Jason Altekruse
> wrote:
>
> > Sudhakar opened an issue for this, so I responded there. Steven is right,
> > t
Hao,
The dir columns are always added to the records coming out of a scan. The
issue is with trying to avoid unneeded reads altogether. If you look at the
query plan you should see that the scan is going to read all of the files
and the filter against the directory column will be applied in a sepa
Sorry about the last minute notice, but unfortunately I'm going to have to
cancel the community hangout today.
Yash, I saw both of your messages, which I assume were in anticipation of a
meeting today. I will find someone today to review the Cassandra storage
plugin. I have some thoughts on the Py
I don't think this actually answers your question. You can limit your
filters by directory to avoid reads from the filesystem, and some of the
storage plugins like Hbase and Hive implement scan level pushdown, but I do
not know if this is sophisticated enough that a join would be aware of the
parti
Almost all of the heavy lifting has been done for us by calcite. See the
discussion here for a little bit of background and the parts we need to
still implement.
http://mail-archives.apache.org/mod_mbox/drill-dev/201501.mbox/%3CCAMpYv7APxne4JzM_wBrAtBd5Emkogj1jpnPeQQ3bA1E-7RKf=w...@mail.gmail.com%
Sorry about that lack of a reminder this week. We were off for presidents
day yesterday and I didn't think about it. If anyone is available, feel
free to join the hangout!
https://plus.google.com/hangouts/_/event/ci4rdiju8bv04a64efj5fedd0lc
-Jason
gh for some members of
the community to see the broader need for a use case like their own.
http://www.w3.org/TR/2001/REC-xmlschema-2-20010502/
-Jason Altekruse
On Mon, Feb 23, 2015 at 11:45 AM, Steven Phillips
wrote:
> To the best of my knowledge, no one has started working on this yet.
>
back into the root of the schema currently.
-Jason Altekruse
On Fri, Feb 27, 2015 at 8:25 AM, Ted Dunning wrote:
> I was just looking through the documentation and I don't see a way to group
> data and then create a list. Flatten turns a list into individual
> records. I woul
Even beyond the issue of types, there are structures that are expressible
in XML that do not fit into a database model well, even one like Drill that
supports complex data. The primary issue is text stored between opening and
closing tags. I don't think these features of XML are commonly used by
sy
m the
attendees, go around and introduce any new people and jump right into the
discussions. Notes can be high level with a quick description of the issue,
features, etc.
https://plus.google.com/hangouts/_/event/ci4rdiju8bv04a64efj5fedd0lc
Thanks,
Jason Altekruse
I did not see Yash's message before I sent this, please see his message for
more info.
On Tue, Mar 3, 2015 at 8:15 AM, Jason Altekruse
wrote:
> I will not have time this morning to lead the hangout, anyone with topics
> to discuss is welcome to still attend and post minutes to t
re. Having an agenda encourages newcomers and mailing list
lurkers to come and discuss topics they think sound interesting.
Thanks,
Jason Altekruse
Hangout happening now!
https://plus.google.com/hangouts/_/event/ci4rdiju8bv04a64efj5fedd0lc
On Mon, Mar 9, 2015 at 11:01 AM, Jason Altekruse
wrote:
> Hello Drillers,
>
> Please join us tomorrow at 10am Pacific for our community meeting. If you
> are new to Drill, have questio
work with tools with incomplete UTF-8
support
On Tue, Mar 10, 2015 at 9:58 AM, Jason Altekruse
wrote:
> Hangout happening now!
>
> https://plus.google.com/hangouts/_/event/ci4rdiju8bv04a64efj5fedd0lc
>
> On Mon, Mar 9, 2015 at 11:01 AM, Jason Altekruse > wrote:
>
>> Hello
Come join the hangout to talk about whats happening with Drill, the recent
0.8 release candidate and the upcoming schedule.
https://plus.google.com/hangouts/_/event/ci4rdiju8bv04a64efj5fedd0lc
- Jason
myself for producing a better error message if this kind
of planning issue comes up the the future.
Thanks,
Jason Altekruse
On Fri, Mar 27, 2015 at 11:01 AM, Andries Engelbrecht <
aengelbre...@maprtech.com> wrote:
> I would recommend to not use a count(*) but rather pick a column to use
&
The error message indicates that this is a planning bug. Please try to look
to see if you can find an open JIRA for the issue and add any information
about your case there. If there is not one already filed, please open a new
one and try to provide as much explanation as you can about the data
invo
Hi Muthu,
Welcome to the Drill community!
Unfortunately the mailing list does not allow attachments, please send
along the error log copied into a mail message.
If you are working with the 0.7 version of Drill, I would recommend
upgrading the the new 0.8 release that just came out, there were a
good
performance for further analysis.
-Jason
On Thu, Apr 2, 2015 at 8:49 AM, Jason Altekruse
wrote:
> Hi Muthu,
>
> Welcome to the Drill community!
>
> Unfortunately the mailing list does not allow attachments, please send
> along the error log copied into a mail message.
>
>
status.
> > >
> > > Error message got from ODBC is
> > >
> > > "ERROR [HY000] [MapR][Drill] (1040) Drill failed to execute the query:
> > > SELECT * FROM `HDFS`.`root`.`./user/hadoop2/unclaimedaccount.json`
> LIMIT
> > 100
> > >
@Adam
This is something that has come up on the list before, you may be thinking
of http://blinkdb.org/. This is something that would definitely be
interesting to explore once we are stable and passed 1.0. We certainly can
try to help you along if you would like to start some of this work.
@Marci
Hi Latha,
Unfortunately the mailing list does not support attachments, could you
possibly throw the file onto a file sharing service and share a link? If
the file is below 20 MB you should be able to file a JIRA issue and upload
it there as an attachment if you don't have another host available.
Hello Phil,
Unfortunately this was a bug that was in flatten all along that ended up
being exposed when we fixed another system-wide issue with supporting large
lists and very wide strings. I have posted a patch that fixes this issue
that is in review, and I want to do a little additional cleanup
Congrats! Certainly well deserved!
On Thu, Apr 16, 2015 at 3:35 PM, Ellen Friedman
wrote:
> Congrats Hanifi and thanks for all your work
>
> Ellen
>
> On Thu, Apr 16, 2015 at 2:29 PM, Jacques Nadeau
> wrote:
>
> > The Apache Drill PMC is very pleased to announce Hanifi Gunes as a new
> > commi
Just as a bit of explanation for anyone who finds the thread, what is
happening here is that the csv parser will read files with no commas in
them as a series of records with one value each. This method is a bit of a
clever hack, but it will not work if any of your values have commas in
them. It is
The attachment for the json profile made it to the list because it is
ASCII, but the screenprint was blocked as a binary file. We can take a look
at the profile by loading the json into an instance of Drill, but just a
reminder about binary attachments for everyone, please upload to a public
host a
If you have some time, join us for our weekly hangout to talk about what is
happening in the Drill comminity, everyone is welcome. Stop in to introduce
yourself!
https://plus.google.com/hangouts/_/event/ci4rdiju8bv04a64efj5fedd0lc
You can fix the Web UI slowness issue for now by deleting this jar, it was
pulled in as a transitive dependency, but we don't actually need it and it
is causing intermittent class-loading conflicts with classes we are
actually using for the Web UI. As stated before, the permanent fix is
already in
Come join the Drill community as we discuss what has been happening lately
and what is in the pipeline. All are welcome, if you know about Drill, want
to know more or just want to listen in.
https://plus.google.com/hangouts/_/event/ci4rdiju8bv04a64efj5fedd0lc
There should be no upper limit on the size of the tables you can create
with Drill. Be advised that Drill does currently operate entirely
optimistically in regards to available resources. If a network connection
between two drillbits fails during a query, we will not currently
re-schedule the work
ure I am adjusting the correct config, these are heap parameters
> within the Drill configure path, not for Hadoop or Zookeeper?
>
>
> > On May 28, 2015, at 12:08 PM, Jason Altekruse
> wrote:
> >
> > There should be no upper limit on the size of the tables you can create
allocation in the drill-env.sh files, and have to
> restart the drill bits.
> >>
> >> How large is the data set you are working with, and your cluster/nodes?
> >>
> >> —Andries
> >>
> >>
> >> On May 28, 2015, at 9:17 AM, Matt
Currently Drill does not allow submission of logical plans. I think that
the web interface is out of date and claims you can submit a logical plan,
but this is not correct. We would like to allow for modification of plans
at the logical level, but we just haven't implemented the feature currently.
arameter of the submit_plan script? Is
> this broken as well?
>
> --
> Piotr Sokólski
>
>
> On Friday 29 May 2015 at 00:29, Jason Altekruse wrote:
>
> > Currently Drill does not allow submission of logical plans. I think that
> > the web interface is out of dat
Tried taking a look at the function to see what the issue was,
from_unixtime is actually a Hive UDF that is just on the classpath and
available by default in Drill. It does look like Andries said that it is
returning var16char, which might be a bug. The fact that it is trying to
cast to bigInt desp
It sounds like we should not have written to the filesystem if we were not
connected to a single host or a distributed filesystem. The problem is that
the files we wrote will not be associated together the way they would be in
a single filesystem (even a distributed one that would have a common
nam
Hi Mano,
Unfortunately the apache mail lists aren't very good with attachments, can
you upload it to a public host and share a link?
- Jason Altekruse
On Mon, Jun 1, 2015 at 3:01 PM, Rangaswamy, Manoharan
wrote:
> Hi Hanifi,
>
> As you can see in the log file, I am unable to
/_/event/ci4rdiju8bv04a64efj5fedd0lc
- Jason Altekruse
Hello Drillers,
I have been working on DRILL-3209, which aims to speed up reading from hive
tables by re-planning them as native Drill reads in the case where the
tables are backed by files that have available native readers. This will
begin with parquet and delimited text files.
To provide the s
This is pretty well implied with Christopher's message, but Drill ships
with a Hive storage plugin which puts Hive jars on the default class path.
Just as with Drill native UDFs we pick up the default Hive functions in
these jars and register them. Another one that was causing some issues was
the H
Hi Rob,
Thanks for putting so much effort into getting Drill set up for your use
case, we know that there are still some sharp edges in Drill and detailed
information about use cases that are hard to set up help us to improve the
docs and core project.
As a quick answer, I think you might have ru
If you are not going to be reading a lot of data, in terms of final results
(i.e. your app will consume filtered and/or aggregated results), the rest
API should serve your purposes. For better throughput the JDBC and ODBC
interfaces will be your best bet. Please note that the odbc driver is not a
p
Be aware, this will work if you turn on all_text_mode, but obviously it
will have some overhead reading as varchar and casting numeric types as
strings. If you turn on read_numbers_as_double this will also "work", but
be aware we are not using these cast statements as hints about how to do
the scan
The patch is currently in review, I don't think that it is going to
necessarily fix this issue. I am have been looking into issues with flatten
and I just opened a new JIRA that I think will actually address your issue.
This is a little bit of a low level issue with how the flatten is currently
bei
estation of the original problem?
>
> -Hanifi
>
> On Fri, Jun 19, 2015 at 10:17 AM, Jason Altekruse <
> altekruseja...@gmail.com>
> wrote:
>
> > The patch is currently in review, I don't think that it is going to
> > necessarily fix this issue. I am have bee
If you have started Drill previously and there is already a configuration
stored in zookeeper, we will not pick up the boostrap-storage-plugins.json
file upon starting Drill. This is only for the first time starting it. You
can modify the entry in zookeeper yourself by uploading the json file for
t
Join us at our weekly hangout to discuss what has been happening in the
Drill community!
https://plus.google.com/hangouts/_/event/ci4rdiju8bv04a64efj5fedd0lc
That is pacific time, the meeting will start in about 10 minutes.
On Tue, Jun 23, 2015 at 9:59 AM, Jason Altekruse
wrote:
> Join us at our weekly hangout to discuss what has been happening in the
> Drill community!
>
> https://plus.google.com/hangouts/_/event/ci4rdiju8bv04a64efj5fedd0lc
>
You're welcome, let us know if it works for you.
On Tue, Jun 23, 2015 at 10:41 AM, Ganesh Muthuraman
wrote:
> Thanks Jason. That helps.
> Thanks,G
>
> > From: altekruseja...@gmail.com
> > Date: Tue, 23 Jun 2015 09:14:11 -0700
> > Subject: Re: Cannot start drillbit
> > To: user@drill.apache.org
>
This is a reasonable hack for some cases, but I'm pretty sure this is going
to break the most common purpose of having quotes at all. If you put the
delimiter (tab) between quotes you are going to have it splitting on those
characters where it shouldn't be. There is also the issue that the quotes
Venki can give more specifics on the status, but Hive impersonation was
implemented in 1.1.0, which should be going out for a vote soon. I think
there might have been some limitations on the scope, but I know we have
verified the functionality with several security models.
On Tue, Jun 30, 2015 at
zalez
wrote:
> That's great! Thanks. Presumably I could pull down a nightly and try it
> out? Will drill still be expecting hive .13?
>
> On Tuesday, June 30, 2015, Jason Altekruse
> wrote:
>
> > Venki can give more specifics on the status, but Hive impersonation was
>
Come join the Apache Drill hangout to find out what is new in the upcoming
1.1 release. Anyone with an interest in Drill is welcome to attend.
https://plus.google.com/hangouts/_/event/ci4rdiju8bv04a64efj5fedd0lc
Hello Drillers,
I have created a Google spreadsheet to track leaders of the weekly hangout
(Tuesday at 10am Pacific time) to make sure we always have someone able to
attend the meeting and facilitate discussion. Commitment is pretty low, for
anyone who has attended the hangout it should be easy to
If the general idea of what you are trying to accomplish with this function
is not private, it might be useful to ask about your use case more
generally. Although we are still working to integrate the JDBC plugin into
the Drill mainline and it still requires a thorough testing cycle, this
might be
Just one additional note here, I would strongly advise against converting
csv files using a select * query out of a csv.
The reason for this is two-fold. Currently we read csv files into a list of
varchars, rather than individual columns. While parquet supports lists and
we will read them, the rea
, Larry White wrote:
> so the solution is to use select, but with columns specifically defined. is
> that right?
>
> On Thu, Jul 2, 2015 at 4:48 PM, Jason Altekruse
> wrote:
>
> > Just one additional note here, I would strongly advise against converting
> > csv files using
While the format is columnar and we are taking advantage of certain aspects
of the layout, we do not split the read between columns, but instead by the
block abstraction in Parquet which they call Row Groups. Each of these
blocks will contain data from each column, forming a complete set of rows.
It certainly does look like an issue with encoding, but you can see in his
code the query he is trying to run. There are not unicode characters that I
can see. It is possible that this is getting corrupted somehow in the ODBC
driver. Please file a JIRA with this case, I don't have a suggestion for
I'm not very experienced with configuring the various filesystems the
implement the HDFS API, but there is not a need for an Azure specific
plugin. The blob storage exposes the HDFS API, similar to S3 and other
storage systems.
If you can get the hadoop client to run an 'ls' or other filesystem co
I am not aware of anyone doing something like this today, but it seems like
something best handled outside of Drill right now. Drill considers itself
essentially stateless, we do not manage indexes, table constraints or
caching data for any of our current storage systems. There was some work
being
@Alexander If you want to test the speed of the ODBC driver you can do that
without a new storage plugin.
If you get the entire dataset into memory, it will be returned from Drill a
quickly as we can possibly send it to the client. One way to do this is to
insert a sort; we cannot send along any o
give you a billion records with negligible I/o.
> >
> > Sent from my iPhone
> >
> > > On Jul 16, 2015, at 15:43, Jason Altekruse
> > wrote:
> > >
> > > @Alexander If you want to test the speed of the ODBC driver you can do
> > that
> > > with
I could be wrong, but I believe that gzip is not a compression that can be
split, you must read and decompress the file from start to end. In this
case we can not parallelize the read. This stackoverflow article mentions
bzip2 as an alternative compression used by hadoop to solve this problem
and a
g and
processing it.
On Thu, Jul 23, 2015 at 12:08 PM, Juergen Kneissl wrote:
> Hi Jason,
>
> On 07/23/15 18:53, Jason Altekruse wrote:
> > I could be wrong, but I believe that gzip is not a compression that can
> be
> > split, you must read and decompress the file from start to
I'm not sure, it is possible that it is being evaluated during planning to
prune the scan, but the filter above the scan is not being removed as it
should be. I'll try to re-create it the case to take a look.
Stefan,
Earlier you had mentioned that it was not only inefficient, but it was also
givin
directory" (hope that makes sense).
>
> I'll let you know when the code is in a good-enough state and I have pushed
> it to github.
>
> Thanks for all the help guys, it's appreciated.
>
> Regards,
> -Stefan
>
>
>
> On Fri, Jul 24, 2015 at 8:46 PM, J
This is actually a known issue, constant folding is not working in the
select clause because of a costing problem. Constant folding only works
currently in the where clause today.
https://issues.apache.org/jira/browse/DRILL-2218
On Fri, Jul 24, 2015 at 4:13 PM, Ted Dunning wrote:
> I think that
Hafiz,
One other note that will likely help you with your use case. Drill allows
you to skip reading the header row in a csv file. Without this feature
configured you will likely get number or date format exceptions trying to
cast your csv data to particular data types, as your column names will b
Put a little more simply, the node that we end up planning the query on is
going to enumerate the files we will be reading in the query so that we can
assign work to given nodes. This currently assumes we are going to know at
planning time (on the single node) all of the files to be read. This
happ
t; > Just to clarify this, Jason - you don't necessarily need HDFS or the like
> > for this, if you had say a NFS volume (for example, Amazon Elastic File
> > System), you can still accomplish it, right? Or merely if you had all
> > files duplicated on every node locally.
You also could use the date-part function.
http://drill.apache.org/docs/date-time-functions-and-arithmetic/#date_part-syntax
On Fri, Jul 31, 2015 at 9:47 AM, Jacques Nadeau wrote:
> I would think you could cast to time and then provide a time boundary.
>
> I don't remember the exact syntax but
We are going to have a lot of users with less perseverance and black box
debugging skills than you have been showing in your evaluation of Drill. I
would not consider this a stupid user issue at all, we need to be clear
about the state of the system to users. If you have some time to record how
you
I don't know if I missed something, but the Postgres docs seem to indicate
that there is no equivalent to the concept of a SYSTEM option that exists
in Drill, which can be set with a query. Options can be set at server
startup, either in a configuration file or with a command line parameter
[2]. On
If files are available through the HDFS API, which includes remote reads,
Drill is able to read the files. A good use case for Drill is actually
installing on a subset of your nodes to save the overhead of running the
server everywhere while still being able to query all of your data. I have
not se
Drill supports ILIKE for case insensitive matching. Be aware that it is
treated like a regular function, as Steven notes here:
https://issues.apache.org/jira/browse/DRILL-3301
This doc page should be changed to include it, I'll open a pull request.
https://drill.apache.org/docs/operators
On Tue,
One of the times this came up I asked about what the requirements would be,
because pure XML is actually not well suited for placing in a standard SQL
table, and some of the constructs are even hard to map into the
JSON/protobuf model we are currently using for complex data in Drill.
I actually do
One thing you can do to speed up the expression evaluation is to use this
expression instead of regex_replace. This will avoid copying each value
into a short lived String object which unfortunately is the only interface
available on the java regex library we are using within the function. We
shoul
ira/browse/DRILL-1441
On Tue, Sep 8, 2015 at 11:22 AM, Jason Altekruse
wrote:
> One thing you can do to speed up the expression evaluation is to use this
> expression instead of regex_replace. This will avoid copying each value
> into a short lived String object which unfortunately is
A SQL level null is different than a null at the JAVA level that would be
giving this exception (we don't represent nulls with an actual null java
object). There might be a way to work around it, but this is a bug in
Drill. You should be able to make a cast between compatible types even if
there ar
I think it is reasonable to consider that a bug. We should implement the
function both as it works today and as you were originally expecting it.
Any ideas about about a good naming scheme for the two?
Unfortunately the regular contains() method does substring matching, but I
think the name repeat
es
> >> >>> determined
> >> >>>> the error to be invalid. Is trying to cast an empty string, or null
> >> value
> >> >>>> to an integer invalid? What's the workaround?
> >> >>>>
> >> >>
1 - 100 of 231 matches
Mail list logo