Re: MS Windows: Hadoop binaries required to run drill?

2016-01-02 Thread Peder Jakobsen | gmail
Hi Jacques, yes when I copy all these files over manually from my Linux
machine, everything works as expected on Windows 7 32bit.  ODBC drivers and
Drill explorer also work fine.

So what do you think is causing some of these files not to be written on
startup?   I have permission on all folders.

For  I/O operations in Drill at startup, ideally they'd be performed in a
try catch block of some sort so that the program can exit gracefully with
the appropriate feedback to the user, no?

Let me know if I can be of further help in tracking down this problem.

Peder J.


On Thu, Dec 31, 2015 at 6:16 PM, Jacques Nadeau  wrote:

> When Drill first starts up (or encounters an empty folder for embedded
> data), it will automatically create a number of new storage plugins. This
> includes cp and dfs, as well as disabled ones for hive, hbase, etc. It
> seems like your Drill is failing to do this. Since you only copied one
> plugin into your tmp directory (the hive plugin), you will not be able to
> query from the cp plugin. You would need to copy all the default plugins to
> get this working.  This potentially provides a workaround but it doesn't
> indicate why Drill is failing to initialize these settings.
>
> Can you possibly try this under the Administrator account if one exists?
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>
> On Thu, Dec 31, 2015 at 2:50 PM, Peder Jakobsen | gmail <
> pjakob...@gmail.com
> > wrote:
>
> > I spoke too soon, perhaps  (but it must be a simple I/O issue on startup,
> > no?)
> >
> >
> > C:\devel\apache-drill-1.4.0\bin>sqlline.bat -u "jdbc:drill:zk=local"
> > DRILL_ARGS - " -u jdbc:drill:zk=local"
> > HADOOP_HOME not detected...
> > HBASE_HOME not detected...
> > Calculating Drill classpath...
> > Dec 31, 2015 5:39:53 PM org.glassfish.jersey.server.ApplicationHandler
> > initialize
> > INFO: Initiating Jersey application, version Jersey: 2.8 2014-04-29
> > 01:25:26...
> > apache drill 1.4.0
> > "just drill it"
> > 0: jdbc:drill:zk=local> SELECT * FROM cp.`employee.json` LIMIT 3;
> > Dec 31, 2015 5:46:55 PM
> > org.apache.calcite.sql.validate.SqlValidatorException 
> > SEVERE: org.apache.calcite.sql.validate.SqlValidatorException: Table
> > 'cp.employee.json' not found
> > Dec 31, 2015 5:46:55 PM org.apache.calcite.runtime.CalciteException
> 
> > SEVERE: org.apache.calcite.runtime.CalciteContextException: From line 1,
> > column 15 to line 1, column 16: Table 'cp.employee.json' not found
> > Error: VALIDATION ERROR: From line 1, column 15 to line 1, column 16:
> Table
> > 'cp.employee.json' not found
> >
> >
> > [Error Id: 9d6af232-fc40-40ec-8a2a-8d082923b776 on Funky:31010]
> > (state=,code=0)
> >
> > On Thu, Dec 31, 2015 at 5:34 PM, Peder Jakobsen | gmail <
> > pjakob...@gmail.com
> > > wrote:
> >
> > > OUCH that hurts.  Sometimes you get lucky when you jump in a haystack
> and
> > > get pricked by the missing needle right away.
> > >
> > > So I copied hive.sys.drill from my tmp folder in linux to windows, and
> > > wow, now it seems to work.   Would you allow me the honour to fix this
> > bug
> > > and commit the changes even though I have not programmed in Java since
> > 2003
> > > ?   ;)  I can't imagine that I/O in java has changed all that much.
> > >
> > > Peder :)
> > >
> > >
> > >
> > > On Thu, Dec 31, 2015 at 5:30 PM, Peder Jakobsen | gmail <
> > > pjakob...@gmail.com> wrote:
> > >
> > >> No.  I've tried this many times.
> > >>
> > >> So at startup, it's supposed to create these files which it needs?
> > >>
> > >> It seems to do so in my linux install.   I will try to copy those
> files
> > >> over and see if I get another error.  Needle in a haystack debugging.
> > :)
> > >>
> > >> P.
> > >>
> > >> On Thu, Dec 31, 2015 at 5:27 PM, Nathan Griffith <
> ngriff...@dremio.com>
> > >> wrote:
> > >>
> > >>> Ah! Okay. I remember it making temporary stuff *somewhere* within
> > >>> windows.
> > >>>
> > >>> If I recall correctly, a bad set of temporary files once gave me the
> > >>> exact same issue, which was fixed by deleting them. But apparently
> > >>> this is isnt helping with your case?
> > >>>
> > >>> On Thu, Dec 31, 2015 at 2:22 PM, Peder Jakobsen | gmail
> > >>>  wrote:
> > >>> > OK, at startup, Drill creates an empty  file called hive.sys.drill
> > >>> that's
> > >>> > located in C:\tmp\drill\sys.storage_plugins
> > >>> >
> > >>> > Perhaps it's not surprising that we get "Unable to deserialize
> > >>> > "/tmp/drill/sys.storage_plugins/hive.sys.drill" (state=,code=0)"
> > >>> > considering that this file appears to be empty.
> > >>> >
> > >>> > On Linux, lots of stuff is included in this drill path:  profiles,
> > >>> > sys.options & sys.storage_plugins
> > >>> >
> > >>> > Hope this helps
> > >>> >
> > >>> > P.
> > >>> >
> > >>> > On Thu, Dec 31, 2015 at 5:13 PM, Peder Jakobsen | gmail <
> > >>> pjakob...@gmail.com
> > >>> >> wrote:
> > >>> >
> > >>> >> I deleted everything in  C:\Windows\Temp.  Note, when I start
> drill
> > 

Performance of Drill SQL for Hadoop when Drill is outside Hadoop cluster

2016-01-02 Thread Shashanka Kuntala
I have a use-case where 100s of TB of data is in HDFS. Installing Drill on all 
nodes of the HDFS is not an option.  If I have a separate Apache Drill cluster 
(external to HDFS), how will  Apache Drill SQL perform with large data sets ?  
Specifically I would like to know if Drill submits MapReduce jobs on HDFS or 
does Drill extract all data from HDFS cluster into Drill cluster before 
applying filters/joins ? Will Drill pushdown SQL into HDFS ?





Re: Performance of Drill SQL for Hadoop when Drill is outside Hadoop cluster

2016-01-02 Thread Ted Dunning
Tomer's answer was excellent, but he didn't address this issue.

HDFS doesn't have enough smarts to allow pushdown of SQL predicates.  The
closest you can come is to use intelligent partitioning (your intelligence,
not that of HDFS, btw). In that case Drill will avoid reading files that it
can avoid reading.



On Sat, Jan 2, 2016 at 1:33 PM, Tomer Shiran  wrote:

> Drill will read the data directly from HDFS in parallel. The performance
> will depend on the size of the Drill cluster, the size of the HDFS cluster,
> and the network. Drill does not translate SQL into MapReduce (the only
> system that works that way is Hive - but that approach lends itself to much
> slower performance particularly for ad-hoc analysis).
>
>
> On Sat, Jan 2, 2016 at 12:28 PM, Shashanka Kuntala <
> shashankati...@yahoo.com.invalid> wrote:
>
> > I have a use-case where 100s of TB of data is in HDFS. Installing Drill
> on
> > all nodes of the HDFS is not an option.  If I have a separate Apache
> Drill
> > cluster (external to HDFS), how will  Apache Drill SQL perform with large
> > data sets ?  Specifically I would like to know if Drill submits MapReduce
> > jobs on HDFS or does Drill extract all data from HDFS cluster into Drill
> > cluster before applying filters/joins ? Will Drill pushdown SQL into
> HDFS ?
> >
> >
> >
> >
>
>
> --
> Tomer Shiran
> CEO and Co-Founder, Dremio
>


Re: Performance of Drill SQL for Hadoop when Drill is outside Hadoop cluster

2016-01-02 Thread Jason Altekruse
Hi Shashanka,

Drill does have the ability to avoid reading part of your data by using
partitioning. This currently works best using partitioned parquet files.
Drill includes an auto-partitioning feature available for use with the
CREATE TABLE AS statement that works when outputting to the parquet format.
Drill will read the metadata on parquet files during scans and apply your
filter predicates to the file statistics avoid reading unneeded files. [1]

If you have existing data, you can get the benefits of partitioning, but
you will need to write filter predicates based on directory names and
partition the data yourself. [2]

Drill cannot do anything like pushed-down joins, this isn't really possible
due to the random block placement of HDFS. That being said we have gotten
some very good performance out of large joins. I would recommend trying out
your workload and looking at the docs on performance tuning. [3] [4]

[1] - https://drill.apache.org/docs/partition-by-clause/
[2] - https://drill.apache.org/docs/how-to-partition-data/
[3] - https://drill.apache.org/docs/performance-tuning-introduction/
[4] - https://drill.apache.org/docs/join-planning-guidelines/

On Sat, Jan 2, 2016 at 5:19 PM, Ted Dunning  wrote:

> Tomer's answer was excellent, but he didn't address this issue.
>
> HDFS doesn't have enough smarts to allow pushdown of SQL predicates.  The
> closest you can come is to use intelligent partitioning (your intelligence,
> not that of HDFS, btw). In that case Drill will avoid reading files that it
> can avoid reading.
>
>
>
> On Sat, Jan 2, 2016 at 1:33 PM, Tomer Shiran  wrote:
>
> > Drill will read the data directly from HDFS in parallel. The performance
> > will depend on the size of the Drill cluster, the size of the HDFS
> cluster,
> > and the network. Drill does not translate SQL into MapReduce (the only
> > system that works that way is Hive - but that approach lends itself to
> much
> > slower performance particularly for ad-hoc analysis).
> >
> >
> > On Sat, Jan 2, 2016 at 12:28 PM, Shashanka Kuntala <
> > shashankati...@yahoo.com.invalid> wrote:
> >
> > > I have a use-case where 100s of TB of data is in HDFS. Installing Drill
> > on
> > > all nodes of the HDFS is not an option.  If I have a separate Apache
> > Drill
> > > cluster (external to HDFS), how will  Apache Drill SQL perform with
> large
> > > data sets ?  Specifically I would like to know if Drill submits
> MapReduce
> > > jobs on HDFS or does Drill extract all data from HDFS cluster into
> Drill
> > > cluster before applying filters/joins ? Will Drill pushdown SQL into
> > HDFS ?
> > >
> > >
> > >
> > >
> >
> >
> > --
> > Tomer Shiran
> > CEO and Co-Founder, Dremio
> >
>