Re: Drill Performance

2016-07-07 Thread Abdel Hakim Deneche
I'm not sure you'll get any performance improvement from running more than
a single drillbit per cluster node.

On Thu, Jul 7, 2016 at 9:47 AM, scott  wrote:

> Follow up question: Is there a sweet spot for DRILL_MAX_DIRECT_MEMORY and
> DRILL_HEAP settings?
>
> On Wed, Jul 6, 2016 at 2:42 PM, scott  wrote:
>
> > Hello,
> > Does anyone know if there is a maximum number of drillbits recommended in
> > a Drill cluster? For example, I've observed that in a Solr Cloud, the
> > performance tapers off for ingest at around 16 JVM instances. Is there a
> > similar practical limitation to the number of drillbits I should cluster
> > together?
> >
> > Thanks,
> > Scott
> >
>



-- 

Abdelhakim Deneche

Software Engineer

  


Now Available - Free Hadoop On-Demand Training



Re: Performance issues when query sorts and retrieves all columns

2016-07-07 Thread Abdel Hakim Deneche
Assuming all your queries contain a LIMIT, when there is no ORDER BY then
Drill doesn't need to read all rows to produce the results, but with ORDER
BY there is no alternative than to read 11M rows from disk.
Parquet is a columnar format, so Drill, is able to only read the columns
you selected that's why you still get descent response times when you only
select a couple of columns. Also, the less columns you select, the less
data needs to be sent through the network.

How many parquet files do you have ? Drill will try to run multiple parquet
reading threads but it can only do so if you have multiple parquet files in
your dataset.

Sharing the query profile will definitely help provide more advice about
how to improve the performance of your query. For instance, do you see any
excessive wait time in parquet_row_group_scan or any of the senders ?

Thanks

On Thu, Jul 7, 2016 at 2:22 AM, Nikos Livathinos  wrote:

> Hi all,
>
> I am really excited about Apache Drill its easiness to bring SQL on top of
> different storage technologies. I am in the phase of learning/evaluating
> Apache Drill and I have come up with a case where the performance drops
> significantly. Therefore, I would like to share with you my results and
> get hints about how to improve performance.
>
> I have installed Drill in a cluster of 12 nodes and I have assigned 8GB
> for Drill per node.The main steps of our data pipeline are:
> 1. Import data on HDFS as Parquet files with Sqoop. For the evaluation
> tests I have a dataset of Parquet files with ~11M rows  and 50 columns.
> The total size is ~1GB.
> 2. Query Parquet files with Drill.
>
> I have tried different types of queries and even in very complicated ones
> the response time is around or less than 5sec. However I have noticed that
> the response time rises to ~80sec if I try queries which have the
> following 2 characteristics:
> 1. Sort the resultset (ORDER BY)
> 2. Get all columns
>
> For example a query with the following pattern:
>
> SELECT *
> FROM table
> ORDER BY columng
> LIMIT 1;
>
> It is interesting that the more columns I put in the select clause the
> more time it needs to respond. If I don't sort or if I get just a couple
> of columns then the response time drops from ~80s to ~3s.
> Please notice that I limit the resultset to 1 row, in order to avoid
> network traffic delays.
>
> I have checked the Query Profiler and the most time consuming operations
> are:
> HASH_PARTITION_SENDER with Avg Process Time: 38sec
> PARQUET_ROW_GROUP_SCAN with Avg Process Time: 42sec
>
> Do you have any idea how I can improve performance in the case of my query
> (if you like I can also provide a Full Json Profile).
>
> Thanks,
> Nikos
>
>
>


-- 

Abdelhakim Deneche

Software Engineer

  


Now Available - Free Hadoop On-Demand Training



Re: Number of records per batch

2016-07-05 Thread Abdel Hakim Deneche
It depends on the data you are querying, for .json you could change the
value of JSONRecordReader.DEFAULT_ROWS_PER_BATCH, which is set by default
to 4096, but this will only affect the size of the batches produced by the
reader, other operators may still alter the batch size

On Tue, Jul 5, 2016 at 7:30 PM, Eric Fukuda <e.s.fuk...@gmail.com> wrote:

> Thanks Abdel. Looking at the code, it looks like the maximum number of
> records in a batch is 64k. I suspect the reason I'm having only 4k is that
> it reached the capacity of the buffer in the batch. Is there a way to
> relieve this capacity restriction? It doesn't have to be a configuration
> option. I don't mind changing and compiling the code.
>
> On Tue, Jul 5, 2016 at 8:55 PM, Abdel Hakim Deneche <adene...@maprtech.com
> >
> wrote:
>
> > Unfortunately I don't think there is way to do it.
> >
> > On Tue, Jul 5, 2016 at 3:58 PM, Eric Fukuda <e.s.fuk...@gmail.com>
> wrote:
> >
> > > I'm trying to see how performance differs with different batch sizes.
> My
> > > table has 13 integer fields and 1 string field, and has 8M records.
> > > Following the code with a debugger, there seem to be 4096 records in a
> > > batch. Can this be 8192 or larger?
> > >
> > > On Tue, Jul 5, 2016 at 6:47 PM, Abdel Hakim Deneche <
> > adene...@maprtech.com
> > > >
> > > wrote:
> > >
> > > > hey Eric,
> > > >
> > > > Can you give more information about what you are trying to achieve ?
> > > >
> > > > Thanks
> > > >
> > > > On Tue, Jul 5, 2016 at 3:41 PM, Eric Fukuda <e.s.fuk...@gmail.com>
> > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > Does anyone know if there is a way to increase or specify the
> number
> > of
> > > > > records per batch manually?
> > > > >
> > > > > Thanks,
> > > > > Eric
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > >
> > > > Abdelhakim Deneche
> > > >
> > > > Software Engineer
> > > >
> > > >   <http://www.mapr.com/>
> > > >
> > > >
> > > > Now Available - Free Hadoop On-Demand Training
> > > > <
> > > >
> > >
> >
> http://www.mapr.com/training?utm_source=Email_medium=Signature_campaign=Free%20available
> > > > >
> > > >
> > >
> >
> >
> >
> > --
> >
> > Abdelhakim Deneche
> >
> > Software Engineer
> >
> >   <http://www.mapr.com/>
> >
> >
> > Now Available - Free Hadoop On-Demand Training
> > <
> >
> http://www.mapr.com/training?utm_source=Email_medium=Signature_campaign=Free%20available
> > >
> >
>



-- 

Abdelhakim Deneche

Software Engineer

  <http://www.mapr.com/>


Now Available - Free Hadoop On-Demand Training
<http://www.mapr.com/training?utm_source=Email_medium=Signature_campaign=Free%20available>


Re: Number of records per batch

2016-07-05 Thread Abdel Hakim Deneche
Unfortunately I don't think there is way to do it.

On Tue, Jul 5, 2016 at 3:58 PM, Eric Fukuda <e.s.fuk...@gmail.com> wrote:

> I'm trying to see how performance differs with different batch sizes. My
> table has 13 integer fields and 1 string field, and has 8M records.
> Following the code with a debugger, there seem to be 4096 records in a
> batch. Can this be 8192 or larger?
>
> On Tue, Jul 5, 2016 at 6:47 PM, Abdel Hakim Deneche <adene...@maprtech.com
> >
> wrote:
>
> > hey Eric,
> >
> > Can you give more information about what you are trying to achieve ?
> >
> > Thanks
> >
> > On Tue, Jul 5, 2016 at 3:41 PM, Eric Fukuda <e.s.fuk...@gmail.com>
> wrote:
> >
> > > Hi,
> > >
> > > Does anyone know if there is a way to increase or specify the number of
> > > records per batch manually?
> > >
> > > Thanks,
> > > Eric
> > >
> >
> >
> >
> > --
> >
> > Abdelhakim Deneche
> >
> > Software Engineer
> >
> >   <http://www.mapr.com/>
> >
> >
> > Now Available - Free Hadoop On-Demand Training
> > <
> >
> http://www.mapr.com/training?utm_source=Email_medium=Signature_campaign=Free%20available
> > >
> >
>



-- 

Abdelhakim Deneche

Software Engineer

  <http://www.mapr.com/>


Now Available - Free Hadoop On-Demand Training
<http://www.mapr.com/training?utm_source=Email_medium=Signature_campaign=Free%20available>


Re: Number of records per batch

2016-07-05 Thread Abdel Hakim Deneche
hey Eric,

Can you give more information about what you are trying to achieve ?

Thanks

On Tue, Jul 5, 2016 at 3:41 PM, Eric Fukuda  wrote:

> Hi,
>
> Does anyone know if there is a way to increase or specify the number of
> records per batch manually?
>
> Thanks,
> Eric
>



-- 

Abdelhakim Deneche

Software Engineer

  


Now Available - Free Hadoop On-Demand Training



Re: Initial Feed Back on 1.7.0 Release

2016-07-05 Thread Abdel Hakim Deneche
answers inline.

On Tue, Jul 5, 2016 at 8:39 AM, John Omernik  wrote:

> Working with the 1.7.0, the feature that I was very interested in was the
> fixing of the Metadata Caching while using user impersonation.
>
> I have a large table, with a day directory that can contain up to 1000
> parquet files each.
>
>
> Planning was getting terrible on this table as I added new data, and the
> metadata cache wasn't an option for me because of impersonation.
>
> Well now will 1.7.0 that's working, and it makes a HUGE difference. A query
> that would take 120 seconds now takes 20 seconds.   Etc.
>
> Overall, this is a great feature and folks should look into it for
> performance of large Parquet tables.
>
> Some observations that I would love some help with.
>
> 1. Drill "Seems" to know when a new subdirectory was added and it generates
> the metadata for that directory with the missing data. This is without
> another REFRESH TABLE METADATA command.  That works great for new
> directories, however, what happens if you just copy new files into an
> existing directory? Will it use the metadata cache that only lists the old
> files. or will things get updated? I guess, how does it know things are in
> sync?
>

When you query folder A that contains metadata cache, Drill will check all
it's sub-directories' last modification time to figure out if anything
changed since last time the metadata cache was refreshed. If data was
added/removed, Drill will refresh the metadata cache for folder A.


> 2.  Pertaining to point 1, when new data was added, the first query that
> used that directory partition, seemed to write the metadata file. However,
> the second query ran ALSO rewrote the file (and it ran with the speed of an
> uncached directory).  However, the third query was now running at cached
> speeds. (the 20 seconds vs. 120 seconds).  This seems odd, but maybe there
> is an reason?
>

Unfortunately, the current implementation of metadata cache doesn't support
incremental refresh, so each time Drill detects a change inside the folder,
it will run a "full" metadata cache refresh before running the query,
that's what explains why your second query took so long to finish.


> 3. Is Drill ok with me running REFRESH TABLE METADATA only for
> subdirectory?  So if I load a day, can I issue REFRESH TABLE METADATA
> `mytable/2016-07-04`  and have things be all where drill is happy?  I.e.
> does the mytable metadata need to be updated as well or is that wasted
> cycles?
>

Drill keeps a metadata cache file for every subdirectory of your table. So
you'll end up with a cache file in "mytable" and another one in
"mytable/2016-07-04".
I'm not sure about the following, and other developers will correct soon
enough, but my understanding is that you can run a refresh command on the
subfolder and it will only cause that particular cache (and any of it's
subfolders) to be updated and it won't cause the cache file on "mytable"
and any other of it's subfolders to be updated.
Also, as long as you only query this particular day, Drill won't detect the
change and won't try to update any other metadata cache, but as soon as you
query "mytable" Drill will figure out things have changed and it will cause
a full refresh of the table.


> 4.  Discussion: perhaps we could compress the metadata file? Each day (for
> me) has 8.2 mb of data, and the file at the root of my table has 332mb of
> data. Just using standard gzip/gunzip I got the 332mb file to 11 mb. That
> seems like an improvement, however, not knowing how this file is
> used/updated compression may add lag.
>

There are definitely other ways we can store the metadata cache files,
compression is one of them but we also want the alternative to make it
easier to run incremental metadata refresh.


> 5. Any other thoughts/suggestions?
>



-- 

Abdelhakim Deneche

Software Engineer

  


Now Available - Free Hadoop On-Demand Training



Re: Parquet Block Size Detection

2016-07-01 Thread Abdel Hakim Deneche
Just make sure you enable parquet metadata caching, otherwise the more
files you have the more time Drill will spend reading the metadata from
every single file.

On Fri, Jul 1, 2016 at 11:17 AM, John Omernik  wrote:

> In addition
> 7. Generally speaking, keeping number of files low, will help in multiple
> phases of planning/execution. True/False
>
>
>
> On Fri, Jul 1, 2016 at 12:56 PM, John Omernik  wrote:
>
> > I looked at that, and both the meta and schema options didn't provide me
> > block size.
> >
> > I may be looking at parquet block size wrong, so let me toss out some
> > observations, and inferences I am making, and then others who know the
> > spec/format can confirm or correct.
> >
> > 1. The block size in parquet is NOT file size. A Parquet file can have
> > multiple blocks in a single file? (Question: when this occurs, do the
> > blocks then line up with DFS block size/chunk size as recommended, or do
> we
> > get weird issues?) In practice, do writes aim for 1 block per file?
> > 2. The block size, when writing is computed prior to compression. This is
> > an inference based on the parquet-mr library.  A job that has a parquet
> > block size of 384mb seems to average files of around 256 mb in size.
> Thus,
> > my theory is that the amount of data in parquet block size is computed
> > prior to write, and then as the file is written compression is applied,
> > thus ensuring that the block size (and file size if 1 is not true, or if
> > you are just writing a single file) will be under the dfs.block size if
> you
> > make both settings the same.
> > 3. Because of 2, setting dfs.blocksize = parquet blocksize is a good
> rule,
> > because the files will always be under the dfsblock size with
> compression,
> > ensuring you don't have cross block reads happening.  (You don't have to,
> > for example, set the parquet block size to be less then dfs block size to
> > ensure you don't have any weird issues)
> > 4.  Also because of 2, with compression enabled, you don't need any slack
> > space for file headers or footers to ensure the files don't cross DFS
> > blocks.
> > 5. In general larger dfs/parquet block sizes will be good for reader
> > performance, however, as you start to get larger, write memory demands
> > increase.  True/False?  In general does a larger block size also put
> > pressures on Reader memory?
> > 6. Any other thoughts/challenges on block size?  When talking about
> > hundreds/thousands of GB of data, little changes in performance like with
> > block size can make a difference.  I am really interested in tips/stories
> > to help me understand better.
> >
> > John
> >
> >
> >
> > On Fri, Jul 1, 2016 at 12:26 PM, Parth Chandra 
> > wrote:
> >
> >> parquet-tools perhaps?
> >>
> >> https://github.com/Parquet/parquet-mr/tree/master/parquet-tools
> >>
> >>
> >>
> >> On Fri, Jul 1, 2016 at 5:39 AM, John Omernik  wrote:
> >>
> >> > Is there any way, with Drill or with other tools, given a Parquet
> file,
> >> to
> >> > detect the block size it was written with?  I am copying data from one
> >> > cluster to another, and trying to determine the block size.
> >> >
> >> > While I was able to get the size by asking the devs, I was wondering,
> is
> >> > there any way to reliably detect it?
> >> >
> >> > John
> >> >
> >>
> >
> >
>



-- 

Abdelhakim Deneche

Software Engineer

  


Now Available - Free Hadoop On-Demand Training



Re: Parquet Block Size Detection

2016-07-01 Thread Abdel Hakim Deneche
some answers inline:

On Fri, Jul 1, 2016 at 10:56 AM, John Omernik  wrote:

> I looked at that, and both the meta and schema options didn't provide me
> block size.
>
> I may be looking at parquet block size wrong, so let me toss out some
> observations, and inferences I am making, and then others who know the
> spec/format can confirm or correct.
>
> 1. The block size in parquet is NOT file size. A Parquet file can have
> multiple blocks in a single file? (Question: when this occurs, do the
> blocks then line up with DFS block size/chunk size as recommended, or do we
> get weird issues?) In practice, do writes aim for 1 block per file?
>

Drill always writes one row group per file.


> 2. The block size, when writing is computed prior to compression. This is
> an inference based on the parquet-mr library.  A job that has a parquet
> block size of 384mb seems to average files of around 256 mb in size. Thus,
> my theory is that the amount of data in parquet block size is computed
> prior to write, and then as the file is written compression is applied,
> thus ensuring that the block size (and file size if 1 is not true, or if
> you are just writing a single file) will be under the dfs.block size if you
> make both settings the same.
> 3. Because of 2, setting dfs.blocksize = parquet blocksize is a good rule,
> because the files will always be under the dfsblock size with compression,
> ensuring you don't have cross block reads happening.  (You don't have to,
> for example, set the parquet block size to be less then dfs block size to
> ensure you don't have any weird issues)
> 4.  Also because of 2, with compression enabled, you don't need any slack
> space for file headers or footers to ensure the files don't cross DFS
> blocks.
> 5. In general larger dfs/parquet block sizes will be good for reader
> performance, however, as you start to get larger, write memory demands
> increase.  True/False?  In general does a larger block size also put
> pressures on Reader memory?
>

We already know the writer will use more heap if you have larger block
sizes.
I believe the current implementation of the reader won't necessarely use
more memory as it will always try to read a specific number of rows at a
time (not sure though).


> 6. Any other thoughts/challenges on block size?  When talking about
> hundreds/thousands of GB of data, little changes in performance like with
> block size can make a difference.  I am really interested in tips/stories
> to help me understand better.
>
> John
>
>
>
> On Fri, Jul 1, 2016 at 12:26 PM, Parth Chandra 
> wrote:
>
> > parquet-tools perhaps?
> >
> > https://github.com/Parquet/parquet-mr/tree/master/parquet-tools
> >
> >
> >
> > On Fri, Jul 1, 2016 at 5:39 AM, John Omernik  wrote:
> >
> > > Is there any way, with Drill or with other tools, given a Parquet file,
> > to
> > > detect the block size it was written with?  I am copying data from one
> > > cluster to another, and trying to determine the block size.
> > >
> > > While I was able to get the size by asking the devs, I was wondering,
> is
> > > there any way to reliably detect it?
> > >
> > > John
> > >
> >
>



-- 

Abdelhakim Deneche

Software Engineer

  


Now Available - Free Hadoop On-Demand Training



Re: Information about ENQUEUED state in Drill

2016-07-01 Thread Abdel Hakim Deneche
Most likely planing is taking longer to finish. Once it's done, it should
move to either ENQUEUED if the queuing was enabled or RUNNING if it was
disabled.

One easy way to confirm if planing is indeed taking too long is to just run
a "EXPLAIN PLAN FOR " and see how long it takes to finish.

On Fri, Jul 1, 2016 at 6:49 AM, John Omernik  wrote:

> Interestingly enough, when I disable queuing, the query sits in the
> "STARTING" phase for the same amount of time it would sit in ENQUEUING if
> queuing was enabled.  Excessive planning?
>
> When looking at the UI, how can I validate this?
>
>
>
> On Fri, Jul 1, 2016 at 8:14 AM, John Omernik  wrote:
>
> > I don't see that, but here's a question, when it's enqueued, it must have
> > to do some level of planning before determining which queue it's going to
> > fall into ... correct?  I wonder if that planning takes to long, if
> that's
> > what's causing the enqueued state?
> >
> >
> >
> > On Thu, Jun 30, 2016 at 1:09 PM, Parth Chandra 
> > wrote:
> >
> >> The queue that the queries are put in is determined by the cost
> calculated
> >> by the optimizer. So in Qiang's case, it might be that the cost
> >> calculation
> >> might be causing the query to be put in the large query queue.
> >>
> >> You can check the cost of the query in the query profile and compare
> with
> >> the value of the QUEUE_THRESHOLD_SIZE setting (exec.queue.threshold) to
> >> see
> >> which queue the query is being put in.
> >>
> >> A single query staying enqueued for 30 seconds sounds really wrong.
> >> Putting
> >> a query in either queue requires getting a distributed semaphore (via
> >> zookeeper) and it is possible this is taking too long which is why the
> >> enqueuing may be taking really long.
> >>
> >> Do you see any messages in the logs about timeouts while enqueuing?
> >>
> >>
> >>
> >>
> >> On Thu, Jun 30, 2016 at 6:46 AM, John Omernik  wrote:
> >>
> >> > Thanks Parth.
> >> >  As I stated in, there are no other jobs running in the cluster when
> >> this
> >> > happens.  I do have queueing enabled, however, with no other jobs
> >> running,
> >> > why would any single job sit in the ENQUEUED state for 30 seconds?
> This
> >> > seems to be an issue or am I missing something?
> >> >
> >> > I would really like to use queueing as this is a multi-tenant cluster,
> >> so I
> >> > don't want to remove it all together.
> >> >
> >> > John
> >> >
> >> > On Wed, Jun 29, 2016 at 10:57 PM, qiang li 
> >> wrote:
> >> >
> >> > > I have the same doult.
> >> > >
> >> > > I set the queue.threshold to 5000, queue.large to 20 and the
> >> > > queue.small to 200. But when I query with about 100 small querys
> >> > > concurrently, most of them are ENQUEUED.
> >> > >
> >> > > If I turn off the queue, it will query fast. If turn on the queue ,
> >> our
> >> > > querys will speed about 7 seconds, while only take 2 to 3 seconds
> if I
> >> > turn
> >> > > off queue.
> >> > >
> >> > > Currently , we turn off the queue and limit the querys at client
> side.
> >> > >
> >> > > 2016-06-30 6:19 GMT+08:00 Parth Chandra :
> >> > >
> >> > > > I would guess you have queueing enabled. With queueing enabled,
> >> only a
> >> > > max
> >> > > > number of queries will be actually running and the rest will wait
> >> in an
> >> > > > ENQUEUED state.
> >> > > >
> >> > > > There are two queues: one for large queries and one for small
> >> queries.
> >> > > You
> >> > > > can change their size with the following parameters -
> >> > > > exec.queue.large
> >> > > > exec.queue.small
> >> > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > > On Wed, Jun 29, 2016 at 1:51 PM, John Omernik 
> >> > wrote:
> >> > > >
> >> > > > > I have some jobs that will stay in an ENQUEUED state for what I
> >> think
> >> > > to
> >> > > > be
> >> > > > > an excessive amount of time.  (No other jobs running on the
> >> cluster,
> >> > > the
> >> > > > > ENQUEUED state lasted for 30 seconds) . What causes this? Is it
> >> > > planning
> >> > > > > when it's in this state? Any information about this would be
> >> helpful.
> >> > > > >
> >> > > > > John
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>



-- 

Abdelhakim Deneche

Software Engineer

  


Now Available - Free Hadoop On-Demand Training



Re: Drill Plugin Update

2016-06-21 Thread Abdel Hakim Deneche
SkipFirstLine is an optional parameter with a default value of "false", so
when the parameter "disappears" it's actually equivalent to setting it to
false.


On Tue, Jun 21, 2016 at 4:30 AM, Kumar Anil7/DEL/TCS 
wrote:

>
> Hi,
>
>
> I am using drill 1.4.0 in MapR5.1 cluster. I am trying to update value of
> "skipFirstLine" from true to false, under csv format in dfs storage plugin
> from web UI, it shows success message but some how "skipFirstLine"
> disappered.
>
>
> Content of dfs plugin for csv format is below: "csv": {
> "type": "text",
> "extensions": [
> "csv"
> ],
> "skipFirstLine": true,
> "extractHeader": true,
> "delimiter": ","
> },
>
> Please help me to resolve the issue.
>
> --
> Regards,
>
> Anil Kumar (671075)
> Digital Enterprise - Analytics, Big Data and Information Management
> Mobile 91- 8588899595
> Tata Consultancy Services
> Mailto: kumar.an...@tcs.com
> Website: http://www.tcs.com
>
>
> =-=-=
> Notice: The information contained in this e-mail
> message and/or attachments to it may contain
> confidential or privileged information. If you are
> not the intended recipient, any dissemination, use,
> review, distribution, printing or copying of the
> information contained in this e-mail message
> and/or attachments to it are strictly prohibited. If
> you have received this communication in error,
> please notify us by reply e-mail or telephone and
> immediately and permanently delete the message
> and any attachments. Thank you
>
>
>


-- 

Abdelhakim Deneche

Software Engineer

  


Now Available - Free Hadoop On-Demand Training



Re: Memory Settings for a Non-Sorted Failed Query

2016-06-13 Thread Abdel Hakim Deneche
Running out of heap could also make a Drillbit become irresponsive,
eventually it will die after printing the following message in it's
drillbit.out:

Unable to handle out of memory condition in FragmentExecutor

You may want to check your drillbits' drillbit.out for such message

On Mon, Jun 13, 2016 at 4:27 PM, John Omernik  wrote:

> I'd like to talk about that on the hangout.  Drill should do better at
> failing with a clean oom error rather then having a bit go unresponsive.
> Can just that bit be restarted to return to a copacetic state? As an admin,
> if this is the case, how do I find this bit?
>
> Other than adding RAM, are there any query tuning settings that could help
> prevent the unresponsive bit? ( I see this as two issues, the memory
> settings for the 1024m block size CTAS and the how can we prevent a bit
> from going unresponsive? )
> On Jun 13, 2016 6:19 PM, "Parth Chandra"  wrote:
>
> The only time I've seen a drillbit get unresponsive is when you run out of
> Direct memory. Did you see any 'Out of Memory Error' in your logs? If you
> see those then you need to increase the Direct memory setting for the JVM.
> ( DRILL_MAX_DIRECT_MEMORY in drill-env.sh)
>
>
>
>
> On Mon, Jun 13, 2016 at 4:10 PM, John Omernik  wrote:
>
> > The 512m block size worked.  My issue with the 1024m block size was on
> the
> > writing using a CTAS that's where my nodes got into a bad
> statethus
> > I am wondering what setting on drill would be the right setting to help
> > node memory pressures on a CTAs using 1024m block size
> > On Jun 13, 2016 6:06 PM, "Parth Chandra"  wrote:
> >
> > In general, you want to make the Parquet block size and the HDFS block
> size
> > the same. A Parquet block size that is larger than the HDFS block size
> can
> > split a Parquet block ( i.e. row_group ) across nodes and that will
> > severely affect performance as data reads will no longer be local. 512 MB
> > is a pretty good setting.
> >
> > Note that you need to ensure the Parquet block size in the source file
> > which (maybe) was produced outside of Drill. So you will need to make the
> > change in the application used to write the Parquet file.
> >
> > If you're using Drill to write the source file as well then, of course,
> the
> > block size setting will be used by the writer.
> >
> > If you're using the new reader, then there is really no knob you have to
> > tweak. Is parquet-tools able to read the file(s)?
> >
> >
> >
> > On Mon, Jun 13, 2016 at 1:59 PM, John Omernik  wrote:
> >
> > > I am doing some performance testing, and per the Impala documentation,
> I
> > am
> > > trying to use a block size of 1024m in both Drill and MapR FS.  When I
> > set
> > > the MFS block size to 512 and the Drill (default) block size I saw some
> > > performance improvements, and wanted to try the 1024 to see how it
> > worked,
> > > however, my query hung and I got into that "bad state" where the nodes
> > are
> > > not responding right and I have to restart my whole cluster (This
> really
> > > bothers me that a query can make the cluster be unresponsive)
> > >
> > > That said, what memory settings can I tweak to help the query work.
> This
> > is
> > > quite a bit of data, a CTAS from Parquet to Parquet, 100-130G of data
> per
> > > data (I am doing a day at a time), 103 columns.   I have to use the
> > > "use_new_reader" option due to my other issues, but other than that I
> am
> > > just setting the block size on MFS and then updating the block size in
> > > Drill, and it's dying. Since this is a simple CTAS (no sort) which
> > settings
> > > can be beneficial for what is happening here?
> > >
> > > Thanks
> > >
> > > John
> > >
> >
>



-- 

Abdelhakim Deneche

Software Engineer

  


Now Available - Free Hadoop On-Demand Training



Re: How to specify Drill JDBC connection timeout or JDBC Query timeout

2016-06-03 Thread Abdel Hakim Deneche
You should fill a JIRA asking for this to be implemented. At least this
will be visible to the developers.

Thanks

On Fri, Jun 3, 2016 at 1:01 PM, Hao Zhu <h...@maprtech.com> wrote:

> Thanks Hakim.
> Seems the test framework is using JAVA Thread level timeout.
> Is there any chance we implement JDBC api level timeout?
>
> Thanks,
> Hao
>
> On Thu, Jun 2, 2016 at 6:42 PM, Abdel Hakim Deneche <adene...@maprtech.com
> >
> wrote:
>
> > For connection timeout, there are configuration options that you can set
> in
> > drill-override.conf that affect how much time the Drill client will try
> to
> > connect to the server, but even then the client could actually block
> > forever (I've seen it happen on an internal tool).
> >
> > Drill test framework has a nice mechanism to cancel a query after a
> certain
> > time, you could use it as an inspiration for your own tool:
> >
> >
> >
> https://github.com/mapr/drill-test-framework/blob/master/framework/src/main/java/org/apache/drill/test/framework/CancelingExecutor.java
> >
> >
> > On Thu, Jun 2, 2016 at 6:33 PM, Hao Zhu <h...@maprtech.com> wrote:
> >
> > > Hi Team,
> > >
> > > I am trying to create a java code to test the health of each drillbit.
> > > The goal is to use JDBC connection logon each drillbit and run a simple
> > > query.
> > >
> > > However I could not find the way to set either connection timeout or
> > query
> > > timeout.
> > > I checked below api pages:
> > >
> > >
> > >
> >
> https://drill.apache.org/api/1.2/jdbc/org/apache/drill/jdbc/DrillConnection.html
> > >
> > > *setNetworkTimeout*
> > > <
> > >
> >
> https://drill.apache.org/api/1.2/jdbc/org/apache/drill/jdbc/DrillConnection.html#setNetworkTimeout(java.util.concurrent.Executor,%20int)
> > > >
> > > (*Executor*
> > > <
> > >
> >
> http://docs.oracle.com/javase/7/docs/api/java/util/concurrent/Executor.html?is-external=true
> > > >
> > > executor,
> > > int milliseconds)
> > >
> > > *Drill*: Not supported (for non-zero timeout value).
> > >
> > >
> >
> https://drill.apache.org/api/1.2/jdbc/org/apache/drill/jdbc/DrillStatement.html
> > > <
> > >
> >
> https://drill.apache.org/api/1.2/jdbc/org/apache/drill/jdbc/DrillStatement.html
> > > >
> > >
> > > *setQueryTimeout*
> > > <
> > >
> >
> https://drill.apache.org/api/1.2/jdbc/org/apache/drill/jdbc/DrillStatement.html#setQueryTimeout(int)
> > > >
> > > (int milliseconds)
> > >
> > > *Drill*: Not supported (for non-zero timeout value).
> > >
> > > Any suggestions?
> > >
> > > Thanks,
> > > Hao
> > >
> >
> >
> >
> > --
> >
> > Abdelhakim Deneche
> >
> > Software Engineer
> >
> >   <http://www.mapr.com/>
> >
> >
> > Now Available - Free Hadoop On-Demand Training
> > <
> >
> http://www.mapr.com/training?utm_source=Email_medium=Signature_campaign=Free%20available
> > >
> >
>



-- 

Abdelhakim Deneche

Software Engineer

  <http://www.mapr.com/>


Now Available - Free Hadoop On-Demand Training
<http://www.mapr.com/training?utm_source=Email_medium=Signature_campaign=Free%20available>


Re: CTAS on MySQL*

2016-06-03 Thread Abdel Hakim Deneche
MySQL storage plugin is indeed read only, you cannot create a table on
mysql through Drill.

Thanks

On Fri, Jun 3, 2016 at 10:05 AM, Shankar Mane 
wrote:

> As we know, we can able to read MySQL data using drill mysql storage
> plugin. But it seems it is Read Only and No Write permissions .
>
> Can we create table (CTAS) on mysql ?
>
> If it is possible, We can able to configured Read/Write permissions at
> mysql Storage plugin Level. Storage plgins like dfs, hive has this setting.
>
> * Doing CTAS on mysql and Reading from MySQL is very costly and it might
> also defeat the purpose of distributed storage and multi-node executions
>
> regards,
> shankar
>



-- 

Abdelhakim Deneche

Software Engineer

  


Now Available - Free Hadoop On-Demand Training



Re: How to specify Drill JDBC connection timeout or JDBC Query timeout

2016-06-02 Thread Abdel Hakim Deneche
For connection timeout, there are configuration options that you can set in
drill-override.conf that affect how much time the Drill client will try to
connect to the server, but even then the client could actually block
forever (I've seen it happen on an internal tool).

Drill test framework has a nice mechanism to cancel a query after a certain
time, you could use it as an inspiration for your own tool:

https://github.com/mapr/drill-test-framework/blob/master/framework/src/main/java/org/apache/drill/test/framework/CancelingExecutor.java


On Thu, Jun 2, 2016 at 6:33 PM, Hao Zhu  wrote:

> Hi Team,
>
> I am trying to create a java code to test the health of each drillbit.
> The goal is to use JDBC connection logon each drillbit and run a simple
> query.
>
> However I could not find the way to set either connection timeout or query
> timeout.
> I checked below api pages:
>
>
> https://drill.apache.org/api/1.2/jdbc/org/apache/drill/jdbc/DrillConnection.html
>
> *setNetworkTimeout*
> <
> https://drill.apache.org/api/1.2/jdbc/org/apache/drill/jdbc/DrillConnection.html#setNetworkTimeout(java.util.concurrent.Executor,%20int)
> >
> (*Executor*
> <
> http://docs.oracle.com/javase/7/docs/api/java/util/concurrent/Executor.html?is-external=true
> >
> executor,
> int milliseconds)
>
> *Drill*: Not supported (for non-zero timeout value).
>
> https://drill.apache.org/api/1.2/jdbc/org/apache/drill/jdbc/DrillStatement.html
> <
> https://drill.apache.org/api/1.2/jdbc/org/apache/drill/jdbc/DrillStatement.html
> >
>
> *setQueryTimeout*
> <
> https://drill.apache.org/api/1.2/jdbc/org/apache/drill/jdbc/DrillStatement.html#setQueryTimeout(int)
> >
> (int milliseconds)
>
> *Drill*: Not supported (for non-zero timeout value).
>
> Any suggestions?
>
> Thanks,
> Hao
>



-- 

Abdelhakim Deneche

Software Engineer

  


Now Available - Free Hadoop On-Demand Training



Re: Guidelines for planner.memory.max_query_memory_per_node

2016-06-02 Thread Abdel Hakim Deneche
This option controls how much each query will be able to allocate at most
for it's sort operator on every drillbit. The higher this value the less
likely your query will spill to disk and the faster it will finish.
Because it's per query, and let's say you set it to 10% of your total
available direct memory, then if you run 10 queries in parallel and they
all use sort, you may run out of memory as no memory will be left for the
other operators.

I guess you can set it to some high value, and if you start seeing queries
running out of memory while running queries that use sort, then you may
lower it down or run less queries in parallel.

Thanks

On Wed, Jun 1, 2016 at 2:00 PM, John Omernik <j...@omernik.com> wrote:

> So for my Parquet issues it will not likely make a difference, (It appears
> to be heap related and/or parquet writer related)  Still, I would be very
> interested in guidelines here, keeping it at 2GB with such beefy nodes
> seems to be a waste.
>
> John
>
> On Wed, Jun 1, 2016 at 3:38 PM, Abdel Hakim Deneche <adene...@maprtech.com
> >
> wrote:
>
> > I don't know about any specific guidelines for this options, but what I
> > know is that it only affects the sort operator, and it's related to
> direct
> > memory not heap memory.
> >
> >
> >
> > On Wed, Jun 1, 2016 at 1:20 PM, John Omernik <j...@omernik.com> wrote:
> >
> > > I am reposting this question here as well. (I posted on the MapR
> > Community
> > > forums).
> > >
> > > The default as I understand it, for the setting
> > > planner.memory.max_query_memory_per_node
> > > is 2G.  The default heap memory settings in drill-env.sh is 4G and the
> > > default Direct memory is 8G.
> > >
> > > I guess, is there any advice on where I should set my
> > > planner.memory.max_query_memory_per_node
> > > as the other numbers scale? I.e. does this setting coordinate more with
> > > heap or direct or both? If I double my direct mem, should I double the
> > > setting? Are there any guidelines or methods for tuning this?
> > >
> > > I am currently running bits at 24 GB of Heap and 84GB of Direct,
> should I
> > > take the planner.memory.max_query_memory_per_node and x 8?  to put 16G?
> > > Thoughts?
> > >
> > > Thanks!
> > >
> > > John
> > >
> >
> >
> >
> > --
> >
> > Abdelhakim Deneche
> >
> > Software Engineer
> >
> >   <http://www.mapr.com/>
> >
> >
> > Now Available - Free Hadoop On-Demand Training
> > <
> >
> http://www.mapr.com/training?utm_source=Email_medium=Signature_campaign=Free%20available
> > >
> >
>



-- 

Abdelhakim Deneche

Software Engineer

  <http://www.mapr.com/>


Now Available - Free Hadoop On-Demand Training
<http://www.mapr.com/training?utm_source=Email_medium=Signature_campaign=Free%20available>


Re: queries take over 2 min

2016-06-01 Thread Abdel Hakim Deneche
sometimes, if you have an issue in one of your storage plugin it affects
all queries even those not querying that specific plugin. Do you have any
enable storage plugin that's causing issues ?

On Wed, Jun 1, 2016 at 2:21 PM, Scott Kinney  wrote:

> i'm running queries on local json files and queries take over 2 min. I'm
> running simple drill-embeded install on e2 t2.large. cpu and memory
> utilization is very low while the query is running. even 'alter session
> set' command takes minutes.
>
>
>
> 
> Scott Kinney | DevOps
> stem    |   m  510.282.1299
> 100 Rollins Road, Millbrae, California 94030
>
> This e-mail and/or any attachments contain Stem, Inc. confidential and
> proprietary information and material for the sole use of the intended
> recipient(s). Any review, use or distribution that has not been expressly
> authorized by Stem, Inc. is strictly prohibited. If you are not the
> intended recipient, please contact the sender and delete all copies. Thank
> you.
>



-- 

Abdelhakim Deneche

Software Engineer

  


Now Available - Free Hadoop On-Demand Training



Re: Guidelines for planner.memory.max_query_memory_per_node

2016-06-01 Thread Abdel Hakim Deneche
I don't know about any specific guidelines for this options, but what I
know is that it only affects the sort operator, and it's related to direct
memory not heap memory.



On Wed, Jun 1, 2016 at 1:20 PM, John Omernik  wrote:

> I am reposting this question here as well. (I posted on the MapR Community
> forums).
>
> The default as I understand it, for the setting
> planner.memory.max_query_memory_per_node
> is 2G.  The default heap memory settings in drill-env.sh is 4G and the
> default Direct memory is 8G.
>
> I guess, is there any advice on where I should set my
> planner.memory.max_query_memory_per_node
> as the other numbers scale? I.e. does this setting coordinate more with
> heap or direct or both? If I double my direct mem, should I double the
> setting? Are there any guidelines or methods for tuning this?
>
> I am currently running bits at 24 GB of Heap and 84GB of Direct, should I
> take the planner.memory.max_query_memory_per_node and x 8?  to put 16G?
> Thoughts?
>
> Thanks!
>
> John
>



-- 

Abdelhakim Deneche

Software Engineer

  


Now Available - Free Hadoop On-Demand Training



Re: Profiles Gone in Web UI: The great profile heist

2016-05-31 Thread Abdel Hakim Deneche
are you storing the profiles in a local folder or in nfs ?

On Tue, May 31, 2016 at 12:49 PM, John Omernik  wrote:

> I am scratching my head at this one... I made some minor changes to my
> drill-env.sh to enable gclogging, and was using the profiles in the webui
> just fine.  Due to some previously mentioned issues, I've had to restart
> drill bits due to GC issues etc.
>
> Now, while my profiles directory still exists, and my drill-override.conf
> has not been changed, no profiles now show up in the webui, even after
> drillbit restarts, and running more queries... The profiles are still being
> created (I can see them being added to the same profiles directory) just
> nothing shows up in the Web UI...
>
> What could be happening here?
>
> *scratching my head
>



-- 

Abdelhakim Deneche

Software Engineer

  


Now Available - Free Hadoop On-Demand Training



Re: Hangout link?

2016-05-31 Thread Abdel Hakim Deneche
Sorry about the delay, there you go:

https://plus.google.com/hangouts/_/event/ci4rdiju8bv04a64efj5fedd0lc

On Tue, May 31, 2016 at 9:57 AM, John Omernik  wrote:

>
>


-- 

Abdelhakim Deneche

Software Engineer

  


Now Available - Free Hadoop On-Demand Training



Re: Reading GC Logs

2016-05-31 Thread Abdel Hakim Deneche
My understanding (which is incomplete) is that both the "new reader" and
"dictionary encoding" are not stable yet and can cause failures or worse,
incorrect data. That's why they are disabled by default.

The "Allocation Failure" means that the JVM had to run a Full GC because it
couldn't allocate more heap for Drill. Looks like Drill is using more that
24GB of heap, which is most likely a bug.

What happens if you run the select part of the CTAS, does it also use too
much heap ?


On Tue, May 31, 2016 at 8:54 AM, John Omernik  wrote:

> Oh, the query just stopped showing up in the profiles webui, completely
> gone like it never happened. Seems to be responding a bit better, the
> sqlline is still hung though.
>
> (Yes this is all related to my CTAS of the parquet data, at this point I am
> just looking for ways to handle the data and not make drill really unhappy.
> )
>
> On Tue, May 31, 2016 at 10:51 AM, John Omernik  wrote:
>
> > Also: Doing a CTAS using the new reader and dictionary encoding is
> > producing this, everything is hung at this point. The query in sqlline is
> > not returning, the web UI is running extremely slowly, and when it does
> > return, shows the running query, however, when I click on it, the profile
> > shows an error saying profile not found.  The Full GCs are happening
> quite
> > a bit, and take a long time (>10 seconds) And (this is my tailed gcclog,
> > it's actually writing part of the the "allocation error" message and then
> > waits a before anything else happens. This is "the scary" state my
> cluster
> > can get into, and I am trying to avoid this :) Any tips on what may be
> > happening here would be appreciated.
> >
> > (24 GB of Heap, 5 nodes at this point)
> >
> >
> >
> >
> >
> > 912.895: [Full GC (Allocation Failure)  23G->20G(24G), 11.7923015 secs]
> >
> > 2924.692: [GC concurrent-mark-abort]
> >
> > 2925.099: [GC pause (G1 Evacuation Pause) (young) 22G->21G(24G),
> 0.0540177
> > secs]
> >
> > 2925.401: [GC pause (G1 Evacuation Pause) (young) (initial-mark)
> > 22G->21G(24G), 0.0638409 secs]
> >
> > 2925.465: [GC concurrent-root-region-scan-start]
> >
> > 2925.475: [GC concurrent-root-region-scan-end, 0.0097528 secs]
> >
> > 2925.475: [GC concurrent-mark-start]
> >
> > 2925.846: [GC pause (G1 Evacuation Pause) (young) 22G->21G(24G),
> 0.0454322
> > secs]
> >
> > 2926.252: [GC pause (G1 Evacuation Pause) (young) 22G->21G(24G),
> 0.0543209
> > secs]
> >
> > 2926.604: [GC pause (G1 Evacuation Pause) (young) 22G->21G(24G),
> 0.0525408
> > secs]
> >
> > 2926.986: [GC pause (G1 Evacuation Pause) (young) 22G->21G(24G),
> 0.0534530
> > secs]
> >
> > 2927.389: [GC concurrent-mark-end, 1.9133249 secs]
> >
> > 2927.405: [GC remark, 0.0446448 secs]
> >
> > 2927.462: [GC cleanup 22G->22G(24G), 0.0290235 secs]
> >
> > 2927.494: [GC concurrent-cleanup-start]
> >
> > 2927.494: [GC concurrent-cleanup-end, 0.190 secs]
> >
> > 2927.530: [GC pause (G1 Evacuation Pause) (young) 22G->21G(24G),
> 0.0500267
> > secs]
> >
> > 2927.828: [GC pause (G1 Evacuation Pause) (young) 22G->21G(24G),
> 0.0462845
> > secs]
> >
> > 2928.184: [GC pause (G1 Evacuation Pause) (young) (initial-mark)
> > 22G->21G(24G), 0.0749704 secs]
> >
> > 2928.259: [GC concurrent-root-region-scan-start]
> >
> > 2928.268: [GC concurrent-root-region-scan-end, 0.0093531 secs]
> >
> > 2928.268: [GC concurrent-mark-start]
> >
> > 2928.568: [GC pause (G1 Evacuation Pause) (young) 22G->22G(24G),
> 0.0555025
> > secs]
> >
> > 2928.952: [GC pause (G1 Evacuation Pause) (young) 23G->22G(24G),
> 0.0489993
> > secs]
> >
> > 2929.333: [GC pause (G1 Evacuation Pause) (young)-- 23G->22G(24G),
> > 0.0676159 secs]
> >
> > 2929.693: [GC pause (G1 Evacuation Pause) (young)-- 23G->23G(24G),
> > 0.2088768 secs]
> >
> > 2929.914: [Full GC (Allocation Failure)  23G->20G(24G), 11.6264600 secs]
> >
> > 2941.544: [GC concurrent-mark-abort]
> >
> > 2941.836: [GC pause (G1 Evacuation Pause) (young) 22G->20G(24G),
> 0.0416962
> > secs]
> >
> > 2942.127: [GC pause (G1 Evacuation Pause) (young) (initial-mark)
> > 22G->21G(24G), 0.0627406 secs]
> >
> > 2942.190: [GC concurrent-root-region-scan-start]
> >
> > 2942.193: [GC concurrent-root-region-scan-end, 0.0029795 secs]
> >
> > 2942.193: [GC concurrent-mark-start]
> >
> > 2942.548: [GC pause (G1 Evacuation Pause) (young) 22G->21G(24G),
> 0.0591030
> > secs]
> >
> > 2942.934: [GC pause (G1 Evacuation Pause) (young) 22G->21G(24G),
> 0.0589163
> > secs]
> >
> > 2943.304: [GC pause (G1 Evacuation Pause) (young) 22G->21G(24G),
> 0.0459117
> > secs]
> >
> > 2943.743: [GC pause (G1 Evacuation Pause) (young) 22G->21G(24G),
> 0.0461640
> > secs]
> >
> > 2943.941: [GC concurrent-mark-end, 1.7476855 secs]
> >
> > 2943.953: [GC remark, 0.0356995 secs]
> >
> > 2944.000: [GC cleanup 22G->22G(24G), 0.0307393 secs]
> >
> > 2944.034: [GC concurrent-cleanup-start]
> >
> > 2944.034: [GC concurrent-cleanup-end, 0.281 secs]
> >
> > 2944.162: [GC pause (G1 Evacuation Pause) 

Re: Reading and converting Parquet files intended for Impala

2016-05-28 Thread Abdel Hakim Deneche
the new parquet reader, the complex reader, is disabled by default. You can
enable it by setting the following option to true:

store.parquet.use_new_reader



On Sat, May 28, 2016 at 4:56 AM, John Omernik  wrote:

> I remember reading that drill uses two readers. One for certain cases ( I
> think flat structures) and the other for complex structures.  A. Am I
> remembering correctly? B. If so, can I determine via the plan or something
> which is being used? And C. Can I force Drill to try the other reader?
>
> On Saturday, May 28, 2016, Ted Dunning  wrote:
>
> > The Parquet user/dev mailing list might be helpful here. They have a real
> > stake in making sure that all readers/writers can work together. The
> > problem here really does sound like there is a borderline case that isn't
> > handled as well in the Drill special purpose parquet reader as in the
> > normal readers.
> >
> >
> >
> >
> >
> > On Fri, May 27, 2016 at 7:23 PM, John Omernik  > > wrote:
> >
> > > So working with MapR support we tried that with Impala, but it didn't
> > > produce the desired results because the outputfile worked fine in
> Drill.
> > > Theory: Evil file is created in Mapr Reduce, and is using a different
> > > writer than Impala is using. Impala can read the evil file, but when it
> > > writes it uses it's own writer, "fixing" the issue on the fly.  Thus,
> > Drill
> > > can't read evil file, but if we try to reduce with Impala, files is no
> > > longer evil, consider it... chaotic neutral ... (For all you D fans )
> > >
> > > I'd ideally love to extract into badness, but on the phone now with
> MapR
> > > support to figure out HOW, that is the question at hand.
> > >
> > > John
> > >
> > > On Fri, May 27, 2016 at 10:09 AM, Ted Dunning  > >
> > > wrote:
> > >
> > > > On Thu, May 26, 2016 at 8:50 PM, John Omernik  > > wrote:
> > > >
> > > > > So, if we have a known "bad" Parquet file (I use quotes, because
> > > > remember,
> > > > > Impala queries this file just fine) created in Map Reduce, with a
> > > column
> > > > > causing Array Index Out of Bounds problems with a BIGINT typed
> > column.
> > > > What
> > > > > would your next steps be to troubleshoot?
> > > > >
> > > >
> > > > I would start reducing the size of the evil file.
> > > >
> > > > If you have a tool that can query the bad parquet and write a new one
> > > > (sounds like Impala might do here) then selecting just the evil
> column
> > > is a
> > > > good first step.
> > > >
> > > > After that, I would start bisecting to find a small range that still
> > > causes
> > > > the problem. There may not be such, but it is good thing to try.
> > > >
> > > > At that point, you could easily have the problem down to a few
> > kilobytes
> > > of
> > > > data that can be used in a unit test.
> > > >
> > >
> >
>
>
> --
> Sent from my iThing
>



-- 

Abdelhakim Deneche

Software Engineer

  


Now Available - Free Hadoop On-Demand Training



Re: Issue with Queries Hanging

2016-05-23 Thread Abdel Hakim Deneche
One question about the missing query profile: do you store the query
profiles in the local file system or the distributed file system ?

On Mon, May 23, 2016 at 9:31 AM, John Omernik  wrote:

> Hey all, this is separate, yet related issue to my other posts RE Parquet,
> however, I thought I'd post this to see if this is normal or should be
> handled (and/or JIRAed)
>
> I am running Drill 1.6, if you've read the other posts, I am trying to CTAS
> a large amount of data (largish) 120 GB from Parquet to better Parquet.
>
> As I am running, I sometimes get the Index Out of Bounds (as in the other
> threads), but depending on source data and/or settings like using the new
> parquet reader, I get a odd situation.
>
> When I refresh the profile in the WebUII get an error "VALIDATION ERROR: no
> profile with given query id '' exists"
>
> I am running this in sqlline, and at this point, there is no error, but I
> can't access my query profile.
>
> Other notes:
>
> 1. The webui is HORRIBLY slow
> 2. If I cancel the query, it will show me some written parquet, but obvious
> it wasn't finished
> 3. There are no errors in any of the drillbits log files (except the forman
> which starts to get "WARN" "Messos of mode (REQUEST OR RESPONSE) of type 8
> (or type 1) too longer than 500ms Actual duration was (high number of ms
> betwen 1900 and 3500 ms)
> 4. Like I said, no errors, just everything appears to hang.
>
> My memory and such seems good here, I have 96 GB of ram DIRECT per node,
> and 12 GB of HEAP per node, 5 nodes,.
>
> The cluster seems really sluggish and out of sorts until I restart drill
> bits... This seems like a very bad "error state"
>
> Has anyone seen this? Any thoughts on this? Should I open a JIRA?
>
>
> Thanks,
> John
>



-- 

Abdelhakim Deneche

Software Engineer

  


Now Available - Free Hadoop On-Demand Training



Re: test

2016-05-17 Thread Abdel Hakim Deneche
your test succeeded ;)

On Tue, May 17, 2016 at 10:17 AM, Khurram Faraaz 
wrote:

> test email
>



-- 

Abdelhakim Deneche

Software Engineer

  


Now Available - Free Hadoop On-Demand Training



Re: CTAS Out of Memory

2016-05-13 Thread Abdel Hakim Deneche
Stefan,

Can you share the query profile for the query that seems to be running
forever ? you won't find it on disk but you can append .json to the profile
web url and save the file.

Thanks

On Fri, May 13, 2016 at 9:55 AM, Stefan Sedich 
wrote:

> Zelaine,
>
> It does, I forgot about those ones, I will do a test where I filter those
> out and see how I go, in my test with a 12GB heap size it seemed to just
> sit there forever and not finish.
>
>
> Thanks
>
> On Fri, May 13, 2016 at 9:50 AM Zelaine Fong  wrote:
>
> > Stefan,
> >
> > Does your source data contain varchar columns?  We've seen instances
> where
> > Drill isn't as efficient as it can be when Parquet is dealing with
> variable
> > length columns.
> >
> > -- Zelaine
> >
> > On Fri, May 13, 2016 at 9:26 AM, Stefan Sedich 
> > wrote:
> >
> > > Thanks for getting back to me so fast!
> > >
> > > I was just playing with that now, went up to 8GB and still ran into it,
> > > trying to go higher to see if I can find the sweet spot, only got 16GB
> > > total RAM on this laptop :)
> > >
> > > Is this an expected amount of memory for not an overly huge table (16
> > > million rows, 6 columns of integers), even now at a 12GB heap seems to
> > have
> > > filled up again.
> > >
> > >
> > >
> > > Thanks
> > >
> > > On Fri, May 13, 2016 at 9:20 AM Jason Altekruse 
> > wrote:
> > >
> > > > I could not find anywhere this is mentioned in the docs, but it has
> > come
> > > up
> > > > a few times one the list. While we made a number of efforts to move
> our
> > > > interactions with the Parquet library to the off-heap memory (which
> we
> > > use
> > > > everywhere else in the engine during processing) the version of the
> > > writer
> > > > we are using still buffers a non-trivial amount of data into heap
> > memory
> > > > when writing parquet files. Try raising your JVM heap memory in
> > > > drill-env.sh on startup and see if that prevents the out of memory
> > issue.
> > > >
> > > > Jason Altekruse
> > > > Software Engineer at Dremio
> > > > Apache Drill Committer
> > > >
> > > > On Fri, May 13, 2016 at 9:07 AM, Stefan Sedich <
> > stefan.sed...@gmail.com>
> > > > wrote:
> > > >
> > > > > Just trying to do a CTAS on a postgres table, it is not huge and
> only
> > > has
> > > > > 16 odd million rows, I end up with an out of memory after a while.
> > > > >
> > > > > Unable to handle out of memory condition in FragmentExecutor.
> > > > >
> > > > > java.lang.OutOfMemoryError: GC overhead limit exceeded
> > > > >
> > > > >
> > > > > Is there a way to avoid this without needing to do the CTAS on a
> > subset
> > > > of
> > > > > my table?
> > > > >
> > > >
> > >
> >
>



-- 

Abdelhakim Deneche

Software Engineer

  


Now Available - Free Hadoop On-Demand Training



Re: Queries and Timeout

2016-05-13 Thread Abdel Hakim Deneche
Long running queries shouldn't timeout. This is most likely a bug.

Is it reproducible ? Can you give more details about the query ?

Thanks

On Mon, May 9, 2016 at 12:30 PM, Subbu Srinivasan 
wrote:

> What is the best way to implement queries that are long running? If queries
> take a long
> time I get this error.
>
> I understand that setting query timeouts are not yet supported in the JDBC
> interface.
> I get this error even if I run the query from the drill console (and  !set
> timeout -1)
>
> Error: SYSTEM ERROR: ConnectTimeoutException: connection timed out:
>
>
>
> --
> Pardon me for typos or  if I do not start with a hi or address you by name.
> Want to make sure
> my carpel tunnel syndrome does not get worse.
>
> Subbu
>



-- 

Abdelhakim Deneche

Software Engineer

  


Now Available - Free Hadoop On-Demand Training



Re: workspaces

2016-05-13 Thread Abdel Hakim Deneche
I believe Drill stores storage plugins in different places when running in
embedded mode vs distributed mode. Embedded mode uses local disk and
distributed mode uses Zookeeper.

On Fri, May 13, 2016 at 9:08 AM, Odin Guillermo Caudillo Gallegos <
odin.guille...@gmail.com> wrote:

> The plugins are working fine in the embbed mode, but when i start the
> drillbit on each server and connect via drill-conf i don't see them.
> Do i need to configure another parameter apart from the zookeeper servers
> in the drill-override.conf file?
>
> 2016-05-13 11:01 GMT-05:00 Andries Engelbrecht  >:
>
> > If Drill was correctly installed in distributed mode the storage plugin
> > and workspaces will be used by the Drill cluster.
> >
> > Make sure the plugin and workspace was correctly configured and accepted.
> >
> > Are you using the WebUI or REST to configure the storage plugins?
> >
> > --Andries
> >
> > > On May 13, 2016, at 8:48 AM, Odin Guillermo Caudillo Gallegos <
> > odin.guille...@gmail.com> wrote:
> > >
> > > Is there a way to configure workspaces on a distributed installation?
> > > Cause i only see the default plugin configuration but not the one that
> i
> > > created.
> > >
> > > Thanks
> >
> >
>



-- 

Abdelhakim Deneche

Software Engineer

  


Now Available - Free Hadoop On-Demand Training



Re: problem running drill in 10minutes tutorial on macpro

2016-05-04 Thread Abdel Hakim Deneche
Hey,

Unfortunately, the Apache mailing list blocks attachments so we are not
able to see the error message. If you want you can just copy past the error
messages here, or share a link to the screenshots.

Thanks

On Wed, May 4, 2016 at 4:23 PM, Rita Kuo  wrote:

> Hi,
>
> I tried follow the Drill in 10 minutes tutorial on my MacBook but was not
> able to run the bin/drill-embedded command successfully.
>
> I get the error message shown in screenCap1 attached
>
> I believe the problem is it was not able to execute the following line in
> drill-embedded.sh:
>
> exec ${bin}/sqlline -u "jdbc:drill:zk=local" "$@"
>
> I was also not able to execute the sqlline command directly ( see
> screenCap2)
>
>
> I am running Java version 1.8.0 on my MAC, OS X El Capitan.
>
>
> Thank You.
>
> Best Regards,
>
> Rita Kuo
>
>
>


-- 

Abdelhakim Deneche

Software Engineer

  


Now Available - Free Hadoop On-Demand Training



Re: Where is this C++ API mentioned in Drill docs

2016-04-09 Thread Abdel Hakim Deneche
I believe the source code is part of Drill distribution, in the following
folder:

contrib/native/client

It also contains an example folder with a fairly good demonstration on how
to use the API to submit queries to Drill.

Thanks

On Sat, Apr 9, 2016 at 2:23 PM, Devender Yadav  wrote:

> Hi All,
>
> As per Drill Docs [
> http://drill.apache.org/docs/architecture-introduction/#drill-clients],
> You
> can access Drill through the following interfaces:
>
>- Drill shell <
> http://drill.apache.org/docs/configuring-the-drill-shell/>
>- Drill Web Console
><
> http://drill.apache.org/docs/monitoring-and-canceling-queries-in-the-drill-web-console
> >
>- ODBC/JDBC
><
> http://drill.apache.org/docs/interfaces-introduction/#using-odbc-to-access-apache-drill-from-bi-tools
> >
>- C++ API
>
> Where is this C++ API?
>
> Regards,
> Devender
>



-- 

Abdelhakim Deneche

Software Engineer

  


Now Available - Free Hadoop On-Demand Training



Re: How to modify connection timeout delay ?

2016-04-05 Thread Abdel Hakim Deneche
can you share a specific query that consistently times out ?

what kind of data are you querying ?

are you running Drill in embedded mode or do you have a Drill cluster ?

in case of a cluster, what is the size and number of cores of your cluster ?

what version of Drill are you running ?

Thanks

On Tue, Apr 5, 2016 at 7:59 AM, COUERON Damien (i-BP - MICROPOLE) <
damien.coueron_s...@i-bp.fr> wrote:

> Despite the log below, what kind of details are you interested in ?
>
>
>
> -Message d'origine-----
> De : Abdel Hakim Deneche [mailto:adene...@maprtech.com]
> Envoyé : dimanche 3 avril 2016 07:37
> À : user
> Objet : Re: How to modify connection timeout delay ?
>
> Hi Damien,
>
> Like Jason said, we have a heartbeat mechanism that should've prevented
> this issue all together, so I'm interested to learn how this is happening.
> We've seen this happen many times but so far we were never able to
> reproduce it.
>
> Could you give us more details so we can reproduce the issue  ?
>
> Thanks
>
> On Thu, Mar 31, 2016 at 2:47 PM, COUERON Damien (i-BP - MICROPOLE) <
> damien.coueron_s...@i-bp.fr> wrote:
>
> > Hi Jason,
> >
> > Thanks for your help. I have set this parameter to 0 on every drillbit
> > and it works like a charm now.
> >
> > Regarding your questions, there was no particular query that triggered
> > this issue. Every query longer than 30 seconds was impacted.
> > Please find below the log messages I received :
> >
> > 2016-03-24 14:18:31,368 [290c16d8-4244-d664-4562-5b156f3e6fff:foreman]
> > INFO  o.a.drill.exec.work.foreman.Foreman - Query text for query id
> > 290c16d8-4244-d664-4562-5b156f3e6fff: select count(columns[1]) from
> > hdfs.lemo.mails
> > 2016-03-24 14:18:31,417 [290c16d8-4244-d664-4562-5b156f3e6fff:foreman]
> > INFO  o.a.d.e.s.schedule.BlockMapBuilder - Get block maps: Executed 1
> > out of 1 using 1 threads. Time: 2ms total, 2.056748ms avg, 2ms max.
> > 2016-03-24 14:18:31,417 [290c16d8-4244-d664-4562-5b156f3e6fff:foreman]
> > INFO  o.a.d.e.s.schedule.BlockMapBuilder - Get block maps: Executed 1
> > out of 1 using 1 threads. Earliest start: 1.391000 μs, Latest start:
> > 1.391000 μs, Average start: 1.391000 μs .
> > 2016-03-24 14:18:31,466
> > [290c16d8-4244-d664-4562-5b156f3e6fff:frag:0:0]
> > INFO  o.a.d.e.w.fragment.FragmentExecutor -
> > 290c16d8-4244-d664-4562-5b156f3e6fff:0:0: State change requested
> > AWAITING_ALLOCATION --> RUNNING
> > 2016-03-24 14:18:31,466
> > [290c16d8-4244-d664-4562-5b156f3e6fff:frag:0:0]
> > INFO  o.a.d.e.w.f.FragmentStatusReporter -
> > 290c16d8-4244-d664-4562-5b156f3e6fff:0:0: State to report: RUNNING
> > 2016-03-24 14:19:01,570 [UserServer-1] INFO
> > o.a.drill.exec.rpc.user.UserServer - RPC connection /39.6.64.20:31010
> > <--> /39.6.64.22:53976 (user client) timed out.  Timeout was set to 30
> > seconds. Closing connection.
> > 2016-03-24 14:19:01,579 [CONTROL-rpc-event-queue] INFO
> > o.a.d.e.w.fragment.FragmentExecutor -
> > 290c16d8-4244-d664-4562-5b156f3e6fff:0:0: State change requested
> > RUNNING
> > --> CANCELLATION_REQUESTED
> > 2016-03-24 14:19:01,580 [CONTROL-rpc-event-queue] INFO
> > o.a.d.e.w.f.FragmentStatusReporter -
> > 290c16d8-4244-d664-4562-5b156f3e6fff:0:0: State to report:
> > CANCELLATION_REQUESTED
> > 2016-03-24 14:19:01,591
> > [290c16d8-4244-d664-4562-5b156f3e6fff:frag:0:0]
> > INFO  o.a.d.e.w.fragment.FragmentExecutor -
> > 290c16d8-4244-d664-4562-5b156f3e6fff:0:0: State change requested
> > CANCELLATION_REQUESTED --> FINISHED
> > 2016-03-24 14:19:01,591
> > [290c16d8-4244-d664-4562-5b156f3e6fff:frag:0:0]
> > INFO  o.a.d.e.w.f.FragmentStatusReporter -
> > 290c16d8-4244-d664-4562-5b156f3e6fff:0:0: State to report: CANCELLED
> > 2016-03-24 14:19:01,626 [UserServer-1] INFO
> > o.a.drill.exec.work.foreman.Foreman - Failure while trying communicate
> > query result to initiating client. This would happenif a client is
> > disconnected before response notice can be sent.
> > org.apache.drill.exec.rpc.ChannelClosedException: null
> > at
> > org.apache.drill.exec.rpc.CoordinationQueue$RpcListener.operationCompl
> > ete(CoordinationQueue.java:89)
> > [drill-rpc-1.6.0.jar:1.6.0]
> > at
> > org.apache.drill.exec.rpc.CoordinationQueue$RpcListener.operationCompl
> > ete(CoordinationQueue.java:67)
> > [drill-rpc-1.6.0.jar:1.6.0]
> > at
> > io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise
> > .java:680) [netty-common-4.0.27.Final.jar:4.0.27.Final]
> > at
> > io.

Re: Reading Avro Arrays

2016-04-03 Thread Abdel Hakim Deneche
pull requests are fine. You still need a JIRA though

On Sun, Apr 3, 2016 at 8:03 AM, Johannes Schulte  wrote:

> I now extended the AvroFormatTest-Suite by two unit tests that show that
>
> * Flattening of primitive array works as expected
> * Flattening of arrays of records does not work properly
>
> I spent some time trying to find the reason but it's my first contact with
> the drill-codebase.
>
> Is the recommended way of making this unit test available still to attach a
> patch in an issue or is a pull-request also an option?
>
> In the context of the recent avro maturity discussion I would love to fix
> this error myself but I would need some hints what goes wrong there
> internally.
>
> Johannes
>
> On Fri, Mar 25, 2016 at 10:50 PM, Johannes Schulte <
> johannes.schu...@gmail.com> wrote:
>
> > Hi Stefan, hi Jacques, thanks for going after this - I almost resignated
> > but know i think it was because i accessed the data over jdbc with
> squirrel
> > and got irritated by the unknown type column there. nonetheless, if the
> > schema looks like this:
> >
> >
> > {
> >   "type" : "record",
> >   "name" : "MainRecord",
> >   "namespace" : "drizz.WriteAvroTestFileForDrill$",
> >   "fields" : [ {
> > "name" : "elements",
> > "type" : {
> >   "type" : "array",
> >   "items" : {
> > "type" : "record",
> > "name" : "NestedRecord",
> > "fields" : [ {
> >   "name" : "field1",
> >   "type" : "int"
> > } ]
> >   },
> >   "java-class" : "java.util.List"
> > }
> >   } ]
> > }
> >
> > and the contents looks like this (according to avro tojson command line
> > utility)
> >
> >
> >
> {"elements":[{"field1":0},{"field1":1},{"field1":2},{"field1":3},{"field1":4},{"field1":5},{"field1":6},{"field1":7},{"field1":8},{"field1":9}]}
> >
> >
> {"elements":[{"field1":0},{"field1":1},{"field1":2},{"field1":3},{"field1":4},{"field1":5},{"field1":6},{"field1":7},{"field1":8},{"field1":9}]}
> >
> > a query like
> >
> > select flatten(elements) from
> > dfs.`/Users/j.schulte/data/avro-drill/no-union/`;
> >
> > yields exactly two rows:
> > +---+
> > |EXPR$0 |
> > +---+
> > | {"field1":9}  |
> > | {"field1":9}  |
> > +---+
> >
> > as if only the last element in the array would survive.
> >
> > Thanks for your help so far..
> >
> > On Fri, Mar 25, 2016 at 5:45 PM, Stefán Baxter <
> ste...@activitystream.com>
> > wrote:
> >
> >> Johannes, Jacques is right.
> >>
> >> I only tested the flattening of maps and not the flattening of
> >> list-of-maps.
> >>
> >> -Stefan
> >>
> >> On Fri, Mar 25, 2016 at 4:12 PM, Jacques Nadeau 
> >> wrote:
> >>
> >> > I think there is some incorrect information and confusion in this
> >> thread.
> >> > Could you please share a piece of sample data and a specific query?
> The
> >> > error message shown in your original email is suggesting that you were
> >> > trying to flatten a map rather than an array of maps. Flatten is for
> >> arrays
> >> > only. The arrays can have scalars or complex objects in them.
> >> >
> >> > --
> >> > Jacques Nadeau
> >> > CTO and Co-Founder, Dremio
> >> >
> >> > On Fri, Mar 25, 2016 at 2:00 AM, Johannes Schulte <
> >> > johannes.schu...@gmail.com> wrote:
> >> >
> >> > > Hi Stefan,
> >> > >
> >> > > thanks for this information - so it seems that there is currently no
> >> way
> >> > of
> >> > > accessing nested rich objects with drill; I somehow got that wrong
> >> from
> >> > the
> >> > > documentation...
> >> > >
> >> > > Cheers,
> >> > > Johannes
> >> > >
> >> > > On Thu, Mar 24, 2016 at 2:14 PM, Stefán Baxter <
> >> > ste...@activitystream.com>
> >> > > wrote:
> >> > >
> >> > > > FYI: flattening of embedded structures is not supported in Parquet
> >> > > either.
> >> > > >
> >> > > > Regards,
> >> > > >  -Stefan
> >> > > >
> >> > > > On Wed, Mar 23, 2016 at 8:51 PM, Johannes Schulte <
> >> > > > johannes.schu...@gmail.com> wrote:
> >> > > >
> >> > > > > Hi Stefan,
> >> > > > >
> >> > > > > thanks for your response and the link to your udf repository,
> >> it's a
> >> > > good
> >> > > > > reference. I tried drill 1.6, the data is an array of complex
> >> objects
> >> > > > > though. I will try to setup a drill dev environment and see if i
> >> can
> >> > > > modify
> >> > > > > the tests to fail.
> >> > > > >
> >> > > > > Johannes
> >> > > > >
> >> > > > > On Wed, Mar 23, 2016 at 8:13 PM, Stefán Baxter <
> >> > > > ste...@activitystream.com>
> >> > > > > wrote:
> >> > > > >
> >> > > > > > FYI. this seems to be working in 1.6, at least on the Avro
> data
> >> > that
> >> > > we
> >> > > > > > have.
> >> > > > > >
> >> > > > > > On Wed, Mar 23, 2016 at 6:59 PM, Stefán Baxter <
> >> > > > > ste...@activitystream.com>
> >> > > > > > wrote:
> >> > > > > >
> >> > > > > > > Hi again,
> >> > > > > > >
> >> > > > > > > What version of Drill are you using?
> >> > > > > > >
> >> > > > > > > Regards,
> >> > > > > > > - 

Re: How to modify connection timeout delay ?

2016-04-02 Thread Abdel Hakim Deneche
Hi Damien,

Like Jason said, we have a heartbeat mechanism that should've prevented
this issue all together, so I'm interested to learn how this is happening.
We've seen this happen many times but so far we were never able to
reproduce it.

Could you give us more details so we can reproduce the issue  ?

Thanks

On Thu, Mar 31, 2016 at 2:47 PM, COUERON Damien (i-BP - MICROPOLE) <
damien.coueron_s...@i-bp.fr> wrote:

> Hi Jason,
>
> Thanks for your help. I have set this parameter to 0 on every drillbit and
> it works like a charm now.
>
> Regarding your questions, there was no particular query that triggered
> this issue. Every query longer than 30 seconds was impacted.
> Please find below the log messages I received :
>
> 2016-03-24 14:18:31,368 [290c16d8-4244-d664-4562-5b156f3e6fff:foreman]
> INFO  o.a.drill.exec.work.foreman.Foreman - Query text for query id
> 290c16d8-4244-d664-4562-5b156f3e6fff: select count(columns[1]) from
> hdfs.lemo.mails
> 2016-03-24 14:18:31,417 [290c16d8-4244-d664-4562-5b156f3e6fff:foreman]
> INFO  o.a.d.e.s.schedule.BlockMapBuilder - Get block maps: Executed 1 out
> of 1 using 1 threads. Time: 2ms total, 2.056748ms avg, 2ms max.
> 2016-03-24 14:18:31,417 [290c16d8-4244-d664-4562-5b156f3e6fff:foreman]
> INFO  o.a.d.e.s.schedule.BlockMapBuilder - Get block maps: Executed 1 out
> of 1 using 1 threads. Earliest start: 1.391000 μs, Latest start: 1.391000
> μs, Average start: 1.391000 μs .
> 2016-03-24 14:18:31,466 [290c16d8-4244-d664-4562-5b156f3e6fff:frag:0:0]
> INFO  o.a.d.e.w.fragment.FragmentExecutor -
> 290c16d8-4244-d664-4562-5b156f3e6fff:0:0: State change requested
> AWAITING_ALLOCATION --> RUNNING
> 2016-03-24 14:18:31,466 [290c16d8-4244-d664-4562-5b156f3e6fff:frag:0:0]
> INFO  o.a.d.e.w.f.FragmentStatusReporter -
> 290c16d8-4244-d664-4562-5b156f3e6fff:0:0: State to report: RUNNING
> 2016-03-24 14:19:01,570 [UserServer-1] INFO
> o.a.drill.exec.rpc.user.UserServer - RPC connection /39.6.64.20:31010
> <--> /39.6.64.22:53976 (user client) timed out.  Timeout was set to 30
> seconds. Closing connection.
> 2016-03-24 14:19:01,579 [CONTROL-rpc-event-queue] INFO
> o.a.d.e.w.fragment.FragmentExecutor -
> 290c16d8-4244-d664-4562-5b156f3e6fff:0:0: State change requested RUNNING
> --> CANCELLATION_REQUESTED
> 2016-03-24 14:19:01,580 [CONTROL-rpc-event-queue] INFO
> o.a.d.e.w.f.FragmentStatusReporter -
> 290c16d8-4244-d664-4562-5b156f3e6fff:0:0: State to report:
> CANCELLATION_REQUESTED
> 2016-03-24 14:19:01,591 [290c16d8-4244-d664-4562-5b156f3e6fff:frag:0:0]
> INFO  o.a.d.e.w.fragment.FragmentExecutor -
> 290c16d8-4244-d664-4562-5b156f3e6fff:0:0: State change requested
> CANCELLATION_REQUESTED --> FINISHED
> 2016-03-24 14:19:01,591 [290c16d8-4244-d664-4562-5b156f3e6fff:frag:0:0]
> INFO  o.a.d.e.w.f.FragmentStatusReporter -
> 290c16d8-4244-d664-4562-5b156f3e6fff:0:0: State to report: CANCELLED
> 2016-03-24 14:19:01,626 [UserServer-1] INFO
> o.a.drill.exec.work.foreman.Foreman - Failure while trying communicate
> query result to initiating client. This would happenif a client is
> disconnected before response notice can be sent.
> org.apache.drill.exec.rpc.ChannelClosedException: null
> at
> org.apache.drill.exec.rpc.CoordinationQueue$RpcListener.operationComplete(CoordinationQueue.java:89)
> [drill-rpc-1.6.0.jar:1.6.0]
> at
> org.apache.drill.exec.rpc.CoordinationQueue$RpcListener.operationComplete(CoordinationQueue.java:67)
> [drill-rpc-1.6.0.jar:1.6.0]
> at
> io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:680)
> [netty-common-4.0.27.Final.jar:4.0.27.Final]
> at
> io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:603)
> [netty-common-4.0.27.Final.jar:4.0.27.Final]
> at
> io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:563)
> [netty-common-4.0.27.Final.jar:4.0.27.Final]
> at
> io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:424)
> [netty-common-4.0.27.Final.jar:4.0.27.Final]
> at
> io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:788)
> [netty-transport-4.0.27.Final.jar:4.0.27.Final]
> at
> io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:689)
> [netty-transport-4.0.27.Final.jar:4.0.27.Final]
> at
> io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1114)
> [netty-transport-4.0.27.Final.jar:4.0.27.Final]
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:705)
> [netty-transport-4.0.27.Final.jar:4.0.27.Final]
> at
> io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:32)
> [netty-transport-4.0.27.Final.jar:4.0.27.Final]
> at
> io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:980)
> [netty-transport-4.0.27.Final.jar:4.0.27.Final]
> at
> 

Re: simple join failing for SQL server.

2016-03-31 Thread Abdel Hakim Deneche
This is a known issue:

https://issues.apache.org/jira/browse/DRILL-4398



On Thu, Mar 31, 2016 at 9:33 AM, Devender Yadav <dev@gmail.com> wrote:

> Tested same query with Drill 1.5 and 1.6 with no success.
>
>
> Error: SYSTEM ERROR: IllegalStateException: Memory was leaked by query.
> Memory leaked: (73728)
> Allocator(op:0:0:3:JdbcSubScan) 100/73728/569344/100
> (res/actual/peak/limit)
>
>
>
> Regards,
> Devender
>
> On Thu, Mar 31, 2016 at 12:45 PM, Devender Yadav <dev@gmail.com>
> wrote:
>
> > Yes Abdel. I will try with 1.6 & let you know.
> >
> > Regards,
> > Devender
> >
> > On Thu, Mar 31, 2016 at 12:43 PM, Abdel Hakim Deneche <
> > adene...@maprtech.com> wrote:
> >
> >> "we" did fix so many of them. =P
> >>
> >> On Thu, Mar 31, 2016 at 8:12 AM, Abdel Hakim Deneche <
> >> adene...@maprtech.com>
> >> wrote:
> >>
> >> > Hi Devender,
> >> >
> >> > Whenever you see such errors, it's Drill's internal memory accounting
> >> > reporting a memory leak. This is "always" a bug, but you did fix so
> >> many of
> >> > them(*) since 1.4 and we even improved the memory allocator in 1.5. Do
> >> you
> >> > want to try again on the latest version and see if you still see this
> >> issue
> >> > ?
> >> >
> >> > (*) not all of them of course ;)
> >> >
> >> > On Thu, Mar 31, 2016 at 8:08 AM, Devender Yadav <dev@gmail.com>
> >> wrote:
> >> >
> >> >> Forgot to add details about Drill.
> >> >>
> >> >> Drill Version -  1.4
> >> >> OS -Ubuntu 14.0.4
> >> >> Mode- Embedded
> >> >> Memory -DRILL_MAX_DIRECT_MEMORY="4G"
> >> >>DRILL_HEAP="2G"
> >> >>
> >> >> Regards,
> >> >> Devender
> >> >>
> >> >> On Thu, Mar 31, 2016 at 12:34 PM, Devender Yadav <dev@gmail.com>
> >> >> wrote:
> >> >>
> >> >> > Hi All,
> >> >> >
> >> >> > I tried a simple join in two tables in SQL server for testing
> >> purpose.
> >> >> >
> >> >> > Two tables test01 & test02 with same data (just 4 records).
> >> >> >
> >> >> > Join query:
> >> >> >
> >> >> > *select t1.num_tiny, t2.num_small from mssql.dbo.`test01` t1 join
> >> >> > mssql.dbo.`test02` t2 on t1.num_int = t2.num_int;*
> >> >> >
> >> >> >
> >> >> > *Error: SYSTEM ERROR: IllegalStateException: Failure while closing
> >> >> > accountor.  Expected private and shared pools to be set to initial
> >> >> values.
> >> >> > However, one or more were not.  Stats are*
> >> >> > * zone init allocated delta *
> >> >> > * private 100 926272 73728 *
> >> >> > * shared 00 00 0.*
> >> >> >
> >> >> > *Fragment 0:0*
> >> >> >
> >> >> > What could be the reason for that? Anything wrong from my side or
> is
> >> it
> >> >> a
> >> >> > bug?
> >> >> >
> >> >> >
> >> >> > Regards,
> >> >> > Devender
> >> >> >
> >> >>
> >> >
> >> >
> >> >
> >> > --
> >> >
> >> > Abdelhakim Deneche
> >> >
> >> > Software Engineer
> >> >
> >> >   <http://www.mapr.com/>
> >> >
> >> >
> >> > Now Available - Free Hadoop On-Demand Training
> >> > <
> >>
> http://www.mapr.com/training?utm_source=Email_medium=Signature_campaign=Free%20available
> >> >
> >> >
> >>
> >>
> >>
> >> --
> >>
> >> Abdelhakim Deneche
> >>
> >> Software Engineer
> >>
> >>   <http://www.mapr.com/>
> >>
> >>
> >> Now Available - Free Hadoop On-Demand Training
> >> <
> >>
> http://www.mapr.com/training?utm_source=Email_medium=Signature_campaign=Free%20available
> >> >
> >>
> >
> >
>



-- 

Abdelhakim Deneche

Software Engineer

  <http://www.mapr.com/>


Now Available - Free Hadoop On-Demand Training
<http://www.mapr.com/training?utm_source=Email_medium=Signature_campaign=Free%20available>


Re: simple join failing for SQL server.

2016-03-31 Thread Abdel Hakim Deneche
This is an interesting observation. Can you add it as a comment to
DRILL-4398, it may help fix the issue.

Thanks

On Thu, Mar 31, 2016 at 10:25 AM, Devender Yadav <dev@gmail.com> wrote:

> I want to add one more observation.
>
> If I create a different plugin with same configuration
>
> Say my SQL Server plugin name is mssql. I created mssql1 with same
> configuration and modified query with mssql1 plugin name on 2nd table.
>
> *select t1.num_tiny, t2.num_small from mssql.dbo.`test01` t1 join
> mssql1.dbo.`test02` t2 on t1.num_int = t2.num_int;*
>
> *The query returned expected output.*
>
>
>
> On Thu, Mar 31, 2016 at 2:03 PM, Devender Yadav <dev@gmail.com> wrote:
>
> > Tested same query with Drill 1.5 and 1.6 with no success.
> >
> >
> > Error: SYSTEM ERROR: IllegalStateException: Memory was leaked by query.
> > Memory leaked: (73728)
> > Allocator(op:0:0:3:JdbcSubScan) 100/73728/569344/100
> > (res/actual/peak/limit)
> >
> >
> >
> > Regards,
> > Devender
> >
> > On Thu, Mar 31, 2016 at 12:45 PM, Devender Yadav <dev@gmail.com>
> > wrote:
> >
> >> Yes Abdel. I will try with 1.6 & let you know.
> >>
> >> Regards,
> >> Devender
> >>
> >> On Thu, Mar 31, 2016 at 12:43 PM, Abdel Hakim Deneche <
> >> adene...@maprtech.com> wrote:
> >>
> >>> "we" did fix so many of them. =P
> >>>
> >>> On Thu, Mar 31, 2016 at 8:12 AM, Abdel Hakim Deneche <
> >>> adene...@maprtech.com>
> >>> wrote:
> >>>
> >>> > Hi Devender,
> >>> >
> >>> > Whenever you see such errors, it's Drill's internal memory accounting
> >>> > reporting a memory leak. This is "always" a bug, but you did fix so
> >>> many of
> >>> > them(*) since 1.4 and we even improved the memory allocator in 1.5.
> Do
> >>> you
> >>> > want to try again on the latest version and see if you still see this
> >>> issue
> >>> > ?
> >>> >
> >>> > (*) not all of them of course ;)
> >>> >
> >>> > On Thu, Mar 31, 2016 at 8:08 AM, Devender Yadav <dev@gmail.com>
> >>> wrote:
> >>> >
> >>> >> Forgot to add details about Drill.
> >>> >>
> >>> >> Drill Version -  1.4
> >>> >> OS -Ubuntu 14.0.4
> >>> >> Mode- Embedded
> >>> >> Memory -DRILL_MAX_DIRECT_MEMORY="4G"
> >>> >>DRILL_HEAP="2G"
> >>> >>
> >>> >> Regards,
> >>> >> Devender
> >>> >>
> >>> >> On Thu, Mar 31, 2016 at 12:34 PM, Devender Yadav <dev@gmail.com
> >
> >>> >> wrote:
> >>> >>
> >>> >> > Hi All,
> >>> >> >
> >>> >> > I tried a simple join in two tables in SQL server for testing
> >>> purpose.
> >>> >> >
> >>> >> > Two tables test01 & test02 with same data (just 4 records).
> >>> >> >
> >>> >> > Join query:
> >>> >> >
> >>> >> > *select t1.num_tiny, t2.num_small from mssql.dbo.`test01` t1 join
> >>> >> > mssql.dbo.`test02` t2 on t1.num_int = t2.num_int;*
> >>> >> >
> >>> >> >
> >>> >> > *Error: SYSTEM ERROR: IllegalStateException: Failure while closing
> >>> >> > accountor.  Expected private and shared pools to be set to initial
> >>> >> values.
> >>> >> > However, one or more were not.  Stats are*
> >>> >> > * zone init allocated delta *
> >>> >> > * private 100 926272 73728 *
> >>> >> > * shared 00 00 0.*
> >>> >> >
> >>> >> > *Fragment 0:0*
> >>> >> >
> >>> >> > What could be the reason for that? Anything wrong from my side or
> >>> is it
> >>> >> a
> >>> >> > bug?
> >>> >> >
> >>> >> >
> >>> >> > Regards,
> >>> >> > Devender
> >>> >> >
> >>> >>
> >>> >
> >>> >
> >>> >
> >>> > --
> >>> >
> >>> > Abdelhakim Deneche
> >>> >
> >>> > Software Engineer
> >>> >
> >>> >   <http://www.mapr.com/>
> >>> >
> >>> >
> >>> > Now Available - Free Hadoop On-Demand Training
> >>> > <
> >>>
> http://www.mapr.com/training?utm_source=Email_medium=Signature_campaign=Free%20available
> >>> >
> >>> >
> >>>
> >>>
> >>>
> >>> --
> >>>
> >>> Abdelhakim Deneche
> >>>
> >>> Software Engineer
> >>>
> >>>   <http://www.mapr.com/>
> >>>
> >>>
> >>> Now Available - Free Hadoop On-Demand Training
> >>> <
> >>>
> http://www.mapr.com/training?utm_source=Email_medium=Signature_campaign=Free%20available
> >>> >
> >>>
> >>
> >>
> >
>



-- 

Abdelhakim Deneche

Software Engineer

  <http://www.mapr.com/>


Now Available - Free Hadoop On-Demand Training
<http://www.mapr.com/training?utm_source=Email_medium=Signature_campaign=Free%20available>


Re: simple join failing for SQL server.

2016-03-31 Thread Abdel Hakim Deneche
"we" did fix so many of them. =P

On Thu, Mar 31, 2016 at 8:12 AM, Abdel Hakim Deneche <adene...@maprtech.com>
wrote:

> Hi Devender,
>
> Whenever you see such errors, it's Drill's internal memory accounting
> reporting a memory leak. This is "always" a bug, but you did fix so many of
> them(*) since 1.4 and we even improved the memory allocator in 1.5. Do you
> want to try again on the latest version and see if you still see this issue
> ?
>
> (*) not all of them of course ;)
>
> On Thu, Mar 31, 2016 at 8:08 AM, Devender Yadav <dev@gmail.com> wrote:
>
>> Forgot to add details about Drill.
>>
>> Drill Version -  1.4
>> OS -Ubuntu 14.0.4
>> Mode- Embedded
>> Memory -DRILL_MAX_DIRECT_MEMORY="4G"
>>DRILL_HEAP="2G"
>>
>> Regards,
>> Devender
>>
>> On Thu, Mar 31, 2016 at 12:34 PM, Devender Yadav <dev@gmail.com>
>> wrote:
>>
>> > Hi All,
>> >
>> > I tried a simple join in two tables in SQL server for testing purpose.
>> >
>> > Two tables test01 & test02 with same data (just 4 records).
>> >
>> > Join query:
>> >
>> > *select t1.num_tiny, t2.num_small from mssql.dbo.`test01` t1 join
>> > mssql.dbo.`test02` t2 on t1.num_int = t2.num_int;*
>> >
>> >
>> > *Error: SYSTEM ERROR: IllegalStateException: Failure while closing
>> > accountor.  Expected private and shared pools to be set to initial
>> values.
>> > However, one or more were not.  Stats are*
>> > * zone init allocated delta *
>> > * private 100 926272 73728 *
>> > * shared 00 00 0.*
>> >
>> > *Fragment 0:0*
>> >
>> > What could be the reason for that? Anything wrong from my side or is it
>> a
>> > bug?
>> >
>> >
>> > Regards,
>> > Devender
>> >
>>
>
>
>
> --
>
> Abdelhakim Deneche
>
> Software Engineer
>
>   <http://www.mapr.com/>
>
>
> Now Available - Free Hadoop On-Demand Training
> <http://www.mapr.com/training?utm_source=Email_medium=Signature_campaign=Free%20available>
>



-- 

Abdelhakim Deneche

Software Engineer

  <http://www.mapr.com/>


Now Available - Free Hadoop On-Demand Training
<http://www.mapr.com/training?utm_source=Email_medium=Signature_campaign=Free%20available>


Re: simple join failing for SQL server.

2016-03-31 Thread Abdel Hakim Deneche
Hi Devender,

Whenever you see such errors, it's Drill's internal memory accounting
reporting a memory leak. This is "always" a bug, but you did fix so many of
them(*) since 1.4 and we even improved the memory allocator in 1.5. Do you
want to try again on the latest version and see if you still see this issue
?

(*) not all of them of course ;)

On Thu, Mar 31, 2016 at 8:08 AM, Devender Yadav  wrote:

> Forgot to add details about Drill.
>
> Drill Version -  1.4
> OS -Ubuntu 14.0.4
> Mode- Embedded
> Memory -DRILL_MAX_DIRECT_MEMORY="4G"
>DRILL_HEAP="2G"
>
> Regards,
> Devender
>
> On Thu, Mar 31, 2016 at 12:34 PM, Devender Yadav 
> wrote:
>
> > Hi All,
> >
> > I tried a simple join in two tables in SQL server for testing purpose.
> >
> > Two tables test01 & test02 with same data (just 4 records).
> >
> > Join query:
> >
> > *select t1.num_tiny, t2.num_small from mssql.dbo.`test01` t1 join
> > mssql.dbo.`test02` t2 on t1.num_int = t2.num_int;*
> >
> >
> > *Error: SYSTEM ERROR: IllegalStateException: Failure while closing
> > accountor.  Expected private and shared pools to be set to initial
> values.
> > However, one or more were not.  Stats are*
> > * zone init allocated delta *
> > * private 100 926272 73728 *
> > * shared 00 00 0.*
> >
> > *Fragment 0:0*
> >
> > What could be the reason for that? Anything wrong from my side or is it a
> > bug?
> >
> >
> > Regards,
> > Devender
> >
>



-- 

Abdelhakim Deneche

Software Engineer

  


Now Available - Free Hadoop On-Demand Training



Re: IllegalStateException: Memory was leaked by query - Drill error on wide table, but OK on a narrow but longer table.

2016-03-26 Thread Abdel Hakim Deneche
Hey Edmon,

Can you search the logs for the errorId displayed with this error ? I've
seen a similar issue when external sort fails to spill to disk (no space
left on disk), it would leak memory and display this error message instead
of the original issue (problem spilling to disk).

I will open a JIRA to fix the error message. Waiting on you to confirm it's
indeed the same issue.

Thanks

On Mon, Mar 21, 2016 at 1:57 PM, Edmon Begoli  wrote:

> We are converting some raw CMS data from csv to parquet using Drill, and
> using partitioning as we go.
>
> Query 1 runs OK on a narrower file:
>
> size:
> 13G Mar 16 18:20 out_revenuej_lds_100_201412.csv (Month 12 file)
> 13G Mar 16 16:37 out_claimsj_lds_100_2014_q1.csv (Quarter 1 file)
>
> q1 has 198 columns;
> month 12 has 32 columns.
>
> Both are partitioned on the same unique ID resulting in 14 buckets.
> Resulting parquet for month 12:  2.6G parquet/outpatient_revenue_12
>
> It fails on out_claimsj_lds_100_2014_q1.csv with 198 columns.
>
> Error:
> 
>
> *Error: SYSTEM ERROR: IllegalStateException: Memory was leaked by query.
> Memory leaked: (8530784)*
> *Allocator(op:0:0:6:ExternalSort) 2000/8530784/357974944/357913941
> (res/actual/peak/limit)*
>
>
> *Fragment 0:0*
>
> *[Error Id: 0606ff19-1c3a-4611-a3d4-1d28d9b3bd60 on cyclone-backend:31010]
> (state=,code=0)*
>
>
> Environment
> -
> (upped memory to MAX RAM - 4 GB to try to make it pass)
>
>
> drill-env.sh
> DRILL_MAX_DIRECT_MEMORY="60G"
> DRILL_HEAP="48G"
>
> export DRILL_JAVA_OPTS="-Xms$DRILL_HEAP -Xmx$DRILL_HEAP
> -XX:MaxDirectMemorySize=$DRILL
> _MAX_DIRECT_MEMORY -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=1G
> -Ddrill.exec.enab
> le-epoll=true"
>
> # Class unloading is disabled by default in Java 7
> #
>
> http://hg.openjdk.java.net/jdk7u/jdk7u60/hotspot/file/tip/src/share/vm/runtime/globa
> ls.hpp#l1622
> export SERVER_GC_OPTS="-XX:+CMSClassUnloadingEnabled -XX:+UseG1GC "
>
> Query 1
> ---
>
> -- Success
>
> --12
> CREATE TABLE outpatient_revenue_12
> PARTITION BY (UID_CLASS) AS SELECT CAST( SUBSTR(columns[0],1,2) as INT) as
>  UID_CLASS, CASE WHEN columns[0] = '' THEN NULL ELSE CAST(columns[0] as
> DOUBLE) END  as DSYSRTKY, CASE WHEN columns[1] = '' THEN NULL ELSE
> CAST(columns[1] as DOUBLE) END  as CLAIMNO, CASE WHEN columns[2] = '' THEN
> NULL ELSE CAST(columns[2] as DOUBLE) END  as CLM_LN, CASE WHEN columns[3] =
> '' THEN NULL ELSE TO_DATE(columns[3], 'MMdd') END  as THRU_DT,
> NULLIF(columns[4],'') as CLM_TYPE, NULLIF(columns[5],'') as REV_CNTR, CASE
> WHEN columns[6] = '' THEN NULL ELSE TO_DATE(columns[6], 'MMdd') END  as
> REV_DT, NULLIF(columns[7],'') as APCHIPPS, NULLIF(columns[8],'') as
> HCPCS_CD, NULLIF(columns[9],'') as MDFR_CD1, NULLIF(columns[10],'') as
> MDFR_CD2, NULLIF(columns[11],'') as PMTMTHD, NULLIF(columns[12],'') as
> DSCNTIND, NULLIF(columns[13],'') as PACKGIND, NULLIF(columns[14],'') as
> OTAF_1, NULLIF(columns[15],'') as IDENDC, CASE WHEN columns[16] = '' THEN
> NULL ELSE CAST(columns[16] as DOUBLE) END  as REV_UNIT, CASE WHEN
> columns[17] = '' THEN NULL ELSE CAST(columns[17] as DOUBLE) END  as
> REV_RATE, CASE WHEN columns[18] = '' THEN NULL ELSE CAST(columns[18] as
> DOUBLE) END  as REVBLOOD, CASE WHEN columns[19] = '' THEN NULL ELSE
> CAST(columns[19] as DOUBLE) END  as REVDCTBL, CASE WHEN columns[20] = ''
> THEN NULL ELSE CAST(columns[20] as DOUBLE) END  as WAGEADJ, CASE WHEN
> columns[21] = '' THEN NULL ELSE CAST(columns[21] as DOUBLE) END  as
> RDCDCOIN, CASE WHEN columns[22] = '' THEN NULL ELSE CAST(columns[22] as
> DOUBLE) END  as REV_MSP1, CASE WHEN columns[23] = '' THEN NULL ELSE
> CAST(columns[23] as DOUBLE) END  as REV_MSP2, CASE WHEN columns[24] = ''
> THEN NULL ELSE CAST(columns[24] as DOUBLE) END  as RPRVDPMT, CASE WHEN
> columns[25] = '' THEN NULL ELSE CAST(columns[25] as DOUBLE) END  as
> RBENEPMT, CASE WHEN columns[26] = '' THEN NULL ELSE CAST(columns[26] as
> DOUBLE) END  as PTNTRESP, CASE WHEN columns[27] = '' THEN NULL ELSE
> CAST(columns[27] as DOUBLE) END  as REVPMT, CASE WHEN columns[28] = '' THEN
> NULL ELSE CAST(columns[28] as DOUBLE) END  as REV_CHRG, CASE WHEN
> columns[29] = '' THEN NULL ELSE CAST(columns[29] as DOUBLE) END  as
> REV_NCVR, NULLIF(columns[30],'') as REVSTIND, NULLIF(columns[31],'') as
> REV_CNTR_PRICNG_IND_CD
> FROM
> dfs.`default`.`/data/cms/2014_outpatient/out_revenuej_lds_100_201412.csv`;
>
>
> Query 2
> ---
>
> -- Failed
>
> -- Q1
> CREATE TABLE base_outpatient_q1
> PARTITION BY (UID_CLASS)
> AS
> SELECT CAST( SUBSTR(columns[0],1,2) as INT) as  UID_CLASS, CASE WHEN
> columns[0] = '' THEN NULL ELSE CAST(columns[0] as DOUBLE) END  as
> `DSYSRTKY`, CASE WHEN columns[1] = '' THEN NULL ELSE CAST(columns[1] as
> DOUBLE) END  as `CLAIMNO`, NULLIF(columns[2],'') as `PROVIDER`, CASE WHEN
> columns[3] = '' THEN NULL ELSE TO_DATE(columns[3], 'MMdd') END  as
> `THRU_DT`, NULLIF(columns[4],'') as `RIC_CD`, NULLIF(columns[5],'') as
> `CLM_TYPE`, 

Re: Code too large

2016-03-24 Thread Abdel Hakim Deneche
This exception states that the code generated for the project is too big
for that Java compiler. Can you share the query that caused this failure ?

On Thu, Mar 24, 2016 at 1:27 PM, Edmon Begoli  wrote:

> Does anyone know what might be causing this exception:
>
> *Error: SYSTEM ERROR: CompileException: File
> 'org.apache.drill.exec.compile.DrillJavaFileObject[ProjectorGen10.java]',
> Line 7275, Column 17: ProjectorGen10.java:7275: error: code too large*
>
> *public void doEval(int inIndex, int outIndex)*
>
> *^ (compiler.err.limit.code)*
>
>
> *Fragment 0:0*
>
>
> *[Error Id: 687009ec-4d55-443a-9066-218fb3ac8adb on localhost:31010]
> (state=,code=0)*
>



-- 

Abdelhakim Deneche

Software Engineer

  


Now Available - Free Hadoop On-Demand Training



Re: unable to start Drill 1.6.0

2016-03-19 Thread Abdel Hakim Deneche
I'm not 100% sure, but I think some changes went in that require all UDFs
to be recompiled with Drill 1.6.0. Did you recompile your UDFs ?

Also, you can check drillbit.log and drillbit.out. When a Drillbit starts
it logs any problem with the UDFs in the logs

On Thu, Mar 17, 2016 at 4:25 PM, Shankar Mane <shankar.m...@games24x7.com>
wrote:

>
> 
>
> 1. Drill in cluster is *working fine *when *customized* drill-module.conf
> file is *not present *in dir "apache-drill-1.6.0/conf/drill-module.conf"
>
>
>
> 
>
>
> 2. Custom UDF is not working as describe below :
>
> i have copied my custom UDF into dir "apache-drill-1.6.0/jars/3rdparty" on
> all nodes and restarted all drillbits.
>
>
> udf filename=udfutil-0.0.1-SNAPSHOT.jar
> jar *structure* -
> /*
> META-INF/
> META-INF/MANIFEST.MF
> com/
> com/companyname/
> com/companyname/drill/
> drill-module.conf
> com/companyname/drill/channeltest.class
> com/companyname/drill/DateFunc.class
> com/companyname/drill/DateExtract.class
> com/companyname/drill/DecodeURI.class
> com/companyname/drill/ChannelID.class
> com/companyname/drill/BrowserFuncNew.class
> com/companyname/drill/ToDate.class
> META-INF/maven/
> META-INF/maven/com.companyname.drill.udf/
> META-INF/maven/com.companyname.drill.udf/udfutil/
> META-INF/maven/com.companyname.drill.udf/udfutil/pom.xml
> META-INF/maven/com.companyname.drill.udf/udfutil/pom.properties
> */
>
>
> -- And login to drill to check whether function is working or not
> /*
> 0: jdbc:drill:> select DateFunc(1458228298) from (values(1)) ;
> *Error: VALIDATION ERROR: From line 1, column 8 to line 1, column 26: No
> match found for function signature DateFunc()*
> */
>
> *IT FAILED*
>
>
>
>
> 
>
> 3. Now as described on website, now i edited file "*drill-module.conf*".
> And copied this file to all nodes in cluster and restarted all drillbits.
>
> vi apache-drill-1.6.0/conf/drill-module.conf
>
> /*
> drill: {
> classpath.scanning: {
> packages: [
> "com.companyname.drill.*"
> ]
> }
> }
> */
>
> *But DRILL GET SHUTDOWN on all nodes.*
>
>
>
>
> *Please help me to resolved this issue. Or suggest any other way to invoke
> my custome UDFs. *
>
>
>
>
>
> On Thu, Mar 17, 2016 at 6:50 PM, Abdel Hakim Deneche <
> adene...@maprtech.com>
> wrote:
>
> > Easiest fix when Drill fails to load a storage plugin is to delete the
> > existing configurations. Deleting /tmp/drill/ should resolve this.
> >
> > I know this may not be practical in some cases, and other developers may
> > give you a better solution.
> >
> > On Thu, Mar 17, 2016 at 2:13 PM, Shankar Mane <
> shankar.m...@games24x7.com>
> > wrote:
> >
> > > *drillbit.out =>*
> > >
> > >
> > > Exception in thread "main"
> > > org.apache.drill.exec.exception.DrillbitStartupException: Failure
> during
> > > initial startup of Drillbit.
> > > at org.apache.drill.exec.server.Drillbit.start(Drillbit.java:284)
> > > at org.apache.drill.exec.server.Drillbit.start(Drillbit.java:261)
> > > at org.apache.drill.exec.server.Drillbit.main(Drillbit.java:257)
> > > Caused by: java.lang.IllegalStateException:
> > > com.fasterxml.jackson.databind.JsonMappingException: Could not resolve
> > type
> > > id 'kudu' into a subtype of [simple type, class
> > > org.apache.drill.common.logical.StoragePluginConfig]: known type ids =
> > > [InfoSchemaConfig, StoragePluginConfig, SystemTablePluginConfig, file,
> > > jdbc, mock, named]
> > >  at [Source: {
> > >   "storage":{
> > > kudu : {
> > >   type:"kudu",
> > >   masterAddresses: "1.2.3.4",
> > >   enabled: false
> > > }
> > >   }
> > > }
> > > ; line: 4, column: 12] (through reference chain:
> > >
> > >
> >
> org.apache.drill.exec.planner.logical.StoragePlugins["storage"]->java.util.LinkedHashMap["kudu"])
> > > at
> > >
> > >
> >
> org.apache.drill.exec.store.StoragePluginRegistryImpl.createPlugins(StoragePluginRegistryImpl.java:182)
> > > at
> > >
> > >
> >
> org.apache.drill.exec.store.StoragePluginR

Re: unable to start Drill 1.6.0

2016-03-19 Thread Abdel Hakim Deneche
Easiest fix when Drill fails to load a storage plugin is to delete the
existing configurations. Deleting /tmp/drill/ should resolve this.

I know this may not be practical in some cases, and other developers may
give you a better solution.

On Thu, Mar 17, 2016 at 2:13 PM, Shankar Mane 
wrote:

> *drillbit.out =>*
>
>
> Exception in thread "main"
> org.apache.drill.exec.exception.DrillbitStartupException: Failure during
> initial startup of Drillbit.
> at org.apache.drill.exec.server.Drillbit.start(Drillbit.java:284)
> at org.apache.drill.exec.server.Drillbit.start(Drillbit.java:261)
> at org.apache.drill.exec.server.Drillbit.main(Drillbit.java:257)
> Caused by: java.lang.IllegalStateException:
> com.fasterxml.jackson.databind.JsonMappingException: Could not resolve type
> id 'kudu' into a subtype of [simple type, class
> org.apache.drill.common.logical.StoragePluginConfig]: known type ids =
> [InfoSchemaConfig, StoragePluginConfig, SystemTablePluginConfig, file,
> jdbc, mock, named]
>  at [Source: {
>   "storage":{
> kudu : {
>   type:"kudu",
>   masterAddresses: "1.2.3.4",
>   enabled: false
> }
>   }
> }
> ; line: 4, column: 12] (through reference chain:
>
> org.apache.drill.exec.planner.logical.StoragePlugins["storage"]->java.util.LinkedHashMap["kudu"])
> at
>
> org.apache.drill.exec.store.StoragePluginRegistryImpl.createPlugins(StoragePluginRegistryImpl.java:182)
> at
>
> org.apache.drill.exec.store.StoragePluginRegistryImpl.init(StoragePluginRegistryImpl.java:126)
> at org.apache.drill.exec.server.Drillbit.run(Drillbit.java:113)
> at org.apache.drill.exec.server.Drillbit.start(Drillbit.java:281)
> ... 2 more
> Caused by: com.fasterxml.jackson.databind.JsonMappingException: Could not
> resolve type id 'kudu' into a subtype of [simple type, class
> org.apache.drill.common.logical.StoragePluginConfig]: known type ids =
> [InfoSchemaConfig, StoragePluginConfig, SystemTablePluginConfig, file,
> jdbc, mock, named]
>  at [Source: {
>   "storage":{
> kudu : {
>   type:"kudu",
>   masterAddresses: "1.2.3.4",
>   enabled: false
> }
>   }
> }
> ; line: 4, column: 12] (through reference chain:
>
> org.apache.drill.exec.planner.logical.StoragePlugins["storage"]->java.util.LinkedHashMap["kudu"])
> at
>
> com.fasterxml.jackson.databind.JsonMappingException.from(JsonMappingException.java:216)
> at
>
> com.fasterxml.jackson.databind.DeserializationContext.unknownTypeException(DeserializationContext.java:983)
> at
>
> com.fasterxml.jackson.databind.jsontype.impl.TypeDeserializerBase._handleUnknownTypeId(TypeDeserializerBase.java:281)
> at
>
> com.fasterxml.jackson.databind.jsontype.impl.TypeDeserializerBase._findDeserializer(TypeDeserializerBase.java:163)
> at
>
> com.fasterxml.jackson.databind.jsontype.impl.AsPropertyTypeDeserializer._deserializeTypedForId(AsPropertyTypeDeserializer.java:106)
> at
>
> com.fasterxml.jackson.databind.jsontype.impl.AsPropertyTypeDeserializer.deserializeTypedFromObject(AsPropertyTypeDeserializer.java:91)
> at
>
> com.fasterxml.jackson.databind.deser.AbstractDeserializer.deserializeWithType(AbstractDeserializer.java:142)
> at
>
> com.fasterxml.jackson.databind.deser.std.MapDeserializer._readAndBindStringMap(MapDeserializer.java:497)
> at
>
> com.fasterxml.jackson.databind.deser.std.MapDeserializer.deserialize(MapDeserializer.java:341)
> at
>
> com.fasterxml.jackson.databind.deser.std.MapDeserializer.deserialize(MapDeserializer.java:26)
> at
>
> com.fasterxml.jackson.databind.deser.SettableBeanProperty.deserialize(SettableBeanProperty.java:490)
> at
>
> com.fasterxml.jackson.databind.deser.BeanDeserializer._deserializeWithErrorWrapping(BeanDeserializer.java:465)
> at
>
> com.fasterxml.jackson.databind.deser.BeanDeserializer._deserializeUsingPropertyBased(BeanDeserializer.java:380)
> at
>
> com.fasterxml.jackson.databind.deser.BeanDeserializerBase.deserializeFromObjectUsingNonDefault(BeanDeserializerBase.java:1123)
> at
>
> com.fasterxml.jackson.databind.deser.BeanDeserializer.deserializeFromObject(BeanDeserializer.java:298)
> at
>
> com.fasterxml.jackson.databind.deser.BeanDeserializer.deserialize(BeanDeserializer.java:133)
> at
>
> com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:3788)
> at
>
> com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:2779)
> at
>
> org.apache.drill.exec.store.StoragePluginRegistryImpl.createPlugins(StoragePluginRegistryImpl.java:144)
> ... 5 more
>
>
>
>
>
> On Thu, Mar 17, 2016 at 6:38 PM, Shankar Mane 
> wrote:
>
> > *drillbit.out =>*
> >
> >
> > Exception in thread "main"
> > org.apache.drill.exec.exception.DrillbitStartupException: Failure during
> > initial startup of Drillbit.
> > at org.apache.drill.exec.server.Drillbit.start(Drillbit.java:284)
> > at org.apache.drill.exec.server.Drillbit.start(Drillbit.java:261)
> > at org.apache.drill.exec.server.Drillbit.main(Drillbit.java:257)
> > Caused by: 

Re: Drill join performance

2016-03-18 Thread Abdel Hakim Deneche
One quick note here, I don't think partitioning LINEORDER table on
LO_ORDERDATE would help this query. If you look at the query profile you
will see that Drill is reading everything from LINEORDER.

On Fri, Mar 18, 2016 at 7:57 AM, Dmitry Krivov  wrote:

> Just for info :
>
> After recreating tables with explicit columns CASTing  have double
> performace of this query (from  60 to 35 sec.)
>
> Best regards,
> Dmitry
>
> > Hello
> >
> > I have load (as CTAS) into parquet-files StarShema Benchmark generated
> > csv-data (scale factor 50)
> >
> > For one of bencmark query's like :
> >
> > select
> > d.d_year,
> > c.c_region,
> > sum(l.lo_extendedprice*l.lo_discount) as revenue
> > from dfs.tpch.lineorder_part l,
> >dfs.tpch.dates d,
> >dfs.tpch.customer c
> > where l.lo_orderdate = d.d_datekey
> >  and l.lo_custkey = c.c_custkey
> >  and d.d_year=1995
> > group by d.d_year, c.c_region
> > order by d.d_year desc, c.c_region asc;
> >
> > got min. exec time of 59 sec.
> >
> > Table LINEORDER have 300M rows and partitioned by LO_ORDERDATE column
> (2406
> > partitions in related parquet-files)
> > Table CUSTOMER have 1.5M rows and table DATES have 2556 rows, both tables
> > not partitioned
> >
> > Drill 1.5 conf. have :
> >
> > drill-env.sh :
> > DRILL_MAX_DIRECT_MEMORY="16G"
> > DRILL_HEAP="8G"
> >
> > sys.options changed  :
> >
> > planner.memory.max_query_memory_per_node = 8 000 000 000
> > planner.memory_limit = 1 000 000 000
> > planner.width.max_per_node = 16 (was 12 by default)
> >
> > Drill is installed on 16VCPU Linux VM and under query runtime all
> 16VCPU's
> > 100% utilized.
> >
> > Is there any chance to improve this query exectime  (my be with some
> > additional sys.options changes) ?
> >
> > Thank's!
> >
> > P.S. Just two days as starting to learn and test Apache Drill
> >
> > Best regards,
> > Dmitry
> >
>



-- 

Abdelhakim Deneche

Software Engineer

  


Now Available - Free Hadoop On-Demand Training



Re: unable to start Drill 1.6.0

2016-03-18 Thread Abdel Hakim Deneche
Hi Shankar,

mailing list doesn't allow attachments, can you post the file in some
public place and share link ?

Thanks

On Thu, Mar 17, 2016 at 1:51 PM, Shankar Mane 
wrote:

> I am not able to start drill 1.6.0. Please find the attached file for more
> details.
>
>


-- 

Abdelhakim Deneche

Software Engineer

  


Now Available - Free Hadoop On-Demand Training



Question about Text Files documentation

2016-03-16 Thread Abdel Hakim Deneche
In this documentation page:

http://drill.apache.org/docs/text-files-csv-tsv-psv/

We can read the following:

Using a distributed file system, such as HDFS, instead of a local file
> system to query the files also improves performance because currently Drill 
> *does
> not split* files on block splits.


Should it actually read: Drill *does split* files on block splits ?

-- 

Abdelhakim Deneche

Software Engineer

  


Now Available - Free Hadoop On-Demand Training



Re: NumberFormatException with cast to double?

2016-03-10 Thread Abdel Hakim Deneche
Looks like the COALESCE function is the source of the problem. Passing a
double (0.0) instead of an int (0) as a second expression solved the
problem for me:

CAST(COALESCE(t_total, 0.0) AS double)


On Fri, Mar 11, 2016 at 12:45 AM, Matt  wrote:

> ~~~
> 00-01  Project(date_tm=[CAST($23):TIMESTAMP(0)],
> id_1=[CAST($11):VARCHAR(1) CHARACTER SET "ISO-8859-1" COLLATE
> "ISO-8859-1$en_US$primary"], id_2=[CAST($15):VARCHAR(1) CHARACTER SET
> "ISO-8859-1" COLLATE "ISO-8859-1$en_US$primary"],
> id_3=[CAST($33):VARCHAR(1) CHARACTER SET "ISO-8859-1" COLLATE
> "ISO-8859-1$en_US$primary"], b_total=[CAST(CASE(IS NOT NULL($4), $4,
> 0)):BIGINT], t_total=[CAST(CASE(IS NOT NULL($31), $31, 0)):DOUBLE],
> h_total=[CAST(CASE(IS NOT NULL($40), $40, 0)):BIGINT],
> b_small=[CAST(CASE(IS NOT NULL($36), $36, 0)):BIGINT],
> t_small=[CAST(CASE(IS NOT NULL($14), $14, 0)):DOUBLE],
> h_small=[CAST(CASE(IS NOT NULL($38), $38, 0)):BIGINT],
> b_18000=[CAST(CASE(IS NOT NULL($32), $32, 0)):BIGINT],
> t_18000=[CAST(CASE(IS NOT NULL($24), $24, 0)):DOUBLE],
> h_18000=[CAST(CASE(IS NOT NULL($27), $27, 0)):BIGINT],
> b_12000=[CAST(CASE(IS NOT NULL($30), $30, 0)):BIGINT],
> t_12000=[CAST(CASE(IS NOT NULL($28), $28, 0)):DOUBLE],
> h_12000=[CAST(CASE(IS NOT NULL($20), $20, 0)):BIGINT], b_6000=[CAST(CASE(IS
> NOT NULL($41), $41, 0)):BIGINT], t_6000=[CAST(CASE(IS NOT NULL($37), $37,
> 0)):DOUBLE], h_6000=[CAST(CASE(IS NOT NULL($29), $29, 0)):BIGINT],
> b_3000=[CAST(CASE(IS NOT NULL($17), $17, 0)):BIGINT], t_3000=[CAST(CASE(IS
> NOT NULL($7), $7, 0)):DOUBLE], h_3000=[CAST(CASE(IS NOT NULL($1), $1,
> 0)):BIGINT], b_2000=[CAST(CASE(IS NOT NULL($26), $26, 0)):BIGINT],
> t_2000=[CAST(CASE(IS NOT NULL($34), $34, 0)):DOUBLE], h_2000=[CAST(CASE(IS
> NOT NULL($10), $10, 0)):BIGINT], b_1500=[CAST(CASE(IS NOT NULL($42), $42,
> 0)):BIGINT], t_1500=[CAST(CASE(IS NOT NULL($13), $13, 0)):DOUBLE],
> h_1500=[CAST(CASE(IS NOT NULL($3), $3, 0)):BIGINT], b_1250=[CAST(CASE(IS
> NOT NULL($21), $21, 0)):BIGINT], t_1250=[CAST(CASE(IS NOT NULL($25), $25,
> 0)):DOUBLE], h_1250=[CAST(CASE(IS NOT NULL($16), $16, 0)):BIGINT],
> b_1000=[CAST(CASE(IS NOT NULL($12), $12, 0)):BIGINT], t_1000=[CAST(CASE(IS
> NOT NULL($19), $19, 0)):DOUBLE], h_1000=[CAST(CASE(IS NOT NULL($6), $6,
> 0)):BIGINT], b_750=[CAST(CASE(IS NOT NULL($9), $9, 0)):BIGINT],
> t_750=[CAST(CASE(IS NOT NULL($0), $0, 0)):DOUBLE], h_750=[CAST(CASE(IS NOT
> NULL($5), $5, 0)):BIGINT], b_500=[CAST(CASE(IS NOT NULL($2), $2,
> 0)):BIGINT], t_500=[CAST(CASE(IS NOT NULL($8), $8, 0)):DOUBLE],
> h_500=[CAST(CASE(IS NOT NULL($39), $39, 0)):BIGINT], b_0=[CAST(CASE(IS NOT
> NULL($18), $18, 0)):BIGINT], t_0=[CAST(CASE(IS NOT NULL($35), $35,
> 0)):DOUBLE], EXPR$42=[CAST(CASE(IS NOT NULL($22), $22, 0)):BIGINT])
> 00-02Scan(groupscan=[EasyGroupScan
> [selectionRoot=hdfs://es05:54310/caspr/csv/smalltest.csv, numFiles=1,
> columns=[`date_tm`, `id_1`, `id_2`, `id_3`, `b_total`, `t_total`,
> `h_total`, `b_small`, `t_small`, `h_small`, `b_18000`, `t_18000`,
> `h_18000`, `b_12000`, `t_12000`, `h_12000`, `b_6000`, `t_6000`, `h_6000`,
> `b_3000`, `t_3000`, `h_3000`, `b_2000`, `t_2000`, `h_2000`, `b_1500`,
> `t_1500`, `h_1500`, `b_1250`, `t_1250`, `h_1250`, `b_1000`, `t_1000`,
> `h_1000`, `b_750`, `t_750`, `h_750`, `b_500`, `t_500`, `h_500`, `b_0`,
> `t_0`, `h_0`], files=[hdfs://es05:54310/caspr/csv/smalltest.csv]]])
> ~~~
>
> ~~~
> {
>   "head" : {
> "version" : 1,
> "generator" : {
>   "type" : "ExplainHandler",
>   "info" : ""
> },
> "type" : "APACHE_DRILL_PHYSICAL",
> "options" : [ {
>   "kind" : "STRING",
>   "type" : "SESSION",
>   "name" : "store.format",
>   "string_val" : "parquet"
> }, {
>   "kind" : "BOOLEAN",
>   "type" : "SESSION",
>   "name" : "exec.errors.verbose",
>   "bool_val" : true
> } ],
> "queue" : 0,
> "resultMode" : "EXEC"
>   },
>   "graph" : [ {
> "pop" : "fs-scan",
> "@id" : 2,
> "userName" : "hduser",
> "files" : [ "hdfs://es05:54310/caspr/csv/smalltest.csv" ],
> "storage" : {
>   "type" : "file",
>   "enabled" : true,
>   "connection" : "hdfs://es05:54310",
>   "workspaces" : {
> "root" : {
>   "location" : "/",
>   "writable" : true,
>   "defaultInputFormat" : null
> },
> "tmp" : {
>   "location" : "/tmp/",
>   "writable" : true,
>   "defaultInputFormat" : null
> },
> "caspr" : {
>   "location" : "/caspr",
>   "writable" : true,
>   "defaultInputFormat" : "csv"
> }
>   "ref" : "`t_3000`",
>   "expr" : "cast( ( ( if (isnotnull(`t_3000`)  ) then (`t_3000` )
> else (0 )  end  )  ) as FLOAT8 )"
> }, {
>   "ref" : "`h_3000`",
>   "expr" : "cast( ( ( if (isnotnull(`h_3000`)  ) then (`h_3000` )
> else (0 )  end  )  ) as BIGINT )"
> }, {
>   "ref" : "`b_2000`",
>   "expr" : "cast( ( ( if (isnotnull(`b_2000`)  ) then (`b_2000` )

Re: Drill with String Aggregation

2016-03-08 Thread Abdel Hakim Deneche
You can always develop a User Defined Aggregate Function:

http://drill.apache.org/docs/develop-custom-functions/

Thanks

On Wed, Mar 9, 2016 at 12:29 AM, Bosung Seo  wrote:

> Hello,
>
> I found that Drill doesn't support string_agg function yet.
> Is there another way to query as the string_agg function?
>
> If I have a table,
> user   | id
> user1 | 1
> user1 | 2
> user1 | 3
> user2 | 1
>
> I want to make like this.
> user   | ids
> user1 | 1,2,3
> user2 | 1
>
> Any help would be appreciated.
>
> Thanks,
> Bo
>



-- 

Abdelhakim Deneche

Software Engineer

  


Now Available - Free Hadoop On-Demand Training



Re: The praises for Drill

2016-02-26 Thread Abdel Hakim Deneche
Looking forward to reading the paper!

On Fri, Feb 26, 2016 at 10:19 AM, Parth Chandra  wrote:

> Welcome back Edmon, and thanks for the praise :). Hope to see you on the
> next hangout.
>
> On Thu, Feb 25, 2016 at 7:27 PM, Edmon Begoli  wrote:
>
> > Hello fellow Driilers,
> >
> > I have been inactive on the development side of the project, as we got
> busy
> > being heavy/power users of the Drill in the last few months.
> >
> > I just want to share some great experiences with the latest versions of
> > Drill.
> >
> > Just tonight, as we were scrambling to meet the deadline, we were able to
> > query two years of flat psv files of claims/billing and clinical data in
> > Drill in less than 60 seconds.
> >
> > No ETL, no warehousing - just plain SQL against tons of files. Run SQL,
> get
> > results.
> >
> > Amazing!
> >
> > We have also done some much more important things too, and we had a paper
> > accepted to Big Data Services about the experiences. The co-author of the
> > paper is Drill's own Dr. Ted Dunning :-)
> > I will share it once it is published.
> >
> > Anyway, cheers to all, and hope to re-join the dev activities soon.
> >
> > Best,
> > Edmon
> >
>



-- 

Abdelhakim Deneche

Software Engineer

  


Now Available - Free Hadoop On-Demand Training



Re: Drill error with large sort

2016-02-25 Thread Abdel Hakim Deneche
Not so short answer:

In Drill 1.5 (I assume you are using 1.5) we have an improved allocator
that better tracks how much memory each operator is using. In your case it
seems that the date has very wide columns that are causing Sort to choke on
the very first batch of data (1024 records taking up 224MB!!!) because it's
way more than it's memory limit (around 178MB in your particular case).
Drill uses a fancy equation to compute this limit and increasing the
aforementioned option will increase the sort limit. More details here:

http://drill.apache.org/docs/configuring-drill-memory/

On Thu, Feb 25, 2016 at 5:26 PM, Abdel Hakim Deneche <adene...@maprtech.com>
wrote:

> Short answer:
>
> increase the value of planner.memory.max_query_memory_per_node, by default
> it's set to 2GB, try setting to 4 or even 8GB. This should get the query to
> pass.
>
> On Thu, Feb 25, 2016 at 5:24 PM, Jeff Maass <jma...@cccis.com> wrote:
>
>>
>> If you are open to changing the query:
>>   # try removing the functions on the 5th column
>>   # is there any way you could further limit the query?
>>   # does the query finish if u add a limit / top clause?
>>   # what do the logs say?
>>
>> 
>> From: Paul Friedman <paul.fried...@streetlightdata.com>
>> Sent: Thursday, February 25, 2016 7:07:12 PM
>> To: user@drill.apache.org
>> Subject: Drill error with large sort
>>
>> I’ve got a query reading from a large directory of parquet files (41 GB)
>> and I’m consistently getting this error:
>>
>>
>>
>> Error: RESOURCE ERROR: One or more nodes ran out of memory while executing
>> the query.
>>
>>
>>
>> Unable to allocate sv2 for 1023 records, and not enough batchGroups to
>> spill.
>>
>> batchGroups.size 0
>>
>> spilledBatchGroups.size 0
>>
>> allocated memory 224287987
>>
>> allocator limit 178956970
>>
>> Fragment 0:0
>>
>>
>>
>> [Error Id: 878d604c-4656-4a5a-8b46-ff38a6ae020d on
>> chai.dev.streetlightdata.com:31010] (state=,code=0)
>>
>>
>>
>> Direct memory is set to 48GB and heap is 8GB.
>>
>>
>>
>> The query is:
>>
>>
>>
>> select probe_id, provider_id, is_moving, mode,  cast(convert_to(points,
>> 'JSON') as varchar(1))
>>
>> from dfs.`/home/paul/data`
>>
>> where
>>
>> start_lat between 24.4873780449008 and 60.0108911181433 and
>>
>> start_lon between -139.065890469841 and -52.8305074899881 and
>>
>> provider_id = '343' and
>>
>> mod(abs(hash(probe_id)),  100) = 0
>>
>> order by probe_id, start_time;
>>
>>
>>
>> I’m also using the “example” drill-override configuration.
>>
>>
>>
>> Any help would be appreciated.
>>
>>
>>
>> Thanks.
>>
>>
>>
>> ---Paul
>>
>
>
>
> --
>
> Abdelhakim Deneche
>
> Software Engineer
>
>   <http://www.mapr.com/>
>
>
> Now Available - Free Hadoop On-Demand Training
> <http://www.mapr.com/training?utm_source=Email_medium=Signature_campaign=Free%20available>
>



-- 

Abdelhakim Deneche

Software Engineer

  <http://www.mapr.com/>


Now Available - Free Hadoop On-Demand Training
<http://www.mapr.com/training?utm_source=Email_medium=Signature_campaign=Free%20available>


Re: Date Format conversion

2016-02-23 Thread Abdel Hakim Deneche
more precisely you can use TO_DATE
.
The following worked for me:

TO_DATE('01/25/2016', 'MM/dd/')



On Tue, Feb 23, 2016 at 10:48 AM, Neeraja Rentachintala <
nrentachint...@maprtech.com> wrote:

> Please refer to https://drill.apache.org/docs/date-time-and-timestamp/
>
> On Tue, Feb 23, 2016 at 10:41 AM, Wilburn, Scott <
> scott.wilb...@verizonwireless.com.invalid> wrote:
>
> > Hello,
> > Is there a way to convert other date formats into the DATE type? I would
> > like to cast a field that contains a date in the format mm/dd/, i.e
> > "01/25/2016", into DATE type.
> >
> > Thanks,
> > Scott Wilburn
> >
> >
>



-- 

Abdelhakim Deneche

Software Engineer

  


Now Available - Free Hadoop On-Demand Training



Re: Drill join performance

2016-02-22 Thread Abdel Hakim Deneche
Hello Dmitry,

Welcome to Drill's community :)

What version of Drill are you using ?
Also, can you share the query profile of your query, it helps to show what
taking most of the time.

Thanks

On Mon, Feb 22, 2016 at 10:54 AM, Dmitry Krivov 
wrote:

> Hello
>
> I have load (as CTAS) into parquet-files StarShema Benchmark generated
> csv-data (scale factor 50)
>
> For one of bencmark query's like :
>
> select
> d.d_year,
> c.c_region,
> sum(l.lo_extendedprice*l.lo_discount) as revenue
> from dfs.tpch.lineorder_part l,
>dfs.tpch.dates d,
>dfs.tpch.customer c
> where l.lo_orderdate = d.d_datekey
>  and l.lo_custkey = c.c_custkey
>  and d.d_year=1995
> group by d.d_year, c.c_region
> order by d.d_year desc, c.c_region asc;
>
> got min. exec time of 59 sec.
>
> Table LINEORDER have 300M rows and partitioned by LO_ORDERDATE column (2406
> partitions in related parquet-files)
> Table CUSTOMER have 1.5M rows and table DATES have 2556 rows, both tables
> not partitioned
>
> Drill 1.5 conf. have :
>
> drill-env.sh :
> DRILL_MAX_DIRECT_MEMORY="16G"
> DRILL_HEAP="8G"
>
> sys.options changed  :
>
> planner.memory.max_query_memory_per_node = 8 000 000 000
> planner.memory_limit = 1 000 000 000
> planner.width.max_per_node = 16 (was 12 by default)
>
> Drill is installed on 16VCPU Linux VM and under query runtime all 16VCPU's
> 100% utilized.
>
> Is there any chance to improve this query exectime  (my be with some
> additional sys.options changes) ?
>
> Thank's!
>
> P.S. Just two days as starting to learn and test Apache Drill
>
> Best regards,
> Dmitry
>



-- 

Abdelhakim Deneche

Software Engineer

  


Now Available - Free Hadoop On-Demand Training



Re: what am I missing?

2016-02-22 Thread Abdel Hakim Deneche
To run Drill in distributed mode you need to have Zookeeper up and running.
This shouldn't be too complicated, you can find more details here:

https://zookeeper.apache.org/doc/r3.1.2/zookeeperStarted.html#sc_InstallingSingleMode

On my Mac I used brew and it took care of everything.

On Mon, Feb 22, 2016 at 8:40 AM, Ted Schwartz 
wrote:

> I'm new to drill and trying to get up and running. My goal is to access
> drill from a JDBC client. I'm a bit confused when starting drill. If I use
> drill-embedded, it appears it only allows one connection and that
> connection is started along with drill.  So if instead I try to run in
> distributed mode, I have modified drill-override.conf like this:
>
> drill.exec: {
>   cluster-id: "drillbits1",
>   zk.connect: "myhost:2181"
> }
>
> and start drillbit(?)  with this:
>
> drillbit.sh start
>
> I get a "starting drillbit..." message, but cannot connect in any way to
> it.
>
> If I try to connect using sqlline, it fails with "Connection timed out":
>
> sqlline -u jdbc:drill:zk=myhost:2181
>
> Same problems if I try to connect using the drill-conf utility (which
> appears to be nothing other than an invocation of sqlline)
>
> No matter how I start drill, it doesn't appear any ports are opened for
> the process. For example, `netstat -a | grep 2181` doesn't yield any
> results, and the http port 8047 that works when I start in embedded mode
> doesn't work in distributed mode.
>
> I feel like I am missing something fundamental to all of this, although
> I'm trying to follow the Getting Started documentation.  I've seen
> references to "making sure you start Zookeeper". How do I do that? I find
> lots of details about starting drillbit, but nothing about starting
> Zookeeper.
>
> Thanks in advance for any clue that can help me move forward.




-- 

Abdelhakim Deneche

Software Engineer

  


Now Available - Free Hadoop On-Demand Training



Re: One single query for more files JSON

2016-02-21 Thread Abdel Hakim Deneche
Paolo,

Hanifi created a JIRA for this issue:

https://issues.apache.org/jira/browse/DRILL-4416

A fix is available and should go into master soon.

Thanks

On Sun, Feb 21, 2016 at 2:52 AM, Paolo Spanevello <paolosp...@gmail.com>
wrote:

> Dear Hanifi,
>
> Yes is windows. the data is residing in a folder of my local pc. I'm using
> this Storage Plugins as a copy of dfs storage plugins.
> I do not know how to use  hdfs.
>
> Any idea?
>
> Best,
> Paolo
>
> {
>   "type": "file",
>   "enabled": true,
>   "connection": "file:///",
>   "workspaces": {
> "root": {
>   "location": "/",
>   "writable": false,
>   "defaultInputFormat": null
> },
> "tmp": {
>   "location": "/tmp",
>   "writable": true,
>   "defaultInputFormat": null
> },
> "GC": {
>   "location": "/Users/user1/AppData/Local/folder",
>   "writable": true,
>   "defaultInputFormat": null
> }
>   },
>   "formats": {
> "psv": {
>   "type": "text",
>   "extensions": [
> "tbl"
>   ],
>   "delimiter": "|"
> },
> "csv": {
>   "type": "text",
>   "extensions": [
> "csv"
>   ],
>   "delimiter": ","
> },
> "tsv": {
>   "type": "text",
>   "extensions": [
> "tsv"
>   ],
>   "delimiter": "\t"
> },
> "parquet": {
>   "type": "parquet"
> },
> "json": {
>   "type": "json"
> },
> "avro": {
>   "type": "avro"
> },
> "sequencefile": {
>   "type": "sequencefile",
>   "extensions": [
> "seq"
>   ]
> },
> "csvh": {
>   "type": "text",
>   "extensions": [
> "csvh"
>   ],
>   "extractHeader": true,
>   "delimiter": ","
> }
>   }
> }
>
> 2016-02-20 0:15 GMT+01:00 Hanifi Gunes <hgu...@maprtech.com>:
>
> > Paolo,
> >
> > I understand that your platform is windows. Where is your data residing
> > though? hdfs?
> >
> > Thanks.
> > -Hanifi
> >
> > On Fri, Feb 19, 2016 at 2:43 PM, Abdel Hakim Deneche <
> > adene...@maprtech.com>
> > wrote:
> >
> > > I think you are hitting the following issue:
> > >
> > >
> > >
> >
> http://stackoverflow.com/questions/10336293/splitting-filenames-using-system-file-separator-symbol
> > >
> > > Can you open a JIRA for this, shouldn't take long to fix (hopefully)
> > >
> > > On Fri, Feb 19, 2016 at 1:38 PM, Paolo Spanevello <
> paolosp...@gmail.com>
> > > wrote:
> > >
> > > > Dear all,
> > > >
> > > > this is the error the occour when i run the query:
> > > >
> > > > SELECT * FROM
> `Performance_Ride_Analytics`.`GC`.`./*/cache/rideDB.json`
> > > >
> > > > Could you support me?
> > > >
> > > > Best,
> > > > Paolo
> > > >
> > > >VALIDATION ERROR: Unexpected internal error near index 1
> > > > \
> > > >  ^
> > > >
> > > >
> > > >
> > > >   (org.apache.calcite.tools.ValidationException)
> > > > java.util.regex.PatternSyntaxException: Unexpected internal error
> near
> > > > index 1
> > > > \
> > > >  ^
> > > > org.apache.calcite.prepare.PlannerImpl.validate():189
> > > > org.apache.calcite.prepare.PlannerImpl.validateAndGetType():198
> > > >
> > > >
> > >
> >
> org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.validateNode():451
> > > >
> > > >
> > >
> >
> org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.validateAndConvert():198
> > > >
> > > >
> > >
> >
> org.apache.drill.exec.planner.sql.handlers.DefaultSq

Re: One single query for more files JSON

2016-02-19 Thread Abdel Hakim Deneche
:35 GMT+01:00 Jinfeng Ni <jinfengn...@gmail.com>:
>
> > Could you please turn on verbose error mode by running the following
> > [1] then re-run the query and post the error message,  or copy & post
> > the drillbit log when the query was executed?
> >
> > ALTER SESSION SET `exec.errors.verbose` = true;
> >
> > That will give us more information about the error you saw. From the
> > error message you posted, it's not clear whether the JSON data itself
> > is not well formated, or Drill hit an execution bug.
> >
> > Thanks,
> >
> >
> > [1] https://drill.apache.org/docs/troubleshooting/#enable-verbose-errors
> >
> >
> >
> > On Tue, Feb 16, 2016 at 2:45 AM, Paolino <paolosp...@gmail.com> wrote:
> > > Dear all,
> > > Someone could support me with this drill?
> > >
> > > Best regards,
> > > Paolo
> > >
> > > - Messaggio originale -
> > > Da: "Paolo Spanevello" <paolosp...@gmail.com>
> > > Inviato: ‎12/‎02/‎2016 21:15
> > > A: "user" <user@drill.apache.org>
> > > Oggetto: Re: One single query for more files JSON
> > >
> > > yes with windows
> > >
> > >
> > > 2016-02-12 17:11 GMT+01:00 Zelaine Fong <zf...@maprtech.com>:
> > >
> > > Are you running on Windows?  If so, perhaps this is DRILL-4305?
> > >
> > > -- Zelaine
> > >
> > > On Fri, Feb 12, 2016 at 8:00 AM, Paolo Spanevello <
> paolosp...@gmail.com>
> > >
> > > wrote:
> > >
> > >> The schema and the files are the same. This is the error:
> > >>
> > >> ERROR [HY000] [MapR][Drill] (1040) Drill failed to execute the query:
> > >> SELECT * FROM
> `Performance_Ride_Analytics`.`GC`.`./*/cache/rideDB.json`
> > >> [30027]Query execution error. Details:[
> > >> VALIDATION ERROR: Unexpected internal error near index 1
> > >> \
> > >>  ^
> > >>
> > >>
> > >> 2016-02-12 16:43 GMT+01:00 Abdel Hakim Deneche <adene...@maprtech.com
> >:
> > >>
> > >> > Of course, if the schema changes between files, this will most
> likely
> > >> cause
> > >> > the query to fail
> > >> >
> > >> > On Fri, Feb 12, 2016 at 7:42 AM, Abdel Hakim Deneche <
> > >> > adene...@maprtech.com>
> > >> > wrote:
> > >> >
> > >> > > Yes, it should work.
> > >> > >
> > >> > > On Fri, Feb 12, 2016 at 7:31 AM, Paolo Spanevello <
> > >> paolosp...@gmail.com>
> > >> > > wrote:
> > >> > >
> > >> > >> Dear All,
> > >> > >>
> > >> > >> Could i've a single query for more json files ?
> > >> > >>
> > >> > >> Example:
> > >> > >>
> > >> > >>- /user/folder1/file1.json
> > >> > >>- /user/folder2/file2.json
> > >> > >>- /user/folder3/file3.json
> > >> > >>
> > >> > >> Query:
> > >> > >> SELECT * FROM /user/*/file*.json
> > >> > >>
> > >> > >> Thanks a lot for your support.
> > >> > >>
> > >> > >> Best,
> > >> > >> Paolo
> > >> > >>
> > >> > >
> > >> > >
> > >> > >
> > >> > > --
> > >> > >
> > >> > > Abdelhakim Deneche
> > >> > >
> > >> > > Software Engineer
> > >> > >
> > >> > >   <http://www.mapr.com/>
> > >> > >
> > >> > >
> > >> > > Now Available - Free Hadoop On-Demand Training
> > >> > > <
> > >> >
> > >>
> >
> http://www.mapr.com/training?utm_source=Email_medium=Signature_campaign=Free%20available
> > >> > >
> > >> > >
> > >> >
> > >> >
> > >> >
> > >> > --
> > >> >
> > >> > Abdelhakim Deneche
> > >> >
> > >> > Software Engineer
> > >> >
> > >> >   <http://www.mapr.com/>
> > >> >
> > >> >
> > >> > Now Available - Free Hadoop On-Demand Training
> > >> > <
> > >> >
> > >>
> >
> http://www.mapr.com/training?utm_source=Email_medium=Signature_campaign=Free%20available
> > >> > >
> > >> >
> > >>
> >
>



-- 

Abdelhakim Deneche

Software Engineer

  <http://www.mapr.com/>


Now Available - Free Hadoop On-Demand Training
<http://www.mapr.com/training?utm_source=Email_medium=Signature_campaign=Free%20available>


Re: Drill Doc Question: Multi Tenant Clusters

2016-02-15 Thread Abdel Hakim Deneche
Someone may want to confirm this, but I think Drill will properly set the
default value (num cores x .7), and it will be specific to every node, but
when you query the option from sys.options, it will show you the value on
the "foreman" node for that specific query.
Once you set it manually using ALTER it will be the same value for all
nodes.

I don't think there is a way, for now, to change this using drill-override,
you may want to create a JIRA for this.

Again, someone else may want to give a more informed advice here, but
setting this option to 5 (6 cores x .7) will help limit how much CPU drill
will be using. Please be aware that this option only controls the "width"
of the queries, but you may still end up with more threads running
simultaneously in various stages of a query, for example Drill can spawn up
to 16 threads when it's reading parquet metadata information during
planning. There is an ongoing work to improve Drill's resource management.

On Mon, Feb 15, 2016 at 11:41 AM, John Omernik <j...@omernik.com> wrote:

> Drill did not automatically set that, it set it to 12, which is likely .7
> or close to it on a 16 core machine, but I have 7 nodes, with different
> cores, so is this setting per drill-bit or is it a cluster wide setting?
> Is it possible to set this in the drill-overide based on the node itself,
> or does drill handle that for us, and if I do a ALTER SESSION then it
> changes thing cluster wide?
>
> The reason I am asking is I am running this in Marathon, and assigning 6
> Cores to each Drill bit.  (this is a resource constrained cluster).  Since
> I am using CGROUPs, as I understand it,  if there is CPU contention, then
> cgroups will limit drill to 6 shares, otherwise it will allow drill to use
> more cores.
>
> So as it pertains to this setting, should I set it to the number of cores
> per node (as it's likely setting it now) or should use the number CPU
> shares I am setting... and if I am doing cores per node, how do I handle
> different sized nodes (16 core nodes vs 24 core nodes for example)
>
>
>
> On Mon, Feb 15, 2016 at 1:37 PM, Abdel Hakim Deneche <
> adene...@maprtech.com>
> wrote:
>
> > so yes, you are correct, you should set it to 1 x 32 x 0.7
> >
> > Btw, Drill should already have set this option to 32 x 0.7
> >
> > On Mon, Feb 15, 2016 at 11:36 AM, Abdel Hakim Deneche <
> > adene...@maprtech.com
> > > wrote:
> >
> > > Don't be, it took me quite some time to figure out this one either =P
> > >
> > > the "number of active drillbits" refers to the number of Drillbits
> > running
> > > on each node of the cluster. Generally, you have 1 active Drillbit per
> > node.
> > >
> > > On Mon, Feb 15, 2016 at 11:22 AM, John Omernik <j...@omernik.com>
> wrote:
> > >
> > >> I am really sorry for being dense here, but based on your comment
> then,
> > >> and
> > >> the docs then if you had sixteen 32 core machines, but only one drill
> > bit
> > >> running per node, you'd still use 1 (one drill bit per node) * 32 (the
> > >> number of cores) * 0.7 (the modifier in the docs) to get 23 as the
> > number
> > >> to set for planner.width_max_per_node  Not 16 * 32 * 0.7.  A reading
> of
> > >> the
> > >> docs is confusing (see below) you can read that as number of active
> > drill
> > >> bits, which on a sixteen node cluster, one per node would be 16 * 32
> > >> (cores
> > >> per node) * 0.7.  But I think you are saying that we should be taking
> 1
> > >> drill bit per node * 32 * 0.7 ... correct?
> > >>
> > >> Quote from the docs:
> > >> number of active drillbits (typically one per node) * number of cores
> > per
> > >> node * 0.7
> > >>
> > >> On Mon, Feb 15, 2016 at 11:15 AM, Abdel Hakim Deneche <
> > >> adene...@maprtech.com
> > >> > wrote:
> > >>
> > >> > No, it's the maximum number of threads each drillbit will be able to
> > >> spawn
> > >> > for every major fragment of a query.
> > >> >
> > >> > If you run a query on a cluster of 32 core machines, and the query
> > plan
> > >> > contains multiple major fragments, each major fragment will have "at
> > >> most"
> > >> > 32 x 0.7= 23 minor fragments (or threads) running in parallel on
> every
> > >> > drillbit. The "at most" is important here, as other factors limit
> how
> > >> many
> > >> > minor fragments c

Re: Drill Doc Question: Multi Tenant Clusters

2016-02-15 Thread Abdel Hakim Deneche
so yes, you are correct, you should set it to 1 x 32 x 0.7

Btw, Drill should already have set this option to 32 x 0.7

On Mon, Feb 15, 2016 at 11:36 AM, Abdel Hakim Deneche <adene...@maprtech.com
> wrote:

> Don't be, it took me quite some time to figure out this one either =P
>
> the "number of active drillbits" refers to the number of Drillbits running
> on each node of the cluster. Generally, you have 1 active Drillbit per node.
>
> On Mon, Feb 15, 2016 at 11:22 AM, John Omernik <j...@omernik.com> wrote:
>
>> I am really sorry for being dense here, but based on your comment then,
>> and
>> the docs then if you had sixteen 32 core machines, but only one drill bit
>> running per node, you'd still use 1 (one drill bit per node) * 32 (the
>> number of cores) * 0.7 (the modifier in the docs) to get 23 as the number
>> to set for planner.width_max_per_node  Not 16 * 32 * 0.7.  A reading of
>> the
>> docs is confusing (see below) you can read that as number of active drill
>> bits, which on a sixteen node cluster, one per node would be 16 * 32
>> (cores
>> per node) * 0.7.  But I think you are saying that we should be taking 1
>> drill bit per node * 32 * 0.7 ... correct?
>>
>> Quote from the docs:
>> number of active drillbits (typically one per node) * number of cores per
>> node * 0.7
>>
>> On Mon, Feb 15, 2016 at 11:15 AM, Abdel Hakim Deneche <
>> adene...@maprtech.com
>> > wrote:
>>
>> > No, it's the maximum number of threads each drillbit will be able to
>> spawn
>> > for every major fragment of a query.
>> >
>> > If you run a query on a cluster of 32 core machines, and the query plan
>> > contains multiple major fragments, each major fragment will have "at
>> most"
>> > 32 x 0.7= 23 minor fragments (or threads) running in parallel on every
>> > drillbit. The "at most" is important here, as other factors limit how
>> many
>> > minor fragments can run in parallel, for example nature and size of the
>> > data.
>> >
>> > On Mon, Feb 15, 2016 at 7:41 AM, John Omernik <j...@omernik.com> wrote:
>> >
>> > > *
>> > >
>> >
>> https://drill.apache.org/docs/configuring-resources-for-a-shared-drillbit/#configuring-query-queuing
>> > > <
>> > >
>> >
>> https://drill.apache.org/docs/configuring-resources-for-a-shared-drillbit/#configuring-query-queuing
>> > > >*
>> > >
>> > >
>> > > *On this page, on the setting planner.width.max_per_node it says the
>> > > below.  In the equation, of number of active drillbits * number of
>> cores
>> > > per node * 0.7,  is the number of active drillbits the number of drill
>> > bits
>> > > PER NODE (as this setting is per node) or is that the number of active
>> > > drill bits per cluster?  The example is unclear because it only shows
>> an
>> > > example on a single node cluster.  (Typically 1 per node doesn't
>> clarify
>> > > whether that number should be per node or per drill bit)*
>> > >
>> > > *Thanks!*
>> > >
>> > >
>> > >
>> > > The maximum width per node defines the maximum degree of parallelism
>> for
>> > > any fragment of a query, but the setting applies at the level of a
>> single
>> > > node in the cluster. The *default* maximum degree of parallelism per
>> node
>> > > is calculated as follows, with the theoretical maximum automatically
>> > scaled
>> > > back (and rounded down) so that only 70% of the actual available
>> capacity
>> > > is taken into account: number of active drillbits (typically one per
>> > node)
>> > > * number of cores per node * 0.7
>> > >
>> > > For example, on a single-node test system with 2 cores and
>> > hyper-threading
>> > >
>> > > enabled: 1 * 4 * 0.7 = 3
>> > >
>> >
>> >
>> >
>> > --
>> >
>> > Abdelhakim Deneche
>> >
>> > Software Engineer
>> >
>> >   <http://www.mapr.com/>
>> >
>> >
>> > Now Available - Free Hadoop On-Demand Training
>> > <
>> >
>> http://www.mapr.com/training?utm_source=Email_medium=Signature_campaign=Free%20available
>> > >
>> >
>>
>
>
>
> --
>
> Abdelhakim Deneche
>
> Software Engineer
>
>   <http://www.mapr.com/>
>
>
> Now Available - Free Hadoop On-Demand Training
> <http://www.mapr.com/training?utm_source=Email_medium=Signature_campaign=Free%20available>
>



-- 

Abdelhakim Deneche

Software Engineer

  <http://www.mapr.com/>


Now Available - Free Hadoop On-Demand Training
<http://www.mapr.com/training?utm_source=Email_medium=Signature_campaign=Free%20available>


Re: Drill Doc Question: Multi Tenant Clusters

2016-02-15 Thread Abdel Hakim Deneche
Don't be, it took me quite some time to figure out this one either =P

the "number of active drillbits" refers to the number of Drillbits running
on each node of the cluster. Generally, you have 1 active Drillbit per node.

On Mon, Feb 15, 2016 at 11:22 AM, John Omernik <j...@omernik.com> wrote:

> I am really sorry for being dense here, but based on your comment then, and
> the docs then if you had sixteen 32 core machines, but only one drill bit
> running per node, you'd still use 1 (one drill bit per node) * 32 (the
> number of cores) * 0.7 (the modifier in the docs) to get 23 as the number
> to set for planner.width_max_per_node  Not 16 * 32 * 0.7.  A reading of the
> docs is confusing (see below) you can read that as number of active drill
> bits, which on a sixteen node cluster, one per node would be 16 * 32 (cores
> per node) * 0.7.  But I think you are saying that we should be taking 1
> drill bit per node * 32 * 0.7 ... correct?
>
> Quote from the docs:
> number of active drillbits (typically one per node) * number of cores per
> node * 0.7
>
> On Mon, Feb 15, 2016 at 11:15 AM, Abdel Hakim Deneche <
> adene...@maprtech.com
> > wrote:
>
> > No, it's the maximum number of threads each drillbit will be able to
> spawn
> > for every major fragment of a query.
> >
> > If you run a query on a cluster of 32 core machines, and the query plan
> > contains multiple major fragments, each major fragment will have "at
> most"
> > 32 x 0.7= 23 minor fragments (or threads) running in parallel on every
> > drillbit. The "at most" is important here, as other factors limit how
> many
> > minor fragments can run in parallel, for example nature and size of the
> > data.
> >
> > On Mon, Feb 15, 2016 at 7:41 AM, John Omernik <j...@omernik.com> wrote:
> >
> > > *
> > >
> >
> https://drill.apache.org/docs/configuring-resources-for-a-shared-drillbit/#configuring-query-queuing
> > > <
> > >
> >
> https://drill.apache.org/docs/configuring-resources-for-a-shared-drillbit/#configuring-query-queuing
> > > >*
> > >
> > >
> > > *On this page, on the setting planner.width.max_per_node it says the
> > > below.  In the equation, of number of active drillbits * number of
> cores
> > > per node * 0.7,  is the number of active drillbits the number of drill
> > bits
> > > PER NODE (as this setting is per node) or is that the number of active
> > > drill bits per cluster?  The example is unclear because it only shows
> an
> > > example on a single node cluster.  (Typically 1 per node doesn't
> clarify
> > > whether that number should be per node or per drill bit)*
> > >
> > > *Thanks!*
> > >
> > >
> > >
> > > The maximum width per node defines the maximum degree of parallelism
> for
> > > any fragment of a query, but the setting applies at the level of a
> single
> > > node in the cluster. The *default* maximum degree of parallelism per
> node
> > > is calculated as follows, with the theoretical maximum automatically
> > scaled
> > > back (and rounded down) so that only 70% of the actual available
> capacity
> > > is taken into account: number of active drillbits (typically one per
> > node)
> > > * number of cores per node * 0.7
> > >
> > > For example, on a single-node test system with 2 cores and
> > hyper-threading
> > >
> > > enabled: 1 * 4 * 0.7 = 3
> > >
> >
> >
> >
> > --
> >
> > Abdelhakim Deneche
> >
> > Software Engineer
> >
> >   <http://www.mapr.com/>
> >
> >
> > Now Available - Free Hadoop On-Demand Training
> > <
> >
> http://www.mapr.com/training?utm_source=Email_medium=Signature_campaign=Free%20available
> > >
> >
>



-- 

Abdelhakim Deneche

Software Engineer

  <http://www.mapr.com/>


Now Available - Free Hadoop On-Demand Training
<http://www.mapr.com/training?utm_source=Email_medium=Signature_campaign=Free%20available>


Re: Drill Doc Question: Multi Tenant Clusters

2016-02-15 Thread Abdel Hakim Deneche
No, it's the maximum number of threads each drillbit will be able to spawn
for every major fragment of a query.

If you run a query on a cluster of 32 core machines, and the query plan
contains multiple major fragments, each major fragment will have "at most"
32 x 0.7= 23 minor fragments (or threads) running in parallel on every
drillbit. The "at most" is important here, as other factors limit how many
minor fragments can run in parallel, for example nature and size of the
data.

On Mon, Feb 15, 2016 at 7:41 AM, John Omernik  wrote:

> *
> https://drill.apache.org/docs/configuring-resources-for-a-shared-drillbit/#configuring-query-queuing
> <
> https://drill.apache.org/docs/configuring-resources-for-a-shared-drillbit/#configuring-query-queuing
> >*
>
>
> *On this page, on the setting planner.width.max_per_node it says the
> below.  In the equation, of number of active drillbits * number of cores
> per node * 0.7,  is the number of active drillbits the number of drill bits
> PER NODE (as this setting is per node) or is that the number of active
> drill bits per cluster?  The example is unclear because it only shows an
> example on a single node cluster.  (Typically 1 per node doesn't clarify
> whether that number should be per node or per drill bit)*
>
> *Thanks!*
>
>
>
> The maximum width per node defines the maximum degree of parallelism for
> any fragment of a query, but the setting applies at the level of a single
> node in the cluster. The *default* maximum degree of parallelism per node
> is calculated as follows, with the theoretical maximum automatically scaled
> back (and rounded down) so that only 70% of the actual available capacity
> is taken into account: number of active drillbits (typically one per node)
> * number of cores per node * 0.7
>
> For example, on a single-node test system with 2 cores and hyper-threading
>
> enabled: 1 * 4 * 0.7 = 3
>



-- 

Abdelhakim Deneche

Software Engineer

  


Now Available - Free Hadoop On-Demand Training



Re: One single query for more files JSON

2016-02-12 Thread Abdel Hakim Deneche
Of course, if the schema changes between files, this will most likely cause
the query to fail

On Fri, Feb 12, 2016 at 7:42 AM, Abdel Hakim Deneche <adene...@maprtech.com>
wrote:

> Yes, it should work.
>
> On Fri, Feb 12, 2016 at 7:31 AM, Paolo Spanevello <paolosp...@gmail.com>
> wrote:
>
>> Dear All,
>>
>> Could i've a single query for more json files ?
>>
>> Example:
>>
>>- /user/folder1/file1.json
>>- /user/folder2/file2.json
>>- /user/folder3/file3.json
>>
>> Query:
>> SELECT * FROM /user/*/file*.json
>>
>> Thanks a lot for your support.
>>
>> Best,
>> Paolo
>>
>
>
>
> --
>
> Abdelhakim Deneche
>
> Software Engineer
>
>   <http://www.mapr.com/>
>
>
> Now Available - Free Hadoop On-Demand Training
> <http://www.mapr.com/training?utm_source=Email_medium=Signature_campaign=Free%20available>
>



-- 

Abdelhakim Deneche

Software Engineer

  <http://www.mapr.com/>


Now Available - Free Hadoop On-Demand Training
<http://www.mapr.com/training?utm_source=Email_medium=Signature_campaign=Free%20available>


Re: One single query for more files JSON

2016-02-12 Thread Abdel Hakim Deneche
Yes, it should work.

On Fri, Feb 12, 2016 at 7:31 AM, Paolo Spanevello 
wrote:

> Dear All,
>
> Could i've a single query for more json files ?
>
> Example:
>
>- /user/folder1/file1.json
>- /user/folder2/file2.json
>- /user/folder3/file3.json
>
> Query:
> SELECT * FROM /user/*/file*.json
>
> Thanks a lot for your support.
>
> Best,
> Paolo
>



-- 

Abdelhakim Deneche

Software Engineer

  


Now Available - Free Hadoop On-Demand Training



expected behavior when using wild cards in table name

2016-02-11 Thread Abdel Hakim Deneche
I have the following table tpch100/lineitem that contains 97 parquet files:

tpch100/lineitem/part-m-0.parquet
tpch100/lineitem/part-m-1.parquet
tpch100/lineitem/part-m-2.parquet

...
tpch100/lineitem/part-m-00096.parquet

I can run the following queries:

SELECT COUNT(*) FROM `tpch100/lineit*;
SELECT COUNT(*) FROM `tpch100/lineitem/part-m-0001*';
SELECT COUNT(*) FROM `tpch100/lineitem/*';

The third query will fail if the table has metadata (it has to do with the
.drill.parquet_metadata showing up at the top of the file system results)

My question is: should the 2nd and 3rd queries be allowed, if we are
querying a table folder that doesn't contain any sub folders  ?

-- 

Abdelhakim Deneche

Software Engineer

  


Now Available - Free Hadoop On-Demand Training



Re: Query Planning and Directory Pruning

2016-02-09 Thread Abdel Hakim Deneche
Hi John,

Sorry I didn't get back to you (I thought I did).

No, I don't need the plan, I just wanted to confirm what was taking most of
the time and you already confirmed it's the planning.

Can you open a JIRA for this ? this may be a known issue, but I'm not sure.

Thanks

On Tue, Feb 9, 2016 at 6:08 AM, John Omernik <j...@omernik.com> wrote:

> Abdel, do you still need the plans, as I said, if your table has any decent
> amount of directories and files, it looks like the planning is touching all
> the directories even though you are pruning.  I can post plans, however, I
> think in this case you'll find they are exactly the same, and the only
> difference is that the longer queries is planning much more because it has
> more files to read.
>
>
> On Thu, Feb 4, 2016 at 10:46 AM, John Omernik <j...@omernik.com> wrote:
>
> > I can package up both plans for you if you need them (let me know if you
> > still want them) but I can tell you the plans were EXACTLY the same,
> > however the data-sum table took 0.932 seconds to plan the query, and the
> > data table (the one with the all the extra data) took 11.379 seconds to
> > plan the query. Indicating to me the issue isn't in the plan that was
> > created, but the actual planning process. (Let me know if you disagree or
> > still need to see the plan, like I said, the actual plans were exactly
> the
> > same)
> >
> >
> > John.
> >
> >
> > On Thu, Feb 4, 2016 at 10:31 AM, Abdel Hakim Deneche <
> > adene...@maprtech.com> wrote:
> >
> >> Hey John, can you try an explain plan for both queries and see how much
> >> times it takes ?
> >>
> >> for example, for the first query you would run:
> >>
> >> *explain plan for* select count(1) from `data/2016-02-03`;
> >>
> >> It can also be helpful if you could share the query profiles for both
> >> queries.
> >>
> >> Thanks
> >>
> >> On Thu, Feb 4, 2016 at 8:15 AM, John Omernik <j...@omernik.com> wrote:
> >>
> >> > Hey all, I think am I seeing an issue related to
> >> > https://issues.apache.org/jira/browse/DRILL-3759 but I want to
> >> describe it
> >> > out here, see if it's really the case, and then determine what the
> >> blockers
> >> > may be to resolution.
> >> >
> >> > I am using the MapR Developer Release 1.4, and I have a directory with
> >> > subdirectories by data.
> >> >
> >> > data/2015-01-01
> >> > data/2015-01-02
> >> > data/2015-01-03
> >> >
> >> > These are stored as Parquet files.  At this point Each data averages
> >> about
> >> > 1 GB of data, and has roughly 75 parquet files in it.
> >> >
> >> > When I run
> >> >
> >> > select count(1) from `data/2016-02-03` it takes roughly 11 seconds.
> >> >
> >> > If I copy the 2016-02-03 directory to a new base (date-sum) and run
> >> >
> >> > select count(1) from `data_sum/2016-02-03` it runs in 0.874 seconds.
> >> >
> >> > Same data, same structure, only difference is the data_sum directory
> >> only
> >> > has a few directories, iand data has dates going back to Nov 2015.  It
> >> > seems like it is getting files name for all files in each directory
> >> prior
> >> > to pruning which seems to me to be adding a lot of latency to queries
> >> that
> >> > doesn't need to be there.  (thus I think I am seeing 3759) but I
> wanted
> >> to
> >> > confirm, and then I wanted to see how we can address this in that the
> >> > directory prune should be fast, and on large data sets its just going
> to
> >> > get worse and worse.
> >> >
> >> >
> >> >
> >> > John
> >> >
> >>
> >>
> >>
> >> --
> >>
> >> Abdelhakim Deneche
> >>
> >> Software Engineer
> >>
> >>   <http://www.mapr.com/>
> >>
> >>
> >> Now Available - Free Hadoop On-Demand Training
> >> <
> >>
> http://www.mapr.com/training?utm_source=Email_medium=Signature_campaign=Free%20available
> >> >
> >>
> >
> >
>



-- 

Abdelhakim Deneche

Software Engineer

  <http://www.mapr.com/>


Now Available - Free Hadoop On-Demand Training
<http://www.mapr.com/training?utm_source=Email_medium=Signature_campaign=Free%20available>


Re: Source for drill's calcite?

2016-02-09 Thread Abdel Hakim Deneche
You can find the r10 branch here:

https://github.com/mapr/incubator-calcite/tree/DrillCalcite1.4.0

On Tue, Feb 9, 2016 at 8:00 AM, Jason Altekruse 
wrote:

> I can't find the latest version either, but this is the r9 branch. I don't
> think any very major changes happened in the last update (it's likely just
> back-porting a fix from calcite master). So you can base your work on this
> branch and rebase it when someone points you to the updated branch.
>
> https://github.com/dremio/calcite/commits/1.4.0-drill-r9
>
> On Tue, Feb 9, 2016 at 7:46 AM, Oscar Morante  wrote:
>
> > I'm trying to add support for week and weekdays to `date_trunc` and
> > `date_part`.  It seems to be working fine right now except that I need to
> > patch calcite's TimeUnit so that the parser doesn't complain when using
> > `extract` directly.
> >
> > I first tried using the latest calcite but it's too different from the
> > version Drill is using, and `1.4.0-incubation` it doesn't work either.
> >
> > I've tried to look for the source but I can't seem to find
> > `1.4.0-drill-r10` anywhere.
> >
> > Any ideas?
> >
> >
> > PS: Here's the patch so far ->
> > https://gist.github.com/spacepluk/40df5a90ddee2efe1f4a
> >
> > --
> > Oscar Morante
> > "Self-education is, I firmly believe, the only kind of education there
> is."
> >  -- Isaac Asimov.
> >
>



-- 

Abdelhakim Deneche

Software Engineer

  


Now Available - Free Hadoop On-Demand Training



Re: Dealing with files created in Windows

2016-02-08 Thread Abdel Hakim Deneche
is dos2unix an option ?

On Mon, Feb 8, 2016 at 9:56 AM, John Omernik  wrote:

> Are there any decent tricks for dealing with Windows based text files (that
> use /r/n as the line ending rather than just /n)
>
> Right now my last field has /r showing up, and I'd like to not have that
> there, I guess I could regex_replace it maybe? I was hoping for a
> performant way to handle (Without reprocessing either)
>
> John
>



-- 

Abdelhakim Deneche

Software Engineer

  


Now Available - Free Hadoop On-Demand Training



Re: Query Planning and Directory Pruning

2016-02-04 Thread Abdel Hakim Deneche
Hey John, can you try an explain plan for both queries and see how much
times it takes ?

for example, for the first query you would run:

*explain plan for* select count(1) from `data/2016-02-03`;

It can also be helpful if you could share the query profiles for both
queries.

Thanks

On Thu, Feb 4, 2016 at 8:15 AM, John Omernik  wrote:

> Hey all, I think am I seeing an issue related to
> https://issues.apache.org/jira/browse/DRILL-3759 but I want to describe it
> out here, see if it's really the case, and then determine what the blockers
> may be to resolution.
>
> I am using the MapR Developer Release 1.4, and I have a directory with
> subdirectories by data.
>
> data/2015-01-01
> data/2015-01-02
> data/2015-01-03
>
> These are stored as Parquet files.  At this point Each data averages about
> 1 GB of data, and has roughly 75 parquet files in it.
>
> When I run
>
> select count(1) from `data/2016-02-03` it takes roughly 11 seconds.
>
> If I copy the 2016-02-03 directory to a new base (date-sum) and run
>
> select count(1) from `data_sum/2016-02-03` it runs in 0.874 seconds.
>
> Same data, same structure, only difference is the data_sum directory only
> has a few directories, iand data has dates going back to Nov 2015.  It
> seems like it is getting files name for all files in each directory prior
> to pruning which seems to me to be adding a lot of latency to queries that
> doesn't need to be there.  (thus I think I am seeing 3759) but I wanted to
> confirm, and then I wanted to see how we can address this in that the
> directory prune should be fast, and on large data sets its just going to
> get worse and worse.
>
>
>
> John
>



-- 

Abdelhakim Deneche

Software Engineer

  


Now Available - Free Hadoop On-Demand Training



Re: UDF - BooleanHolder

2016-02-03 Thread Abdel Hakim Deneche
It's called BitHolder

On Wed, Feb 3, 2016 at 3:12 PM, Nicolas Paris  wrote:

> Hello,
>
> Hello,
>
> I would like to create a user defined function that would return a boolean
> value.
> Use case would be :
>
> SELECT * FROM x WHERE MyFunction();
>
> I haven't found any BooleanHolder in order to.
>
>   @Output
> BooleanHolder out;
>
> The only way I have is:
> SELECT * FROM x WHERE MyFunction() =1;
>
> Maybe I miss something
>
> Thanks
>



-- 

Abdelhakim Deneche

Software Engineer

  


Now Available - Free Hadoop On-Demand Training



Re: Avro reader - Possible regression in 1.5-SNAPSHOT

2016-02-02 Thread Abdel Hakim Deneche
Hi Stefán,

Can you open a JIRA for this, please ?

Thanks

On Tue, Feb 2, 2016 at 6:21 AM, Stefán Baxter 
wrote:

> Hi,
>
> I can confirm that this same query+avro-files work in 1.4 so this is
> probably a regression
>
> Regards,
>  -Stefan
>
> On Tue, Feb 2, 2016 at 1:59 PM, Stefán Baxter 
> wrote:
>
> > Hi,
> >
> > I'm getting this error on master/head using the Avro Reader:
> >
> > "what ever the mind of man can conceive and believe, drill can query"
> > 0: jdbc:drill:zk=local> select * from dfs.asa.`/`;
> > Exception in thread "drill-executor-2" java.lang.NoSuchMethodError:
> >
> org.apache.drill.exec.store.avro.AvroRecordReader.setColumns(Ljava/util/Collection;)V
> > at
> >
> org.apache.drill.exec.store.avro.AvroRecordReader.(AvroRecordReader.java:99)
> > at
> >
> org.apache.drill.exec.store.avro.AvroFormatPlugin.getRecordReader(AvroFormatPlugin.java:73)
> > at
> >
> org.apache.drill.exec.store.dfs.easy.EasyFormatPlugin.getReaderBatch(EasyFormatPlugin.java:172)
> > at
> >
> org.apache.drill.exec.store.dfs.easy.EasyReaderBatchCreator.getBatch(EasyReaderBatchCreator.java:35)
> > at
> >
> org.apache.drill.exec.store.dfs.easy.EasyReaderBatchCreator.getBatch(EasyReaderBatchCreator.java:28)
> > at
> >
> org.apache.drill.exec.physical.impl.ImplCreator.getRecordBatch(ImplCreator.java:147)
> > at
> >
> org.apache.drill.exec.physical.impl.ImplCreator.getChildren(ImplCreator.java:170)
> > at
> >
> org.apache.drill.exec.physical.impl.ImplCreator.getRecordBatch(ImplCreator.java:127)
> > at
> >
> org.apache.drill.exec.physical.impl.ImplCreator.getChildren(ImplCreator.java:170)
> > at
> >
> org.apache.drill.exec.physical.impl.ImplCreator.getRecordBatch(ImplCreator.java:127)
> > at
> >
> org.apache.drill.exec.physical.impl.ImplCreator.getChildren(ImplCreator.java:170)
> > at
> >
> org.apache.drill.exec.physical.impl.ImplCreator.getRootExec(ImplCreator.java:101)
> > at
> >
> org.apache.drill.exec.physical.impl.ImplCreator.getExec(ImplCreator.java:79)
> > at
> >
> org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:230)
> > at
> >
> org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38)
> > at
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> > at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> > at java.lang.Thread.run(Thread.java:745)
> >
> > We have been using the Avro reader for a while and this looks like a
> > regression.
> >
> > I will verify that this is working in 1.4 and report.
> >
> > Regards,
> >  -Stefan
> >
>



-- 

Abdelhakim Deneche

Software Engineer

  


Now Available - Free Hadoop On-Demand Training



Re: DRILL 1.4 - newline in strings not supported

2016-02-01 Thread Abdel Hakim Deneche
Then it's similar to DRILL-3178 indeed.
Unfortunately there is no way I can think of to read csv files in Drill
without replacing the new line characters.
As Ted mentioned, Drill expected one data row per line to allow easy
splitting of csv files.

On Mon, Feb 1, 2016 at 8:24 AM, Nicolas Paris <nipari...@gmail.com> wrote:

> Abdel,
>
> select * on my csv file fails as well
>
> Thanks
>
> 2016-02-01 17:16 GMT+01:00 Abdel Hakim Deneche <adene...@maprtech.com>:
>
> > When you run a select * on your csv file, does it succeed or fail ?
> >
> > On Mon, Feb 1, 2016 at 7:53 AM, Nicolas Paris <nipari...@gmail.com>
> wrote:
> >
> > > @Abdel,
> > >
> > > Yes problem is similar. By the way, the jira issue allready exists
> > isnt'it
> > > ?
> > >
> > >
> >
> https://www.google.fr/url?sa=t=j==s=web=2=rja=8=0ahUKEwjUscyr7tTKAhXBVhoKHf0CAjYQFggpMAE=http%3A%2F%2Fmail-archives.apache.org%2Fmod_mbox%2Fdrill-dev%2F201505.mbox%2F%253CJIRA.12832322.1432356299000.15684.1432356317225%40Atlassian.JIRA%253E=AFQjCNHEwAdEpCBmS1QeuLhdfL8SIdTx6Q=4EM_xXq2QWd8kmC3LT2-Wg
> > > If not, I would be glad to add one. Just tell me why
> > >
> > > @Ted,
> > >
> > > If you have new lines in your files then the files becomes unsuitable
> for
> > > splitting.  This means that the only parallelism available in a ctas
> > > statement is multiple files
> > >
> > > ​Does it means newlines are incompatible with drill's distributed
> > calculus
> > > ?
> > >
> > > Do you have a fair number of files?​
> > > ​I have one 30GB csv file. I don't know how many parquet file it could
> > > create as process crashes because of newlines.
> > > I can imagine approx 5 parquet files 500 MB.
> > >
> > > Thanks,​
> > >
> > >
> > > 2016-02-01 16:41 GMT+01:00 Abdel Hakim Deneche <adene...@maprtech.com
> >:
> > >
> > > > Another user already reported some problems querying csv files with
> new
> > > > line characters:
> > > >
> > > >
> http://comments.gmane.org/gmane.comp.apache.incubator.drill.user/2350
> > > >
> > > > His particular problem was related to a bug in the LIKE function.
> > > > Unfortunately he never got around to fill a JIRA for his issue.
> > > >
> > > > Is your problem similar ? if yes, then can you please fill a JIRA ?
> > > >
> > > > On Mon, Feb 1, 2016 at 7:26 AM, Nicolas Paris <nipari...@gmail.com>
> > > wrote:
> > > >
> > > > > Hello Abdel,
> > > > >
> > > > > I am creating parquet file from those CSV files. (CREATE TABLE
> > syntax).
> > > > > Basically, I have a text column, with a maximum of 50k characters,
> > > > > containing newlines (the texts come from pdf extracted). I have
> > > > > multimilions tuples of texts. I am subseting texts containing some
> > > > patterns
> > > > > (LIKE '%foo%' or regex => sadly I haven't found mention about regex
> > in
> > > > > documentation (postgresql "~" operator equivalent))
> > > > > Usually I used postgresql or monetdb in order to mine the texts,
> but
> > I
> > > am
> > > > > benchmarking/studying apache drill too.
> > > > >
> > > > > Thanks,
> > > > >
> > > > >
> > > > > 2016-02-01 15:54 GMT+01:00 Abdel Hakim Deneche <
> > adene...@maprtech.com
> > > >:
> > > > >
> > > > > > Hey Nicolas,
> > > > > >
> > > > > > what kind of queries are you running on your csv file ?
> > > > > >
> > > > > > On Sun, Jan 31, 2016 at 12:14 PM, Nicolas Paris <
> > nipari...@gmail.com
> > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hello,
> > > > > > >
> > > > > > > I am trying to import a csv containing large texts. They
> contains
> > > > > newline
> > > > > > > character "\n".
> > > > > > > Apache Drill conplains about that. There is a jira issue opened
> > on
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://www.google.fr/url?sa=t=j==s=web=2=rja=8=0ahUKEwjUscyr7tTKAhXBVhoKHf0CAjYQFggpMA

Re: DRILL 1.4 - newline in strings not supported

2016-02-01 Thread Abdel Hakim Deneche
When you run a select * on your csv file, does it succeed or fail ?

On Mon, Feb 1, 2016 at 7:53 AM, Nicolas Paris <nipari...@gmail.com> wrote:

> @Abdel,
>
> Yes problem is similar. By the way, the jira issue allready exists isnt'it
> ?
>
> https://www.google.fr/url?sa=t=j==s=web=2=rja=8=0ahUKEwjUscyr7tTKAhXBVhoKHf0CAjYQFggpMAE=http%3A%2F%2Fmail-archives.apache.org%2Fmod_mbox%2Fdrill-dev%2F201505.mbox%2F%253CJIRA.12832322.1432356299000.15684.1432356317225%40Atlassian.JIRA%253E=AFQjCNHEwAdEpCBmS1QeuLhdfL8SIdTx6Q=4EM_xXq2QWd8kmC3LT2-Wg
> If not, I would be glad to add one. Just tell me why
>
> @Ted,
>
> If you have new lines in your files then the files becomes unsuitable for
> splitting.  This means that the only parallelism available in a ctas
> statement is multiple files
>
> ​Does it means newlines are incompatible with drill's distributed calculus
> ?
>
> Do you have a fair number of files?​
> ​I have one 30GB csv file. I don't know how many parquet file it could
> create as process crashes because of newlines.
> I can imagine approx 5 parquet files 500 MB.
>
> Thanks,​
>
>
> 2016-02-01 16:41 GMT+01:00 Abdel Hakim Deneche <adene...@maprtech.com>:
>
> > Another user already reported some problems querying csv files with new
> > line characters:
> >
> > http://comments.gmane.org/gmane.comp.apache.incubator.drill.user/2350
> >
> > His particular problem was related to a bug in the LIKE function.
> > Unfortunately he never got around to fill a JIRA for his issue.
> >
> > Is your problem similar ? if yes, then can you please fill a JIRA ?
> >
> > On Mon, Feb 1, 2016 at 7:26 AM, Nicolas Paris <nipari...@gmail.com>
> wrote:
> >
> > > Hello Abdel,
> > >
> > > I am creating parquet file from those CSV files. (CREATE TABLE syntax).
> > > Basically, I have a text column, with a maximum of 50k characters,
> > > containing newlines (the texts come from pdf extracted). I have
> > > multimilions tuples of texts. I am subseting texts containing some
> > patterns
> > > (LIKE '%foo%' or regex => sadly I haven't found mention about regex in
> > > documentation (postgresql "~" operator equivalent))
> > > Usually I used postgresql or monetdb in order to mine the texts, but I
> am
> > > benchmarking/studying apache drill too.
> > >
> > > Thanks,
> > >
> > >
> > > 2016-02-01 15:54 GMT+01:00 Abdel Hakim Deneche <adene...@maprtech.com
> >:
> > >
> > > > Hey Nicolas,
> > > >
> > > > what kind of queries are you running on your csv file ?
> > > >
> > > > On Sun, Jan 31, 2016 at 12:14 PM, Nicolas Paris <nipari...@gmail.com
> >
> > > > wrote:
> > > >
> > > > > Hello,
> > > > >
> > > > > I am trying to import a csv containing large texts. They contains
> > > newline
> > > > > character "\n".
> > > > > Apache Drill conplains about that. There is a jira issue opened on
> > > > >
> > > > >
> > > >
> > >
> >
> https://www.google.fr/url?sa=t=j==s=web=2=rja=8=0ahUKEwjUscyr7tTKAhXBVhoKHf0CAjYQFggpMAE=http%3A%2F%2Fmail-archives.apache.org%2Fmod_mbox%2Fdrill-dev%2F201505.mbox%2F%253CJIRA.12832322.1432356299000.15684.1432356317225%40Atlassian.JIRA%253E=AFQjCNHEwAdEpCBmS1QeuLhdfL8SIdTx6Q=4EM_xXq2QWd8kmC3LT2-Wg
> > > > >
> > > > > Is there a workaround ? (different that removing \n from texts)
> > > > >
> > > > > Thanks by advance
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > >
> > > > Abdelhakim Deneche
> > > >
> > > > Software Engineer
> > > >
> > > >   <http://www.mapr.com/>
> > > >
> > > >
> > > > Now Available - Free Hadoop On-Demand Training
> > > > <
> > > >
> > >
> >
> http://www.mapr.com/training?utm_source=Email_medium=Signature_campaign=Free%20available
> > > > >
> > > >
> > >
> >
> >
> >
> > --
> >
> > Abdelhakim Deneche
> >
> > Software Engineer
> >
> >   <http://www.mapr.com/>
> >
> >
> > Now Available - Free Hadoop On-Demand Training
> > <
> >
> http://www.mapr.com/training?utm_source=Email_medium=Signature_campaign=Free%20available
> > >
> >
>



-- 

Abdelhakim Deneche

Software Engineer

  <http://www.mapr.com/>


Now Available - Free Hadoop On-Demand Training
<http://www.mapr.com/training?utm_source=Email_medium=Signature_campaign=Free%20available>


Re: DRILL 1.4 - newline in strings not supported

2016-02-01 Thread Abdel Hakim Deneche
Hey Nicolas,

what kind of queries are you running on your csv file ?

On Sun, Jan 31, 2016 at 12:14 PM, Nicolas Paris  wrote:

> Hello,
>
> I am trying to import a csv containing large texts. They contains newline
> character "\n".
> Apache Drill conplains about that. There is a jira issue opened on
>
> https://www.google.fr/url?sa=t=j==s=web=2=rja=8=0ahUKEwjUscyr7tTKAhXBVhoKHf0CAjYQFggpMAE=http%3A%2F%2Fmail-archives.apache.org%2Fmod_mbox%2Fdrill-dev%2F201505.mbox%2F%253CJIRA.12832322.1432356299000.15684.1432356317225%40Atlassian.JIRA%253E=AFQjCNHEwAdEpCBmS1QeuLhdfL8SIdTx6Q=4EM_xXq2QWd8kmC3LT2-Wg
>
> Is there a workaround ? (different that removing \n from texts)
>
> Thanks by advance
>



-- 

Abdelhakim Deneche

Software Engineer

  


Now Available - Free Hadoop On-Demand Training



Re: DRILL 1.4 - newline in strings not supported

2016-02-01 Thread Abdel Hakim Deneche
Another user already reported some problems querying csv files with new
line characters:

http://comments.gmane.org/gmane.comp.apache.incubator.drill.user/2350

His particular problem was related to a bug in the LIKE function.
Unfortunately he never got around to fill a JIRA for his issue.

Is your problem similar ? if yes, then can you please fill a JIRA ?

On Mon, Feb 1, 2016 at 7:26 AM, Nicolas Paris <nipari...@gmail.com> wrote:

> Hello Abdel,
>
> I am creating parquet file from those CSV files. (CREATE TABLE syntax).
> Basically, I have a text column, with a maximum of 50k characters,
> containing newlines (the texts come from pdf extracted). I have
> multimilions tuples of texts. I am subseting texts containing some patterns
> (LIKE '%foo%' or regex => sadly I haven't found mention about regex in
> documentation (postgresql "~" operator equivalent))
> Usually I used postgresql or monetdb in order to mine the texts, but I am
> benchmarking/studying apache drill too.
>
> Thanks,
>
>
> 2016-02-01 15:54 GMT+01:00 Abdel Hakim Deneche <adene...@maprtech.com>:
>
> > Hey Nicolas,
> >
> > what kind of queries are you running on your csv file ?
> >
> > On Sun, Jan 31, 2016 at 12:14 PM, Nicolas Paris <nipari...@gmail.com>
> > wrote:
> >
> > > Hello,
> > >
> > > I am trying to import a csv containing large texts. They contains
> newline
> > > character "\n".
> > > Apache Drill conplains about that. There is a jira issue opened on
> > >
> > >
> >
> https://www.google.fr/url?sa=t=j==s=web=2=rja=8=0ahUKEwjUscyr7tTKAhXBVhoKHf0CAjYQFggpMAE=http%3A%2F%2Fmail-archives.apache.org%2Fmod_mbox%2Fdrill-dev%2F201505.mbox%2F%253CJIRA.12832322.1432356299000.15684.1432356317225%40Atlassian.JIRA%253E=AFQjCNHEwAdEpCBmS1QeuLhdfL8SIdTx6Q=4EM_xXq2QWd8kmC3LT2-Wg
> > >
> > > Is there a workaround ? (different that removing \n from texts)
> > >
> > > Thanks by advance
> > >
> >
> >
> >
> > --
> >
> > Abdelhakim Deneche
> >
> > Software Engineer
> >
> >   <http://www.mapr.com/>
> >
> >
> > Now Available - Free Hadoop On-Demand Training
> > <
> >
> http://www.mapr.com/training?utm_source=Email_medium=Signature_campaign=Free%20available
> > >
> >
>



-- 

Abdelhakim Deneche

Software Engineer

  <http://www.mapr.com/>


Now Available - Free Hadoop On-Demand Training
<http://www.mapr.com/training?utm_source=Email_medium=Signature_campaign=Free%20available>


Re: CTAS error with CSV data

2016-01-26 Thread Abdel Hakim Deneche
This definitely looks like a bug, could you open a JIRA and share as much
information as possible about the structure of the CSV file and the number
of records.

On Tue, Jan 26, 2016 at 7:38 PM, Matt <bsg...@gmail.com> wrote:

> The CTAS with fails with:
>
> ~~~
> Error: SYSTEM ERROR: IllegalArgumentException: length: -260 (expected: >=
> 0)
>
> Fragment 1:2
>
> [Error Id: 1807615e-4385-4f85-8402-5900aaa568e9 on es07:31010]
>
>   (java.lang.IllegalArgumentException) length: -260 (expected: >= 0)
> io.netty.buffer.AbstractByteBuf.checkIndex():1131
> io.netty.buffer.PooledUnsafeDirectByteBuf.nioBuffer():344
> io.netty.buffer.WrappedByteBuf.nioBuffer():727
> io.netty.buffer.UnsafeDirectLittleEndian.nioBuffer():26
> io.netty.buffer.DrillBuf.nioBuffer():356
>
> org.apache.drill.exec.store.ParquetOutputRecordWriter$VarCharParquetConverter.writeField():1842
> org.apache.drill.exec.store.EventBasedRecordWriter.write():62
> org.apache.drill.exec.physical.impl.WriterRecordBatch.innerNext():106
> org.apache.drill.exec.record.AbstractRecordBatch.next():162
> org.apache.drill.exec.physical.impl.BaseRootExec.next():104
>
> org.apache.drill.exec.physical.impl.SingleSenderCreator$SingleSenderRootExec.innerNext():93
> org.apache.drill.exec.physical.impl.BaseRootExec.next():94
> org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():256
> org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():250
> java.security.AccessController.doPrivileged():-2
> javax.security.auth.Subject.doAs():415
> org.apache.hadoop.security.UserGroupInformation.doAs():1657
> org.apache.drill.exec.work.fragment.FragmentExecutor.run():250
> org.apache.drill.common.SelfCleaningRunnable.run():38
> java.util.concurrent.ThreadPoolExecutor.runWorker():1145
> java.util.concurrent.ThreadPoolExecutor$Worker.run():615
> java.lang.Thread.run():745 (state=,code=0)
> ~~~
>
> And a simple SELECT * fails with:
>
> ~~~
> java.lang.IndexOutOfBoundsException: index: 547681, length: 1 (expected:
> range(0, 547681))
> at
> io.netty.buffer.AbstractByteBuf.checkIndex(AbstractByteBuf.java:1134)
> at
> io.netty.buffer.PooledUnsafeDirectByteBuf.getBytes(PooledUnsafeDirectByteBuf.java:136)
> at io.netty.buffer.WrappedByteBuf.getBytes(WrappedByteBuf.java:289)
> at
> io.netty.buffer.UnsafeDirectLittleEndian.getBytes(UnsafeDirectLittleEndian.java:26)
> at io.netty.buffer.DrillBuf.getBytes(DrillBuf.java:586)
> at io.netty.buffer.DrillBuf.getBytes(DrillBuf.java:586)
> at io.netty.buffer.DrillBuf.getBytes(DrillBuf.java:586)
> at io.netty.buffer.DrillBuf.getBytes(DrillBuf.java:586)
> at
> org.apache.drill.exec.vector.VarCharVector$Accessor.get(VarCharVector.java:443)
> at
> org.apache.drill.exec.vector.accessor.VarCharAccessor.getBytes(VarCharAccessor.java:125)
> at
> org.apache.drill.exec.vector.accessor.VarCharAccessor.getString(VarCharAccessor.java:146)
> at
> org.apache.drill.exec.vector.accessor.VarCharAccessor.getObject(VarCharAccessor.java:136)
> at
> org.apache.drill.exec.vector.accessor.VarCharAccessor.getObject(VarCharAccessor.java:94)
> at
> org.apache.drill.exec.vector.accessor.BoundCheckingAccessor.getObject(BoundCheckingAccessor.java:148)
> at
> org.apache.drill.jdbc.impl.TypeConvertingSqlAccessor.getObject(TypeConvertingSqlAccessor.java:795)
> at
> org.apache.drill.jdbc.impl.AvaticaDrillSqlAccessor.getObject(AvaticaDrillSqlAccessor.java:179)
> at
> net.hydromatic.avatica.AvaticaResultSet.getObject(AvaticaResultSet.java:351)
> at
> org.apache.drill.jdbc.impl.DrillResultSetImpl.getObject(DrillResultSetImpl.java:420)
> at sqlline.Rows$Row.(Rows.java:157)
> at sqlline.IncrementalRows.hasNext(IncrementalRows.java:63)
> at
> sqlline.TableOutputFormat$ResizingRowsProvider.next(TableOutputFormat.java:87)
> at sqlline.TableOutputFormat.print(TableOutputFormat.java:118)
> at sqlline.SqlLine.print(SqlLine.java:1593)
> at sqlline.Commands.execute(Commands.java:852)
> at sqlline.Commands.sql(Commands.java:751)
> at sqlline.SqlLine.dispatch(SqlLine.java:746)
> at sqlline.SqlLine.begin(SqlLine.java:621)
> at sqlline.SqlLine.start(SqlLine.java:375)
> at sqlline.SqlLine.main(SqlLine.java:268)
> ~~~
>
> It also looks like if I run the SELECT from a bash shell as "sqlline -u
> ... -f test.sql 2>&1 > test.out" upon the error the sqlline session "locks
> up". No errors spool to the out file and the Java thread can only be
> terminated with a kill -9. It c

Re: CTAS error with CSV data

2016-01-26 Thread Abdel Hakim Deneche
It's an internal buffer index. Can you try enabling verbose errors and run
the query again, this should provide us with more details about the error.
You can enable verbose error by running the following before the select *:

alter session set `exec.errors.verbose`=true;

thanks

On Tue, Jan 26, 2016 at 11:01 AM, Matt <bsg...@gmail.com> wrote:

> Putting the "select * from
> `/csv/customer/hourly/customer_20151017.csv`;" in a local .sql file,
> and executing it with sqlline > /dev/null (to avoid a ton of scrolling)
> results in:
>
> ~~~
> index: 418719, length: 2 (expected: range(0, 418719))
>Aborting command
> set because "force" is false and command failed: "select * from
> `/csv/customer/hourly/customer_20151017.csv`;"
> Closing: org.apache.drill.jdbc.impl.DrillConnectionImpl
> ~~~
>
> Is that index a byte or line offset?
>
>
> On 26 Jan 2016, at 12:55, Abdel Hakim Deneche wrote:
>
> Does a select * on the same data also fail ?
>>
>> On Tue, Jan 26, 2016 at 9:44 AM, Matt <bsg...@gmail.com> wrote:
>>
>> Getting some errors when attempting to create Parquet files from CSV data,
>>> and trying to determine if it is due to the format of the source data.
>>>
>>> Its a fairly simple format of
>>> "datetime,key,key,key,numeric,numeric,numeric, ..." with 32 of those
>>> numeric columns in total.
>>>
>>> The source data does contain a lot missing values for the numeric
>>> columns,
>>> and those are represented by as consecutive delimiters:
>>> ""datetime,key,key,key,numeric,,..."
>>>
>>> Could this be causing the CTAS to fail with these types of errors? Or is
>>> there another cause to look for?
>>>
>>> ~~~
>>> Error: SYSTEM ERROR: IllegalArgumentException: length: -260 (expected: >=
>>> 0)
>>>
>>> │···
>>>
>>>
>>>
>>> │···
>>> Fragment 1:2
>>> ~~~
>>>
>>>
>>
>>
>> --
>>
>> Abdelhakim Deneche
>>
>> Software Engineer
>>
>> <http://www.mapr.com/>
>>
>>
>> Now Available - Free Hadoop On-Demand Training
>> <
>> http://www.mapr.com/training?utm_source=Email_medium=Signature_campaign=Free%20available
>> >
>>
>


-- 

Abdelhakim Deneche

Software Engineer

  <http://www.mapr.com/>


Now Available - Free Hadoop On-Demand Training
<http://www.mapr.com/training?utm_source=Email_medium=Signature_campaign=Free%20available>


Re: Drill ODBC: format number on excel looks like Text

2016-01-23 Thread Abdel Hakim Deneche
Could this help ?

http://superuser.com/questions/385511/easy-way-to-one-off-import-data-with-different-decimal-separator-in-excel

The solution is a bit old though and newer version of Excel may not offer
the option anymore.

On Sat, Jan 23, 2016 at 3:11 PM, Paolo Spanevello 
wrote:

> Dear Ted,
> exactly this is the point.
>
> How could I fix this issue?
>
> Best,
> Paolo
>
> 2016-01-23 23:27 GMT+01:00 Ted Dunning :
>
> > Andries,
> >
> > But if you set Excel to use , as the decimal separator, then 925.
> could
> > be displayed as 925,
> >
> > Guessing by name, I suspect that Paolo is European and might have Excel
> set
> > this way.
> >
> >
> >
> > On Fri, Jan 22, 2016 at 9:26 AM, Andries Engelbrecht <
> > aengelbre...@maprtech.com> wrote:
> >
> > > What does the JSON data look like?
> > >
> > > I did a quick test with Excel and MS Query through ODBC to connect to
> > > Drill.
> > >
> > > Selecting data as either a string or numeric value.
> > >
> > > select * from (values('925.000',925.))
> > >
> > > The results returned is
> > > 925.000  for the string value and 925 for the numeric value to Excel.
> > >
> > >
> > > With Drill 1.4 you can use typeof() to see what data type it being
> > > interpreted as.
> > >
> > > --Andries
> > >
> > >
> > >
> > > > On Jan 22, 2016, at 8:50 AM, Ted Dunning 
> > wrote:
> > > >
> > > > This sounds like this might be a problem of decimal point separator.
> > Can
> > > > you say what decimal point character you normally use? It might also
> be
> > > > that this is set differently on the exel machine from the machine
> where
> > > > drill is running.
> > > >
> > > > I am presuming that the value that you want to see is 925.0
> > > >
> > > > Is that correct?
> > > >
> > > >
> > > > On Fri, Jan 22, 2016 at 6:33 AM, Paolo Spanevello <
> > paolosp...@gmail.com>
> > > > wrote:
> > > >
> > > >> Dear All,
> > > >> i'm drilling a JSON File with some fields with numbers with this
> > format
> > > :
> > > >>
> > > >> 1s_critical_power
> > > >> 925.0
> > > >>
> > > >>
> > > >> I'm using ODBC Driver to connect it on excel and the result aspect
> is
> > > >>
> > > >> 1s_critical_power
> > > >> 925,0
> > > >> Do you know the right way to have it?
> > > >>
> > > >> Best regards,
> > > >> Paolo
> > > >>
> > >
> > >
> >
>



-- 

Abdelhakim Deneche

Software Engineer

  


Now Available - Free Hadoop On-Demand Training



Re: JDBC Driver - Possible regression

2016-01-20 Thread Abdel Hakim Deneche
Stefán,

Please reopen the JIRA and add a comment describing what you are seeing.

Thanks

On Wed, Jan 20, 2016 at 4:34 AM, Stefán Baxter 
wrote:

> Hi again,
>
> We have verified that the error exists on master:head (1.5-SNAPSHOT).
>
> Regards,
>  -Stefan
>
> On Wed, Jan 20, 2016 at 10:39 AM, Stefán Baxter  >
> wrote:
>
> > Hi,
> >
> > We are using the 1.5-SNAPSHOT version of the JDBC drilver (all) and we
> > seem to be getting this old thing:
> >
> > https://issues.apache.org/jira/browse/DRILL-2482
> >
> > We are either doing something wrong or this or this is a regression. Has
> > anyone else experienced not being able to get nested structures via the
> > latest JDBC driver?
> >
> > (I'm going to pull the lastest from master to be sure this has not been
> > solved)
> >
> > The error we get when accessing a field containing a sub-structure is :
> >
> > java.lang.NoClassDefFoundError: org/apache/hadoop/io/Text
> >
> > at
> >
> oadd.org.apache.drill.exec.util.JsonStringArrayList.(JsonStringArrayList.java:35)
> > at
> >
> oadd.org.apache.drill.exec.vector.RepeatedVarCharVector$Accessor.getObject(RepeatedVarCharVector.java:293)
> > at
> >
> oadd.org.apache.drill.exec.vector.RepeatedVarCharVector$Accessor.getObject(RepeatedVarCharVector.java:290)
> > at
> >
> oadd.org.apache.drill.exec.vector.accessor.GenericAccessor.getObject(GenericAccessor.java:44)
> > at
> >
> oadd.org.apache.drill.exec.vector.accessor.BoundCheckingAccessor.getObject(BoundCheckingAccessor.java:148)
> > at
> >
> org.apache.drill.jdbc.impl.TypeConvertingSqlAccessor.getObject(TypeConvertingSqlAccessor.java:795)
> > at
> >
> org.apache.drill.jdbc.impl.AvaticaDrillSqlAccessor.getObject(AvaticaDrillSqlAccessor.java:179)
> > at
> >
> oadd.net.hydromatic.avatica.AvaticaResultSet.getObject(AvaticaResultSet.java:351)
> > at
> >
> org.apache.drill.jdbc.impl.DrillResultSetImpl.getObject(DrillResultSetImpl.java:420)
> >
> >
> > Regards,
> >  -Stefan
> >
> >
>



-- 

Abdelhakim Deneche

Software Engineer

  


Now Available - Free Hadoop On-Demand Training



Re: Too many open files

2016-01-11 Thread Abdel Hakim Deneche
Hi Ian,

Can you open up a JIRA for this ? is it easy to reproduce ?

Thanks

On Mon, Jan 11, 2016 at 8:59 AM, Ian Maloney 
wrote:

> Hi,
>
> I've been running a lot of queries via jdbc/drill. I have four drillbits,
> but I could not get the zk jdbc URL to work so I used:
> jdbc:drill:drillbit=a-bits-hostname
>
> Now I get a SocketException for too many open files, even when accessing
> via cli. I imagine I could restart the bits, but for something intended for
> production, that doesn't seem like a viable solution. Any ideas on how to
> keep the (suspected) resource leak from happening?
>
> I'm closing ResultSet, Statement, and Connection, after each query.
>



-- 

Abdelhakim Deneche

Software Engineer

  


Now Available - Free Hadoop On-Demand Training



Re: Issue in developing UDF

2016-01-11 Thread Abdel Hakim Deneche
t; > >> >
> > net.sourceforge.wurfl.core.GeneralWURFLEngine("/home/nirav/wurfl.xml");
> > >> > //String SUA =
> > >> >
> > >>
> >
> org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.toStringFromUTF8(userAgent.start,
> > >> > userAgent.end, userAgent.buffer);
> > >> >
> > >> >
> > >> > /*String[] capabilities = {
> > >> > "device_os",
> > >> > "device_os_version",
> > >> > "is_tablet",
> > >> > "is_wireless_device",
> > >> > "pointing_method",
> > >> > "preferred_markup",
> > >> > "resolution_height",
> > >> > "resolution_width",
> > >> > "ux_full_desktop",
> > >> > "xhtml_support_level",
> > >> > "is_smarttv",
> > >> > "can_assign_phone_number",
> > >> > "brand_name",
> > >> > "model_name",
> > >> > "marketing_name",
> > >> > "mobile_browser_version"
> > >> > };
> > >> > wurfl.setEngineTarget(EngineTarget.accuracy);
> > >> > wurfl.setCapabilityFilter(capabilities);
> > >> > Device device = wurfl.getDeviceForRequest(SUA);
> > >> > System.out.println("4-->"+new Date(
> > >> > System.currentTimeMillis()));
> > >> > System.out.println("Device OS: " +
> > >> > device.getCapability("device_os"));
> > >> > System.out.println("Device OS version: " +
> > >> > device.getCapability("device_os_version"));
> > >> > System.out.println("Brand name: " +
> > >> > device.getCapability("brand_name"));
> > >> > System.out.println("advertised_device_os_version: " +
> > >> > device.getCapability("advertised_device_os_version"));
> > >> > System.out.println("advertised_device_os: " +
> > >> > device.getCapability("advertised_device_os"));
> > >> > System.out.println("advertised_browser: " +
> > >> > device.getCapability("advertised_browser"));
> > >> >
> > >> >
> > >>
> >
> System.out.println("advertised_browser_version:"+device.getCapability("advertised_browser_version"));
> > >> > stringOutValue =
> > device.getCapability("device_os_version");
> > >> > */
> > >> > byte[] valueDecoded = stringOutValue.getBytes();
> > >> > outValue.buffer =
> > >> buffer.reallocIfNeeded(valueDecoded.length);
> > >> > outValue.start = 0;
> > >> > outValue.end = valueDecoded.length;
> > >> > buffer.setBytes(0, valueDecoded);
> > >> > } catch (Exception e) {
> > >> > // TODO Auto-generated catch block
> > >> > //stringOutValue = "null6";
> > >> > byte[] valueDecoded = e.getMessage().getBytes();
> > >> > outValue.buffer =
> > >> buffer.reallocIfNeeded(valueDecoded.length);
> > >> > outValue.start = 0;
> > >> > outValue.end = valueDecoded.length;
> > >> > buffer.setBytes(0, valueDecoded);
> > >> >
> > >> > }
> > >> >
> > >> > }
> > >> > }
> > >> > // select GetBrowserDtl('Mozilla/5.0 (X11; Linux x86_64)
> > >> AppleWebKit/537.36
> > >> > (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36',1) from
> > >> (values(1));
> > >> > //select GetBrowserDtl('Mozilla/5.0 (X11; Ubuntu; Linux x86_64;
> > rv:31.0)
> > >> > Gecko/20100101 Firefox/31.0',1) from (values(1));
> > >> > //se

Re: Issue in developing UDF

2016-01-06 Thread Abdel Hakim Deneche
According to Drill documentation:

http://drill.apache.org/docs/adding-custom-functions-to-drill/

You need to copy both class jar and source jar of your UDF to
$DRILL_HOME/jars/3rdparty/

did you do it ?

On Tue, Jan 5, 2016 at 11:58 PM, Nirav Shah 
wrote:

> Hi ,
>
>
> I am trying to extract info from user agent using WURFL libraries.
>
> I am not sure what's wrong I am doing here but it's not working and not
> giving any errors as well.
>
> I have put wurfl.xml in source folder and  wurfl.jar to
> /jar/3rdparty/
>
> *Code :*
>
> wurfl = new
> net.sourceforge.wurfl.core.GeneralWURFLEngine("wurfl.xml");
> String SUA =
>
> org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.toStringFromUTF8(userAgent.start,
> userAgent.end, userAgent.buffer);
>
>
> String[] capabilities = {
> "device_os",
> "device_os_version",
> "is_tablet",
> "is_wireless_device",
> "pointing_method",
> "preferred_markup",
> "resolution_height",
> "resolution_width",
> "ux_full_desktop",
> "xhtml_support_level",
> "is_smarttv",
> "can_assign_phone_number",
> "brand_name",
> "model_name",
> "marketing_name",
> "mobile_browser_version"
> };
> wurfl.setEngineTarget(EngineTarget.accuracy);
> wurfl.setCapabilityFilter(capabilities);
> Device device = wurfl.getDeviceForRequest(SUA);
> System.out.println("4-->"+new Date(
> System.currentTimeMillis()));
> System.out.println("Device OS: " +
> device.getCapability("device_os"));
> System.out.println("Device OS version: " +
> device.getCapability("device_os_version"));
> System.out.println("Brand name: " +
> device.getCapability("brand_name"));
> System.out.println("advertised_device_os_version: " +
> device.getCapability("advertised_device_os_version"));
> System.out.println("advertised_device_os: " +
> device.getCapability("advertised_device_os"));
> System.out.println("advertised_browser: " +
> device.getCapability("advertised_browser"));
>
>
> System.out.println("advertised_browser_version:"+device.getCapability("advertised_browser_version"));
> stringOutValue = device.getCapability("device_os_version");
>
>
> Regards,
> Nirav
>



-- 

Abdelhakim Deneche

Software Engineer

  


Now Available - Free Hadoop On-Demand Training



Re: Drill Query Problem

2015-12-04 Thread Abdel Hakim Deneche
Hi Nirav,

can you give us more information to help reproduce this issue ?

thanks

On Fri, Dec 4, 2015 at 3:42 AM, Nirav Shah  wrote:

> Hello,
>
> I am getting below error while running big query.
>
> ===
>
> Error: SYSTEM ERROR: CompileException: File
> 'org.apache.drill.exec.compile.DrillJavaFileObject[ProjectorGen2825.java]',
> Line 5799, Column 17: ProjectorGen2825.java:5799: error: code too large
> public void doEval(int inIndex, int outIndex)
> ^ (compiler.err.limit.code)
>
>
> Regards,
> Nirav
>



-- 

Abdelhakim Deneche

Software Engineer

  


Now Available - Free Hadoop On-Demand Training



Re: Infinite pending Bug ?

2015-11-13 Thread Abdel Hakim Deneche
Boris, this is definitely a bug, the query seems to be blocked in planning.
Please open a JIRA.

Thanks

On Fri, Nov 13, 2015 at 9:03 AM, Abhishek Girish <abhishek.gir...@gmail.com>
wrote:

> I tried this on latest master - looks like its stuck at planning stage
> (explain plan hangs).
>
> Any slight modification to the create View2 query in the "trim" section
> makes the select * query succeed (for example changing '' to ' '). I'm not
> sure if this will help narrow down the issue though.
>
> -Abhishek
>
> On Fri, Nov 13, 2015 at 8:55 AM, Abdel Hakim Deneche <
> adene...@maprtech.com>
> wrote:
>
> > Hello Boris,
> >
> > What version of Drill are you using ?
> >
> > Thanks
> >
> > On Fri, Nov 13, 2015 at 8:33 AM, Hsuan Yi Chu <hyi...@maprtech.com>
> wrote:
> >
> > > Do you know if it is stuck at planning?
> > >
> > > On Fri, Nov 13, 2015 at 8:03 AM, Boris Chmiel <
> > > boris.chm...@yahoo.com.invalid> wrote:
> > >
> > > > Hello every one,
> > > > I reach an infinite pending on a quite simple set of queries with
> very
> > > > small files. Do you see a flaw in my queries or it is a bug ?
> > > > View :
> > > > create or replace view View1 AS (SELECT B1.columns[0]
> c0,B1.columns[1]
> > > > c1FROM dfs.tmp.`TEST\B1.csv` B1LEFT OUTER JOIN dfs.tmp.`TEST\BK.csv`
> > BKON
> > > > B1.columns[1] = BK.columns[0]WHERE BK.columns[0] is null AND
> > > > trim(B1.columns[1]) <> '');
> > > >
> > > > create or replace view View2 AS (SELECT View1.c0,View1.c1FROM
> View1LEFT
> > > > OUTER JOIN dfs.tmp.`TEST\BK.csv` BKON View1.c1 = BK.columns[0]WHERE
> > > > BK.columns[0] is null AND trim(View1.c1) <> '');
> > > >
> > > > Query :select * FROM dfs.tmp.View2
> > > > => Infinite Pending
> > > > data set : B1 :A;B;FC;AD;EE;F;C
> > > > BK:A;1B;2F;4
> > > >
> > >
> >
> >
> >
> > --
> >
> > Abdelhakim Deneche
> >
> > Software Engineer
> >
> >   <http://www.mapr.com/>
> >
> >
> > Now Available - Free Hadoop On-Demand Training
> > <
> >
> http://www.mapr.com/training?utm_source=Email_medium=Signature_campaign=Free%20available
> > >
> >
>



-- 

Abdelhakim Deneche

Software Engineer

  <http://www.mapr.com/>


Now Available - Free Hadoop On-Demand Training
<http://www.mapr.com/training?utm_source=Email_medium=Signature_campaign=Free%20available>


Re: Infinite pending Bug ?

2015-11-13 Thread Abdel Hakim Deneche
Hello Boris,

What version of Drill are you using ?

Thanks

On Fri, Nov 13, 2015 at 8:33 AM, Hsuan Yi Chu  wrote:

> Do you know if it is stuck at planning?
>
> On Fri, Nov 13, 2015 at 8:03 AM, Boris Chmiel <
> boris.chm...@yahoo.com.invalid> wrote:
>
> > Hello every one,
> > I reach an infinite pending on a quite simple set of queries with very
> > small files. Do you see a flaw in my queries or it is a bug ?
> > View :
> > create or replace view View1 AS (SELECT B1.columns[0] c0,B1.columns[1]
> > c1FROM dfs.tmp.`TEST\B1.csv` B1LEFT OUTER JOIN dfs.tmp.`TEST\BK.csv` BKON
> > B1.columns[1] = BK.columns[0]WHERE BK.columns[0] is null AND
> > trim(B1.columns[1]) <> '');
> >
> > create or replace view View2 AS (SELECT View1.c0,View1.c1FROM View1LEFT
> > OUTER JOIN dfs.tmp.`TEST\BK.csv` BKON View1.c1 = BK.columns[0]WHERE
> > BK.columns[0] is null AND trim(View1.c1) <> '');
> >
> > Query :select * FROM dfs.tmp.View2
> > => Infinite Pending
> > data set : B1 :A;B;FC;AD;EE;F;C
> > BK:A;1B;2F;4
> >
>



-- 

Abdelhakim Deneche

Software Engineer

  


Now Available - Free Hadoop On-Demand Training



Re: Help with Troubleshooting dense error message

2015-11-04 Thread Abdel Hakim Deneche
One last thing, what version of Drill do you have installed ?

On Wed, Nov 4, 2015 at 11:04 AM, John Omernik <j...@omernik.com> wrote:

> No I don't think so.  I am running Drill in Marathon on Mesos, so my
> startup settings are all very static. In addition, the only session
> variable I was changed was the json as text option at the session level and
> I was setting it on both the pre drillbit reboot and the post drillbit
> reboot sessions (I need that to query the data).
>
> On Wed, Nov 4, 2015 at 12:46 PM, Abdel Hakim Deneche <
> adene...@maprtech.com>
> wrote:
>
> > This is strange indeed. The error message you reported earlier doesn't
> > suggest a memory leak issue but rather a bug when reading a specific set
> of
> > data.
> > Could it be that you changed some session options, and you forgot to set
> > them again after you restarted the drillbits ?
> >
> > Thanks
> >
> > On Wed, Nov 4, 2015 at 10:37 AM, John Omernik <j...@omernik.com> wrote:
> >
> > > So I pulled the (I was up to two) files that seemed to be causing this
> > > issue out, and loaded my data.  (see my other posts on how I did that
> > with
> > > loading into a folder prefixed by .)
> > >
> > > Anywho, my Drill cluster became unstable in general, and I was not able
> > to
> > > run any queries until I bounced by drill bits.
> > >
> > > I did that, got my process working again, and went to go try
> > > troubleshooting this problem again and everything appears to be working
> > > well now.  I am stumped.   Could a memory leak have caused that error
> > only
> > > on some files?  I am monitoring now to determine if the problem starts
> > > again, but that is REALLY strange to me. This seems out of character
> for
> > > Drill, both in my use of it, and in how it handles memory has been
> > > explained to me.  If I get the error again, I'll ensure I set that to
> > get a
> > > full stack trace.
> > >
> > > John
> > >
> > > On Wed, Nov 4, 2015 at 12:13 PM, Abdel Hakim Deneche <
> > > adene...@maprtech.com>
> > > wrote:
> > >
> > > > The error message "index: 9604, length: 4 (expected: range(0, 8192))"
> > > > suggests an error happened when Drill tried to access a memory buffer
> > > (most
> > > > likely while writing an int or float value)
> > > > This may be a bug actually exposed by that particular data record.
> > > >
> > > > You can try enabling verbose error logging before running the query
> > > again:
> > > >
> > > > set `exec.errors.verbose`=true;
> > > >
> > > > This should give us a nice stack trace about this error.
> > > >
> > > > Thanks
> > > >
> > > > On Wed, Nov 4, 2015 at 7:29 AM, John Omernik <j...@omernik.com>
> wrote:
> > > >
> > > > > There are multiple fields in that record, including two lists. Both
> > > lists
> > > > > have data in them (now I am runnning with json text mode because at
> > > times
> > > > > the first value is a JSON null, but in these cases, that should be
> > > turned
> > > > > to "null" as  string.  (If I am understanding things correctly) and
> > > > > shouldn't be causing a problem.
> > > > >
> > > > >
> > > > >
> > > > > On Wed, Nov 4, 2015 at 9:21 AM, Hsuan Yi Chu <hyi...@maprtech.com>
> > > > wrote:
> > > > >
> > > > > > What is the data type for that record in line 2402? A list?
> > > > > >
> > > > > > Do you think it could be similar to this issue ?
> > > > > >
> > > > > > https://issues.apache.org/jira/browse/DRILL-4006
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Wed, Nov 4, 2015 at 6:48 AM, John Omernik <j...@omernik.com>
> > > wrote:
> > > > > >
> > > > > > > Hey all,
> > > > > > >
> > > > > > > I am working with JSON that is on the whole fairly clean.  I am
> > > > trying
> > > > > to
> > > > > > > load into Parquet files, and the previous days worth of data
> > worked
> > > > > just
> > > > > > > fine, but todays data has something wrong with it and I Can't
> > > figure

Re: Help with Troubleshooting dense error message

2015-11-04 Thread Abdel Hakim Deneche
This is strange indeed. The error message you reported earlier doesn't
suggest a memory leak issue but rather a bug when reading a specific set of
data.
Could it be that you changed some session options, and you forgot to set
them again after you restarted the drillbits ?

Thanks

On Wed, Nov 4, 2015 at 10:37 AM, John Omernik <j...@omernik.com> wrote:

> So I pulled the (I was up to two) files that seemed to be causing this
> issue out, and loaded my data.  (see my other posts on how I did that with
> loading into a folder prefixed by .)
>
> Anywho, my Drill cluster became unstable in general, and I was not able to
> run any queries until I bounced by drill bits.
>
> I did that, got my process working again, and went to go try
> troubleshooting this problem again and everything appears to be working
> well now.  I am stumped.   Could a memory leak have caused that error only
> on some files?  I am monitoring now to determine if the problem starts
> again, but that is REALLY strange to me. This seems out of character for
> Drill, both in my use of it, and in how it handles memory has been
> explained to me.  If I get the error again, I'll ensure I set that to get a
> full stack trace.
>
> John
>
> On Wed, Nov 4, 2015 at 12:13 PM, Abdel Hakim Deneche <
> adene...@maprtech.com>
> wrote:
>
> > The error message "index: 9604, length: 4 (expected: range(0, 8192))"
> > suggests an error happened when Drill tried to access a memory buffer
> (most
> > likely while writing an int or float value)
> > This may be a bug actually exposed by that particular data record.
> >
> > You can try enabling verbose error logging before running the query
> again:
> >
> > set `exec.errors.verbose`=true;
> >
> > This should give us a nice stack trace about this error.
> >
> > Thanks
> >
> > On Wed, Nov 4, 2015 at 7:29 AM, John Omernik <j...@omernik.com> wrote:
> >
> > > There are multiple fields in that record, including two lists. Both
> lists
> > > have data in them (now I am runnning with json text mode because at
> times
> > > the first value is a JSON null, but in these cases, that should be
> turned
> > > to "null" as  string.  (If I am understanding things correctly) and
> > > shouldn't be causing a problem.
> > >
> > >
> > >
> > > On Wed, Nov 4, 2015 at 9:21 AM, Hsuan Yi Chu <hyi...@maprtech.com>
> > wrote:
> > >
> > > > What is the data type for that record in line 2402? A list?
> > > >
> > > > Do you think it could be similar to this issue ?
> > > >
> > > > https://issues.apache.org/jira/browse/DRILL-4006
> > > >
> > > >
> > > >
> > > > On Wed, Nov 4, 2015 at 6:48 AM, John Omernik <j...@omernik.com>
> wrote:
> > > >
> > > > > Hey all,
> > > > >
> > > > > I am working with JSON that is on the whole fairly clean.  I am
> > trying
> > > to
> > > > > load into Parquet files, and the previous days worth of data worked
> > > just
> > > > > fine, but todays data has something wrong with it and I Can't
> figure
> > > out
> > > > > what it is. Unfortunately, I can't post the data, which I know
> makes
> > > this
> > > > > hard to troubleshoot for the community. Hopefully I can provide
> some
> > > info
> > > > > here, and get some pointers on where to look, and then report back
> on
> > > how
> > > > > we could potentially improve the error messages.
> > > > >
> > > > > The error is below.
> > > > >
> > > > >
> > > > > I am looking to figure out given the information reported where I'd
> > > look
> > > > to
> > > > > trouble shoot this. Obviously the file
> > > > 02ffc306e877_my_load_1446640931.json
> > > > > is where I am looking to start
> > > > >
> > > > > This file has 3000 lines (records of data, so it's somewhere in
> > > between.
> > > > >
> > > > > The index/length/expected range don't mean anything to me I could
> use
> > > > some
> > > > > help there, because I am not even sure what I am looking for.
> > > > >
> > > > > The record and/or Fragment... do those help me dig in?
> > > > >
> > > > > Since this is one record per line, I went to line 2402 but that
> > record
> > >

Re: Help with Troubleshooting dense error message

2015-11-04 Thread Abdel Hakim Deneche
The error message "index: 9604, length: 4 (expected: range(0, 8192))"
suggests an error happened when Drill tried to access a memory buffer (most
likely while writing an int or float value)
This may be a bug actually exposed by that particular data record.

You can try enabling verbose error logging before running the query again:

set `exec.errors.verbose`=true;

This should give us a nice stack trace about this error.

Thanks

On Wed, Nov 4, 2015 at 7:29 AM, John Omernik  wrote:

> There are multiple fields in that record, including two lists. Both lists
> have data in them (now I am runnning with json text mode because at times
> the first value is a JSON null, but in these cases, that should be turned
> to "null" as  string.  (If I am understanding things correctly) and
> shouldn't be causing a problem.
>
>
>
> On Wed, Nov 4, 2015 at 9:21 AM, Hsuan Yi Chu  wrote:
>
> > What is the data type for that record in line 2402? A list?
> >
> > Do you think it could be similar to this issue ?
> >
> > https://issues.apache.org/jira/browse/DRILL-4006
> >
> >
> >
> > On Wed, Nov 4, 2015 at 6:48 AM, John Omernik  wrote:
> >
> > > Hey all,
> > >
> > > I am working with JSON that is on the whole fairly clean.  I am trying
> to
> > > load into Parquet files, and the previous days worth of data worked
> just
> > > fine, but todays data has something wrong with it and I Can't figure
> out
> > > what it is. Unfortunately, I can't post the data, which I know makes
> this
> > > hard to troubleshoot for the community. Hopefully I can provide some
> info
> > > here, and get some pointers on where to look, and then report back on
> how
> > > we could potentially improve the error messages.
> > >
> > > The error is below.
> > >
> > >
> > > I am looking to figure out given the information reported where I'd
> look
> > to
> > > trouble shoot this. Obviously the file
> > 02ffc306e877_my_load_1446640931.json
> > > is where I am looking to start
> > >
> > > This file has 3000 lines (records of data, so it's somewhere in
> between.
> > >
> > > The index/length/expected range don't mean anything to me I could use
> > some
> > > help there, because I am not even sure what I am looking for.
> > >
> > > The record and/or Fragment... do those help me dig in?
> > >
> > > Since this is one record per line, I went to line 2402 but that record
> > > looks completely normal to me, (like all the other ones) but since this
> > is
> > > dense text, I am obviously missing something, but is the record the
> line
> > > number?
> > >
> > > Any other pointers I can use to trouble shoot this?
> > >
> > > Thanks!
> > >
> > > Error:
> > >
> > >
> > > Caused by: org.apache.drill.common.exceptions.UserRemoteException:
> > > DATA_READ ERROR: Error parsing JSON - index: 9604, length: 4 (expected:
> > > range(0, 8192))
> > >
> > >
> > >
> > > File
> > >
> > >
> >
> /etl/dev/my-metadata/mysqspull/loads/2015-11-04/02ffc306e877_my_load_1446640931.json
> > >
> > > Record  2402
> > >
> > > Fragment 1:5
> > >
> >
>



-- 

Abdelhakim Deneche

Software Engineer

  


Now Available - Free Hadoop On-Demand Training



Re: Drill Query Error

2015-10-29 Thread Abdel Hakim Deneche
Hi Sanjeev,

are you running a single query or multiple queries concurrently ?

Thanks

On Thu, Oct 29, 2015 at 9:24 AM, Andries Engelbrecht <
aengelbre...@maprtech.com> wrote:

> Information that will be more helpful are answering questions like the
> following.
>
> What DFS are you trying to connect to? Is it local or remote? What does
> the storage plugin config looks like? etc.
>
>
> —Andries
>
> > On Oct 29, 2015, at 9:15 AM, Sanjeev Verma 
> wrote:
> >
> > it is 5 node drill cluster configured with 20GB direct memory.
> >
> > On Thu, Oct 29, 2015 at 8:39 PM, Jacques Nadeau 
> wrote:
> >
> >> This message is saying that Drill asked HDFS (or MFS) for the block
> >> locations of the files involved in the query. Drill used 16 threads for
> >> doing this work and waited a total of 108 seconds but the Namenode (or
> MFS)
> >> didn't respond. There were 116 separate files that a block map was
> >> requested for.
> >>
> >> As Andries said, we need more information on the situation you are
> running
> >> on. Is this HDFS? Is this S3? Which version of Drill and other systems?
> >>
> >> --
> >> Jacques Nadeau
> >> CTO and Co-Founder, Dremio
> >>
> >> On Thu, Oct 29, 2015 at 7:01 AM, Sanjeev Verma <
> sanjeev.verm...@gmail.com>
> >> wrote:
> >>
> >>> I am getting Drill exception while querying any clue?
> >>>
> >>> RESOURCE ERROR: Waited for 108750ms, but tasks for 'Get block maps' are
> >> not
> >>> complete. Total runnable size 116, parallelism 16.
> >>>
> >>
>
>


-- 

Abdelhakim Deneche

Software Engineer

  


Now Available - Free Hadoop On-Demand Training



Re: Exception with CSV storage format : Repeated types are not supported

2015-10-27 Thread Abdel Hakim Deneche
Hey Chandan,

I assume 'parquetlogs' contain parquet files, right ?
what is the schema of 'parquetlogs' ? does it contain repeated fields ?

thanks

On Tue, Oct 27, 2015 at 2:40 AM, chandan prakash 
wrote:

> Hi everyone,
> Can anyone help how to write CTAS query with storage format as CSV without
> giving the "Repeated types are not supported" Exception.
> If i am using any other storage format like JSON or PARQUET,its working
> fine but i have to use CSV format.
> *Saw similar JIRA bug :  https://issues.apache.org/jira/browse/DRILL-1954
> *
> but its still in open state.
>
> I have a query like :
> alter session set `store.format`='csv';
> CREATE TABLE dfs.csvlogs.`jobId` AS SELECT * FROM
> dfs.parquetlogs.`parquetlogs` ;
>
> *Getting exception :*
> 0: jdbc:drill:zk=local> CREATE TABLE dfs.csvlogs.`jobId` AS   SELECT * FROM
> dfs.parquetlogs.`parquetlogs` where dir0>'20150928';
> Error: SYSTEM ERROR: UnsupportedOperationException: Repeated types are not
> supported.
>
> Fragment 1:1
>
> [Error Id: 8da5886d-72f5-4801-891b-fcb5d0f1bbf0 on chandans-mbp:31010]
>
>   (java.lang.UnsupportedOperationException) Repeated types are not
> supported.
>
>
> org.apache.drill.exec.store.StringOutputRecordWriter$RepeatedVarCharStringFieldConverter.writeField():1560
> org.apache.drill.exec.store.EventBasedRecordWriter.write():61
> org.apache.drill.exec.physical.impl.WriterRecordBatch.innerNext():106
> org.apache.drill.exec.record.AbstractRecordBatch.next():147
> org.apache.drill.exec.physical.impl.BaseRootExec.next():83
>
>
> org.apache.drill.exec.physical.impl.SingleSenderCreator$SingleSenderRootExec.innerNext():93
> org.apache.drill.exec.physical.impl.BaseRootExec.next():73
>
>
> --
> Chandan Prakash
>



-- 

Abdelhakim Deneche

Software Engineer

  


Now Available - Free Hadoop On-Demand Training



Re: CTAS over empty file throws NPE

2015-10-22 Thread Abdel Hakim Deneche
Chandan actually found the JIRA:

https://issues.apache.org/jira/browse/DRILL-3539

On Thu, Oct 22, 2015 at 10:25 AM, Neeraja Rentachintala <
nrentachint...@maprtech.com> wrote:

> Hsuan
> Is there is a JIRA for this?
>
> On Thu, Oct 22, 2015 at 10:11 AM, Hsuan Yi Chu 
> wrote:
>
> > Hi,
> > This is known issue. It is because there is no schema with empty.tsv
> file.
> >
> > I think Daniel might be trying to address this issue.
> >
> > Thanks for bringing this up.
> >
> > On Wed, Oct 21, 2015 at 8:10 PM, chandan prakash <
> > chandanbaran...@gmail.com>
> > wrote:
> >
> > > Hi,
> > > I  have to run CTAS on a tsv file which might in some cases be empty .
> > > In those cases its giving NPE.
> > >
> > > java.sql.SQLException: SYSTEM ERROR: NullPointerException
> > >
> > > Fragment 0:0
> > >
> > > [Error Id: 4aa5a127-b2dd-41a0-ac49-fc2058e9564f on 192.168.0.104:31010
> ]
> > >
> > > at org.apache.drill.jdbc.impl.DrillCursor.nextRowInternally(
> > > DrillCursor.java:214)
> > >
> > > at org.apache.drill.jdbc.impl.DrillCursor.loadInitialSchema(
> > > DrillCursor.java:257)
> > >
> > > saw *similar bug post :
> https://issues.apache.org/jira/browse/DRILL-3539
> > > *
> > >
> > > Can anyone help with the fix or some workaround ?
> > > Any lead will be appreciated .
> > >
> > > Thanks,
> > > Chandan
> > >
> > >
> > > --
> > > Chandan Prakash
> > >
> >
>



-- 

Abdelhakim Deneche

Software Engineer

  


Now Available - Free Hadoop On-Demand Training



Re: Drill CTAS to single file

2015-10-21 Thread Abdel Hakim Deneche
Another way to do it is to let sqlline save the csv file for you, this way
you won't have to worry about Drill's parallelization, but you might need
to make slight changes to your storage plugin to properly read sqlline's
csv files.

For example, I have the following CTAS:

create table e as select * from cp.`employee.json` order by salary;

I create a script file, for example q.sql that contains the select query:

--- content of q.sql ---
select * from cp.`employee.json` order by salary;


I can then run this query through sqlline and force it write the results in
a csv file:

bin/sqlline -u jdbc:drill:zk=local --outputformat=csv -f q.sql > q.csv


You can already query the file in Drill but you'll notice that it has a
header line and all strings are wrapped in single quotes. So I had to
change my "csv" config in the dfs storage plugin to the following:

"csv": {
  "type": "text",
  "extensions": [
"csv"
  ],
*  "quote": "'",*
*  "skipFirstLine": true,*
  "delimiter": ","
},

You can find more information about configuring the storage plugin here:
http://drill.apache.org/docs/text-files-csv-tsv-psv/

Let me know if this helps

Thanks

On Wed, Oct 21, 2015 at 10:36 AM, Jason Altekruse 
wrote:

> For clarity, the only reason I said anything about a size limit on a CSV is
> that it is possible that Drill may stop writing one file and open up
> another in the same directory. We do this with parquet files, and I'm not
> sure if the behavior is the same or different for CSV files.
>
> Drill won't stop writing your data in the case where it hits this limit, it
> would just mean that the results would be split into several files. This
> should only happen with very large datasets (assuming you use the global
> sort at the end of your query as discussed above, other cases can produce
> small files if that fragment of the query happened to end up with a small
> amount of data, i.e. very few records were hashed to that bucket in an
> aggregate, etc.), and may not happen at all when we are writing CSV.
>
> On Wed, Oct 21, 2015 at 10:31 AM, Jason Altekruse <
> altekruseja...@gmail.com>
> wrote:
>
> > When you say that you are running a succession of queries, are these
> > queries that could be combined together using a UNION ALL statement? I
> > don't know if there is an upper bound on the size of a CSV that we will
> > generate, but if the reason Drill is writing multiple files is because of
> > parallelization (as Ramana was guessing), you can run an operation that
> > will remove the parallelization, like a global sort on the final result
> of
> > the query.
> >
> > The issue is that if the query is scanning several files/datasources to
> > begin with, or hits an operation that the engine decides to parallelize,
> > Drill will be handling multiple streams of data through the engine for
> the
> > one query. One such operation that could cause parallelization is an
> > aggregate (by hash partitioning the data, or by sorting and using ordered
> > partitioning). Once the data has been slit up, all subsequent operations
> > can be parallelized, so Drill will try to take advantage of that, even
> > through a write operation (to allow writing from several threads/machines
> > for disk throughput). Using a dataset wide sort will force the data down
> to
> > a single stream of output. Even if we split up the operation and do sorts
> > of subsets of the data, we must merge all of the separated sorted sets
> into
> > one complete sorted set of data. If this is the last operation before we
> > start writing, the write should happen on a single thread so that the
> order
> > is preserved in the file.
> >
> > On Wed, Oct 21, 2015 at 10:14 AM, Ramana I N  wrote:
> >
> >> You may be able to by playing around with the system/session options
> >> planner.width.max_per_query
> >> or planner.width.max_per_node
> >> ,
> Not
> >> sure if you would want to though.
> >>
> >> Any of those options will reduce the parallelism possible either during
> >> CTAS(writing) or when reading the files through drill.
> >>
> >>
> >>
> >> Regards
> >> Ramana
> >>
> >>
> >> On Wed, Oct 21, 2015 at 8:59 AM, Boris Chmiel <
> >> boris.chm...@yahoo.com.invalid> wrote:
> >>
> >> > Hi all,
> >> > Does anyone know if there is a native way to force drill to produce
> only
> >> > one file as a result of a CTAS ?In one of my specific use case, I run
> >> > succession of queries with Drill to produce several csv result with
> >> > CTAS. Many folders contains multiple files and I need to run a shell
> >> script
> >> > to cat / sed the result. Is it possible to avoid that ?
> >> > cheersBoris
> >>
> >
> >
>



-- 

Abdelhakim Deneche

Software Engineer

  


Now Available - Free Hadoop On-Demand Training

[ANNOUNCE] Release of Apache Drill 1.2.0

2015-10-17 Thread Abdel Hakim Deneche
It is my pleasure to announce the release of Apache Drill 1.2.0.

This release of Drill fixes many issues and introduces a number of
enhancements, including the following ones:

- Support for JDBC data sources, such as MySQL, through a new JDBC Storage
plugin
- Partition pruning improvements
- Five new SQL window functions
- HTTPS support for Web Console operations
- Parquet metadata caching to improve query performance on a large number
of files
- DROP TABLE command

The source and binary artifacts are available at [1]
Review a complete list of fixes and enhancements at [2]

Thanks to everyone in the community who contributed in this release.

[1] http://drill.apache.org/download/
[2]
https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12332042=12313820

-- 

Abdelhakim Deneche

Software Engineer

  


Now Available - Free Hadoop On-Demand Training



[VOTE] Release Apache Drill 1.2.0 RC1

2015-10-08 Thread Abdel Hakim Deneche
Hi,

I propose the second release candidate of Apache Drill, version 1.2.0.

Here is a list of all JIRAs that have been resolved in this release [1].

The tarball artifacts are hosted at [2] and the maven artifacts are hosted
at [3].

The vote will be open for the next 72 hours ending at 7AM Pacific, October
11, 2015.

[ ] +1
[ ] +0
[ ] -1

thanks,
Hakim

[1]
https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12332042=12313820
[2] http://people.apache.org/~adeneche/apache-drill-1.2.0-rc1/
[3] https://repository.apache.org/content/repositories/orgapachedrill-1007

-- 

Abdelhakim Deneche

Software Engineer

  


Now Available - Free Hadoop On-Demand Training



Re: Parquet #Files and closing accountor error

2015-10-08 Thread Abdel Hakim Deneche
the MapR version of Drill 1.2 has been released. I just checked and the
packages are now available.

On Thu, Oct 8, 2015 at 8:48 AM, John Omernik <j...@omernik.com> wrote:

> MapR have a package yet? :) When I compiled Drill with the MapR Profile
> myself, I couldn't get MapR Tables working, so I reverted back to Drill 1.1
> as packaged by MapR.
>
>
>
> On Thu, Oct 8, 2015 at 10:42 AM, Abdel Hakim Deneche <
> adene...@maprtech.com>
> wrote:
>
> > We fixed a similar issue as part of Drill 1.2. Can you give it a try to
> see
> > if your problem is effectively resolved ?
> >
> > Thanks
> >
> > On Thu, Oct 8, 2015 at 8:33 AM, John Omernik <j...@omernik.com> wrote:
> >
> > > I am on the MapR Packaged version of 1.1.  Do you still need the
> > > sys.version?
> > >
> > > On Thu, Oct 8, 2015 at 10:13 AM, Abdel Hakim Deneche <
> > > adene...@maprtech.com>
> > > wrote:
> > >
> > > > Hey John,
> > > >
> > > > The error you are seeing is a memory leak. Drill's allocator found
> that
> > > > about 1MB of allocated memory wasn't released at the end of the
> > > fragment's
> > > > execution.
> > > >
> > > > What version of Drill are you using ? can you share the result of:
> > > >
> > > > select * from sys.version;
> > > >
> > > > Thanks
> > > >
> > > > On Thu, Oct 8, 2015 at 7:35 AM, John Omernik <j...@omernik.com>
> wrote:
> > > >
> > > > > I am trying to complete a test case on some data. I took a schema
> and
> > > > used
> > > > > log-synth (thanks Ted) to create fairly wide table.  (89
> columns).  I
> > > > then
> > > > > outputted my data as csv files, and created a drill view, so far so
> > > good.
> > > > >
> > > > > One of the columns is a "date" column, (-MM-DD) format and has
> > 1216
> > > > > unique values. To me this would be like a 4 ish years of daily
> > > > partitioned
> > > > > data in hive, so tried to created my data partiioning on that
> field.
> > > > >
> > > > > If I create a Parquet table based on that, eventually things hork
> on
> > me
> > > > and
> > > > > I get the error below.  If I don't use the PARTITION BY clause, it
> > > > creates
> > > > > the table just fine with 30 files.
> > > > >
> > > > > Looking in the folder it was supposed to create the PARTITIONED
> > table,
> > > it
> > > > > has over 20K files in there.  Is this expected? Would we expect
> > > > #Partitions
> > > > > * #Fragment files? Could this be what the error is trying to tell
> me?
> > >  I
> > > > > guess I am just lost on what the error means, and what I
> should/could
> > > > > expect on something like this.  Is this a bug or expected?
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > Error:
> > > > >
> > > > > java.lang.RuntimeException: java.sql.SQLException: SYSTEM ERROR:
> > > > > IllegalStateException: Failure while closing accountor.  Expected
> > > private
> > > > > and shared pools to be set to initial values.  However, one or more
> > > were
> > > > > not.  Stats are
> > > > >
> > > > > zone init allocated delta
> > > > >
> > > > > private 100 100 0
> > > > >
> > > > > shared 00 9997806954 1193046.
> > > > >
> > > > >
> > > > > Fragment 1:25
> > > > >
> > > > >
> > > > > [Error Id: cad06490-f93e-4744-a9ec-d27cd06bc0a1 on
> > > > > hadoopmapr1.mydata.com:31010]
> > > > >
> > > > > at sqlline.IncrementalRows.hasNext(IncrementalRows.java:73)
> > > > >
> > > > > at
> > > > >
> > > > >
> > > >
> > >
> >
> sqlline.TableOutputFormat$ResizingRowsProvider.next(TableOutputFormat.java:87)
> > > > >
> > > > > at sqlline.TableOutputFormat.print(TableOutputFormat.java:118)
> > > > >
> > > > > at sqlline.SqlLine.print(SqlLine.java:1583)
> > > > >
> > > > > at sqlline.Commands.execute(Commands.java:852)
> > > > >
> > > > > at sqlline.Commands.sql(Commands.java:751)
> > > > >
> > > > > at sqlline.SqlLine.dispatch(SqlLine.java:738)
> > > > >
> > > > > at sqlline.SqlLine.begin(SqlLine.java:612)
> > > > >
> > > > > at sqlline.SqlLine.start(SqlLine.java:366)
> > > > >
> > > > > at sqlline.SqlLine.main(SqlLine.java:259)
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > >
> > > > Abdelhakim Deneche
> > > >
> > > > Software Engineer
> > > >
> > > >   <http://www.mapr.com/>
> > > >
> > > >
> > > > Now Available - Free Hadoop On-Demand Training
> > > > <
> > > >
> > >
> >
> http://www.mapr.com/training?utm_source=Email_medium=Signature_campaign=Free%20available
> > > > >
> > > >
> > >
> >
> >
> >
> > --
> >
> > Abdelhakim Deneche
> >
> > Software Engineer
> >
> >   <http://www.mapr.com/>
> >
> >
> > Now Available - Free Hadoop On-Demand Training
> > <
> >
> http://www.mapr.com/training?utm_source=Email_medium=Signature_campaign=Free%20available
> > >
> >
>



-- 

Abdelhakim Deneche

Software Engineer

  <http://www.mapr.com/>


Now Available - Free Hadoop On-Demand Training
<http://www.mapr.com/training?utm_source=Email_medium=Signature_campaign=Free%20available>


Re: Parquet #Files and closing accountor error

2015-10-08 Thread Abdel Hakim Deneche
Hey John,

The error you are seeing is a memory leak. Drill's allocator found that
about 1MB of allocated memory wasn't released at the end of the fragment's
execution.

What version of Drill are you using ? can you share the result of:

select * from sys.version;

Thanks

On Thu, Oct 8, 2015 at 7:35 AM, John Omernik  wrote:

> I am trying to complete a test case on some data. I took a schema and used
> log-synth (thanks Ted) to create fairly wide table.  (89 columns).  I then
> outputted my data as csv files, and created a drill view, so far so good.
>
> One of the columns is a "date" column, (-MM-DD) format and has 1216
> unique values. To me this would be like a 4 ish years of daily partitioned
> data in hive, so tried to created my data partiioning on that field.
>
> If I create a Parquet table based on that, eventually things hork on me and
> I get the error below.  If I don't use the PARTITION BY clause, it creates
> the table just fine with 30 files.
>
> Looking in the folder it was supposed to create the PARTITIONED table, it
> has over 20K files in there.  Is this expected? Would we expect #Partitions
> * #Fragment files? Could this be what the error is trying to tell me?   I
> guess I am just lost on what the error means, and what I should/could
> expect on something like this.  Is this a bug or expected?
>
>
>
>
>
>
>
>
> Error:
>
> java.lang.RuntimeException: java.sql.SQLException: SYSTEM ERROR:
> IllegalStateException: Failure while closing accountor.  Expected private
> and shared pools to be set to initial values.  However, one or more were
> not.  Stats are
>
> zone init allocated delta
>
> private 100 100 0
>
> shared 00 9997806954 1193046.
>
>
> Fragment 1:25
>
>
> [Error Id: cad06490-f93e-4744-a9ec-d27cd06bc0a1 on
> hadoopmapr1.mydata.com:31010]
>
> at sqlline.IncrementalRows.hasNext(IncrementalRows.java:73)
>
> at
>
> sqlline.TableOutputFormat$ResizingRowsProvider.next(TableOutputFormat.java:87)
>
> at sqlline.TableOutputFormat.print(TableOutputFormat.java:118)
>
> at sqlline.SqlLine.print(SqlLine.java:1583)
>
> at sqlline.Commands.execute(Commands.java:852)
>
> at sqlline.Commands.sql(Commands.java:751)
>
> at sqlline.SqlLine.dispatch(SqlLine.java:738)
>
> at sqlline.SqlLine.begin(SqlLine.java:612)
>
> at sqlline.SqlLine.start(SqlLine.java:366)
>
> at sqlline.SqlLine.main(SqlLine.java:259)
>



-- 

Abdelhakim Deneche

Software Engineer

  


Now Available - Free Hadoop On-Demand Training



[VOTE] Release Apache Drill 1.2.0 RC2

2015-10-08 Thread Abdel Hakim Deneche
Hi all,

I'm enjoying the release management so much that I decided to propose a
third RC of Apache Drill 1.2.0

The tarball artifacts are hosted at [1] and the maven artifacts are hosted
at [2].

The vote will be open for the next 72 hours ending at 2PM Pacific, October
11, 2015.

[ ] +1
[ ] +0
[ ] -1

thanks,
Hakim

[1] http://people.apache.org/~adeneche/apache-drill-1.2.0-rc2/
[2] https://repository.apache.org/content/repositories/orgapachedrill-1008

-- 

Abdelhakim Deneche

Software Engineer

  


Now Available - Free Hadoop On-Demand Training



Re: Using Drill JDBC V1.2 with V1.1 release

2015-09-30 Thread Abdel Hakim Deneche
Hi Nikunj,

What kind of dependcy issues you are seeing ? DRILL-3589 purpose was to
improve Drill JDBC dependencies but I think the JDBC driver didn't have any
"known" issues in 1.1

Thanks

On Wed, Sep 30, 2015 at 12:02 PM, Daniel Barclay 
wrote:

> Nikunj Thakkar wrote:
>
>> ...
>> I'm using drill v1.1.0 in my current setup and facing dependency issues. I
>> came across https://issues.apache.org/jira/browse/DRILL-3589 which seems
>> to
>> be the root cause of my problem. Issue got resolved in v1.2. Can I use it
>> with 1.1 release? Has anyone tried this?
>>
>
> I think that it happens to work, but it's definitely not known to work, so
> you'll have to check.
>
> (I think I tried that combination incidentally as part of doing something
> else, so it wasn't a solid test, just connecting and a few basic queries).
>
> Daniel
>
> --
> Daniel Barclay
> MapR Technologies
>



-- 

Abdelhakim Deneche

Software Engineer

  


Now Available - Free Hadoop On-Demand Training



Re: JDBC connection pool

2015-09-29 Thread Abdel Hakim Deneche
This is most likely a bug. I think it's similar to the following bug:

https://issues.apache.org/jira/browse/DRILL-3763

Please take a look at it and feel free to add more information to the
ticket.

Thanks

On Tue, Sep 29, 2015 at 2:20 AM, xia  wrote:

> Hi, everyone,
>
>
> I'm new to drill.
>
>
> I am running concurrent queries each holds its own JDBC connection to
> single drillbit node.
>
>
> As I found, if one of the connections is closed, it will affect other
> connections and lead to unexpected channel close exception.
>
>
> So does drill jdbc support connection pool?
>
>
> How am I supposed to manage concurrent connections without interfering
> with each other?
>
>
> Many thanks!
>
>
> Iris




-- 

Abdelhakim Deneche

Software Engineer

  


Now Available - Free Hadoop On-Demand Training



Re: CTAS exception

2015-09-18 Thread Abdel Hakim Deneche
This kind of errors usually happens when there is an unsupported schema
change in the json files, but you should be able to reproduce the error
with just a select statement. Can you share both queries you tried (the
failing CTAS and the successful SELECT *) ?

Thanks

On Fri, Sep 18, 2015 at 5:38 AM, Stefán Baxter 
wrote:

> Hi,
>
> I have some json files that I want to transform to parquet.
>
> We have been doing this without any issues but this time around I get this
> exception:
>
> Error: SYSTEM ERROR: IllegalStateException: Failure while reading vector.
> Expected vector class of org.apache.drill.exec.vector.NullableVarCharVector
> but was holding vector class
> org.apache.drill.exec.vector.complex.MapVector.
>
> Fragment 2:0
>
> [Error Id: 86501e92-5319-4540-9cf3-9a1aede4554b on localhost:31010]
> (state=,code=0)
>
> Adding verbosity adds no additional information regarding the
> source/row/column that triggers this reaction.
>
> I have successfully executed "select * from " and that has
> run perfectly. For me this means that the issue is CTAS related.
>
> I'm running a fresh build from master to execute this.
>
> Any ideas/pointers?
>
> Regards,
>   -Stefán
>



-- 

Abdelhakim Deneche

Software Engineer

  


Now Available - Free Hadoop On-Demand Training



Re: CTAS exception

2015-09-18 Thread Abdel Hakim Deneche
This could be a bug. Please open a new Jira and add as much information as
possible ? thanks

On Fri, Sep 18, 2015 at 10:07 AM, Stefán Baxter <ste...@activitystream.com>
wrote:

> Hi,
>
> There are no schema changes.
>
> I can select * from the table just fine!
>
> Regards,
>  -Stefán
>
> On Fri, Sep 18, 2015 at 4:55 PM, Abhishek Girish <
> abhishek.gir...@gmail.com>
> wrote:
>
> > Are you sure there is no schema change occurring between records for a
> > column other than *occurred_at?**. *In your second query, since the only
> > column being read is *occurred_at, *you may not be hitting the issue.
> First
> > query being a select * would read all columns and may hit this schema
> > change error.
> >
> >
> > On Fri, Sep 18, 2015 at 9:16 AM, Stefán Baxter <
> ste...@activitystream.com>
> > wrote:
> >
> > > This fails:
> > >
> > >- select * from dfs.asa.* where occurred_at < '2015-09-18' order by
> > >occurred_at;
> > >
> > > This, oddly enough, does not fail:
> > >
> > >- select occurred_at from dfs.* where occurred_at < '2015-09-18'
> order
> > >by occurred_at;
> > >
> > > -ste
> > >
> > > On Fri, Sep 18, 2015 at 4:08 PM, Stefán Baxter <
> > ste...@activitystream.com>
> > > wrote:
> > >
> > > >
> > > > The failing select query:
> > > >
> > > > select * from dfs.* where occurred_at < '2015-09-18' order by
> > > occurred_at;
> > > >
> > > > -ste
> > > >
> > > > On Fri, Sep 18, 2015 at 4:02 PM, Stefán Baxter <
> > > ste...@activitystream.com>
> > > > wrote:
> > > >
> > > >> Hi,
> > > >>
> > > >> Both statements select everything but the CTAS statement included a
> > date
> > > >> filter + date order.
> > > >>
> > > >> The date field is always the same (extended json format date ISO)
> and
> > is
> > > >> always present so I can say, with 100% certainty, that there is no
> > > schema
> > > >> change involved.
> > > >>
> > > >> The select statement fails as well when I add the condition +
> > ordering.
> > > >>
> > > >> Regards,
> > > >>  -Stefan
> > > >>
> > > >> On Fri, Sep 18, 2015 at 3:37 PM, Abdel Hakim Deneche <
> > > >> adene...@maprtech.com> wrote:
> > > >>
> > > >>> This kind of errors usually happens when there is an unsupported
> > schema
> > > >>> change in the json files, but you should be able to reproduce the
> > error
> > > >>> with just a select statement. Can you share both queries you tried
> > (the
> > > >>> failing CTAS and the successful SELECT *) ?
> > > >>>
> > > >>> Thanks
> > > >>>
> > > >>> On Fri, Sep 18, 2015 at 5:38 AM, Stefán Baxter <
> > > >>> ste...@activitystream.com>
> > > >>> wrote:
> > > >>>
> > > >>> > Hi,
> > > >>> >
> > > >>> > I have some json files that I want to transform to parquet.
> > > >>> >
> > > >>> > We have been doing this without any issues but this time around I
> > get
> > > >>> this
> > > >>> > exception:
> > > >>> >
> > > >>> > Error: SYSTEM ERROR: IllegalStateException: Failure while reading
> > > >>> vector.
> > > >>> > Expected vector class of
> > > >>> org.apache.drill.exec.vector.NullableVarCharVector
> > > >>> > but was holding vector class
> > > >>> > org.apache.drill.exec.vector.complex.MapVector.
> > > >>> >
> > > >>> > Fragment 2:0
> > > >>> >
> > > >>> > [Error Id: 86501e92-5319-4540-9cf3-9a1aede4554b on
> localhost:31010]
> > > >>> > (state=,code=0)
> > > >>> >
> > > >>> > Adding verbosity adds no additional information regarding the
> > > >>> > source/row/column that triggers this reaction.
> > > >>> >
> > > >>> > I have successfully executed "select * from " and
> > > that
> > > >>> has
> > > >>> > run perfectly. For me this means that the issue is CTAS related.
> > > >>> >
> > > >>> > I'm running a fresh build from master to execute this.
> > > >>> >
> > > >>> > Any ideas/pointers?
> > > >>> >
> > > >>> > Regards,
> > > >>> >   -Stefán
> > > >>> >
> > > >>>
> > > >>>
> > > >>>
> > > >>> --
> > > >>>
> > > >>> Abdelhakim Deneche
> > > >>>
> > > >>> Software Engineer
> > > >>>
> > > >>>   <http://www.mapr.com/>
> > > >>>
> > > >>>
> > > >>> Now Available - Free Hadoop On-Demand Training
> > > >>> <
> > > >>>
> > >
> >
> http://www.mapr.com/training?utm_source=Email_medium=Signature_campaign=Free%20available
> > > >>> >
> > > >>>
> > > >>
> > > >>
> > > >
> > >
> >
>



-- 

Abdelhakim Deneche

Software Engineer

  <http://www.mapr.com/>


Now Available - Free Hadoop On-Demand Training
<http://www.mapr.com/training?utm_source=Email_medium=Signature_campaign=Free%20available>


Re: Resetting an option

2015-09-17 Thread Abdel Hakim Deneche
I am looking at the corresponding pull request:

https://github.com/apache/drill/pull/159

and I have a question I can't seem to find an answer in this discussion:

Let's say a user changes an option A both at the SESSION and SYSTEM level.
What happens when the users issues "ALTER SYSTEM RESET A", does it reset A
only at the SYSTEM level but leave it changed at the SESSION level, or do
we want it to reset both SESSION and SYSTEM levels of A ?



On Mon, Aug 10, 2015 at 3:24 PM, Abhishek Girish  wrote:

> A session level *set* operation, by definition, should override the
> corresponding system level option for the duration of the session.
>
> Going by that, I think, a *reset* operation should default it back to the
> value held by the system level option. If a user (say an admin) has updated
> the corresponding system option, the reverted value would be a custom,
> non-Drill-default value. And if not, the reverted value would be the
> Drill-default value. This would make it simpler to manage.
>
>
> On Mon, Aug 10, 2015 at 2:51 PM, Jason Altekruse  >
> wrote:
>
> > I don't know if I missed something, but the Postgres docs seem to
> indicate
> > that there is no equivalent to the concept of a SYSTEM option that exists
> > in Drill, which can be set with a query. Options can be set at server
> > startup, either in a configuration file or with a command line parameter
> > [2]. Once the server is running, it appears that the closest to our ALTER
> > SYSTEM statement would be the feature to set options at a user or
> database
> > level [2].
> >
> > Here is an excerpt from the docs on the DEFAULT option value: [1] -
> DEFAULT
> > can be written to specify resetting the parameter to its default value
> > (that is, whatever value it would have had if no SET had been executed in
> > the current session).
> >
> > We should probably just try it out to confirm, but this statement leads
> me
> > to believe that the option will return to the value set in the startup
> > config file/parameter or what was set at the user/database level, not the
> > system default. This is in agreement with my intuition on the issue, the
> > whole idea behind nesting these configurations, from Drill default to
> > System and then to Session would seem to be trying to provide users with
> > the safest environment possible.
> >
> > Setting something at the system level should only be done if the
> > administrator is certain that the non-standard option is a helpful
> > modification for the majority of their users. Thus users can choose to
> > override it, but their escape hatch should bring them back to the option
> > values given by their administrator, not Drill defaults.
> >
> > [1] http://www.postgresql.org/docs/9.2/static/sql-set.html
> > [2] http://www.postgresql.org/docs/9.2/static/config-setting.html
> >
> > On Mon, Aug 10, 2015 at 2:25 PM, Sudheesh Katkam 
> > wrote:
> >
> > > Correction: currently any user can SET or RESET an option for session
> and
> > > system.
> > >
> > > > On Aug 10, 2015, at 2:20 PM, Sudheesh Katkam 
> > > wrote:
> > > >
> > > > Hey y‘all,
> > > >
> > > > Re DRILL-1065 , at
> > > system level (ALTER system RESET …), resetting an option would mean
> > > changing the value to the default provided by Drill. But, at session
> > level
> > > (ALTER session RESET …), would resetting an option mean:
> > > > (a) changing the value to the default provided by Drill? or,
> > > > (b) changing the value to the system value, that an admin could’ve
> > > changed?
> > > >
> > > > (b) would not allow non-admin users to know what the default is
> > > (easily). However, for a given option, (a) would allow a non-admin user
> > to
> > > know what the default is (by resetting) and what the system setting is
> > > (from sys.options). Opinions?
> > > >
> > > > Thank you,
> > > > Sudheesh
> > >
> > >
> >
>



-- 

Abdelhakim Deneche

Software Engineer

  


Now Available - Free Hadoop On-Demand Training



Re: Reg Connecting Apache Drill to Oracle DB

2015-09-15 Thread Abdel Hakim Deneche
Hi Siva,

Now that the JDBC storage plugin has been merged into master, you can just
get the latest version of Drill from github and build it yourself. Let me
know if you need more information about how to do it.

Thanks

On Tue, Sep 15, 2015 at 7:12 AM, Tugdual Grall  wrote:

> Hello,
>
> You can use the JDBC Storage Plugin:
> https://github.com/apache/drill/tree/master/contrib/storage-jdbc
>
> This plugin is still under development and not packaged in the Apache 1.1.
>
> You can build it from sources and use it, but it is incomplete. You can
> look at the development tasks in the JIRA:
> https://issues.apache.org/jira/browse/DRILL-3180
>
> As you can see from this JIRA it should be available in Drill 1.2.0.
>
> Regards
> Tug
> @tgrall
>
> On Tue, Sep 15, 2015 at 4:02 PM, Kuppusamy, Sivaraman <
> sivaraman.kuppus...@altisource.com> wrote:
>
> > Hi Team,
> >
> > I have scenario as below where we need to connect Apache Drill from Linux
> > machine to Oracle DB.
> > Please let me know if this is possible and help me with required steps &
> > instructions.
> >
> > Thanks,
> > Siva
> >
> >
> ***
> >
> > This email message and any attachments are intended solely for the use of
> > the addressee. If you are not the intended recipient, you are prohibited
> > from reading, disclosing, reproducing, distributing, disseminating or
> > otherwise using this transmission. If you have received this message in
> > error, please promptly notify the sender by reply email and immediately
> > delete this message from your system. This message and any attachments
> may
> > contain information that is confidential, privileged or exempt from
> > disclosure. Delivery of this message to any person other than the
> intended
> > recipient is not intended to waive any right or privilege. Message
> > transmission is not guaranteed to be secure or free of software viruses.
> >
> >
> ***
> >
>



-- 

Abdelhakim Deneche

Software Engineer

  


Now Available - Free Hadoop On-Demand Training



Re: Quering parquet is giving Reading past RLE/BitPacking stream

2015-09-15 Thread Abdel Hakim Deneche
Hi Anas,

Can you please open a JIRA for this ? It would be really helpful if you can
attach the parquet file to the JIRA

Thanks

On Mon, Sep 14, 2015 at 9:21 PM, Anas Mesrah  wrote:

> Hi,
>
> I am trying to query large parquet files -local file system- that has
> string optional fields. The query is giving me the following error. Any
> clue what might be the cause?
>
> org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR:
> IllegalArgumentException: Reading past RLE/BitPacking stream. Fragment 0:0
> [Error Id: 54f60733-8495-4542-9ea2-7e903cf1dadd
>



-- 

Abdelhakim Deneche

Software Engineer

  


Now Available - Free Hadoop On-Demand Training



Community Hangout happening now!

2015-09-15 Thread Abdel Hakim Deneche
Please join us in the weekly community hangout:

https://plus.google.com/hangouts/_/event/ci4rdiju8bv04a64efj5fedd0lc

-- 

Abdelhakim Deneche

Software Engineer

  


Now Available - Free Hadoop On-Demand Training



Re: BlockMissingException

2015-09-09 Thread Abdel Hakim Deneche
Hi Grant,

Do you see any other errors in the logs ?

I don't think the WorkEventBus warning has anything to do with the issue.
It's a warning you can expect to see for failed/cancelled queries.

Thanks

On Wed, Sep 9, 2015 at 10:32 AM, Grant Overby (groverby)  wrote:

> I'm still getting this, but it seems to go away after a while. It happens
> with multiple blocks.
>
> I'm seeing the following lines in logs. I suspect it's related:
>
>
> 2015-09-09 12:38:12,556 [WorkManager-1147] WARN
> o.a.d.exec.rpc.control.WorkEventBus - Fragment
> 2a0f9fbf-f85b-39a3-dce2-891d7f62b385:1:40 not found in the work bus.
>
> Any help or pointers would be greatly appreciated.
>
> [
> http://www.cisco.com/web/europe/images/email/signature/est2014/logo_06.png?ct=1398192119726
> ]
>
> Grant Overby
> Software Engineer
> Cisco.com
> grove...@cisco.com
> Mobile: 865 724 4910
>
>
>
>
>
>
> [http://www.cisco.com/assets/swa/img/thinkbeforeyouprint.gif] Think
> before you print.
>
> This email may contain confidential and privileged material for the sole
> use of the intended recipient. Any review, use, distribution or disclosure
> by others is strictly prohibited. If you are not the intended recipient (or
> authorized to receive for the recipient), please contact the sender by
> reply email and delete all copies of this message.
>
> Please click here<
> http://www.cisco.com/web/about/doing_business/legal/cri/index.html> for
> Company Registration Information.
>
>
>
>
>
> From: "Grant Overby (groverby)" >
> Date: Tuesday, September 8, 2015 at 11:17 AM
> To: "user@drill.apache.org" <
> user@drill.apache.org>
> Subject: BlockMissingException
>
> Drill is throwing a block missing exception; however, hdfs seems healthy.
> Thoughts?
>
> From Drill's web ui after executing a query:
> org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR:
> BlockMissingException: Could not obtain block:
> BP-1605794487-10.0.1.3-1435700184285:blk_1073828756_89484
> file=/warehouse2/completed/events/connection_events/1441707300/1441707312267-9-f98c71d9-dff0-4dad-8220-01b7a8a5e1d5.parquet
> Fragment 1:19 [Error Id: 9a0b442a-e5e1-44f9-9200-fff9f59b990a on
> twig03.twigs:31010]
>
> Retrieving the file from hdfs:
>
> root@twig03:~# hdfs dfs -get
> /warehouse2/completed/events/connection_events/1441707300/1441707312267-9-f98c71d9-dff0-4dad-8220-01b7a8a5e1d5.parquet
> /tmp/.
>
> root@twig03:~# ls /tmp/*.parquet
>
> /tmp/1441707312267-9-f98c71d9-dff0-4dad-8220-01b7a8a5e1d5.parquet
>
> hdfs report:
>
> root@twig03:~# hdfs dfsadmin -report
>
> Configured Capacity: 7856899358720 (7.15 TB)
>
> Present Capacity: 7856899358720 (7.15 TB)
>
> DFS Remaining: 4567228003960 (4.15 TB)
>
> DFS Used: 3289671354760 (2.99 TB)
>
> DFS Used%: 41.87%
>
> Under replicated blocks: 22108
>
> Blocks with corrupt replicas: 0
>
> Missing blocks: 0
>
>
> -
>
> Live datanodes (2):
>
>
> Name: 10.0.1.4:50010 (twig04.twigs)
>
> Hostname: twig04.twigs
>
> Decommission Status : Normal
>
> Configured Capacity: 3928449679360 (3.57 TB)
>
> DFS Used: 1644836539588 (1.50 TB)
>
> Non DFS Used: 0 (0 B)
>
> DFS Remaining: 2283613139772 (2.08 TB)
>
> DFS Used%: 41.87%
>
> DFS Remaining%: 58.13%
>
> Configured Cache Capacity: 0 (0 B)
>
> Cache Used: 0 (0 B)
>
> Cache Remaining: 0 (0 B)
>
> Cache Used%: 100.00%
>
> Cache Remaining%: 0.00%
>
> Xceivers: 3
>
> Last contact: Tue Sep 08 11:15:47 EDT 2015
>
>
>
> Name: 10.0.1.3:50010 (twig03.twigs)
>
> Hostname: twig03.twigs
>
> Decommission Status : Normal
>
> Configured Capacity: 3928449679360 (3.57 TB)
>
> DFS Used: 1644834815172 (1.50 TB)
>
> Non DFS Used: 0 (0 B)
>
> DFS Remaining: 2283614864188 (2.08 TB)
>
> DFS Used%: 41.87%
>
> DFS Remaining%: 58.13%
>
> Configured Cache Capacity: 0 (0 B)
>
> Cache Used: 0 (0 B)
>
> Cache Remaining: 0 (0 B)
>
> Cache Used%: 100.00%
>
> Cache Remaining%: 0.00%
>
> Xceivers: 2
>
> Last contact: Tue Sep 08 11:15:47 EDT 2015
>
>
>
> [
> http://www.cisco.com/web/europe/images/email/signature/est2014/logo_06.png?ct=1398192119726
> ]
>
> Grant Overby
> Software Engineer
> Cisco.com
> grove...@cisco.com
> Mobile: 865 724 4910
>
>
>
>
>
>
> [http://www.cisco.com/assets/swa/img/thinkbeforeyouprint.gif] Think
> before you print.
>
> This email may contain confidential and privileged material for the sole
> use of the intended recipient. Any review, use, distribution or disclosure
> by others is strictly prohibited. If you are not the intended recipient (or
> authorized to receive for the recipient), please contact the sender by
> reply email and delete all copies of this message.
>
> Please click here<
> http://www.cisco.com/web/about/doing_business/legal/cri/index.html> for
> Company Registration Information.
>
>
>
>
>


-- 

Abdelhakim Deneche


Re: what is the meaning of the 0 in '0:jdbc:drill'?

2015-09-08 Thread Abdel Hakim Deneche
Because Sqlline can open multiple jdbc connections, each connection
receives a unique id (starting from 0). "0" is just the id of the jdbc
connection you are using.

Thanks



On Sun, Sep 6, 2015 at 9:02 PM, 寻 <1813710...@qq.com> wrote:

> -- 原始邮件 --
> 发件人: "寻";<1813710...@qq.com>;
> 发送时间: 2015年9月7日(星期一) 中午12:01
> 收件人: "user";
>
> 主题: what is the meaning of the 0 in '0:jdbc:drill'?
>
>
>
> hello ,drill team:
> I am a newbie using drill. And I am so happy that I can receive help from
> this mail.
> when I enter the drill shell, no matter in distributed mode or embedded
> mode, there is a '0:jdbc:drill' . Could you please tell me what does the 0
> in 'jdbc:drill' mean?
> Thanks very much
> Best regards
> Shayne
>



-- 

Abdelhakim Deneche

Software Engineer

  


Now Available - Free Hadoop On-Demand Training



  1   2   >