Re: Drill's Future

2020-05-19 Thread Herman Tan
Hi,

I appreciate the work that this community have put into Drill.
I am using it regularly and promoting it's use.
I can help with testing and documentation.


Regards,
Herman Tan

On Sun, 17 May 2020, 9:56 pm Charles Givre,  wrote:

> Hey Bob,
> Great to hear from you and thanks for volunteering. Between you and me,
> I've been considering forking Drill, moving it to the OWASP foundation, and
> integrating a lot of security-related stuff that either hasn't been
> accepted or I haven't submitted. I suspect this would help increase Drill's
> user base and hence get better features etc.
>
> What do you think?
> -- C
>
>
>
> > On May 15, 2020, at 6:31 PM, Bob Rudis  wrote:
> >
> > I can carve out work-time to review PRs.
> >
> > -boB
> >
> > On Fri, May 15, 2020 at 11:25 AM Prabhakar Bhosaale
> >  wrote:
> >>
> >> HI Charles,
> >> I have never contributed to any open source community before this. Also
> I
> >> am not expert on JAVA but willing to contribute for sure. Please let me
> >> know in what capacity i can be helpful here.
> >>
> >> Also, I would like to know if there any efforts going on to identify
> >> primary support for Apache drill and what is its future in case there
> is no
> >> primary supporter? My organization is planning to build some tools using
> >> drill and this news will definitely impact the same. thx
> >>
> >> Regards
> >> Prabhakar
> >>
> >> On Fri, May 15, 2020 at 8:18 PM Charles Givre  wrote:
> >>
> >>> Dear Drill Users,
> >>> I hope everyone weathering the COVID storm and staying safe.
> >>>
> >>> As you may or may not be aware, I found out this week that MapR's
> >>> successor, HPE has for all intents and purposes ended its support for
> >>> Drill.  Whilst they will be contributing minimal bug fixes, they will
> not
> >>> be adding any new functionality or new features.  This leaves the open
> >>> source Drill without a primary backer.  Under the Apache rules, to
> commit
> >>> code to the code base it must be reviewed and voted upon. While we have
> >>> committers, we do not currently have enough code reviewers to review
> new
> >>> commits.
> >>>
> >>> My question to you as a Drill user is whether or not Drill is offering
> >>> value to you and your organization.  If it is, and you would like to
> see
> >>> Drill development continue, then we need additional volunteers to step
> up
> >>> and review pull requests.
> >>>
> >>> If you're interested in reviewing code for Drill, please let me know.
> >>>
> >>> Thank you for your continued support and stay safe!
> >>> -- Charles
>
>


Re: Need help with Drill SQL for a json file

2020-05-17 Thread Herman Tan
Dear Drill,

My mistake, I missed "Response" in the first field.
Apologise for this careless mistake.

Regards,
Herman

On Mon, May 18, 2020 at 10:35 AM Herman Tan  wrote:

> Hi Drill community,
>
> I need help in making a SQL Query to extract from attached JSON file.
>
> How do I create a sql select to get Location/Address/Label and the
> Location/DisplayPosition/Latitude and Longitude fields in 3 columns?
>
> I have been trying all different ways but I have been getting NULLs.
>
> Your help will be appreciated.
>
> Regards,
> Herman Tan
>
> One record:
> -
> {
>  "MetaInfo":{"Timestamp":"2020-05-17T08:18:09.892+"},
>  "View":[{
>"_type":"SearchResultsViewType",
>"ViewId":0,
>"Result":[{
>   "Relevance":0.8,
>   "MatchLevel":"street",
>   "MatchQuality":{
>   "Country":1.0,
>   "City":1.0,
>   "District":1.0,
>   "Street":[0.85],
>   "PostalCode":1.0
>   },
>   "Location":{
>   "LocationId":"NT_PSJYsiXVY9dOfsNAh-g6bC",
>   "LocationType":"point",
>
> "DisplayPosition":{"Latitude":-8.14916,"Longitude":113.71817},
>
> "NavigationPosition":[{"Latitude":-8.14916,"Longitude":113.71817}],
>   "MapView":{
>
>  "TopLeft":{"Latitude":-8.14823,"Longitude":113.71793},
>
>  "BottomRight":{"Latitude":-8.14943,"Longitude":113.71926}
> },
>   "Address":{
>  "Label":"Jalan Perumahan
> Pesona Regency, Jember 68111, Indonesia",
>  "Country":"IDN",
>  "County":"Jawa Timur",
>  "City":"Jember",
>  "District":"Patrang",
>  "Subdistrict":"Patrang Kel.",
>  "Street":"Jalan Perumahan
> Pesona Regency",
>  "PostalCode":"68111",
>
>  "AdditionalData":[{"value":"Indonesia","key":"CountryName"},{"value":"Jawa
> Timur","key":"CountyName"}]}
>  },
>   "AdditionalData":[]
> }]
> }]
> }
> -
>
>
>


Need help with Drill SQL for a json file

2020-05-17 Thread Herman Tan
Hi Drill community,

I need help in making a SQL Query to extract from attached JSON file.

How do I create a sql select to get Location/Address/Label and the
Location/DisplayPosition/Latitude and Longitude fields in 3 columns?

I have been trying all different ways but I have been getting NULLs.

Your help will be appreciated.

Regards,
Herman Tan

One record:
-
{
 "MetaInfo":{"Timestamp":"2020-05-17T08:18:09.892+"},
 "View":[{
   "_type":"SearchResultsViewType",
   "ViewId":0,
   "Result":[{
  "Relevance":0.8,
  "MatchLevel":"street",
  "MatchQuality":{
  "Country":1.0,
  "City":1.0,
  "District":1.0,
  "Street":[0.85],
  "PostalCode":1.0
  },
  "Location":{
  "LocationId":"NT_PSJYsiXVY9dOfsNAh-g6bC",
  "LocationType":"point",

"DisplayPosition":{"Latitude":-8.14916,"Longitude":113.71817},

"NavigationPosition":[{"Latitude":-8.14916,"Longitude":113.71817}],
  "MapView":{

 "TopLeft":{"Latitude":-8.14823,"Longitude":113.71793},

 "BottomRight":{"Latitude":-8.14943,"Longitude":113.71926}
},
  "Address":{
 "Label":"Jalan Perumahan
Pesona Regency, Jember 68111, Indonesia",
 "Country":"IDN",
 "County":"Jawa Timur",
 "City":"Jember",
 "District":"Patrang",
 "Subdistrict":"Patrang Kel.",
 "Street":"Jalan Perumahan
Pesona Regency",
 "PostalCode":"68111",

 "AdditionalData":[{"value":"Indonesia","key":"CountryName"},{"value":"Jawa
Timur","key":"CountyName"}]}
 },
  "AdditionalData":[]
}]
}]
}
-


here_api_20200517_162435.json
Description: application/json


Re: JDBC datasource on Websphere server 8.5.5.9

2020-03-29 Thread Herman Tan
Sorry, please ignore.


On Mon, Mar 30, 2020 at 1:55 PM Herman Tan  wrote:

> Igor,
>
> You are a drill user?
> That's great!
> Our paths has crossed again.
>
> Herman
>
> On Tue, Mar 24, 2020 at 9:28 PM Igor Guzenko 
> wrote:
>
>> Hello Prabhakar,
>>
>> Seems like there is a similar question on
>>
>> https://stackoverflow.com/questions/1677722/websphere-application-server-data-source
>> and
>> easiest path is to uncheck enable2Phase somewhere in WebSphere driver
>> settings.
>> Or harder way is that you can try to make a small project just to wrap
>> Drill driver with some of JDBC connection pools like HikariCP.
>>
>> Thanks,
>> Igor
>>
>> On Tue, Mar 24, 2020 at 2:49 PM Charles Givre  wrote:
>>
>> > HI Prabhakar,
>> > Thanks for your interest in Drill.  Can you share your config info as
>> well
>> > as the versions of Drill and JDBC Driver that you are using?
>> > Thanks,
>> > -- C
>> >
>> >
>> > > On Mar 24, 2020, at 7:07 AM, Prabhakar Bhosaale <
>> bhosale@gmail.com>
>> > wrote:
>> > >
>> > > Hi Team,
>> > >
>> > > we are trying to connect to apache drill from websphere 8.5.5.9.  We
>> > created the the Data provider and data source as per standard process of
>> > WAS.  But when we try to test the connection, it gives following error.
>> > >
>> > > "Test connection operation failed for data source retrievalds on
>> server
>> > ARCHIVE_SERVER at node ARCHIVALPROFILENode1 with the following
>> exception:
>> > java.lang.Exception: DSRA8101E: DataSource class cannot be used as
>> > one-phase: ClassCastException: org.apache.drill.jdbc.Driver incompatible
>> > with javax.sql.ConnectionPoolDataSource   "
>> > >
>> > > We are using SDK version 1.8
>> > > Attaching the JVM log also for your reference. thx
>> > >
>> > > Any pointers or any documentation in this regards would be
>> appreciated.
>> > Please help. thx
>> > >
>> > > Regards
>> > > Prabhakar
>> > > 
>> >
>> >
>>
>


Re: JDBC datasource on Websphere server 8.5.5.9

2020-03-29 Thread Herman Tan
Igor,

You are a drill user?
That's great!
Our paths has crossed again.

Herman

On Tue, Mar 24, 2020 at 9:28 PM Igor Guzenko 
wrote:

> Hello Prabhakar,
>
> Seems like there is a similar question on
>
> https://stackoverflow.com/questions/1677722/websphere-application-server-data-source
> and
> easiest path is to uncheck enable2Phase somewhere in WebSphere driver
> settings.
> Or harder way is that you can try to make a small project just to wrap
> Drill driver with some of JDBC connection pools like HikariCP.
>
> Thanks,
> Igor
>
> On Tue, Mar 24, 2020 at 2:49 PM Charles Givre  wrote:
>
> > HI Prabhakar,
> > Thanks for your interest in Drill.  Can you share your config info as
> well
> > as the versions of Drill and JDBC Driver that you are using?
> > Thanks,
> > -- C
> >
> >
> > > On Mar 24, 2020, at 7:07 AM, Prabhakar Bhosaale  >
> > wrote:
> > >
> > > Hi Team,
> > >
> > > we are trying to connect to apache drill from websphere 8.5.5.9.  We
> > created the the Data provider and data source as per standard process of
> > WAS.  But when we try to test the connection, it gives following error.
> > >
> > > "Test connection operation failed for data source retrievalds on server
> > ARCHIVE_SERVER at node ARCHIVALPROFILENode1 with the following exception:
> > java.lang.Exception: DSRA8101E: DataSource class cannot be used as
> > one-phase: ClassCastException: org.apache.drill.jdbc.Driver incompatible
> > with javax.sql.ConnectionPoolDataSource   "
> > >
> > > We are using SDK version 1.8
> > > Attaching the JVM log also for your reference. thx
> > >
> > > Any pointers or any documentation in this regards would be appreciated.
> > Please help. thx
> > >
> > > Regards
> > > Prabhakar
> > > 
> >
> >
>


Re: Caused by: org.apache.drill.exec.exception.DrillbitStartupException: Drillbit is disallowed to bind to loopback address in distributed mode.

2018-12-08 Thread Herman Tan
On Sun, 9 Dec 2018, 2:39 AM Arina Yelchiyeva  Yes, this option helps but is not recommended for the production, only for
> testing purposes.
>
> Kind regards,
> Arina
>
> > On Dec 8, 2018, at 8:08 PM, Alaa Salmo  wrote:
> >
> > I found the solution. When we run drill & Zookeeper on the same
> machine(we
> > run drill under distributed mode), we need to add
> > allow_loopback_address_binding under drill.exec in Drill configuration
> file
> > (drill-override.conf). We need to set  allow_loopback_address_binding to
> > true.
> >
> >
> >
> > On Sat, Dec 8, 2018 at 12:22 PM Arina Yelchiyeva <
> arina.yelchiy...@gmail.com>
> > wrote:
> >
> >> Please refer to this answer on the stackoverflow:
> >>
> >>
> >>
> https://stackoverflow.com/questions/40506221/how-to-start-drillbit-locally-in-distributed-mode
> >>
> >>> On Dec 8, 2018, at 2:10 AM, Alaa Salmo  wrote:
> >>>
> >>> Hi All,
> >>>
> >>> I am trying to install Hadoop/Zookeeper with Apache Drill (distributed
> >>> mode). I am facing the following error
> >>> "org.apache.drill.exec.exception.DrillbitStartupException:
> >>> Drillbit is disallowed to bind to loopback address in distributed
> >> mode.". I
> >>> tried to google, the answer was change  drill-override.conf. I am still
> >>> facing the same problem. Did anybody has same issue? Does Drill work in
> >> the
> >>> local machine with Zookeeper?
> >>>
> >>> drill.exec: {
> >>> cluster-id: "drillbits1",
> >>> zk.connect: "mgelbana-machine:2181"
> >>> }
> >>>
> >>> OR
> >>>
> >>> drill.exec: {
> >>> cluster-id: "drillbits1",
> >>> zk.connect: "127.0.0.1:2181"
> >>> }
> >>>
> >>>
> >>>
>  On Sat, Dec 1, 2018 at 9:09 AM Alaa Salmo 
> wrote:
> 
>  Hi,
> 
>  I am installing Drill on my virtual machine (Centos). I am facing one
>  issue. I tried to google but I could not solve it
> 
>  bin/drillbit.sh start
>  cat /usr/local/apache-drill-1.14.0/log/drillbit.out
> 
> 
> >>
> ***
>  Exception in thread "main"
>  org.apache.drill.exec.exception.DrillbitStartupException: Failure
> during
>  initial startup of Drillbit.
>    at
> org.apache.drill.exec.server.Drillbit.start(Drillbit.java:428)
>    at
> org.apache.drill.exec.server.Drillbit.start(Drillbit.java:394)
>    at org.apache.drill.exec.server.Drillbit.main(Drillbit.java:390)
>  Caused by: org.apache.drill.exec.exception.DrillbitStartupException:
>  Drillbit is disallowed to bind to loopback address in distributed
> mode.
>    at
> 
> org.apache.drill.exec.service.ServiceEngine.start(ServiceEngine.java:97)
>    at org.apache.drill.exec.server.Drillbit.run(Drillbit.java:191)
>    at
> org.apache.drill.exec.server.Drillbit.start(Drillbit.java:424)
> 
> 
> >>
> ***
> 
>  - I tried to chnage the hostname but I still facing same issue
>  - I tried to change Java 8 JDK from OpenJDK to Oracle but I still
> facing
>  same issue
> 
>  Please can you provide me the right answer?
> 
>  Regards,
>  Alaa
> 
> 
> >>
> >>
>


Re: ERROR is reading parquet data after create table

2018-11-11 Thread Herman Tan
Hi,

I am running into this problem again.
The solution is to define a view that is a "union all" of individual select
queries of single csv files.
Can I request that the hardcoded timeout values be parametrized please?

Herman

On Tue, Oct 2, 2018 at 7:29 PM Herman Tan  wrote:

> Hi,
>
> I found the hard coded parameter in the source code
>
>
> https://github.com/apache/drill/blob/8edeb49873d1a1710cfe28e0b49364d07eb1aef4/exec/java-exec/src/main/java/org/apache/drill/exec/store/TimedCallable.java
>
> LINE 52   : private static long TIMEOUT_PER_RUNNABLE_IN_MSECS = 15000;
> LINE 210 :timeout = TIMEOUT_PER_RUNNABLE_IN_MSECS * ((tasks.size() -
> 1)/parallelism + 1);
>
> parallelism param is also hardcoded
>
> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/metadata/Metadata.java
> LINE 343: 16
>
> In my case: task size is 29, 15000 * ((29-1)/(16+1)) = 15000 * 2 = 3
> 16 runnables executes round robbin on 29 tasks.  Each runnable given 3
> ms to timeout.
>
> This is the error message:
> Waited for 3 ms, but only 11 tasks for 'Fetch parquet metadata' are
> complete. Total number of tasks 29, parallelism 16.
>
> TimedCallable.java:
> LINE 248:final String errMsg = String.format("Waited for %d ms, but
> only %d tasks for '%s' are complete." +
> " Total number of tasks %d, parallelism %d.", timeout,
> futureMapper.count, activity, tasks.size(), parallelism);
>
> Shouldn't these be parameterized in "options" based on the infrastructure?
>
> Regards,
> Herman
>
>
>
>
> On Tue, Oct 2, 2018 at 6:47 PM Herman Tan  wrote:
>
>> Hi,
>>
>> I have restarted drill and run the script again.
>>
>> select * from dfs.tmp.`load_pos_sales_detail_tbl`;
>> -- SQL Error: RESOURCE ERROR: Waited for 3 ms, but only 11 tasks for
>> 'Fetch parquet metadata' are complete. Total number of tasks 29,
>> parallelism 16.
>>
>> The 29 tasks is related to the 29 parquet files in the folder.
>> To check if any of the parquet files has an error, I ran the following
>> SQL on each parquet file in the folder.  ALL PASSED. (SQL Below).
>> select * from dfs.tmp.`load_pos_sales_detail_tbl/1_0_0.parquet`;
>>
>> So it seems that for this table drill can only get the metadata for 11
>> parquet files before it times out.
>> The time-out is a calculation and it varies from size of table.
>> I checked the source code but I cannot find the calculation of the
>> timeout of "3 ms".
>> When I am lucky, drill can resolve the metadata for 29 files in 3 ms
>> and it passes.
>>
>> I plan to use drill for production but this bothers me that there is a
>> limit on the number of parquet files and the timeout parameter cannot be
>> tuned.
>>
>> Does anyone have any ideas?
>>
>> Regards,
>> Herman
>> --  SQL BELOW -
>>
>> select * from dfs.tmp.`load_pos_sales_detail_tbl/1_0_0.parquet`;
>> select * from dfs.tmp.`load_pos_sales_detail_tbl/1_10_0.parquet`;
>> select * from dfs.tmp.`load_pos_sales_detail_tbl/1_10_1.parquet`;
>> select * from dfs.tmp.`load_pos_sales_detail_tbl/1_11_0.parquet`;
>> select * from dfs.tmp.`load_pos_sales_detail_tbl/1_11_1.parquet`;
>> select * from dfs.tmp.`load_pos_sales_detail_tbl/1_12_0.parquet`;
>> select * from dfs.tmp.`load_pos_sales_detail_tbl/1_12_1.parquet`;
>> select * from dfs.tmp.`load_pos_sales_detail_tbl/1_13_0.parquet`;
>> select * from dfs.tmp.`load_pos_sales_detail_tbl/1_13_1.parquet`;
>> select * from dfs.tmp.`load_pos_sales_detail_tbl/1_14_0.parquet`;
>> select * from dfs.tmp.`load_pos_sales_detail_tbl/1_15_0.parquet`;
>> select * from dfs.tmp.`load_pos_sales_detail_tbl/1_15_1.parquet`;
>> select * from dfs.tmp.`load_pos_sales_detail_tbl/1_16_0.parquet`;
>> select * from dfs.tmp.`load_pos_sales_detail_tbl/1_16_1.parquet`;
>> select * from dfs.tmp.`load_pos_sales_detail_tbl/1_1_0.parquet`;
>> select * from dfs.tmp.`load_pos_sales_detail_tbl/1_2_0.parquet`;
>> select * from dfs.tmp.`load_pos_sales_detail_tbl/1_3_0.parquet`;
>> select * from dfs.tmp.`load_pos_sales_detail_tbl/1_4_0.parquet`;
>> select * from dfs.tmp.`load_pos_sales_detail_tbl/1_4_1.parquet`;
>> select * from dfs.tmp.`load_pos_sales_detail_tbl/1_5_0.parquet`;
>> select * from dfs.tmp.`load_pos_sales_detail_tbl/1_5_1.parquet`;
>> select * from dfs.tmp.`load_pos_sales_detail_tbl/1_6_0.parquet`;
>> select * from dfs.tmp.`load_pos_sales_detail_tbl/1_6_1.parquet`;
>> select * from dfs.tmp.`load_pos_sales_detail_tbl/1_7_0.parquet`;
>&

Re: ERROR is reading parquet data after create table

2018-10-11 Thread Herman Tan
I am running in single instance embedded mode under windows.

On Thu, 11 Oct 2018, 3:26 PM Divya Gehlot,  wrote:

> Hi ,
> Can somebody clarify on the number of tasks, What I understood from Herman
> is if you have 29 parquet files than Drill actually creates 29 tasks .
>
> Herman , Can I know you are running drill on embedded mode or distributed
> mode.
>
> I am running Drill in production for multiple sources and I do have many(
> like 2 years worth of data)  parquet files and never encountered this issue
> .
> Yeah at times when I have parquet in multi directory hierarchy or parquet
> files are small I do get either time out or query is too slow .
>
> Thoughts please?
>
> Thanks,
> Divya
>
> On Tue, 2 Oct 2018 at 18:48, Herman Tan  wrote:
>
> > Hi,
> >
> > I have restarted drill and run the script again.
> >
> > select * from dfs.tmp.`load_pos_sales_detail_tbl`;
> > -- SQL Error: RESOURCE ERROR: Waited for 3 ms, but only 11 tasks for
> > 'Fetch parquet metadata' are complete. Total number of tasks 29,
> > parallelism 16.
> >
> > The 29 tasks is related to the 29 parquet files in the folder.
> > To check if any of the parquet files has an error, I ran the following
> SQL
> > on each parquet file in the folder.  ALL PASSED. (SQL Below).
> > select * from dfs.tmp.`load_pos_sales_detail_tbl/1_0_0.parquet`;
> >
> > So it seems that for this table drill can only get the metadata for 11
> > parquet files before it times out.
> > The time-out is a calculation and it varies from size of table.
> > I checked the source code but I cannot find the calculation of the
> timeout
> > of "3 ms".
> > When I am lucky, drill can resolve the metadata for 29 files in 3 ms
> > and it passes.
> >
> > I plan to use drill for production but this bothers me that there is a
> > limit on the number of parquet files and the timeout parameter cannot be
> > tuned.
> >
> > Does anyone have any ideas?
> >
> > Regards,
> > Herman
> > --  SQL BELOW -
> >
> > select * from dfs.tmp.`load_pos_sales_detail_tbl/1_0_0.parquet`;
> > select * from dfs.tmp.`load_pos_sales_detail_tbl/1_10_0.parquet`;
> > select * from dfs.tmp.`load_pos_sales_detail_tbl/1_10_1.parquet`;
> > select * from dfs.tmp.`load_pos_sales_detail_tbl/1_11_0.parquet`;
> > select * from dfs.tmp.`load_pos_sales_detail_tbl/1_11_1.parquet`;
> > select * from dfs.tmp.`load_pos_sales_detail_tbl/1_12_0.parquet`;
> > select * from dfs.tmp.`load_pos_sales_detail_tbl/1_12_1.parquet`;
> > select * from dfs.tmp.`load_pos_sales_detail_tbl/1_13_0.parquet`;
> > select * from dfs.tmp.`load_pos_sales_detail_tbl/1_13_1.parquet`;
> > select * from dfs.tmp.`load_pos_sales_detail_tbl/1_14_0.parquet`;
> > select * from dfs.tmp.`load_pos_sales_detail_tbl/1_15_0.parquet`;
> > select * from dfs.tmp.`load_pos_sales_detail_tbl/1_15_1.parquet`;
> > select * from dfs.tmp.`load_pos_sales_detail_tbl/1_16_0.parquet`;
> > select * from dfs.tmp.`load_pos_sales_detail_tbl/1_16_1.parquet`;
> > select * from dfs.tmp.`load_pos_sales_detail_tbl/1_1_0.parquet`;
> > select * from dfs.tmp.`load_pos_sales_detail_tbl/1_2_0.parquet`;
> > select * from dfs.tmp.`load_pos_sales_detail_tbl/1_3_0.parquet`;
> > select * from dfs.tmp.`load_pos_sales_detail_tbl/1_4_0.parquet`;
> > select * from dfs.tmp.`load_pos_sales_detail_tbl/1_4_1.parquet`;
> > select * from dfs.tmp.`load_pos_sales_detail_tbl/1_5_0.parquet`;
> > select * from dfs.tmp.`load_pos_sales_detail_tbl/1_5_1.parquet`;
> > select * from dfs.tmp.`load_pos_sales_detail_tbl/1_6_0.parquet`;
> > select * from dfs.tmp.`load_pos_sales_detail_tbl/1_6_1.parquet`;
> > select * from dfs.tmp.`load_pos_sales_detail_tbl/1_7_0.parquet`;
> > select * from dfs.tmp.`load_pos_sales_detail_tbl/1_7_1.parquet`;
> > select * from dfs.tmp.`load_pos_sales_detail_tbl/1_8_0.parquet`;
> > select * from dfs.tmp.`load_pos_sales_detail_tbl/1_8_1.parquet`;
> > select * from dfs.tmp.`load_pos_sales_detail_tbl/1_9_0.parquet`;
> > select * from dfs.tmp.`load_pos_sales_detail_tbl/1_9_1.parquet`;
> >
> >
> > On Tue, Oct 2, 2018 at 4:44 PM Herman Tan  wrote:
> >
> > > Hi Divya and everyone,
> > >
> > > The problem has disappeared.
> > > Drill was not restarted.
> > > This appears to be intermittent.
> > > Before I submitted the error report, I ran the script several times and
> > it
> > > failed all the time.
> > > Today I ran it again and it succeeded.
> > > I will restart and test again.
> > >
> > > Regards,

Re: to_date() string to date conversion ERROR

2018-10-03 Thread Herman Tan
By the way,
I was looking for how to set user.timezone option.
https://mapr.com/support/s/article/Understanding-Drill-s-timestamp-and-timezone?language=en_US
It says sys.boot option user.timezone, set to UTC
It does not exist anymore

https://mapr.com/support/s/article/MapR-DB-JSON-application-Error-Illegal-instant-due-to-time-zone-offset-transition?language=en_US
I followed the instructions here.
Add -Duser.timezone=UTC in DRILL_JAVA_OPTS sqlline.bat


Regards,
Herman



On Wed, Oct 3, 2018 at 11:34 PM Herman Tan  wrote:

> Hi,
>
> I ran SQL
> SELECT TIMEOFDAY() FROM (VALUES(1));
> and I got:
> 2018-10-03 15:27:16.586 Asia/Singapore
>
> I added -Duser.timezone=UTC in DRILL_JAVA_OPTS  in the sqlline.bat
> Restarted drill
>
> I ran SQL again
> SELECT TIMEOFDAY() FROM (VALUES(1));
> and I got:
> 2018-10-03 15:27:16.586 UTC
>
> I ran SQL that failed before:
> select to_date('1982/01/01 00:00:00.0','/MM/dd
> HH:mm:ss.S') -- fail
> from (values(1))
> and it passes!
>
> Thank you for the help.
>
> Herman
>
>
> On Wed, Oct 3, 2018 at 5:35 PM Vova Vysotskyi  wrote:
>
>> Hello Herman,
>>
>> I tried to reproduce this error, but all queries passed on my machine.
>> Could you please add more details about your env? Which version of Drill
>> is
>> used, which timezone is set?
>> Is it reproduced with UTC timezone?
>>
>> Kind regards,
>> Volodymyr Vysotskyi
>>
>>
>> On Mon, Oct 1, 2018 at 10:58 AM Herman Tan  wrote:
>>
>> > Hi,
>> >
>> > I have a very puzzling error.
>> > Try the following SQL statements.
>> >
>> > What is the problem with '1982/01/01 00:01:00.0'?
>> > Error message: Illegal instant due to time zone offset transition
>> >
>> > select to_date('1981/12/31 00:00:00.0','/MM/dd
>> > HH:mm:ss.S') -- pass
>> > from (values(1))
>> >
>> > select to_date('1981/12/31 11:59:59.0','/MM/dd
>> > HH:mm:ss.S') -- pass
>> > from (values(1))
>> >
>> > select to_date('1982/01/01 00:00:00.0','/MM/dd
>> > HH:mm:ss.S') -- fail
>> > from (values(1))
>> >
>> > select to_date('1982/01/01 00:00:01.0','/MM/dd
>> > HH:mm:ss.S') -- fail
>> > from (values(1))
>> >
>> > select to_date('1982/01/01 00:01:00.0','/MM/dd
>> > HH:mm:ss.S') -- fail
>> > from (values(1))
>> >
>> > select to_date('1982/01/01 01:00:00.0','/MM/dd
>> > HH:mm:ss.S') -- pass
>> > from (values(1))
>> >
>> > select to_date('1982/01/02 00:00:00.0','/MM/dd
>> > HH:mm:ss.S') -- pass
>> > from (values(1))
>> >
>> > select to_date('1983/01/01 00:00:00.0','/MM/dd
>> > HH:mm:ss.S') -- pass
>> > from (values(1))
>> >
>> > Herman
>> >
>>
>


Re: to_date() string to date conversion ERROR

2018-10-03 Thread Herman Tan
Hi,

I ran SQL
SELECT TIMEOFDAY() FROM (VALUES(1));
and I got:
2018-10-03 15:27:16.586 Asia/Singapore

I added -Duser.timezone=UTC in DRILL_JAVA_OPTS  in the sqlline.bat
Restarted drill

I ran SQL again
SELECT TIMEOFDAY() FROM (VALUES(1));
and I got:
2018-10-03 15:27:16.586 UTC

I ran SQL that failed before:
select to_date('1982/01/01 00:00:00.0','/MM/dd
HH:mm:ss.S') -- fail
from (values(1))
and it passes!

Thank you for the help.

Herman


On Wed, Oct 3, 2018 at 5:35 PM Vova Vysotskyi  wrote:

> Hello Herman,
>
> I tried to reproduce this error, but all queries passed on my machine.
> Could you please add more details about your env? Which version of Drill is
> used, which timezone is set?
> Is it reproduced with UTC timezone?
>
> Kind regards,
> Volodymyr Vysotskyi
>
>
> On Mon, Oct 1, 2018 at 10:58 AM Herman Tan  wrote:
>
> > Hi,
> >
> > I have a very puzzling error.
> > Try the following SQL statements.
> >
> > What is the problem with '1982/01/01 00:01:00.0'?
> > Error message: Illegal instant due to time zone offset transition
> >
> > select to_date('1981/12/31 00:00:00.0','/MM/dd
> > HH:mm:ss.S') -- pass
> > from (values(1))
> >
> > select to_date('1981/12/31 11:59:59.0','/MM/dd
> > HH:mm:ss.S') -- pass
> > from (values(1))
> >
> > select to_date('1982/01/01 00:00:00.0','/MM/dd
> > HH:mm:ss.S') -- fail
> > from (values(1))
> >
> > select to_date('1982/01/01 00:00:01.0','/MM/dd
> > HH:mm:ss.S') -- fail
> > from (values(1))
> >
> > select to_date('1982/01/01 00:01:00.0','/MM/dd
> > HH:mm:ss.S') -- fail
> > from (values(1))
> >
> > select to_date('1982/01/01 01:00:00.0','/MM/dd
> > HH:mm:ss.S') -- pass
> > from (values(1))
> >
> > select to_date('1982/01/02 00:00:00.0','/MM/dd
> > HH:mm:ss.S') -- pass
> > from (values(1))
> >
> > select to_date('1983/01/01 00:00:00.0','/MM/dd
> > HH:mm:ss.S') -- pass
> > from (values(1))
> >
> > Herman
> >
>


Re: ERROR is reading parquet data after create table

2018-10-02 Thread Herman Tan
Hi,

I found the hard coded parameter in the source code

https://github.com/apache/drill/blob/8edeb49873d1a1710cfe28e0b49364d07eb1aef4/exec/java-exec/src/main/java/org/apache/drill/exec/store/TimedCallable.java

LINE 52   : private static long TIMEOUT_PER_RUNNABLE_IN_MSECS = 15000;
LINE 210 :timeout = TIMEOUT_PER_RUNNABLE_IN_MSECS * ((tasks.size() -
1)/parallelism + 1);

parallelism param is also hardcoded
https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/metadata/Metadata.java
LINE 343: 16

In my case: task size is 29, 15000 * ((29-1)/(16+1)) = 15000 * 2 = 3
16 runnables executes round robbin on 29 tasks.  Each runnable given 3
ms to timeout.

This is the error message:
Waited for 3 ms, but only 11 tasks for 'Fetch parquet metadata' are
complete. Total number of tasks 29, parallelism 16.

TimedCallable.java:
LINE 248:final String errMsg = String.format("Waited for %d ms, but
only %d tasks for '%s' are complete." +
" Total number of tasks %d, parallelism %d.", timeout,
futureMapper.count, activity, tasks.size(), parallelism);

Shouldn't these be parameterized in "options" based on the infrastructure?

Regards,
Herman




On Tue, Oct 2, 2018 at 6:47 PM Herman Tan  wrote:

> Hi,
>
> I have restarted drill and run the script again.
>
> select * from dfs.tmp.`load_pos_sales_detail_tbl`;
> -- SQL Error: RESOURCE ERROR: Waited for 3 ms, but only 11 tasks for
> 'Fetch parquet metadata' are complete. Total number of tasks 29,
> parallelism 16.
>
> The 29 tasks is related to the 29 parquet files in the folder.
> To check if any of the parquet files has an error, I ran the following SQL
> on each parquet file in the folder.  ALL PASSED. (SQL Below).
> select * from dfs.tmp.`load_pos_sales_detail_tbl/1_0_0.parquet`;
>
> So it seems that for this table drill can only get the metadata for 11
> parquet files before it times out.
> The time-out is a calculation and it varies from size of table.
> I checked the source code but I cannot find the calculation of the timeout
> of "3 ms".
> When I am lucky, drill can resolve the metadata for 29 files in 3 ms
> and it passes.
>
> I plan to use drill for production but this bothers me that there is a
> limit on the number of parquet files and the timeout parameter cannot be
> tuned.
>
> Does anyone have any ideas?
>
> Regards,
> Herman
> --  SQL BELOW -
>
> select * from dfs.tmp.`load_pos_sales_detail_tbl/1_0_0.parquet`;
> select * from dfs.tmp.`load_pos_sales_detail_tbl/1_10_0.parquet`;
> select * from dfs.tmp.`load_pos_sales_detail_tbl/1_10_1.parquet`;
> select * from dfs.tmp.`load_pos_sales_detail_tbl/1_11_0.parquet`;
> select * from dfs.tmp.`load_pos_sales_detail_tbl/1_11_1.parquet`;
> select * from dfs.tmp.`load_pos_sales_detail_tbl/1_12_0.parquet`;
> select * from dfs.tmp.`load_pos_sales_detail_tbl/1_12_1.parquet`;
> select * from dfs.tmp.`load_pos_sales_detail_tbl/1_13_0.parquet`;
> select * from dfs.tmp.`load_pos_sales_detail_tbl/1_13_1.parquet`;
> select * from dfs.tmp.`load_pos_sales_detail_tbl/1_14_0.parquet`;
> select * from dfs.tmp.`load_pos_sales_detail_tbl/1_15_0.parquet`;
> select * from dfs.tmp.`load_pos_sales_detail_tbl/1_15_1.parquet`;
> select * from dfs.tmp.`load_pos_sales_detail_tbl/1_16_0.parquet`;
> select * from dfs.tmp.`load_pos_sales_detail_tbl/1_16_1.parquet`;
> select * from dfs.tmp.`load_pos_sales_detail_tbl/1_1_0.parquet`;
> select * from dfs.tmp.`load_pos_sales_detail_tbl/1_2_0.parquet`;
> select * from dfs.tmp.`load_pos_sales_detail_tbl/1_3_0.parquet`;
> select * from dfs.tmp.`load_pos_sales_detail_tbl/1_4_0.parquet`;
> select * from dfs.tmp.`load_pos_sales_detail_tbl/1_4_1.parquet`;
> select * from dfs.tmp.`load_pos_sales_detail_tbl/1_5_0.parquet`;
> select * from dfs.tmp.`load_pos_sales_detail_tbl/1_5_1.parquet`;
> select * from dfs.tmp.`load_pos_sales_detail_tbl/1_6_0.parquet`;
> select * from dfs.tmp.`load_pos_sales_detail_tbl/1_6_1.parquet`;
> select * from dfs.tmp.`load_pos_sales_detail_tbl/1_7_0.parquet`;
> select * from dfs.tmp.`load_pos_sales_detail_tbl/1_7_1.parquet`;
> select * from dfs.tmp.`load_pos_sales_detail_tbl/1_8_0.parquet`;
> select * from dfs.tmp.`load_pos_sales_detail_tbl/1_8_1.parquet`;
> select * from dfs.tmp.`load_pos_sales_detail_tbl/1_9_0.parquet`;
> select * from dfs.tmp.`load_pos_sales_detail_tbl/1_9_1.parquet`;
>
>
> On Tue, Oct 2, 2018 at 4:44 PM Herman Tan  wrote:
>
>> Hi Divya and everyone,
>>
>> The problem has disappeared.
>> Drill was not restarted.
>> This appears to be intermittent.
>> Before I submitted the error report, I ran the script several times and
>> it failed all the time.
>> Today I ran it agai

Re: ERROR is reading parquet data after create table

2018-10-02 Thread Herman Tan
Hi,

I have restarted drill and run the script again.

select * from dfs.tmp.`load_pos_sales_detail_tbl`;
-- SQL Error: RESOURCE ERROR: Waited for 3 ms, but only 11 tasks for
'Fetch parquet metadata' are complete. Total number of tasks 29,
parallelism 16.

The 29 tasks is related to the 29 parquet files in the folder.
To check if any of the parquet files has an error, I ran the following SQL
on each parquet file in the folder.  ALL PASSED. (SQL Below).
select * from dfs.tmp.`load_pos_sales_detail_tbl/1_0_0.parquet`;

So it seems that for this table drill can only get the metadata for 11
parquet files before it times out.
The time-out is a calculation and it varies from size of table.
I checked the source code but I cannot find the calculation of the timeout
of "3 ms".
When I am lucky, drill can resolve the metadata for 29 files in 3 ms
and it passes.

I plan to use drill for production but this bothers me that there is a
limit on the number of parquet files and the timeout parameter cannot be
tuned.

Does anyone have any ideas?

Regards,
Herman
--  SQL BELOW -

select * from dfs.tmp.`load_pos_sales_detail_tbl/1_0_0.parquet`;
select * from dfs.tmp.`load_pos_sales_detail_tbl/1_10_0.parquet`;
select * from dfs.tmp.`load_pos_sales_detail_tbl/1_10_1.parquet`;
select * from dfs.tmp.`load_pos_sales_detail_tbl/1_11_0.parquet`;
select * from dfs.tmp.`load_pos_sales_detail_tbl/1_11_1.parquet`;
select * from dfs.tmp.`load_pos_sales_detail_tbl/1_12_0.parquet`;
select * from dfs.tmp.`load_pos_sales_detail_tbl/1_12_1.parquet`;
select * from dfs.tmp.`load_pos_sales_detail_tbl/1_13_0.parquet`;
select * from dfs.tmp.`load_pos_sales_detail_tbl/1_13_1.parquet`;
select * from dfs.tmp.`load_pos_sales_detail_tbl/1_14_0.parquet`;
select * from dfs.tmp.`load_pos_sales_detail_tbl/1_15_0.parquet`;
select * from dfs.tmp.`load_pos_sales_detail_tbl/1_15_1.parquet`;
select * from dfs.tmp.`load_pos_sales_detail_tbl/1_16_0.parquet`;
select * from dfs.tmp.`load_pos_sales_detail_tbl/1_16_1.parquet`;
select * from dfs.tmp.`load_pos_sales_detail_tbl/1_1_0.parquet`;
select * from dfs.tmp.`load_pos_sales_detail_tbl/1_2_0.parquet`;
select * from dfs.tmp.`load_pos_sales_detail_tbl/1_3_0.parquet`;
select * from dfs.tmp.`load_pos_sales_detail_tbl/1_4_0.parquet`;
select * from dfs.tmp.`load_pos_sales_detail_tbl/1_4_1.parquet`;
select * from dfs.tmp.`load_pos_sales_detail_tbl/1_5_0.parquet`;
select * from dfs.tmp.`load_pos_sales_detail_tbl/1_5_1.parquet`;
select * from dfs.tmp.`load_pos_sales_detail_tbl/1_6_0.parquet`;
select * from dfs.tmp.`load_pos_sales_detail_tbl/1_6_1.parquet`;
select * from dfs.tmp.`load_pos_sales_detail_tbl/1_7_0.parquet`;
select * from dfs.tmp.`load_pos_sales_detail_tbl/1_7_1.parquet`;
select * from dfs.tmp.`load_pos_sales_detail_tbl/1_8_0.parquet`;
select * from dfs.tmp.`load_pos_sales_detail_tbl/1_8_1.parquet`;
select * from dfs.tmp.`load_pos_sales_detail_tbl/1_9_0.parquet`;
select * from dfs.tmp.`load_pos_sales_detail_tbl/1_9_1.parquet`;


On Tue, Oct 2, 2018 at 4:44 PM Herman Tan  wrote:

> Hi Divya and everyone,
>
> The problem has disappeared.
> Drill was not restarted.
> This appears to be intermittent.
> Before I submitted the error report, I ran the script several times and it
> failed all the time.
> Today I ran it again and it succeeded.
> I will restart and test again.
>
> Regards,
> Herman
>
>
>
> On Thu, Sep 27, 2018 at 11:50 AM Divya Gehlot 
> wrote:
>
>> Hi Herman,
>> Just to ensure that  your parquet file format is not corrupted , Can you
>> please query a folder like just 2001 or some of the files underneath
>> .Instead of querying the whole data set at once .
>>
>> Thanks,
>> Divya
>>
>> On Wed, 26 Sep 2018 at 15:35, Herman Tan  wrote:
>>
>> > Hi Kunal,
>> >
>> > 
>> > That said, could you provide some details about the parquet data you've
>> > created, like the schema, parquet version and the tool used to generate.
>> > Usually, the schema (and meta) provides most of these details for any
>> > parquet file.
>> > 
>> >
>> > 1. The schema is under dfs.tmp, the queries to generate are all
>> documented
>> > below.
>> > 2. I don't know how to find the parquet version of the data file
>> > 3. The tool used to generate the parquest is apache drill.  The CTAS is
>> > detailed below.
>> >
>> > Regards,
>> > Herman
>> > 
>> >
>> > *This is the Text data*
>> >
>> > This is the folders of the files
>> > Total # of lines about 50 million rows
>> > --
>> > show files from dfs.`D:\retail_sandbox\pos\sales_pos_detail\pos_details_
>> > 20180825`
>> > ;
>> > show files

Re: ERROR is reading parquet data after create table

2018-10-02 Thread Herman Tan
Hi Divya and everyone,

The problem has disappeared.
Drill was not restarted.
This appears to be intermittent.
Before I submitted the error report, I ran the script several times and it
failed all the time.
Today I ran it again and it succeeded.
I will restart and test again.

Regards,
Herman



On Thu, Sep 27, 2018 at 11:50 AM Divya Gehlot 
wrote:

> Hi Herman,
> Just to ensure that  your parquet file format is not corrupted , Can you
> please query a folder like just 2001 or some of the files underneath
> .Instead of querying the whole data set at once .
>
> Thanks,
> Divya
>
> On Wed, 26 Sep 2018 at 15:35, Herman Tan  wrote:
>
> > Hi Kunal,
> >
> > 
> > That said, could you provide some details about the parquet data you've
> > created, like the schema, parquet version and the tool used to generate.
> > Usually, the schema (and meta) provides most of these details for any
> > parquet file.
> > 
> >
> > 1. The schema is under dfs.tmp, the queries to generate are all
> documented
> > below.
> > 2. I don't know how to find the parquet version of the data file
> > 3. The tool used to generate the parquest is apache drill.  The CTAS is
> > detailed below.
> >
> > Regards,
> > Herman
> > 
> >
> > *This is the Text data*
> >
> > This is the folders of the files
> > Total # of lines about 50 million rows
> > --
> > show files from dfs.`D:\retail_sandbox\pos\sales_pos_detail\pos_details_
> > 20180825`
> > ;
> > show files from dfs.`D:\retail_sandbox\pos\sales_pos_detail\pos_details_
> > 20180825\2011`
> > ;
> > -
> > sales_pos_detail
> >   \pos_details_20180825
> > \2007
> > \2008
> > \2009
> > \2010
> > \2011
> >   \pos_details_0.csv
> >   \pos_details_1.csv
> >   \pos_details_2.csv
> >   \pos_details_3.csv
> >   \pos_details_4.csv
> >   \pos_details_5.csv
> >   \pos_details_6.csv
> >   \pos_details_7.csv
> >   \pos_details_8.csv
> > \2012
> > \2013
> > \2014
> > \2015
> > \2016
> > \2017
> > \2018
> > \others
> > -
> >
> > *This is the view with the metadata defined:*
> >
> > create or replace view dfs.tmp.load_pos_sales_detail as
> > SELECT
> > -- dimension keys
> >  cast(dim_date_key as int) dim_date_key
> > ,cast(dim_site_key as int) dim_site_key
> > ,cast(dim_pos_header_key as bigint) dim_pos_header_key
> > ,cast(dim_pos_cashier_key as int) dim_pos_cashier_key
> > ,cast(dim_card_number_key as int) dim_card_number_key
> > ,cast(dim_hour_minute_key as int) dim_hour_minute_key
> > ,cast(dim_pos_clerk_key as int) dim_pos_clerk_key
> > ,cast(dim_product_key as int) dim_product_key
> > ,cast(dim_pos_employee_purchase_key as int) dim_pos_employee_purchase_key
> > ,cast(dim_pos_terminal_key as int) dim_pos_terminal_key
> > ,cast(dim_campaign_key as int) dim_campaign_key
> > ,cast(dim_promo_key as int) dim_promo_key
> > ,cast( case when dim_site_lfl_key = '' then 0 else dim_site_lfl_key end
> as
> > int) dim_site_lfl_key
> > -- derived from keys
> > ,dim_date_str
> > ,`year` as `trx_year`
> > -- Measures
> > ,Product_Sales_Qty
> > ,Product_Sales_Price
> > ,Product_Cost_Price
> > ,Product_Cost_Amt
> > ,Product_Sales_Gross_Amt
> > ,Product_Sales_Promo_Disc_Amt
> > ,Product_Sales_Add_Promo_Disc_Amt
> > ,Product_Sales_Total_Promo_Disc_Amt
> > ,Product_Sales_Retail_Promo_Amt
> > ,Product_Sales_Retail_Amt
> > ,Product_Sales_VAT_Amt
> > ,Product_Sales_Product_Margin_Amt
> > ,Product_Sales_Initial_Margin_Amt
> > from dfs.`D:\retail_sandbox\pos\sales_pos_detail\pos_details_20180825`
> > ;
> >
> >
> > *This is the CTAS that generates the parquet from the view above:*
> >
> > drop table if exists dfs.tmp.load_pos_sales_detail_tbl
> > ;
> >
> > create table dfs.tmp.load_pos_sales_detail_tbl AS
> > SELECT
> > -- dimension keys
> >  dim_date_key
> > ,dim_site_key
> > ,dim_pos_header_key
> > ,dim_pos_cashier_key
> > ,dim_card_number_key
> > ,dim_hour_minute_key
> > ,dim_pos_clerk_key
> > ,dim_product_key
> > ,dim_pos_employee_purchase_key
> > ,dim_pos_terminal_key
> > ,dim_campaign_key
> > ,dim_promo_key
> > ,dim_site_lfl_key
> > -- derived from keys
> > ,dim_date_str
> > ,`trx_year`
> > -- Measures
> > ,Product_Sales_Qty Sales_Qty
> > ,Product_Sales_Price Sales_Price
> 

to_date() string to date conversion ERROR

2018-10-01 Thread Herman Tan
Hi,

I have a very puzzling error.
Try the following SQL statements.

What is the problem with '1982/01/01 00:01:00.0'?
Error message: Illegal instant due to time zone offset transition

select to_date('1981/12/31 00:00:00.0','/MM/dd
HH:mm:ss.S') -- pass
from (values(1))

select to_date('1981/12/31 11:59:59.0','/MM/dd
HH:mm:ss.S') -- pass
from (values(1))

select to_date('1982/01/01 00:00:00.0','/MM/dd
HH:mm:ss.S') -- fail
from (values(1))

select to_date('1982/01/01 00:00:01.0','/MM/dd
HH:mm:ss.S') -- fail
from (values(1))

select to_date('1982/01/01 00:01:00.0','/MM/dd
HH:mm:ss.S') -- fail
from (values(1))

select to_date('1982/01/01 01:00:00.0','/MM/dd
HH:mm:ss.S') -- pass
from (values(1))

select to_date('1982/01/02 00:00:00.0','/MM/dd
HH:mm:ss.S') -- pass
from (values(1))

select to_date('1983/01/01 00:00:00.0','/MM/dd
HH:mm:ss.S') -- pass
from (values(1))

Herman


Re: ERROR is reading parquet data after create table

2018-09-26 Thread Herman Tan
Hi Kunal,


That said, could you provide some details about the parquet data you've
created, like the schema, parquet version and the tool used to generate.
Usually, the schema (and meta) provides most of these details for any
parquet file.


1. The schema is under dfs.tmp, the queries to generate are all documented
below.
2. I don't know how to find the parquet version of the data file
3. The tool used to generate the parquest is apache drill.  The CTAS is
detailed below.

Regards,
Herman


*This is the Text data*

This is the folders of the files
Total # of lines about 50 million rows
--
show files from dfs.`D:\retail_sandbox\pos\sales_pos_detail\pos_details_
20180825`
;
show files from dfs.`D:\retail_sandbox\pos\sales_pos_detail\pos_details_
20180825\2011`
;
-
sales_pos_detail
  \pos_details_20180825
\2007
\2008
\2009
\2010
\2011
  \pos_details_0.csv
  \pos_details_1.csv
  \pos_details_2.csv
  \pos_details_3.csv
  \pos_details_4.csv
  \pos_details_5.csv
  \pos_details_6.csv
  \pos_details_7.csv
  \pos_details_8.csv
\2012
\2013
\2014
\2015
\2016
\2017
\2018
\others
-

*This is the view with the metadata defined:*

create or replace view dfs.tmp.load_pos_sales_detail as
SELECT
-- dimension keys
 cast(dim_date_key as int) dim_date_key
,cast(dim_site_key as int) dim_site_key
,cast(dim_pos_header_key as bigint) dim_pos_header_key
,cast(dim_pos_cashier_key as int) dim_pos_cashier_key
,cast(dim_card_number_key as int) dim_card_number_key
,cast(dim_hour_minute_key as int) dim_hour_minute_key
,cast(dim_pos_clerk_key as int) dim_pos_clerk_key
,cast(dim_product_key as int) dim_product_key
,cast(dim_pos_employee_purchase_key as int) dim_pos_employee_purchase_key
,cast(dim_pos_terminal_key as int) dim_pos_terminal_key
,cast(dim_campaign_key as int) dim_campaign_key
,cast(dim_promo_key as int) dim_promo_key
,cast( case when dim_site_lfl_key = '' then 0 else dim_site_lfl_key end as
int) dim_site_lfl_key
-- derived from keys
,dim_date_str
,`year` as `trx_year`
-- Measures
,Product_Sales_Qty
,Product_Sales_Price
,Product_Cost_Price
,Product_Cost_Amt
,Product_Sales_Gross_Amt
,Product_Sales_Promo_Disc_Amt
,Product_Sales_Add_Promo_Disc_Amt
,Product_Sales_Total_Promo_Disc_Amt
,Product_Sales_Retail_Promo_Amt
,Product_Sales_Retail_Amt
,Product_Sales_VAT_Amt
,Product_Sales_Product_Margin_Amt
,Product_Sales_Initial_Margin_Amt
from dfs.`D:\retail_sandbox\pos\sales_pos_detail\pos_details_20180825`
;


*This is the CTAS that generates the parquet from the view above:*

drop table if exists dfs.tmp.load_pos_sales_detail_tbl
;

create table dfs.tmp.load_pos_sales_detail_tbl AS
SELECT
-- dimension keys
 dim_date_key
,dim_site_key
,dim_pos_header_key
,dim_pos_cashier_key
,dim_card_number_key
,dim_hour_minute_key
,dim_pos_clerk_key
,dim_product_key
,dim_pos_employee_purchase_key
,dim_pos_terminal_key
,dim_campaign_key
,dim_promo_key
,dim_site_lfl_key
-- derived from keys
,dim_date_str
,`trx_year`
-- Measures
,Product_Sales_Qty Sales_Qty
,Product_Sales_Price Sales_Price
,Product_Cost_Price Cost_Price
,Product_Cost_Amt Cost_Amt
,Product_Sales_Gross_Amt Sales_Gross_Amt
,Product_Sales_Promo_Disc_Amt Sales_Promo_Disc_Amt
,Product_Sales_Add_Promo_Disc_Amt Add_Promo_Disc_Amt
,Product_Sales_Total_Promo_Disc_Amt Total_Promo_Disc_Amt
,Product_Sales_Retail_Promo_Amt Retail_Promo_Amt
,Product_Sales_Retail_Amt Retail_Amt
,Product_Sales_VAT_Amt VAT_Amt
,Product_Sales_Product_Margin_Amt Product_Margin_Amt
,Product_Sales_Initial_Margin_Amt Initial_Margin_Amt
from dfs.tmp.load_pos_sales_detail
;


*This is the select query that generated the error:*

select *
from dfs.tmp.load_pos_sales_detail_tbl
;

- ERROR 

SQL Error: RESOURCE ERROR: Waited for 3 ms, but only 10 tasks for
'Fetch parquet metadata' are complete. Total number of tasks 29,
parallelism 16.


On Mon, Sep 24, 2018 at 9:08 AM, Kunal Khatua  wrote:

> Hi Herman
>
> Assuming that you're doing analytics on your data. If that's the case,
> parquet format is the way to go.
>
> That said, could you provide some details about the parquet data you've
> created, like the schema, parquet version and the tool used to generate.
> Usually, the schema (and meta) provides most of these details for any
> parquet file.
>
> It'll be useful to know if there is a pattern in the failure because of
> which there might be corruption occurring.
>
> Kunal
>
>
> On 9/22/2018 11:49:36 PM, Herman Tan  wrote:
> Hi Karthik,
>
> Thank you for pointing me to the mail archive in May 2018.
> That is exactly the same problem I am facing.
>
> I thought of using Drill as an ETL where I load the warehouse parquet
> tables from text source files.
> Then I query the parquet tables.
> It works on some parquet tables but am having problems with large ones that
> consist of several files. (I think)
> Still investigating.

Re: ERROR is reading parquet data after create table

2018-09-23 Thread Herman Tan
Hi Karthik,

Thank you for pointing me to the mail archive in May 2018.
That is exactly the same problem I am facing.

I thought of using Drill as an ETL where I load the warehouse parquet
tables from text source files.
Then I query the parquet tables.
It works on some parquet tables but am having problems with large ones that
consist of several files. (I think)
Still investigating.
Anyone in the community have other experience?
Should I work with all text files instead of parquet?


Herman


On Fri, Sep 21, 2018 at 2:15 AM, Karthikeyan Manivannan <
kmanivan...@mapr.com> wrote:

> Hi Herman,
>
> I am not sure what the exact problem here is but can you check to see if
> you are not hitting the problem described here:
>
> http://mail-archives.apache.org/mod_mbox/drill-user/201805.mbox/%
> 3ccacwrgnexlxop2vcyugsa4gwd1jgs8f+rcpzq8rhuatfw5fm...@mail.gmail.com%3E
>
> Thanks
>
> Karthik
>
> On Wed, Sep 19, 2018 at 7:02 PM Herman Tan  wrote:
>
> > Hi,
> >
> > I encountered the following error.
> > The Steps I did are as follows:
> > 1. Create a view to fix the data type of fields with cast
> > 2. Create table (parquet) using the view
> > 3. Query select * from table (query a field also does not work)
> >
> > The error:
> > SQL Error: RESOURCE ERROR: Waited for 3 ms, but only 10 tasks for
> > 'Fetch parquet metadata' are complete. Total number of tasks 29,
> > parallelism 16.
> >
> > When I re-run this, the number of tasks will vary.
> >
> > What could be the problem?
> >
> > Regards,
> > Herman Tan
> >
> > More info below:
> >
> > This is the folders of the files
> > Total # of lines, 50 million
> > --
> > show files from
> > dfs.`D:\retail_sandbox\pos\sales_pos_detail\pos_details_20180825`
> > ;
> > show files from
> > dfs.`D:\retail_sandbox\pos\sales_pos_detail\pos_details_20180825\2011`
> > ;
> > -
> > sales_pos_detail
> >   \pos_details_20180825
> > \2007
> > \2008
> > \2009
> > \2010
> > \2011
> >   \pos_details_0.csv
> >   \pos_details_1.csv
> >   \pos_details_2.csv
> >   \pos_details_3.csv
> >   \pos_details_4.csv
> >   \pos_details_5.csv
> >   \pos_details_6.csv
> >   \pos_details_7.csv
> >   \pos_details_8.csv
> > \2012
> > \2013
> > \2014
> > \2015
> > \2016
> > \2017
> > \2018
> > \others
> > -
> >
> > create or replace view dfs.tmp.load_pos_sales_detail as
> > SELECT
> > -- dimension keys
> >  cast(dim_date_key as int) dim_date_key
> > ,cast(dim_site_key as int) dim_site_key
> > ,cast(dim_pos_header_key as bigint) dim_pos_header_key
> > ,cast(dim_pos_cashier_key as int) dim_pos_cashier_key
> > ,cast(dim_card_number_key as int) dim_card_number_key
> > ,cast(dim_hour_minute_key as int) dim_hour_minute_key
> > ,cast(dim_pos_clerk_key as int) dim_pos_clerk_key
> > ,cast(dim_product_key as int) dim_product_key
> > ,cast(dim_pos_employee_purchase_key as int)
> dim_pos_employee_purchase_key
> > ,cast(dim_pos_terminal_key as int) dim_pos_terminal_key
> > ,cast(dim_campaign_key as int) dim_campaign_key
> > ,cast(dim_promo_key as int) dim_promo_key
> > ,cast( case when dim_site_lfl_key = '' then 0 else dim_site_lfl_key end
> as
> > int) dim_site_lfl_key
> > -- derived from keys
> > ,dim_date_str
> > ,`year` as `trx_year`
> > -- Measures
> > ,Product_Sales_Qty
> > ,Product_Sales_Price
> > ,Product_Cost_Price
> > ,Product_Cost_Amt
> > ,Product_Sales_Gross_Amt
> > ,Product_Sales_Promo_Disc_Amt
> > ,Product_Sales_Add_Promo_Disc_Amt
> > ,Product_Sales_Total_Promo_Disc_Amt
> > ,Product_Sales_Retail_Promo_Amt
> > ,Product_Sales_Retail_Amt
> > ,Product_Sales_VAT_Amt
> > ,Product_Sales_Product_Margin_Amt
> > ,Product_Sales_Initial_Margin_Amt
> > from dfs.`D:\retail_sandbox\pos\sales_pos_detail\pos_details_20180825`
> > ;
> >
> > drop table if exists dfs.tmp.load_pos_sales_detail_tbl
> > ;
> >
> > create table dfs.tmp.load_pos_sales_detail_tbl AS
> > SELECT
> > -- dimension keys
> >  dim_date_key
> > ,dim_site_key
> > ,dim_pos_header_key
> > ,dim_pos_cashier_key
> > ,dim_card_number_key
> > ,dim_hour_minute_key
> > ,dim_pos_clerk_key
> > ,dim_product_key
> > ,dim_pos_employee_purchase_key
> > ,dim_pos_terminal_key
> > ,dim_campaign_key
> > ,dim_promo_key
> > ,dim_site_lfl_key
> > -- derived from keys
> &g

ERROR is reading parquet data after create table

2018-09-19 Thread Herman Tan
Hi,

I encountered the following error.
The Steps I did are as follows:
1. Create a view to fix the data type of fields with cast
2. Create table (parquet) using the view
3. Query select * from table (query a field also does not work)

The error:
SQL Error: RESOURCE ERROR: Waited for 3 ms, but only 10 tasks for
'Fetch parquet metadata' are complete. Total number of tasks 29,
parallelism 16.

When I re-run this, the number of tasks will vary.

What could be the problem?

Regards,
Herman Tan

More info below:

This is the folders of the files
Total # of lines, 50 million
--
show files from
dfs.`D:\retail_sandbox\pos\sales_pos_detail\pos_details_20180825`
;
show files from
dfs.`D:\retail_sandbox\pos\sales_pos_detail\pos_details_20180825\2011`
;
-
sales_pos_detail
  \pos_details_20180825
\2007
\2008
\2009
\2010
\2011
  \pos_details_0.csv
  \pos_details_1.csv
  \pos_details_2.csv
  \pos_details_3.csv
  \pos_details_4.csv
  \pos_details_5.csv
  \pos_details_6.csv
  \pos_details_7.csv
  \pos_details_8.csv
\2012
\2013
\2014
\2015
\2016
\2017
\2018
\others
-

create or replace view dfs.tmp.load_pos_sales_detail as
SELECT
-- dimension keys
 cast(dim_date_key as int) dim_date_key
,cast(dim_site_key as int) dim_site_key
,cast(dim_pos_header_key as bigint) dim_pos_header_key
,cast(dim_pos_cashier_key as int) dim_pos_cashier_key
,cast(dim_card_number_key as int) dim_card_number_key
,cast(dim_hour_minute_key as int) dim_hour_minute_key
,cast(dim_pos_clerk_key as int) dim_pos_clerk_key
,cast(dim_product_key as int) dim_product_key
,cast(dim_pos_employee_purchase_key as int) dim_pos_employee_purchase_key
,cast(dim_pos_terminal_key as int) dim_pos_terminal_key
,cast(dim_campaign_key as int) dim_campaign_key
,cast(dim_promo_key as int) dim_promo_key
,cast( case when dim_site_lfl_key = '' then 0 else dim_site_lfl_key end as
int) dim_site_lfl_key
-- derived from keys
,dim_date_str
,`year` as `trx_year`
-- Measures
,Product_Sales_Qty
,Product_Sales_Price
,Product_Cost_Price
,Product_Cost_Amt
,Product_Sales_Gross_Amt
,Product_Sales_Promo_Disc_Amt
,Product_Sales_Add_Promo_Disc_Amt
,Product_Sales_Total_Promo_Disc_Amt
,Product_Sales_Retail_Promo_Amt
,Product_Sales_Retail_Amt
,Product_Sales_VAT_Amt
,Product_Sales_Product_Margin_Amt
,Product_Sales_Initial_Margin_Amt
from dfs.`D:\retail_sandbox\pos\sales_pos_detail\pos_details_20180825`
;

drop table if exists dfs.tmp.load_pos_sales_detail_tbl
;

create table dfs.tmp.load_pos_sales_detail_tbl AS
SELECT
-- dimension keys
 dim_date_key
,dim_site_key
,dim_pos_header_key
,dim_pos_cashier_key
,dim_card_number_key
,dim_hour_minute_key
,dim_pos_clerk_key
,dim_product_key
,dim_pos_employee_purchase_key
,dim_pos_terminal_key
,dim_campaign_key
,dim_promo_key
,dim_site_lfl_key
-- derived from keys
,dim_date_str
,`trx_year`
-- Measures
,Product_Sales_Qty Sales_Qty
,Product_Sales_Price Sales_Price
,Product_Cost_Price Cost_Price
,Product_Cost_Amt Cost_Amt
,Product_Sales_Gross_Amt Sales_Gross_Amt
,Product_Sales_Promo_Disc_Amt Sales_Promo_Disc_Amt
,Product_Sales_Add_Promo_Disc_Amt Add_Promo_Disc_Amt
,Product_Sales_Total_Promo_Disc_Amt Total_Promo_Disc_Amt
,Product_Sales_Retail_Promo_Amt Retail_Promo_Amt
,Product_Sales_Retail_Amt Retail_Amt
,Product_Sales_VAT_Amt VAT_Amt
,Product_Sales_Product_Margin_Amt Product_Margin_Amt
,Product_Sales_Initial_Margin_Amt Initial_Margin_Amt
from dfs.tmp.load_pos_sales_detail
;

select *
from dfs.tmp.load_pos_sales_detail_tbl
;

- ERROR 

SQL Error: RESOURCE ERROR: Waited for 3 ms, but only 10 tasks for
'Fetch parquet metadata' are complete. Total number of tasks 29,
parallelism 16.


[Error Id: 3b079174-f5d0-4313-8097-25a0b3070854 on
IORA-G9KY9P2.stf.nus.edu.sg:31010]
  RESOURCE ERROR: Waited for 3 ms, but only 10 tasks for 'Fetch parquet
metadata' are complete. Total number of tasks 29, parallelism 16.


[Error Id: 3b079174-f5d0-4313-8097-25a0b3070854 on
IORA-G9KY9P2.stf.nus.edu.sg:31010]
RESOURCE ERROR: Waited for 3 ms, but only 10 tasks for 'Fetch
parquet metadata' are complete. Total number of tasks 29, parallelism 16.


[Error Id: 3b079174-f5d0-4313-8097-25a0b3070854 on
IORA-G9KY9P2.stf.nus.edu.sg:31010]
RESOURCE ERROR: Waited for 3 ms, but only 10 tasks for 'Fetch
parquet metadata' are complete. Total number of tasks 29, parallelism 16.


[Error Id: 3b079174-f5d0-4313-8097-25a0b3070854 on
IORA-G9KY9P2.stf.nus.edu.sg:31010]


>From Drill log:

2018-09-20 08:58:12,035 [245d0f5a-ae5f-bfa2-ff04-40f7bdd1c2bf:foreman]
INFO  o.a.drill.exec.work.foreman.Foreman - Query text for query id
245d0f5a-ae5f-bfa2-ff04-40f7bdd1c2bf: select *
from dfs.tmp.load_pos_sales_detail_tbl

2018-09-20 08:58:53,068 [245d0f5a-ae5f-bfa2-ff04-40f7bdd1c2bf:foreman]
ERROR o.a.d.e.s.parquet.metadata.Metadata - Waited for 3 ms, but only
10 tasks for 'Fetch parquet metadata' are complete. Total number of tasks
29, parallel