Re: append data to already existing table saved in parquet format

2017-07-26 Thread Divya Gehlot
Hi Paul,
Let my try your approach of CTAS and save to  partition directory structure
.

Thanks for the suggestion.

Thanks,
Divya

On 27 July 2017 at 11:57, Paul Rogers  wrote:

> Hi All,
>
> Saurabh, you are right. But, since Parquet does not allow appending to
> existing files, we have to do the logical equivalent which is to create a
> new Parquet file. For it to be part of the same “table” it must be part of
> an existing partition structure as Divya described.
>
> The trick here is to choose a proper time grain. Too small and you end up
> with a very large number of files, and performance will suffer. (Once a
> second, for example, is too frequent.) Too slow and people don’t get
> near-real-time results. But, with the hour grain, Divya is using, then the
> number of files will not be too large per directory, and each file might be
> of a reasonable size.
>
> Using Kafka to batch data would be a fine idea.
>
> Of course, this is still not as good as Saurabh's former project, Druid,
> which builds aggregated cubes on the fly and has a lambda architecture to
> allow querying both immediate and historical data. Still, Divya’s design
> can work fine for some use cases when latency is not an issue and data
> volume is reasonable.
>
> It would help if Drill had INSERT INTO support. But, I wonder, can it be
> made to work with Drill today? Perhaps the query can simply include the
> proper target directory in the CTAS statement. That is, data for 2017-07-25
> 02:00 would go into “2017/07/26/2000.parquet”, say. That is, do-it-yourself
> partitioning. I hope Drill won’t care how the Parquet files got into the
> directories, only that the directories have the expected structure. (Is
> this accurate? Haven’t tried it myself…)
>
> With single-threaded, hourly updates, there is no worry about the name
> collisions and other tricky issues that INSERT INTO will have to solve.
>
> Divya, have you tried this solution?
>
> Thanks,
>
> - Paul
>
> > On Jul 26, 2017, at 7:32 PM, Saurabh Mahapatra <
> saurabhmahapatr...@gmail.com> wrote:
> >
> > But append only means you are adding event record to a table(forget the
> layout for a while). That means you have to write to the end of a table. If
> the writes are too many, you have to batch them and then convert them into
> a column format.
> >
> > This to me sounds like a Kafka workflow where you keeping ingesting
> event data, then batch process it ( or stream process it). Writing or
> appending to a columnar store when you data is in a row like format does
> not sound efficient at all. I have not seen such a design in systems that
> actually work. I know there are query engines that try to do that but the
> use is limited. You cannot scale.
> >
> > I always think of Parquet or a columnar data store as the repository of
> historical data that came from the OLTP world. You do not want to touch it
> once you created it. You want to have a strategy where you batch the recent
> data, create the historical data and move on.
> >
> > My 2 cents.
> >
> > Saurabh
> >
> > On Jul 26, 2017, at 6:58 PM, Divya Gehlot 
> wrote:
> >
> > Yes Paul I am looking for the insert into partition feature .
> > In this way we just have to create the file for that particular partition
> > when new data comes in or any updation if its required .
> > Else every time when data comes in have run the view and recreate the
> > parquet files for whole data set which is very time consuming specially
> > when your data is being visualized in some real time dashboard .
> >
> > Thanks,
> > Divya
> >
> >> On 27 July 2017 at 08:40, Paul Rogers  wrote:
> >>
> >> Hi Divya,
> >>
> >> Seems that you are asking for an “INSERT INTO” feature (DRILL-3534). The
> >> idea would be to create new Parquet files into an existing partition
> >> structure. That feature has not yet been started. So, the workarounds
> >> provided might help you for now.
> >>
> >> - Paul
> >>
> >>> On Jul 26, 2017, at 8:46 AM, Saurabh Mahapatra <
> >> saurabhmahapatr...@gmail.com> wrote:
> >>>
> >>> Does Drill provide that kind of functionality? Theoretically yes. CTAS
> >>> should work. But your cluster has to be sized. But I would never put
> >>> something in such a pipeline without adequate testing. And I would
> always
> >>> consider a lambda architecture to ensure that if this path were to fail
> >>> (with Drill or any other combination of tools), you can recover from
> the
> >>> failure. Each failure that you have puts you behind. If you have
> several
> >>> failures, you will be backlogged and need a mechanism to catch up.
> >>>
> >>> For data growth, you would need to go back to the source of the data
> and
> >>> estimate the row cardinality. If this is coming from a OLTP system,
> then
> >> it
> >>> is related to volume of transactions in the business process. If you do
> >> not
> >>> understand that load, your system will eventually start failing in the
> >>> future with Drill or 

Re: regex replace in string

2017-07-26 Thread Divya Gehlot
Hi,
I have already set the plugin configuration to extractheader :true .
 and I followed the below link
https://drill.apache.org/docs/lesson-2-run-queries-with-ansi-sql/

SELECT REGEXP_REPLACE(CAST(`Column1` AS VARCHAR(100)), '[,".]', '') AS
`Col1` FROM
dfs.`installedsoftwares/ApacheDrill/apache-drill-1.10.0.tar/apache-drill-1.10.0/sample-data/sample_data.csv`

Just extracting column which has special charcaters including the delimeter
as one of the special character  gives me empty result set .

Am I missing something ?

Appreciate the help.

Thanks,
Divya

On 27 July 2017 at 12:23, Paul Rogers  wrote:

> Hi Divya,
>
> I presume that “sample_data.csv” is your file? The default CSV
> configuration reads files without headers and puts all columns into a
> single array called “columns”. Do a SELECT * and you’ll see it. You’ll see
> an array that contains your data:
>
> [“Fred”, “Flintstone”]
>
> So, the correct query would be:
>
> SELECT REGEXP_REPLACE(CAST(columns[0] AS VARCHAR(100)), '[,".]', '') FROM
> dfs.`installedsoftwares/ApacheDrill/apache-drill-1.10.
> 0.tar/apache-drill-1.10.0/sample-data/sample_data.csv`
>
> I notice the regex is messing with brackets. Are you trying to convert the
> array form shown above to a cleaner list? Won’t work: those brackets are
> not in the data; they are the textual sugar added to show the array when
> printing.
>
> Maybe what you want is:
>
> SELECT columns[0] as `a`, columns[1] as `b` …
>
> Or, if your file actually contains headers, use a table function (or
> storage plugin config) to specify to use the headings to create individual
> columns. See the example at [1] under “Using the Formats Attributes as
> Table Function Parameters”.
>
> - Paul
>
> [1] https://drill.apache.org/docs/plugin-configuration-basics/
>
> > On Jul 26, 2017, at 8:22 PM, Divya Gehlot 
> wrote:
> >
> > The another thing which I observed is when I  run below query
> > SELECT  REGEXP_REPLACE('"This, col7 data yes."', '[,".]', '') FROM
> > (VALUES(1))
> > EXPR$0
> > This col7 data yes
> >
> >
> > Same when I run the csv file it gives me empty result set :
> > SELECT REGEXP_REPLACE(CAST(`Column1` AS VARCHAR(100)), '[,".]', '') FROM
> > dfs.`installedsoftwares/ApacheDrill/apache-drill-1.10.
> 0.tar/apache-drill-1.10.0/sample-data/sample_data.csv`
> >
> > EXPR$0
> >
> >
> > P.S. As advised all the above queries I ran on Drill 1.11
> >
> > Appreciate the help .
> >
> > Thanks,
> > Divya
> >
> > On 27 July 2017 at 09:54, Divya Gehlot  wrote:
> >
> >> Hi,
> >> Please find attached the sample_data.csv file
> >> Pasting the content of the csv file  below , in case attachment doesn't
> >> reach
> >>
> >>> Column1,Column2,Column3,Column4,Column5
> >>> colonedata1,coltwodata1,-35.924476,138.5987123,
> >>> colonedata2,coltwodata2,-27.4372536,153.0304583,137
> >>> colonedata3,coltwodata3,-35.2793885,149.1233503,134
> >>> colonedata4,coltwodata4,-33.8724176,151.2067579,
> >>> colonedata5,coltwodata5,,,
> >>> "This, col6 data",coltwodata6,-33.869732,151.203,351
> >>> "This, col7 data yes.",coltwodata7,1.2845045,103.8482739,80
> >>> Chifley,coltwodata5,,,
> >>
> >>
> >> Error :
> >>
> >>> Query Failed: An Error Occurred
> >>> org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR:
> >>> IllegalArgumentException: reallocation size must be non-negative
> Fragment
> >>> 0:0
> >>
> >>
> >>
> >> Thanks all for the help.
> >>
> >> Thanks ,
> >> Divya
> >>
> >>
> >> On 26 July 2017 at 23:31, Paul Rogers  wrote:
> >>
> >>> Hi Divya,
> >>>
> >>> We found a couple of issues in CSV files that would lead to the kind of
> >>> errors you encountered. These issues will be fixed in the upcoming
> Drill
> >>> 1.11 release.
> >>>
> >>> Sharing a sample CSV file will let us check the issue. Even better,
> >>> voting is open for the 1.11 release. Please go ahead and download it
> and
> >>> try your file with that release. Let us know if you still have a
> problem.
> >>>
> >>> Thanks,
> >>>
> >>> - Paul
> >>>
>  On Jul 26, 2017, at 6:14 AM, Khurram Faraaz  wrote:
> 
>  Can you please share your CSV file, the SQL query and the version of
> >>> Drill that you are on. So someone can take a look and try to reproduce
> the
> >>> error that you are seeing.
> 
> 
>  Thanks,
> 
>  Khurram
> 
>  
>  From: Divya Gehlot 
>  Sent: Wednesday, July 26, 2017 3:18:08 PM
>  To: user@drill.apache.org
>  Subject: regex replace in string
> 
>  Hi,
>  I have a CSV file where  column values are
>  "This is the column,one "
>  "This is column , two"
>  column3
>  column4
> 
>  When I try to regex_replace it throws error
> 
>  org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR:
> > IllegalArgumentException: reallocation size must be non-negative
> >>> Fragment

Re: regex replace in string

2017-07-26 Thread Paul Rogers
Hi Divya,

I presume that “sample_data.csv” is your file? The default CSV configuration 
reads files without headers and puts all columns into a single array called 
“columns”. Do a SELECT * and you’ll see it. You’ll see an array that contains 
your data:

[“Fred”, “Flintstone”]

So, the correct query would be:

SELECT REGEXP_REPLACE(CAST(columns[0] AS VARCHAR(100)), '[,".]', '') FROM
dfs.`installedsoftwares/ApacheDrill/apache-drill-1.10.0.tar/apache-drill-1.10.0/sample-data/sample_data.csv`

I notice the regex is messing with brackets. Are you trying to convert the 
array form shown above to a cleaner list? Won’t work: those brackets are not in 
the data; they are the textual sugar added to show the array when printing.

Maybe what you want is:

SELECT columns[0] as `a`, columns[1] as `b` …

Or, if your file actually contains headers, use a table function (or storage 
plugin config) to specify to use the headings to create individual columns. See 
the example at [1] under “Using the Formats Attributes as Table Function 
Parameters”.

- Paul

[1] https://drill.apache.org/docs/plugin-configuration-basics/

> On Jul 26, 2017, at 8:22 PM, Divya Gehlot  wrote:
> 
> The another thing which I observed is when I  run below query
> SELECT  REGEXP_REPLACE('"This, col7 data yes."', '[,".]', '') FROM
> (VALUES(1))
> EXPR$0
> This col7 data yes
> 
> 
> Same when I run the csv file it gives me empty result set :
> SELECT REGEXP_REPLACE(CAST(`Column1` AS VARCHAR(100)), '[,".]', '') FROM
> dfs.`installedsoftwares/ApacheDrill/apache-drill-1.10.0.tar/apache-drill-1.10.0/sample-data/sample_data.csv`
> 
> EXPR$0
> 
> 
> P.S. As advised all the above queries I ran on Drill 1.11
> 
> Appreciate the help .
> 
> Thanks,
> Divya
> 
> On 27 July 2017 at 09:54, Divya Gehlot  wrote:
> 
>> Hi,
>> Please find attached the sample_data.csv file
>> Pasting the content of the csv file  below , in case attachment doesn't
>> reach
>> 
>>> Column1,Column2,Column3,Column4,Column5
>>> colonedata1,coltwodata1,-35.924476,138.5987123,
>>> colonedata2,coltwodata2,-27.4372536,153.0304583,137
>>> colonedata3,coltwodata3,-35.2793885,149.1233503,134
>>> colonedata4,coltwodata4,-33.8724176,151.2067579,
>>> colonedata5,coltwodata5,,,
>>> "This, col6 data",coltwodata6,-33.869732,151.203,351
>>> "This, col7 data yes.",coltwodata7,1.2845045,103.8482739,80
>>> Chifley,coltwodata5,,,
>> 
>> 
>> Error :
>> 
>>> Query Failed: An Error Occurred
>>> org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR:
>>> IllegalArgumentException: reallocation size must be non-negative Fragment
>>> 0:0
>> 
>> 
>> 
>> Thanks all for the help.
>> 
>> Thanks ,
>> Divya
>> 
>> 
>> On 26 July 2017 at 23:31, Paul Rogers  wrote:
>> 
>>> Hi Divya,
>>> 
>>> We found a couple of issues in CSV files that would lead to the kind of
>>> errors you encountered. These issues will be fixed in the upcoming Drill
>>> 1.11 release.
>>> 
>>> Sharing a sample CSV file will let us check the issue. Even better,
>>> voting is open for the 1.11 release. Please go ahead and download it and
>>> try your file with that release. Let us know if you still have a problem.
>>> 
>>> Thanks,
>>> 
>>> - Paul
>>> 
 On Jul 26, 2017, at 6:14 AM, Khurram Faraaz  wrote:
 
 Can you please share your CSV file, the SQL query and the version of
>>> Drill that you are on. So someone can take a look and try to reproduce the
>>> error that you are seeing.
 
 
 Thanks,
 
 Khurram
 
 
 From: Divya Gehlot 
 Sent: Wednesday, July 26, 2017 3:18:08 PM
 To: user@drill.apache.org
 Subject: regex replace in string
 
 Hi,
 I have a CSV file where  column values are
 "This is the column,one "
 "This is column , two"
 column3
 column4
 
 When I try to regex_replace it throws error
 
 org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR:
> IllegalArgumentException: reallocation size must be non-negative
>>> Fragment
> 0:0
 
 
 How can I read the above columns as one string  like  This is the column
 one
 Appreciate the help
 
 Thanks,
 Divyab
>>> 
>>> 
>> 



Re: append data to already existing table saved in parquet format

2017-07-26 Thread Paul Rogers
Hi All,

Saurabh, you are right. But, since Parquet does not allow appending to existing 
files, we have to do the logical equivalent which is to create a new Parquet 
file. For it to be part of the same “table” it must be part of an existing 
partition structure as Divya described.

The trick here is to choose a proper time grain. Too small and you end up with 
a very large number of files, and performance will suffer. (Once a second, for 
example, is too frequent.) Too slow and people don’t get near-real-time 
results. But, with the hour grain, Divya is using, then the number of files 
will not be too large per directory, and each file might be of a reasonable 
size.

Using Kafka to batch data would be a fine idea.

Of course, this is still not as good as Saurabh's former project, Druid, which 
builds aggregated cubes on the fly and has a lambda architecture to allow 
querying both immediate and historical data. Still, Divya’s design can work 
fine for some use cases when latency is not an issue and data volume is 
reasonable.

It would help if Drill had INSERT INTO support. But, I wonder, can it be made 
to work with Drill today? Perhaps the query can simply include the proper 
target directory in the CTAS statement. That is, data for 2017-07-25 02:00 
would go into “2017/07/26/2000.parquet”, say. That is, do-it-yourself 
partitioning. I hope Drill won’t care how the Parquet files got into the 
directories, only that the directories have the expected structure. (Is this 
accurate? Haven’t tried it myself…)

With single-threaded, hourly updates, there is no worry about the name 
collisions and other tricky issues that INSERT INTO will have to solve.

Divya, have you tried this solution?

Thanks,

- Paul

> On Jul 26, 2017, at 7:32 PM, Saurabh Mahapatra  
> wrote:
> 
> But append only means you are adding event record to a table(forget the 
> layout for a while). That means you have to write to the end of a table. If 
> the writes are too many, you have to batch them and then convert them into a 
> column format. 
> 
> This to me sounds like a Kafka workflow where you keeping ingesting event 
> data, then batch process it ( or stream process it). Writing or appending to 
> a columnar store when you data is in a row like format does not sound 
> efficient at all. I have not seen such a design in systems that actually 
> work. I know there are query engines that try to do that but the use is 
> limited. You cannot scale. 
> 
> I always think of Parquet or a columnar data store as the repository of 
> historical data that came from the OLTP world. You do not want to touch it 
> once you created it. You want to have a strategy where you batch the recent 
> data, create the historical data and move on. 
> 
> My 2 cents.
> 
> Saurabh
> 
> On Jul 26, 2017, at 6:58 PM, Divya Gehlot  wrote:
> 
> Yes Paul I am looking for the insert into partition feature .
> In this way we just have to create the file for that particular partition
> when new data comes in or any updation if its required .
> Else every time when data comes in have run the view and recreate the
> parquet files for whole data set which is very time consuming specially
> when your data is being visualized in some real time dashboard .
> 
> Thanks,
> Divya
> 
>> On 27 July 2017 at 08:40, Paul Rogers  wrote:
>> 
>> Hi Divya,
>> 
>> Seems that you are asking for an “INSERT INTO” feature (DRILL-3534). The
>> idea would be to create new Parquet files into an existing partition
>> structure. That feature has not yet been started. So, the workarounds
>> provided might help you for now.
>> 
>> - Paul
>> 
>>> On Jul 26, 2017, at 8:46 AM, Saurabh Mahapatra <
>> saurabhmahapatr...@gmail.com> wrote:
>>> 
>>> Does Drill provide that kind of functionality? Theoretically yes. CTAS
>>> should work. But your cluster has to be sized. But I would never put
>>> something in such a pipeline without adequate testing. And I would always
>>> consider a lambda architecture to ensure that if this path were to fail
>>> (with Drill or any other combination of tools), you can recover from the
>>> failure. Each failure that you have puts you behind. If you have several
>>> failures, you will be backlogged and need a mechanism to catch up.
>>> 
>>> For data growth, you would need to go back to the source of the data and
>>> estimate the row cardinality. If this is coming from a OLTP system, then
>> it
>>> is related to volume of transactions in the business process. If you do
>> not
>>> understand that load, your system will eventually start failing in the
>>> future with Drill or otherwise.
>>> 
>>> Sizing and testing. Just do it.
>>> 
>>> Thanks,
>>> Saurabh
>>> 
>>> 
>>> 
>>> On Wed, Jul 26, 2017 at 2:52 AM, Divya Gehlot 
>>> wrote:
>>> 
 The data size is not big for every hour but  data size will grow with
>> the
 time say if I have data for 2 years and data is coming on 

Re: regex replace in string

2017-07-26 Thread Divya Gehlot
The another thing which I observed is when I  run below query
SELECT  REGEXP_REPLACE('"This, col7 data yes."', '[,".]', '') FROM
(VALUES(1))
EXPR$0
This col7 data yes


Same when I run the csv file it gives me empty result set :
SELECT REGEXP_REPLACE(CAST(`Column1` AS VARCHAR(100)), '[,".]', '') FROM
dfs.`installedsoftwares/ApacheDrill/apache-drill-1.10.0.tar/apache-drill-1.10.0/sample-data/sample_data.csv`

EXPR$0


P.S. As advised all the above queries I ran on Drill 1.11

Appreciate the help .

Thanks,
Divya

On 27 July 2017 at 09:54, Divya Gehlot  wrote:

> Hi,
> Please find attached the sample_data.csv file
> Pasting the content of the csv file  below , in case attachment doesn't
> reach
>
>> Column1,Column2,Column3,Column4,Column5
>> colonedata1,coltwodata1,-35.924476,138.5987123,
>> colonedata2,coltwodata2,-27.4372536,153.0304583,137
>> colonedata3,coltwodata3,-35.2793885,149.1233503,134
>> colonedata4,coltwodata4,-33.8724176,151.2067579,
>> colonedata5,coltwodata5,,,
>> "This, col6 data",coltwodata6,-33.869732,151.203,351
>> "This, col7 data yes.",coltwodata7,1.2845045,103.8482739,80
>> Chifley,coltwodata5,,,
>
>
> Error :
>
>> Query Failed: An Error Occurred
>> org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR:
>> IllegalArgumentException: reallocation size must be non-negative Fragment
>> 0:0
>
>
>
> Thanks all for the help.
>
> Thanks ,
> Divya
>
>
> On 26 July 2017 at 23:31, Paul Rogers  wrote:
>
>> Hi Divya,
>>
>> We found a couple of issues in CSV files that would lead to the kind of
>> errors you encountered. These issues will be fixed in the upcoming Drill
>> 1.11 release.
>>
>> Sharing a sample CSV file will let us check the issue. Even better,
>> voting is open for the 1.11 release. Please go ahead and download it and
>> try your file with that release. Let us know if you still have a problem.
>>
>> Thanks,
>>
>> - Paul
>>
>> > On Jul 26, 2017, at 6:14 AM, Khurram Faraaz  wrote:
>> >
>> > Can you please share your CSV file, the SQL query and the version of
>> Drill that you are on. So someone can take a look and try to reproduce the
>> error that you are seeing.
>> >
>> >
>> > Thanks,
>> >
>> > Khurram
>> >
>> > 
>> > From: Divya Gehlot 
>> > Sent: Wednesday, July 26, 2017 3:18:08 PM
>> > To: user@drill.apache.org
>> > Subject: regex replace in string
>> >
>> > Hi,
>> > I have a CSV file where  column values are
>> > "This is the column,one "
>> > "This is column , two"
>> > column3
>> > column4
>> >
>> > When I try to regex_replace it throws error
>> >
>> > org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR:
>> >> IllegalArgumentException: reallocation size must be non-negative
>> Fragment
>> >> 0:0
>> >
>> >
>> > How can I read the above columns as one string  like  This is the column
>> > one
>> > Appreciate the help
>> >
>> > Thanks,
>> > Divyab
>>
>>
>


Re: append data to already existing table saved in parquet format

2017-07-26 Thread Saurabh Mahapatra
But append only means you are adding event record to a table(forget the layout 
for a while). That means you have to write to the end of a table. If the writes 
are too many, you have to batch them and then convert them into a column 
format. 

This to me sounds like a Kafka workflow where you keeping ingesting event data, 
then batch process it ( or stream process it). Writing or appending to a 
columnar store when you data is in a row like format does not sound efficient 
at all. I have not seen such a design in systems that actually work. I know 
there are query engines that try to do that but the use is limited. You cannot 
scale. 

I always think of Parquet or a columnar data store as the repository of 
historical data that came from the OLTP world. You do not want to touch it once 
you created it. You want to have a strategy where you batch the recent data, 
create the historical data and move on. 

My 2 cents.

Saurabh

On Jul 26, 2017, at 6:58 PM, Divya Gehlot  wrote:

Yes Paul I am looking for the insert into partition feature .
In this way we just have to create the file for that particular partition
when new data comes in or any updation if its required .
Else every time when data comes in have run the view and recreate the
parquet files for whole data set which is very time consuming specially
when your data is being visualized in some real time dashboard .

Thanks,
Divya

> On 27 July 2017 at 08:40, Paul Rogers  wrote:
> 
> Hi Divya,
> 
> Seems that you are asking for an “INSERT INTO” feature (DRILL-3534). The
> idea would be to create new Parquet files into an existing partition
> structure. That feature has not yet been started. So, the workarounds
> provided might help you for now.
> 
> - Paul
> 
>> On Jul 26, 2017, at 8:46 AM, Saurabh Mahapatra <
> saurabhmahapatr...@gmail.com> wrote:
>> 
>> Does Drill provide that kind of functionality? Theoretically yes. CTAS
>> should work. But your cluster has to be sized. But I would never put
>> something in such a pipeline without adequate testing. And I would always
>> consider a lambda architecture to ensure that if this path were to fail
>> (with Drill or any other combination of tools), you can recover from the
>> failure. Each failure that you have puts you behind. If you have several
>> failures, you will be backlogged and need a mechanism to catch up.
>> 
>> For data growth, you would need to go back to the source of the data and
>> estimate the row cardinality. If this is coming from a OLTP system, then
> it
>> is related to volume of transactions in the business process. If you do
> not
>> understand that load, your system will eventually start failing in the
>> future with Drill or otherwise.
>> 
>> Sizing and testing. Just do it.
>> 
>> Thanks,
>> Saurabh
>> 
>> 
>> 
>> On Wed, Jul 26, 2017 at 2:52 AM, Divya Gehlot 
>> wrote:
>> 
>>> The data size is not big for every hour but  data size will grow with
> the
>>> time say if I have data for 2 years and data is coming on hourly basis
> and
>>> everytime creating the paruqet table is not the feasible solution .
>>> Likewise for hive create the partition and insert the data into
> partition
>>> accordingly .
>>> Was lookiing for that kind of solution.
>>> Does Drill provides that kind of functionalty ?
>>> 
>>> Thanks,
>>> Divya
>>> 
>>> 
>>> On 26 July 2017 at 15:04, Saurabh Mahapatra <
> saurabhmahapatr...@gmail.com>
>>> wrote:
>>> 
 I always recommend against using CTAS as a shortcut for a ETL type
> large
 workload. You will need to size your Drill cluster accordingly.
> Consider
 using Hive or Spark instead.
 
 What are the source file formats? For every hour, what is the size and
> the
 number of rows for that data? Are you doing any aggregations? And what
> is
 the lag between the streaming data and data available for analytics
> that
 you are willing to tolerate?
 
 On Tue, Jul 25, 2017 at 11:27 PM, rahul challapalli <
 challapallira...@gmail.com> wrote:
 
> I am not aware of any clean way to do this. However if your data is
> partitioned based on directories, then you can use the below hack
> which
> leverages temporary tables [1]. Essentially, you backup your partition
 to a
> temp table, then override it by taking the union of new partition data
 and
> existing partition data. This way we are not over-writing the entire
 table.
> 
> create temporary table mytable_2017 (col1, col2)  as select col1,
 col2,
> ..from mytable where dir0 = "2017";
> drop table `mytable/2017`;
> create table `mytable/2017` as
>   select col1, col2 .from new_partition_data
>   union
>   select col1, col2 . from mytable_2017;
> drop table mytable_2017;
> 
> Caveat : Temporary tables get dropped automatically if the session
> ends
 or
> the drillbit crashes. In the above sequence, if 

Re: append data to already existing table saved in parquet format

2017-07-26 Thread Divya Gehlot
Yes Paul I am looking for the insert into partition feature .
In this way we just have to create the file for that particular partition
when new data comes in or any updation if its required .
Else every time when data comes in have run the view and recreate the
parquet files for whole data set which is very time consuming specially
when your data is being visualized in some real time dashboard .

Thanks,
Divya

On 27 July 2017 at 08:40, Paul Rogers  wrote:

> Hi Divya,
>
> Seems that you are asking for an “INSERT INTO” feature (DRILL-3534). The
> idea would be to create new Parquet files into an existing partition
> structure. That feature has not yet been started. So, the workarounds
> provided might help you for now.
>
> - Paul
>
> > On Jul 26, 2017, at 8:46 AM, Saurabh Mahapatra <
> saurabhmahapatr...@gmail.com> wrote:
> >
> > Does Drill provide that kind of functionality? Theoretically yes. CTAS
> > should work. But your cluster has to be sized. But I would never put
> > something in such a pipeline without adequate testing. And I would always
> > consider a lambda architecture to ensure that if this path were to fail
> > (with Drill or any other combination of tools), you can recover from the
> > failure. Each failure that you have puts you behind. If you have several
> > failures, you will be backlogged and need a mechanism to catch up.
> >
> > For data growth, you would need to go back to the source of the data and
> > estimate the row cardinality. If this is coming from a OLTP system, then
> it
> > is related to volume of transactions in the business process. If you do
> not
> > understand that load, your system will eventually start failing in the
> > future with Drill or otherwise.
> >
> > Sizing and testing. Just do it.
> >
> > Thanks,
> > Saurabh
> >
> >
> >
> > On Wed, Jul 26, 2017 at 2:52 AM, Divya Gehlot 
> > wrote:
> >
> >> The data size is not big for every hour but  data size will grow with
> the
> >> time say if I have data for 2 years and data is coming on hourly basis
> and
> >> everytime creating the paruqet table is not the feasible solution .
> >> Likewise for hive create the partition and insert the data into
> partition
> >> accordingly .
> >> Was lookiing for that kind of solution.
> >> Does Drill provides that kind of functionalty ?
> >>
> >> Thanks,
> >> Divya
> >>
> >>
> >> On 26 July 2017 at 15:04, Saurabh Mahapatra <
> saurabhmahapatr...@gmail.com>
> >> wrote:
> >>
> >>> I always recommend against using CTAS as a shortcut for a ETL type
> large
> >>> workload. You will need to size your Drill cluster accordingly.
> Consider
> >>> using Hive or Spark instead.
> >>>
> >>> What are the source file formats? For every hour, what is the size and
> the
> >>> number of rows for that data? Are you doing any aggregations? And what
> is
> >>> the lag between the streaming data and data available for analytics
> that
> >>> you are willing to tolerate?
> >>>
> >>> On Tue, Jul 25, 2017 at 11:27 PM, rahul challapalli <
> >>> challapallira...@gmail.com> wrote:
> >>>
>  I am not aware of any clean way to do this. However if your data is
>  partitioned based on directories, then you can use the below hack
> which
>  leverages temporary tables [1]. Essentially, you backup your partition
> >>> to a
>  temp table, then override it by taking the union of new partition data
> >>> and
>  existing partition data. This way we are not over-writing the entire
> >>> table.
> 
>  create temporary table mytable_2017 (col1, col2)  as select col1,
> >>> col2,
>  ..from mytable where dir0 = "2017";
>  drop table `mytable/2017`;
>  create table `mytable/2017` as
> select col1, col2 .from new_partition_data
> union
> select col1, col2 . from mytable_2017;
>  drop table mytable_2017;
> 
>  Caveat : Temporary tables get dropped automatically if the session
> ends
> >>> or
>  the drillbit crashes. In the above sequence, if the connection gets
> >>> dropped
>  (there are known issues causing this) between the client and drillbit
> >>> after
>  executing the "DROP" statement, then your partition data is lost
> >>> forever.
>  And since drill doesn't support transactions, the mentioned approach
> is
>  dangerous.
> 
>  [1] https://drill.apache.org/docs/create-temporary-table-as-cttas/
> 
> 
>  On Tue, Jul 25, 2017 at 10:52 PM, Divya Gehlot <
> divya.htco...@gmail.com
> 
>  wrote:
> 
> > Hi,
> > I am naive to Apache drill.
> > As I have data coming in every hour , when I searched I couldnt find
> >>> the
> > insert into partition command in Apache drill.
> > How can we insert data to particular partition without rewriting the
>  whole
> > data set ?
> >
> >
> > Appreciate the help.
> > Thanks,
> > Divya
> >
> 
> >>>
> >>
> >>
>
>


Re: regex replace in string

2017-07-26 Thread Divya Gehlot
Hi,
Please find attached the sample_data.csv file
Pasting the content of the csv file  below , in case attachment doesn't
reach

> Column1,Column2,Column3,Column4,Column5
> colonedata1,coltwodata1,-35.924476,138.5987123,
> colonedata2,coltwodata2,-27.4372536,153.0304583,137
> colonedata3,coltwodata3,-35.2793885,149.1233503,134
> colonedata4,coltwodata4,-33.8724176,151.2067579,
> colonedata5,coltwodata5,,,
> "This, col6 data",coltwodata6,-33.869732,151.203,351
> "This, col7 data yes.",coltwodata7,1.2845045,103.8482739,80
> Chifley,coltwodata5,,,


Error :

> Query Failed: An Error Occurred
> org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR:
> IllegalArgumentException: reallocation size must be non-negative Fragment
> 0:0



Thanks all for the help.

Thanks ,
Divya


On 26 July 2017 at 23:31, Paul Rogers  wrote:

> Hi Divya,
>
> We found a couple of issues in CSV files that would lead to the kind of
> errors you encountered. These issues will be fixed in the upcoming Drill
> 1.11 release.
>
> Sharing a sample CSV file will let us check the issue. Even better, voting
> is open for the 1.11 release. Please go ahead and download it and try your
> file with that release. Let us know if you still have a problem.
>
> Thanks,
>
> - Paul
>
> > On Jul 26, 2017, at 6:14 AM, Khurram Faraaz  wrote:
> >
> > Can you please share your CSV file, the SQL query and the version of
> Drill that you are on. So someone can take a look and try to reproduce the
> error that you are seeing.
> >
> >
> > Thanks,
> >
> > Khurram
> >
> > 
> > From: Divya Gehlot 
> > Sent: Wednesday, July 26, 2017 3:18:08 PM
> > To: user@drill.apache.org
> > Subject: regex replace in string
> >
> > Hi,
> > I have a CSV file where  column values are
> > "This is the column,one "
> > "This is column , two"
> > column3
> > column4
> >
> > When I try to regex_replace it throws error
> >
> > org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR:
> >> IllegalArgumentException: reallocation size must be non-negative
> Fragment
> >> 0:0
> >
> >
> > How can I read the above columns as one string  like  This is the column
> > one
> > Appreciate the help
> >
> > Thanks,
> > Divyab
>
>
Column1,Column2,Column3,Column4,Column5
colonedata1,coltwodata1,-35.924476,138.5987123,
colonedata2,coltwodata2,-27.4372536,153.0304583,137
colonedata3,coltwodata3,-35.2793885,149.1233503,134
colonedata4,coltwodata4,-33.8724176,151.2067579,
colonedata5,coltwodata5,,,
"This, col6 data",coltwodata6,-33.869732,151.203,351
"This, col7 data yes.",coltwodata7,1.2845045,103.8482739,80
Chifley,coltwodata5,,,


Re: append data to already existing table saved in parquet format

2017-07-26 Thread Paul Rogers
Hi Divya,

Seems that you are asking for an “INSERT INTO” feature (DRILL-3534). The idea 
would be to create new Parquet files into an existing partition structure. That 
feature has not yet been started. So, the workarounds provided might help you 
for now.

- Paul

> On Jul 26, 2017, at 8:46 AM, Saurabh Mahapatra  
> wrote:
> 
> Does Drill provide that kind of functionality? Theoretically yes. CTAS
> should work. But your cluster has to be sized. But I would never put
> something in such a pipeline without adequate testing. And I would always
> consider a lambda architecture to ensure that if this path were to fail
> (with Drill or any other combination of tools), you can recover from the
> failure. Each failure that you have puts you behind. If you have several
> failures, you will be backlogged and need a mechanism to catch up.
> 
> For data growth, you would need to go back to the source of the data and
> estimate the row cardinality. If this is coming from a OLTP system, then it
> is related to volume of transactions in the business process. If you do not
> understand that load, your system will eventually start failing in the
> future with Drill or otherwise.
> 
> Sizing and testing. Just do it.
> 
> Thanks,
> Saurabh
> 
> 
> 
> On Wed, Jul 26, 2017 at 2:52 AM, Divya Gehlot 
> wrote:
> 
>> The data size is not big for every hour but  data size will grow with the
>> time say if I have data for 2 years and data is coming on hourly basis and
>> everytime creating the paruqet table is not the feasible solution .
>> Likewise for hive create the partition and insert the data into partition
>> accordingly .
>> Was lookiing for that kind of solution.
>> Does Drill provides that kind of functionalty ?
>> 
>> Thanks,
>> Divya
>> 
>> 
>> On 26 July 2017 at 15:04, Saurabh Mahapatra 
>> wrote:
>> 
>>> I always recommend against using CTAS as a shortcut for a ETL type large
>>> workload. You will need to size your Drill cluster accordingly. Consider
>>> using Hive or Spark instead.
>>> 
>>> What are the source file formats? For every hour, what is the size and the
>>> number of rows for that data? Are you doing any aggregations? And what is
>>> the lag between the streaming data and data available for analytics that
>>> you are willing to tolerate?
>>> 
>>> On Tue, Jul 25, 2017 at 11:27 PM, rahul challapalli <
>>> challapallira...@gmail.com> wrote:
>>> 
 I am not aware of any clean way to do this. However if your data is
 partitioned based on directories, then you can use the below hack which
 leverages temporary tables [1]. Essentially, you backup your partition
>>> to a
 temp table, then override it by taking the union of new partition data
>>> and
 existing partition data. This way we are not over-writing the entire
>>> table.
 
 create temporary table mytable_2017 (col1, col2)  as select col1,
>>> col2,
 ..from mytable where dir0 = "2017";
 drop table `mytable/2017`;
 create table `mytable/2017` as
select col1, col2 .from new_partition_data
union
select col1, col2 . from mytable_2017;
 drop table mytable_2017;
 
 Caveat : Temporary tables get dropped automatically if the session ends
>>> or
 the drillbit crashes. In the above sequence, if the connection gets
>>> dropped
 (there are known issues causing this) between the client and drillbit
>>> after
 executing the "DROP" statement, then your partition data is lost
>>> forever.
 And since drill doesn't support transactions, the mentioned approach is
 dangerous.
 
 [1] https://drill.apache.org/docs/create-temporary-table-as-cttas/
 
 
 On Tue, Jul 25, 2017 at 10:52 PM, Divya Gehlot  Hi,
> I am naive to Apache drill.
> As I have data coming in every hour , when I searched I couldnt find
>>> the
> insert into partition command in Apache drill.
> How can we insert data to particular partition without rewriting the
 whole
> data set ?
> 
> 
> Appreciate the help.
> Thanks,
> Divya
> 
 
>>> 
>> 
>> 



Re: 1.11.0 RC question

2017-07-26 Thread Bob Rudis
Oh I'm an idiot. I'll add the pcap format after dinner and try again.

Thx for the quick and helpful response!

-boB
On Wed, Jul 26, 2017 at 18:03 Parth Chandra  wrote:

> You might have to add the pcap format in the dfs storage plugin config [1]
>
> Something like this :
>
> "formats": {
> "csv": {
>   "type": "text",
>   "extensions": [
> "csv"
>   ],
>   "delimiter": ","
> },
> "parquet": {
>   "type": "parquet"
> },
> "json": {
>   "type": "json",
>   "extensions": [
> "json"
>   ]
> },
> "abc": {
>   "type": "json",
>   "extensions": [
> "abc"
>   ]
> },
> "pcap": {
>   "type": "pcap",
>   "extensions": [
> "pcap"
>   ]
> }
>   }
>
>  Then you can specify "cap" as the the default format for any workspace.
> Configuring that is also described in [1]
>
> Parth
>
> [1] https://drill.apache.org/docs/plugin-configuration-basics/
>
> On Wed, Jul 26, 2017 at 1:36 PM, Jinfeng Ni  wrote:
>
> > Hi Bob,
> >
> > Is DRILL-5432 the one you are talking about? I saw it's merged and should
> > have been put in the release candidate.
> >
> > What type of error did you see when you tried to query a PCAP? Also, it
> may
> > help to provide the commit id of your build, by run the following query:
> >
> > SELECT * from sys.version;
> >
> >
> > https://issues.apache.org/jira/browse/DRILL-5432
> >
> >
> > On Wed, Jul 26, 2017 at 1:03 PM, Bob Rudis  wrote:
> >
> > > I wasn't sure if this belonged on the dev list or not but I was
> > > peeking around the JIRA for 1.11.0 RC and noticed that it _looked_
> > > like PCAP support is/was going to be in 1.11.0 but when I did a quick
> > > d/l and test of the RC (early yesterday) and tried to query a PCAP it
> > > did not work.
> > >
> > > I'm wondering if I just grabbed a too-early RC and shld try again or
> > > if PCAP support will miss the 1.11.0 release. (I might have misread a
> > > tweet from Ted that seemed to suggest it might not make it for
> > > 1.11.0).
> > >
> > > If it's the latter, will that mean the mapr github pcap drill example
> > > shld work as an interim substitute until 1.12.0 (NOTE: I haven't tried
> > > that yet)?
> > >
> > > If PCAP support had not previously, actually made the cut for 1.11.0
> > > RC can I make a last-minute req to have it be included? :-)
> > >
> > > thx for the hard work by the dev team. I ended up scanning through all
> > > the JIRAs and that's _alot_ of work. it's definitely appreciated.
> > >
> > > thx,
> > >
> > > -Bob
> > >
> >
>


Re: 1.11.0 RC question

2017-07-26 Thread Parth Chandra
You might have to add the pcap format in the dfs storage plugin config [1]

Something like this :

"formats": {
"csv": {
  "type": "text",
  "extensions": [
"csv"
  ],
  "delimiter": ","
},
"parquet": {
  "type": "parquet"
},
"json": {
  "type": "json",
  "extensions": [
"json"
  ]
},
"abc": {
  "type": "json",
  "extensions": [
"abc"
  ]
},
"pcap": {
  "type": "pcap",
  "extensions": [
"pcap"
  ]
}
  }

 Then you can specify "cap" as the the default format for any workspace.
Configuring that is also described in [1]

Parth

[1] https://drill.apache.org/docs/plugin-configuration-basics/

On Wed, Jul 26, 2017 at 1:36 PM, Jinfeng Ni  wrote:

> Hi Bob,
>
> Is DRILL-5432 the one you are talking about? I saw it's merged and should
> have been put in the release candidate.
>
> What type of error did you see when you tried to query a PCAP? Also, it may
> help to provide the commit id of your build, by run the following query:
>
> SELECT * from sys.version;
>
>
> https://issues.apache.org/jira/browse/DRILL-5432
>
>
> On Wed, Jul 26, 2017 at 1:03 PM, Bob Rudis  wrote:
>
> > I wasn't sure if this belonged on the dev list or not but I was
> > peeking around the JIRA for 1.11.0 RC and noticed that it _looked_
> > like PCAP support is/was going to be in 1.11.0 but when I did a quick
> > d/l and test of the RC (early yesterday) and tried to query a PCAP it
> > did not work.
> >
> > I'm wondering if I just grabbed a too-early RC and shld try again or
> > if PCAP support will miss the 1.11.0 release. (I might have misread a
> > tweet from Ted that seemed to suggest it might not make it for
> > 1.11.0).
> >
> > If it's the latter, will that mean the mapr github pcap drill example
> > shld work as an interim substitute until 1.12.0 (NOTE: I haven't tried
> > that yet)?
> >
> > If PCAP support had not previously, actually made the cut for 1.11.0
> > RC can I make a last-minute req to have it be included? :-)
> >
> > thx for the hard work by the dev team. I ended up scanning through all
> > the JIRAs and that's _alot_ of work. it's definitely appreciated.
> >
> > thx,
> >
> > -Bob
> >
>


Re: 1.11.0 RC question

2017-07-26 Thread Jinfeng Ni
Hi Bob,

Is DRILL-5432 the one you are talking about? I saw it's merged and should
have been put in the release candidate.

What type of error did you see when you tried to query a PCAP? Also, it may
help to provide the commit id of your build, by run the following query:

SELECT * from sys.version;


https://issues.apache.org/jira/browse/DRILL-5432


On Wed, Jul 26, 2017 at 1:03 PM, Bob Rudis  wrote:

> I wasn't sure if this belonged on the dev list or not but I was
> peeking around the JIRA for 1.11.0 RC and noticed that it _looked_
> like PCAP support is/was going to be in 1.11.0 but when I did a quick
> d/l and test of the RC (early yesterday) and tried to query a PCAP it
> did not work.
>
> I'm wondering if I just grabbed a too-early RC and shld try again or
> if PCAP support will miss the 1.11.0 release. (I might have misread a
> tweet from Ted that seemed to suggest it might not make it for
> 1.11.0).
>
> If it's the latter, will that mean the mapr github pcap drill example
> shld work as an interim substitute until 1.12.0 (NOTE: I haven't tried
> that yet)?
>
> If PCAP support had not previously, actually made the cut for 1.11.0
> RC can I make a last-minute req to have it be included? :-)
>
> thx for the hard work by the dev team. I ended up scanning through all
> the JIRAs and that's _alot_ of work. it's definitely appreciated.
>
> thx,
>
> -Bob
>


1.11.0 RC question

2017-07-26 Thread Bob Rudis
I wasn't sure if this belonged on the dev list or not but I was
peeking around the JIRA for 1.11.0 RC and noticed that it _looked_
like PCAP support is/was going to be in 1.11.0 but when I did a quick
d/l and test of the RC (early yesterday) and tried to query a PCAP it
did not work.

I'm wondering if I just grabbed a too-early RC and shld try again or
if PCAP support will miss the 1.11.0 release. (I might have misread a
tweet from Ted that seemed to suggest it might not make it for
1.11.0).

If it's the latter, will that mean the mapr github pcap drill example
shld work as an interim substitute until 1.12.0 (NOTE: I haven't tried
that yet)?

If PCAP support had not previously, actually made the cut for 1.11.0
RC can I make a last-minute req to have it be included? :-)

thx for the hard work by the dev team. I ended up scanning through all
the JIRAs and that's _alot_ of work. it's definitely appreciated.

thx,

-Bob


Re: regex replace in string

2017-07-26 Thread Khurram Faraaz
regexp_replace function works on that data on Drill 1.11.0, commit id : 4220fb2

{noformat}

Data used was,

[root@centos-01 community]# cat rgex_replce.csv
"This is the column,one "
"This is column , two"
column3
column4

0: jdbc:drill:schema=dfs.tmp> select * from `rgex_replce.csv`;
+--+
|   columns|
+--+
| ["This is the column,one "]  |
| ["This is column , two"] |
| ["column3"]  |
| ["column4"]  |
+--+
4 rows selected (0.391 seconds)

// regexp_replace works as designed on Drill 1.11.0 commit id : 4220fb2

0: jdbc:drill:schema=dfs.tmp> select regexp_replace(cast(columns[0] as 
varchar(256)),'column','foobar') from `rgex_replce.csv`;
+--+
|  EXPR$0  |
+--+
| This is the foobar,one   |
| This is foobar , two |
| foobar3  |
| foobar4  |
+--+
4 rows selected (0.254 seconds)

0: jdbc:drill:schema=dfs.tmp> select 
regexp_replace(columns[0],'column','foobar') from `rgex_replce.csv`;
+--+
|  EXPR$0  |
+--+
| This is the foobar,one   |
| This is foobar , two |
| foobar3  |
| foobar4  |
+--+
4 rows selected (0.238 seconds)
{noformat}


Thanks,

Khurram


From: Kunal Khatua 
Sent: Wednesday, July 26, 2017 11:47:22 PM
To: user@drill.apache.org
Subject: RE: regex replace in string

Here is the reference mail for the release candidate of Drill 1.11.0

-Original Message-
From: Arina Yelchiyeva [mailto:arina.yelchiy...@gmail.com]
Sent: Tuesday, July 25, 2017 3:36 AM
To: d...@drill.apache.org
Subject: [VOTE] Release Apache Drill 1.11.0 - rc0

Hi all,

I'd like to propose the first release candidate (rc0) of Apache Drill, version 
1.11.0.

The release candidate covers a total of 126 resolved JIRAs [1]. Thanks to 
everyone who contributed to this release.

The tarball artifacts are hosted at [2] and the maven artifacts are hosted at 
[3].

This release candidate is based on commit
4220fb2fffbc81883df3e5fea575fa0a584852b3 located at [4].

The vote ends at 1:00 PM UTC (5:00 AM PT), July 28, 2017.

[ ] +1
[ ] +0
[ ] -1

Here's my vote: +1 (non-binding)


[1]
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12313820=12339943
[2] http://home.apache.org/~arina/drill/releases/1.11.0/rc0/
[3] https://repository.apache.org/content/repositories/orgapachedrill-1042/
[4] https://github.com/arina-ielchiieva/drill/commits/drill-1.11.0

Kind regards
Arina

-Original Message-
From: Paul Rogers [mailto:prog...@mapr.com]
Sent: Wednesday, July 26, 2017 8:32 AM
To: user@drill.apache.org
Subject: Re: regex replace in string

Hi Divya,

We found a couple of issues in CSV files that would lead to the kind of errors 
you encountered. These issues will be fixed in the upcoming Drill 1.11 release.

Sharing a sample CSV file will let us check the issue. Even better, voting is 
open for the 1.11 release. Please go ahead and download it and try your file 
with that release. Let us know if you still have a problem.

Thanks,

- Paul

> On Jul 26, 2017, at 6:14 AM, Khurram Faraaz  wrote:
>
> Can you please share your CSV file, the SQL query and the version of Drill 
> that you are on. So someone can take a look and try to reproduce the error 
> that you are seeing.
>
>
> Thanks,
>
> Khurram
>
> 
> From: Divya Gehlot 
> Sent: Wednesday, July 26, 2017 3:18:08 PM
> To: user@drill.apache.org
> Subject: regex replace in string
>
> Hi,
> I have a CSV file where  column values are "This is the column,one "
> "This is column , two"
> column3
> column4
>
> When I try to regex_replace it throws error
>
> org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR:
>> IllegalArgumentException: reallocation size must be non-negative
>> Fragment
>> 0:0
>
>
> How can I read the above columns as one string  like  This is the
> column one Appreciate the help
>
> Thanks,
> Divyab



RE: drill error connecting to Hbase

2017-07-26 Thread Kunal Khatua
The bundled projects (HBase, Hive) in CDH have their own versions. I'm 
wondering if that is what is the difference.

Drill has been tested with HBase 1.1.1 and Hive 1.2.1 . For higher versions, as 
long as APIs have not changed, things should be backward compatible. 

Also, the error message you see in the SQLLine session... there is a complete 
stack trace in the Drill logs. Can you share that stack trace as well?


-Original Message-
From: Shai Shapira [mailto:shai.shap...@amdocs.com] 
Sent: Wednesday, July 26, 2017 5:50 AM
To: user@drill.apache.org
Subject: RE: drill error connecting to Hbase

It is CDH 5.8.2

I believe it is reliable versions, isn't it?

Thanks,
Shai

-Original Message-
From: Kunal Khatua [mailto:kkha...@mapr.com] 
Sent: Monday, July 24, 2017 8:50 AM
To: user@drill.apache.org
Subject: RE: drill error connecting to Hbase

This means that the connectivity with ZK appears to be working. 

What are the HBase, ZK and Hadoop versions that you are working with? I presume 
that the student table is otherwise accessible.

-Original Message-
From: Shai Shapira [mailto:shai.shap...@amdocs.com] 
Sent: Sunday, July 23, 2017 2:58 AM
To: user@drill.apache.org
Cc: Shai Shapira 
Subject: RE: drill error connecting to Hbase

Hi,

I installed Drill and started to work with it, my goal is to use it to connect 
to Hbase.
I checked it a bit locally, csv files, Json files, works great.
When I am trying to connect to Hbase, I am getting error.

It seems that it is connecting to the Hbase/ZK, but fails somehow there.
The errors when trying to select from non-exist table ( stud ) and when 
accessing an existing table ( students ) are different.
For existing table, the error is in the zookeeper.MetaTableLocator.

Any ideas?

Thanks,
Shai




illin4620 STABDB05 54 > drill
Jul 20, 2017 6:17:02 PM org.glassfish.jersey.server.ApplicationHandler 
initialize
INFO: Initiating Jersey application, version Jersey: 2.8 2014-04-29 01:25:26...
apache drill 1.10.0
"just drill it"
0: jdbc:drill:zk=local> use hbase;
+---++
|  ok   |  summary   |
+---++
| true  | Default schema changed to [hbase]  |
+---++
1 row selected (0.895 seconds)
0: jdbc:drill:zk=local> select * from students ;
Error: SYSTEM ERROR: IllegalAccessError: tried to access method 
com.google.common.base.Stopwatch.()V from class 
org.apache.hadoop.hbase.zookeeper.MetaTableLocator


[Error Id: 71a4a041-4f5d-4e68-9400-78c2faeac6f9 on illin4620:31010] 
(state=,code=0)
0: jdbc:drill:zk=local> select * from stud ;
Error: DATA_READ ERROR: Failure while loading table stud in database hbase.

Message:  stud
SQL Query null

[Error Id: f0a6591d-9068-4490-95c0-b0aea41365b4 on illin4620:31010] 
(state=,code=0)


Thanks,
Shai

From: Shai Shapira
Sent: Sunday, July 23, 2017 12:49 PM
To: Shai Shapira 
Subject: drill error connecting to Hbase



Shai Shapira
*  shai.shap...@amdocs.com
* +972 9 776 4171

This message and the information contained herein is proprietary and 
confidential and subject to the Amdocs policy statement,

you may review at https://www.amdocs.com/about/email-disclaimer 

This message and the information contained herein is proprietary and 
confidential and subject to the Amdocs policy statement,

you may review at https://www.amdocs.com/about/email-disclaimer 




Hbase tables create are not displayed in apache drill

2017-07-26 Thread hardik lathigara
HI,

I have created hbase setup in apache drill with the reference of
https://drill.apache.org/docs/hbase-storage-plugin/ and was trying to query
hbase tables(students and clicks) by following
https://drill.apache.org/docs/querying-hbase/ but when I try to run
commands like (SHOW TABLES,  SELECT * FROM STUDENTS) then it's not
returning any value even though I have created the tables.

Please find logs attached with email found in Threads tab (
http://localhost:8047/threads) of apache drill portal.


Re: regex replace in string

2017-07-26 Thread Paul Rogers
Hi Divya,

We found a couple of issues in CSV files that would lead to the kind of errors 
you encountered. These issues will be fixed in the upcoming Drill 1.11 release.

Sharing a sample CSV file will let us check the issue. Even better, voting is 
open for the 1.11 release. Please go ahead and download it and try your file 
with that release. Let us know if you still have a problem.

Thanks,

- Paul

> On Jul 26, 2017, at 6:14 AM, Khurram Faraaz  wrote:
> 
> Can you please share your CSV file, the SQL query and the version of Drill 
> that you are on. So someone can take a look and try to reproduce the error 
> that you are seeing.
> 
> 
> Thanks,
> 
> Khurram
> 
> 
> From: Divya Gehlot 
> Sent: Wednesday, July 26, 2017 3:18:08 PM
> To: user@drill.apache.org
> Subject: regex replace in string
> 
> Hi,
> I have a CSV file where  column values are
> "This is the column,one "
> "This is column , two"
> column3
> column4
> 
> When I try to regex_replace it throws error
> 
> org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR:
>> IllegalArgumentException: reallocation size must be non-negative Fragment
>> 0:0
> 
> 
> How can I read the above columns as one string  like  This is the column
> one
> Appreciate the help
> 
> Thanks,
> Divyab



Re: regex replace in string

2017-07-26 Thread Khurram Faraaz
Can you please share your CSV file, the SQL query and the version of Drill that 
you are on. So someone can take a look and try to reproduce the error that you 
are seeing.


Thanks,

Khurram


From: Divya Gehlot 
Sent: Wednesday, July 26, 2017 3:18:08 PM
To: user@drill.apache.org
Subject: regex replace in string

Hi,
I have a CSV file where  column values are
"This is the column,one "
"This is column , two"
column3
column4

When I try to regex_replace it throws error

org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR:
> IllegalArgumentException: reallocation size must be non-negative Fragment
> 0:0


How can I read the above columns as one string  like  This is the column
one
Appreciate the help

Thanks,
Divyab


RE: drill error connecting to Hbase

2017-07-26 Thread Shai Shapira
It is CDH 5.8.2

I believe it is reliable versions, isn't it?

Thanks,
Shai

-Original Message-
From: Kunal Khatua [mailto:kkha...@mapr.com] 
Sent: Monday, July 24, 2017 8:50 AM
To: user@drill.apache.org
Subject: RE: drill error connecting to Hbase

This means that the connectivity with ZK appears to be working. 

What are the HBase, ZK and Hadoop versions that you are working with? I presume 
that the student table is otherwise accessible.

-Original Message-
From: Shai Shapira [mailto:shai.shap...@amdocs.com] 
Sent: Sunday, July 23, 2017 2:58 AM
To: user@drill.apache.org
Cc: Shai Shapira 
Subject: RE: drill error connecting to Hbase

Hi,

I installed Drill and started to work with it, my goal is to use it to connect 
to Hbase.
I checked it a bit locally, csv files, Json files, works great.
When I am trying to connect to Hbase, I am getting error.

It seems that it is connecting to the Hbase/ZK, but fails somehow there.
The errors when trying to select from non-exist table ( stud ) and when 
accessing an existing table ( students ) are different.
For existing table, the error is in the zookeeper.MetaTableLocator.

Any ideas?

Thanks,
Shai




illin4620 STABDB05 54 > drill
Jul 20, 2017 6:17:02 PM org.glassfish.jersey.server.ApplicationHandler 
initialize
INFO: Initiating Jersey application, version Jersey: 2.8 2014-04-29 01:25:26...
apache drill 1.10.0
"just drill it"
0: jdbc:drill:zk=local> use hbase;
+---++
|  ok   |  summary   |
+---++
| true  | Default schema changed to [hbase]  |
+---++
1 row selected (0.895 seconds)
0: jdbc:drill:zk=local> select * from students ;
Error: SYSTEM ERROR: IllegalAccessError: tried to access method 
com.google.common.base.Stopwatch.()V from class 
org.apache.hadoop.hbase.zookeeper.MetaTableLocator


[Error Id: 71a4a041-4f5d-4e68-9400-78c2faeac6f9 on illin4620:31010] 
(state=,code=0)
0: jdbc:drill:zk=local> select * from stud ;
Error: DATA_READ ERROR: Failure while loading table stud in database hbase.

Message:  stud
SQL Query null

[Error Id: f0a6591d-9068-4490-95c0-b0aea41365b4 on illin4620:31010] 
(state=,code=0)


Thanks,
Shai

From: Shai Shapira
Sent: Sunday, July 23, 2017 12:49 PM
To: Shai Shapira 
Subject: drill error connecting to Hbase



Shai Shapira
*  shai.shap...@amdocs.com
* +972 9 776 4171

This message and the information contained herein is proprietary and 
confidential and subject to the Amdocs policy statement,

you may review at https://www.amdocs.com/about/email-disclaimer 

This message and the information contained herein is proprietary and 
confidential and subject to the Amdocs policy statement,

you may review at https://www.amdocs.com/about/email-disclaimer 




Re: append data to already existing table saved in parquet format

2017-07-26 Thread Divya Gehlot
The data size is not big for every hour but  data size will grow with the
time say if I have data for 2 years and data is coming on hourly basis and
everytime creating the paruqet table is not the feasible solution .
Likewise for hive create the partition and insert the data into partition
accordingly .
Was lookiing for that kind of solution.
Does Drill provides that kind of functionalty ?

Thanks,
Divya


On 26 July 2017 at 15:04, Saurabh Mahapatra 
wrote:

> I always recommend against using CTAS as a shortcut for a ETL type large
> workload. You will need to size your Drill cluster accordingly. Consider
> using Hive or Spark instead.
>
> What are the source file formats? For every hour, what is the size and the
> number of rows for that data? Are you doing any aggregations? And what is
> the lag between the streaming data and data available for analytics that
> you are willing to tolerate?
>
> On Tue, Jul 25, 2017 at 11:27 PM, rahul challapalli <
> challapallira...@gmail.com> wrote:
>
> > I am not aware of any clean way to do this. However if your data is
> > partitioned based on directories, then you can use the below hack which
> > leverages temporary tables [1]. Essentially, you backup your partition
> to a
> > temp table, then override it by taking the union of new partition data
> and
> > existing partition data. This way we are not over-writing the entire
> table.
> >
> > create temporary table mytable_2017 (col1, col2)  as select col1,
> col2,
> > ..from mytable where dir0 = "2017";
> > drop table `mytable/2017`;
> > create table `mytable/2017` as
> > select col1, col2 .from new_partition_data
> > union
> > select col1, col2 . from mytable_2017;
> > drop table mytable_2017;
> >
> > Caveat : Temporary tables get dropped automatically if the session ends
> or
> > the drillbit crashes. In the above sequence, if the connection gets
> dropped
> > (there are known issues causing this) between the client and drillbit
> after
> > executing the "DROP" statement, then your partition data is lost forever.
> > And since drill doesn't support transactions, the mentioned approach is
> > dangerous.
> >
> > [1] https://drill.apache.org/docs/create-temporary-table-as-cttas/
> >
> >
> > On Tue, Jul 25, 2017 at 10:52 PM, Divya Gehlot 
> > wrote:
> >
> > > Hi,
> > > I am naive to Apache drill.
> > > As I have data coming in every hour , when I searched I couldnt find
> the
> > > insert into partition command in Apache drill.
> > > How can we insert data to particular partition without rewriting the
> > whole
> > >  data set ?
> > >
> > >
> > > Appreciate the help.
> > > Thanks,
> > > Divya
> > >
> >
>


regex replace in string

2017-07-26 Thread Divya Gehlot
Hi,
I have a CSV file where  column values are
"This is the column,one "
"This is column , two"
column3
column4

When I try to regex_replace it throws error

org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR:
> IllegalArgumentException: reallocation size must be non-negative Fragment
> 0:0


How can I read the above columns as one string  like  This is the column
one
Appreciate the help

Thanks,
Divyab


Re: append data to already existing table saved in parquet format

2017-07-26 Thread Saurabh Mahapatra
I always recommend against using CTAS as a shortcut for a ETL type large
workload. You will need to size your Drill cluster accordingly. Consider
using Hive or Spark instead.

What are the source file formats? For every hour, what is the size and the
number of rows for that data? Are you doing any aggregations? And what is
the lag between the streaming data and data available for analytics that
you are willing to tolerate?

On Tue, Jul 25, 2017 at 11:27 PM, rahul challapalli <
challapallira...@gmail.com> wrote:

> I am not aware of any clean way to do this. However if your data is
> partitioned based on directories, then you can use the below hack which
> leverages temporary tables [1]. Essentially, you backup your partition to a
> temp table, then override it by taking the union of new partition data and
> existing partition data. This way we are not over-writing the entire table.
>
> create temporary table mytable_2017 (col1, col2)  as select col1, col2,
> ..from mytable where dir0 = "2017";
> drop table `mytable/2017`;
> create table `mytable/2017` as
> select col1, col2 .from new_partition_data
> union
> select col1, col2 . from mytable_2017;
> drop table mytable_2017;
>
> Caveat : Temporary tables get dropped automatically if the session ends or
> the drillbit crashes. In the above sequence, if the connection gets dropped
> (there are known issues causing this) between the client and drillbit after
> executing the "DROP" statement, then your partition data is lost forever.
> And since drill doesn't support transactions, the mentioned approach is
> dangerous.
>
> [1] https://drill.apache.org/docs/create-temporary-table-as-cttas/
>
>
> On Tue, Jul 25, 2017 at 10:52 PM, Divya Gehlot 
> wrote:
>
> > Hi,
> > I am naive to Apache drill.
> > As I have data coming in every hour , when I searched I couldnt find the
> > insert into partition command in Apache drill.
> > How can we insert data to particular partition without rewriting the
> whole
> >  data set ?
> >
> >
> > Appreciate the help.
> > Thanks,
> > Divya
> >
>


Re: [HANGOUT] Topics for 7/25/17

2017-07-26 Thread yuliya Feldman
Sorry for the late chime in.Just a note - regarding s3 - even after upgrade to 
hadoop 2.8.x you may need to separately update versions of aws, as one provided 
with the upgrade is not supporting all the newly added regions.
Thanks,Yuliya

  From: Arina Yelchiyeva 
 To: d...@drill.apache.org 
Cc: user 
 Sent: Tuesday, July 25, 2017 10:35 AM
 Subject: Re: [HANGOUT] Topics for 7/25/17
   
Meeting minutes 25 July 2017:

Attendees:
Rob, Vova, Sorabh, Pritesh, Paul, Aman, Padma, Jyothsna, Sindhuri.

Two topics were discussed.
1. Release candidate for 1.11.0.
Everybody is encouraged to test the release candidate and vote.
Aman asked about the release candidate performance testing.
Asked Kunal via email and he confirmed that performance testing is in
progress.

2. Upgrade to hadoop version 2.8.
Padma was looking into S3 connectivity issues and found out that switching
to Hadoop version 2.8.1 will solve these problems.
However, the hadoop release notes for 2.8.1 (and 2.8.0 as well) say the
following:
"Please note that 2.8.x release line continues to be not yet ready for
production use”.
Was decided to wait till the next Hadoop stable version release (hopefully
before Drill 1.12.0 release)
and for now document that users may switch to 2.8.1 themselves.


Thank you all for attending the hangout today.

Kind regards
Arina

On Tue, Jul 25, 2017 at 8:04 PM, Arina Yelchiyeva <
arina.yelchiy...@gmail.com> wrote:

> Hangouts is starting now...
>
> On Tue, Jul 25, 2017 at 7:41 AM, Padma Penumarthy 
> wrote:
>
>> I have a topic to discuss. Lot of folks on the user mailing list raised
>> the issue of not being able to access all S3 regions using Drill.
>> We need hadoop version 2.8 or higher to be able to connect to
>> regions which support only Version 4 signature.
>> I tried with 2.8.1, which just got released and it works i.e. I am able to
>> connect to both old and new regions (by specifying the endpoint in the
>> config).
>> There are some failures in unit tests, which can be fixed.
>>
>> Fixing S3 connectivity issues is important.
>> However, the hadoop release notes for 2.8.1 (and 2.8.0 as well) say the
>> following:
>> "Please note that 2.8.x release line continues to be not yet ready for
>> production use”.
>>
>> So, should we or not move to 2.8.1 ?
>>
>> Thanks,
>> Padma
>>
>>
>> On Jul 24, 2017, at 9:46 AM, Arina Yelchiyeva > > wrote:
>>
>> Hi all,
>>
>> We'll have the hangout tomorrow at the usual time [1]. Any topics to be
>> discussed?
>>
>> [1] https://drill.apache.org/community-resources/
>>
>> Kind regards
>> Arina
>>
>>
>

   

Re: append data to already existing table saved in parquet format

2017-07-26 Thread rahul challapalli
I am not aware of any clean way to do this. However if your data is
partitioned based on directories, then you can use the below hack which
leverages temporary tables [1]. Essentially, you backup your partition to a
temp table, then override it by taking the union of new partition data and
existing partition data. This way we are not over-writing the entire table.

create temporary table mytable_2017 (col1, col2)  as select col1, col2,
..from mytable where dir0 = "2017";
drop table `mytable/2017`;
create table `mytable/2017` as
select col1, col2 .from new_partition_data
union
select col1, col2 . from mytable_2017;
drop table mytable_2017;

Caveat : Temporary tables get dropped automatically if the session ends or
the drillbit crashes. In the above sequence, if the connection gets dropped
(there are known issues causing this) between the client and drillbit after
executing the "DROP" statement, then your partition data is lost forever.
And since drill doesn't support transactions, the mentioned approach is
dangerous.

[1] https://drill.apache.org/docs/create-temporary-table-as-cttas/


On Tue, Jul 25, 2017 at 10:52 PM, Divya Gehlot 
wrote:

> Hi,
> I am naive to Apache drill.
> As I have data coming in every hour , when I searched I couldnt find the
> insert into partition command in Apache drill.
> How can we insert data to particular partition without rewriting the whole
>  data set ?
>
>
> Appreciate the help.
> Thanks,
> Divya
>