Re: ORC Transaction Table - Spark

2017-08-24 Thread Gopal Vijayaraghavan
> Or, is this an artifact of an incompatibility between ORC files written by 
> the Hive 2.x ORC serde not being readable by the Hive 1.x ORC serde?  
> 3. Is there a difference in the ORC file format spec. at play here?

Nope, we're still defaulting to hive-0.12 format ORC files in Hive-2.x.

We haven't changed the format compatibility in 5 years, so we're due for a 
refresh soon.

> 5. What’s the mechanism that affects Spark here?

SparkSQL has never properly supported ACID, because to do this correctly Spark 
has to grab locks on the table and heartbeat the lock, to prevent a compaction 
from removing a currently used ACID snapshot.

AFAIK, there's no code in SparkSQL to handle transactions in Hive - this is not 
related to the format, it is related to the directory structure used to 
maintain ACID snapshots, so that you can delete a row without failing queries 
in progress.

However, that's mostly an operational issue for production. Off the raw 
filesystem (i.e not table),  I've used SparkSQL to read the ACID 2.x raw data 
to write a acidfsck which checks underlying structures by reading them as raw 
data, so that I can easily do tests like "There's only 1 delete for each 
ROW__ID" when ACID 2.x was in dev.

You can think of the ACID data as basically

Struct , Struct

when reading it raw.

> 6. Any similar issues with Parquet format in Hive 1.x and 2.x?

Not similar - but a different set of Parquet incompatibilities are inbound, 
with parquet.writer.version=v2.

Cheers,
Gopal

 
 
 





RE: ORC Transaction Table - Spark

2017-08-24 Thread Larson, Kurt
Just some clarifying points please.

1.   Is this the general case for all file formats?

2.   Or, is this an artifact of an incompatibility between ORC files 
written by the Hive 2.x ORC serde not being readable by the Hive 1.x ORC serde?

3.   Is there a difference in the ORC file format spec. at play here?

4.   Or, is any incompatibility limited to the Hive ORC serde 
implementations in Hive 1.x and 2.x?

5.   What’s the mechanism that affects Spark here?

a.   Same ORC serdes as Hive?

b.  Similar issues in Spark ORC serde implementation(s) as in Hive 1.x ORC 
serde?

6.   Any similar issues with Parquet format in Hive 1.x and 2.x?


From: Aviral Agarwal [mailto:aviral12...@gmail.com]
Sent: Wednesday, August 23, 2017 10:34 PM
To: user@hive.apache.org
Subject: Re: ORC Transaction Table - Spark

So, there is no way possible right now for Spark to read Hive 2.x data ?

On Thu, Aug 24, 2017 at 12:17 AM, Eugene Koifman 
<ekoif...@hortonworks.com<mailto:ekoif...@hortonworks.com>> wrote:
This looks like you have some data written by Hive 2.x and Hive 1.x code trying 
to read it.
That is not supported.

From: Aviral Agarwal <aviral12...@gmail.com<mailto:aviral12...@gmail.com>>
Reply-To: "user@hive.apache.org<mailto:user@hive.apache.org>" 
<user@hive.apache.org<mailto:user@hive.apache.org>>
Date: Wednesday, August 23, 2017 at 12:24 AM
To: "user@hive.apache.org<mailto:user@hive.apache.org>" 
<user@hive.apache.org<mailto:user@hive.apache.org>>
Subject: Re: ORC Transaction Table - Spark

Hi,

Yes it caused by wrong naming convention of the delta directory :

/apps/hive/warehouse/foo.db/bar/year=2017/month=5/delta_0645253_0645253_0001

How do I solve this ?

Thanks !
Aviral Agarwal

On Tue, Aug 22, 2017 at 11:50 PM, Eugene Koifman 
<ekoif...@hortonworks.com<mailto:ekoif...@hortonworks.com>> wrote:
Could you do recursive “ls” in your table or partition that you are trying to 
read?
Most likely you have files that don’t follow expected naming convention

Eugene


From: Aviral Agarwal <aviral12...@gmail.com<mailto:aviral12...@gmail.com>>
Reply-To: "user@hive.apache.org<mailto:user@hive.apache.org>" 
<user@hive.apache.org<mailto:user@hive.apache.org>>
Date: Tuesday, August 22, 2017 at 5:39 AM
To: "user@hive.apache.org<mailto:user@hive.apache.org>" 
<user@hive.apache.org<mailto:user@hive.apache.org>>
Subject: ORC Transaction Table - Spark

Hi,

I am trying to read hive orc transaction table through Spark but I am getting 
the following error

Caused by: java.lang.RuntimeException: serious problem
at 
org.apache.hadoop.hive.ql.io<http://org.apache.hadoop.hive.ql.io>.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1021)
at 
org.apache.hadoop.hive.ql.io<http://org.apache.hadoop.hive.ql.io>.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
.
Caused by: java.util.concurrent.ExecutionException: 
java.lang.NumberFormatException: For input string: "0645253_0001"
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at 
org.apache.hadoop.hive.ql.io<http://org.apache.hadoop.hive.ql.io>.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:998)
... 118 more

Any help would be appreciated.

Thanks and Regards,
Aviral Agarwal




Re: ORC Transaction Table - Spark

2017-08-24 Thread Furcy Pin
As far as I know, Spark can't read Hive's transactionnal tables yet:
https://issues.apache.org/jira/browse/SPARK-16996




On Thu, Aug 24, 2017 at 4:34 AM, Aviral Agarwal <aviral12...@gmail.com>
wrote:

> So, there is no way possible right now for Spark to read Hive 2.x data ?
>
> On Thu, Aug 24, 2017 at 12:17 AM, Eugene Koifman <ekoif...@hortonworks.com
> > wrote:
>
>> This looks like you have some data written by Hive 2.x and Hive 1.x code
>> trying to read it.
>>
>> That is not supported.
>>
>>
>>
>> *From: *Aviral Agarwal <aviral12...@gmail.com>
>> *Reply-To: *"user@hive.apache.org" <user@hive.apache.org>
>> *Date: *Wednesday, August 23, 2017 at 12:24 AM
>> *To: *"user@hive.apache.org" <user@hive.apache.org>
>> *Subject: *Re: ORC Transaction Table - Spark
>>
>>
>>
>> Hi,
>>
>> Yes it caused by wrong naming convention of the delta directory :
>>
>> /apps/hive/warehouse/foo.db/bar/year=2017/month=5/delta_0645
>> 253_0645253_0001
>>
>> How do I solve this ?
>>
>> Thanks !
>> Aviral Agarwal
>>
>>
>>
>> On Tue, Aug 22, 2017 at 11:50 PM, Eugene Koifman <
>> ekoif...@hortonworks.com> wrote:
>>
>> Could you do recursive “ls” in your table or partition that you are
>> trying to read?
>>
>> Most likely you have files that don’t follow expected naming convention
>>
>>
>>
>> Eugene
>>
>>
>>
>>
>>
>> *From: *Aviral Agarwal <aviral12...@gmail.com>
>> *Reply-To: *"user@hive.apache.org" <user@hive.apache.org>
>> *Date: *Tuesday, August 22, 2017 at 5:39 AM
>> *To: *"user@hive.apache.org" <user@hive.apache.org>
>> *Subject: *ORC Transaction Table - Spark
>>
>>
>>
>> Hi,
>>
>>
>>
>> I am trying to read hive orc transaction table through Spark but I am
>> getting the following error
>>
>>
>> Caused by: java.lang.RuntimeException: serious problem
>> at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSpli
>> tsInfo(OrcInputFormat.java:1021)
>> at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(Or
>> cInputFormat.java:1048)
>> at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
>> .
>> Caused by: java.util.concurrent.ExecutionException:
>> java.lang.NumberFormatException: For input string: "0645253_0001"
>> at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>> at java.util.concurrent.FutureTask.get(FutureTask.java:192)
>> at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSpli
>> tsInfo(OrcInputFormat.java:998)
>> ... 118 more
>>
>>
>> Any help would be appreciated.
>>
>> Thanks and Regards,
>> Aviral Agarwal
>>
>>
>>
>
>


Re: ORC Transaction Table - Spark

2017-08-23 Thread Aviral Agarwal
So, there is no way possible right now for Spark to read Hive 2.x data ?

On Thu, Aug 24, 2017 at 12:17 AM, Eugene Koifman <ekoif...@hortonworks.com>
wrote:

> This looks like you have some data written by Hive 2.x and Hive 1.x code
> trying to read it.
>
> That is not supported.
>
>
>
> *From: *Aviral Agarwal <aviral12...@gmail.com>
> *Reply-To: *"user@hive.apache.org" <user@hive.apache.org>
> *Date: *Wednesday, August 23, 2017 at 12:24 AM
> *To: *"user@hive.apache.org" <user@hive.apache.org>
> *Subject: *Re: ORC Transaction Table - Spark
>
>
>
> Hi,
>
> Yes it caused by wrong naming convention of the delta directory :
>
> /apps/hive/warehouse/foo.db/bar/year=2017/month=5/delta_
> 0645253_0645253_0001
>
> How do I solve this ?
>
> Thanks !
> Aviral Agarwal
>
>
>
> On Tue, Aug 22, 2017 at 11:50 PM, Eugene Koifman <ekoif...@hortonworks.com>
> wrote:
>
> Could you do recursive “ls” in your table or partition that you are trying
> to read?
>
> Most likely you have files that don’t follow expected naming convention
>
>
>
> Eugene
>
>
>
>
>
> *From: *Aviral Agarwal <aviral12...@gmail.com>
> *Reply-To: *"user@hive.apache.org" <user@hive.apache.org>
> *Date: *Tuesday, August 22, 2017 at 5:39 AM
> *To: *"user@hive.apache.org" <user@hive.apache.org>
> *Subject: *ORC Transaction Table - Spark
>
>
>
> Hi,
>
>
>
> I am trying to read hive orc transaction table through Spark but I am
> getting the following error
>
>
> Caused by: java.lang.RuntimeException: serious problem
> at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(
> OrcInputFormat.java:1021)
> at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(
> OrcInputFormat.java:1048)
> at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
> .
> Caused by: java.util.concurrent.ExecutionException: 
> java.lang.NumberFormatException:
> For input string: "0645253_0001"
> at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> at java.util.concurrent.FutureTask.get(FutureTask.java:192)
> at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(
> OrcInputFormat.java:998)
> ... 118 more
>
>
> Any help would be appreciated.
>
> Thanks and Regards,
> Aviral Agarwal
>
>
>


Re: ORC Transaction Table - Spark

2017-08-23 Thread Eugene Koifman
This looks like you have some data written by Hive 2.x and Hive 1.x code trying 
to read it.
That is not supported.

From: Aviral Agarwal <aviral12...@gmail.com>
Reply-To: "user@hive.apache.org" <user@hive.apache.org>
Date: Wednesday, August 23, 2017 at 12:24 AM
To: "user@hive.apache.org" <user@hive.apache.org>
Subject: Re: ORC Transaction Table - Spark

Hi,

Yes it caused by wrong naming convention of the delta directory :

/apps/hive/warehouse/foo.db/bar/year=2017/month=5/delta_0645253_0645253_0001

How do I solve this ?

Thanks !
Aviral Agarwal

On Tue, Aug 22, 2017 at 11:50 PM, Eugene Koifman 
<ekoif...@hortonworks.com<mailto:ekoif...@hortonworks.com>> wrote:
Could you do recursive “ls” in your table or partition that you are trying to 
read?
Most likely you have files that don’t follow expected naming convention

Eugene


From: Aviral Agarwal <aviral12...@gmail.com<mailto:aviral12...@gmail.com>>
Reply-To: "user@hive.apache.org<mailto:user@hive.apache.org>" 
<user@hive.apache.org<mailto:user@hive.apache.org>>
Date: Tuesday, August 22, 2017 at 5:39 AM
To: "user@hive.apache.org<mailto:user@hive.apache.org>" 
<user@hive.apache.org<mailto:user@hive.apache.org>>
Subject: ORC Transaction Table - Spark

Hi,

I am trying to read hive orc transaction table through Spark but I am getting 
the following error

Caused by: java.lang.RuntimeException: serious problem
at 
org.apache.hadoop.hive.ql.io<http://org.apache.hadoop.hive.ql.io>.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1021)
at 
org.apache.hadoop.hive.ql.io<http://org.apache.hadoop.hive.ql.io>.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
.
Caused by: java.util.concurrent.ExecutionException: 
java.lang.NumberFormatException: For input string: "0645253_0001"
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at 
org.apache.hadoop.hive.ql.io<http://org.apache.hadoop.hive.ql.io>.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:998)
... 118 more

Any help would be appreciated.

Thanks and Regards,
Aviral Agarwal



Re: ORC Transaction Table - Spark

2017-08-22 Thread Eugene Koifman
Could you do recursive “ls” in your table or partition that you are trying to 
read?
Most likely you have files that don’t follow expected naming convention

Eugene


From: Aviral Agarwal <aviral12...@gmail.com>
Reply-To: "user@hive.apache.org" <user@hive.apache.org>
Date: Tuesday, August 22, 2017 at 5:39 AM
To: "user@hive.apache.org" <user@hive.apache.org>
Subject: ORC Transaction Table - Spark

Hi,

I am trying to read hive orc transaction table through Spark but I am getting 
the following error

Caused by: java.lang.RuntimeException: serious problem
at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1021)
at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
.
Caused by: java.util.concurrent.ExecutionException: 
java.lang.NumberFormatException: For input string: "0645253_0001"
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:998)
... 118 more

Any help would be appreciated.

Thanks and Regards,
Aviral Agarwal


ORC Transaction Table - Spark

2017-08-22 Thread Aviral Agarwal
Hi,

I am trying to read hive orc transaction table through Spark but I am
getting the following error

Caused by: java.lang.RuntimeException: serious problem
at
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1021)
at
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
.
Caused by: java.util.concurrent.ExecutionException:
java.lang.NumberFormatException: For input string: "0645253_0001"
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:998)
... 118 more

Any help would be appreciated.

Thanks and Regards,
Aviral Agarwal