[jira] [Updated] (SPARK-21797) spark cannot read partitioned data in S3 that are partly in glacier

2017-08-24 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated SPARK-21797:
---
Environment: Amazon EMR

> spark cannot read partitioned data in S3 that are partly in glacier
> ---
>
> Key: SPARK-21797
> URL: https://issues.apache.org/jira/browse/SPARK-21797
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
> Environment: Amazon EMR
>Reporter: Boris Clémençon 
>  Labels: glacier, partitions, read, s3
>
> I have a dataset in parquet in S3 partitioned by date (dt) with oldest date 
> stored in AWS Glacier to save some money. For instance, we have...
> {noformat}
> s3://my-bucket/my-dataset/dt=2017-07-01/[in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-09/[in glacier]
> s3://my-bucket/my-dataset/dt=2017-07-10/[not in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-24/[not in glacier]
> {noformat}
> I want to read this dataset, but only a subset of date that are not yet in 
> glacier, eg:
> {code:java}
> val from = "2017-07-15"
> val to = "2017-08-24"
> val path = "s3://my-bucket/my-dataset/"
> val X = spark.read.parquet(path).where(col("dt").between(from, to))
> {code}
> Unfortunately, I have the exception
> {noformat}
> java.io.IOException: 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
>  The operation is not valid for the object's storage class (Service: Amazon 
> S3; Status Code: 403; Error Code: InvalidObjectState; Request ID: 
> C444D508B6042138)
> {noformat}
> I seems that spark does not like partitioned dataset when some partitions are 
> in Glacier. I could always read specifically each date, add the column with 
> current date and reduce(_ union _) at the end, but not pretty and it should 
> not be necessary.
> Is there any tip to read available data in the datastore even with old data 
> in glacier?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21797) spark cannot read partitioned data in S3 that are partly in glacier

2017-08-21 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-21797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boris Clémençon  updated SPARK-21797:
-
Priority: Major  (was: Critical)

> spark cannot read partitioned data in S3 that are partly in glacier
> ---
>
> Key: SPARK-21797
> URL: https://issues.apache.org/jira/browse/SPARK-21797
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Boris Clémençon 
>  Labels: glacier, partitions, read, s3
>
> I have a dataset in parquet in S3 partitioned by date (dt) with oldest date 
> stored in AWS Glacier to save some money. For instance, we have...
> {noformat}
> s3://my-bucket/my-dataset/dt=2017-07-01/[in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-09/[in glacier]
> s3://my-bucket/my-dataset/dt=2017-07-10/[not in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-24/[not in glacier]
> {noformat}
> I want to read this dataset, but only a subset of date that are not yet in 
> glacier, eg:
> {code:java}
> val from = "2017-07-15"
> val to = "2017-08-24"
> val path = "s3://my-bucket/my-dataset/"
> val X = spark.read.parquet(path).where(col("dt").between(from, to))
> {code}
> Unfortunately, I have the exception
> {noformat}
> java.io.IOException: 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
>  The operation is not valid for the object's storage class (Service: Amazon 
> S3; Status Code: 403; Error Code: InvalidObjectState; Request ID: 
> C444D508B6042138)
> {noformat}
> I seems that spark does not like partitioned dataset when some partitions are 
> in Glacier. I could always read specifically each date, add the column with 
> current date and reduce(_ union _) at the end, but not pretty and it should 
> not be necessary.
> Is there any tip to read available data in the datastore even with old data 
> in glacier?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21797) spark cannot read partitioned data in S3 that are partly in glacier

2017-08-21 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-21797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boris Clémençon  updated SPARK-21797:
-
Description: 
I have a dataset in parquet in S3 partitioned by date (dt) with oldest date 
stored in AWS Glacier to save some money. For instance, we have...



{noformat}
s3://my-bucket/my-dataset/dt=2017-07-01/[in glacier]
...
s3://my-bucket/my-dataset/dt=2017-07-09/[in glacier]
s3://my-bucket/my-dataset/dt=2017-07-10/[not in glacier]
...
s3://my-bucket/my-dataset/dt=2017-07-24/[not in glacier]
{noformat}

I want to read this dataset, but only a subset of date that are not yet in 
glacier, eg:


{code:java}
val from = "2017-07-15"
val to = "2017-08-24"
val path = "s3://my-bucket/my-dataset/"
val X = spark.read.parquet(path).where(col("dt").between(from, to))
{code}


Unfortunately, I have the exception


{noformat}
java.io.IOException: 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
 The operation is not valid for the object's storage class (Service: Amazon S3; 
Status Code: 403; Error Code: InvalidObjectState; Request ID: C444D508B6042138)
{noformat}

I seems that spark does not like partitioned dataset when some partitions are 
in Glacier. I could always read specifically each date, add the column with 
current date and reduce(_ union _) at the end, but not pretty and it should not 
be necessary.

Is there any tip to read available data in the datastore even with old data in 
glacier?

  was:
I have a dataset in parquet in S3 partitioned by date (dt) with oldest date 
stored in AWS Glacier to save some money. For instance, we have...



{noformat}
s3://my-bucket/my-dataset/dt=2017-07-01/[in glacier]
...
s3://my-bucket/my-dataset/dt=2017-07-09/[in glacier]
s3://my-bucket/my-dataset/dt=2017-07-10/[not in glacier]
...
s3://my-bucket/my-dataset/dt=2017-07-24/[not in glacier]
{noformat}

I want to read this dataset, but only a subset of date that are not yet in 
glacier, eg:


{code:java}
val from = "2017-07-15"
val to = "2017-08-24"
val path = "s3://my-bucket/my-dataset/"
val X = spark.read.parquet(path).where(col("dt").between(from, to))
{code}


Unfortunately, I have the exception


{noformat}
java.io.IOException: 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
 The operation is not valid for the object's storage class (Service: Amazon S3; 
Status Code: 403; Error Code: InvalidObjectState; Request ID: C444D508B6042138)
{noformat}

I seems that spark does not like partitioned dataset when some partitions are 
in Glacier. I could always read specifically each date, add the column with 
current date and reduce(_ union _) at the end, but it is ugly like hell and it 
should not be necessary.

Is there any tip to read available data in the datastore even with old data in 
glacier?


> spark cannot read partitioned data in S3 that are partly in glacier
> ---
>
> Key: SPARK-21797
> URL: https://issues.apache.org/jira/browse/SPARK-21797
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Boris Clémençon 
>Priority: Critical
>  Labels: glacier, partitions, read, s3
>
> I have a dataset in parquet in S3 partitioned by date (dt) with oldest date 
> stored in AWS Glacier to save some money. For instance, we have...
> {noformat}
> s3://my-bucket/my-dataset/dt=2017-07-01/[in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-09/[in glacier]
> s3://my-bucket/my-dataset/dt=2017-07-10/[not in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-24/[not in glacier]
> {noformat}
> I want to read this dataset, but only a subset of date that are not yet in 
> glacier, eg:
> {code:java}
> val from = "2017-07-15"
> val to = "2017-08-24"
> val path = "s3://my-bucket/my-dataset/"
> val X = spark.read.parquet(path).where(col("dt").between(from, to))
> {code}
> Unfortunately, I have the exception
> {noformat}
> java.io.IOException: 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
>  The operation is not valid for the object's storage class (Service: Amazon 
> S3; Status Code: 403; Error Code: InvalidObjectState; Request ID: 
> C444D508B6042138)
> {noformat}
> I seems that spark does not like partitioned dataset when some partitions are 
> in Glacier. I could always read specifically each date, add the column with 
> current date and reduce(_ union _) at the end, but not pretty and it should 
> not be necessary.
> Is there any tip to read available data in the datastore even with old data 
> in glacier?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, 

[jira] [Updated] (SPARK-21797) spark cannot read partitioned data in S3 that are partly in glacier

2017-08-21 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-21797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boris Clémençon  updated SPARK-21797:
-
Description: 
I have a dataset in parquet in S3 partitioned by date (dt) with oldest date 
stored in AWS Glacier to save some money. For instance, we have...



{noformat}
s3://my-bucket/my-dataset/dt=2017-07-01/[in glacier]
...
s3://my-bucket/my-dataset/dt=2017-07-09/[in glacier]
s3://my-bucket/my-dataset/dt=2017-07-10/[not in glacier]
...
s3://my-bucket/my-dataset/dt=2017-07-24/[not in glacier]
{noformat}

I want to read this dataset, but only a subset of date that are not yet in 
glacier, eg:


{code:java}
val from = "2017-07-15"
val to = "2017-08-24"
val path = "s3://my-bucket/my-dataset/"
val X = spark.read.parquet(path).where(col("dt").between(from, to))
{code}


Unfortunately, I have the exception


{noformat}
java.io.IOException: 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
 The operation is not valid for the object's storage class (Service: Amazon S3; 
Status Code: 403; Error Code: InvalidObjectState; Request ID: C444D508B6042138)
{noformat}

I seems that spark does not like partitioned dataset when some partitions are 
in Glacier. I could always read specifically each date, add the column with 
current date and reduce(_ union _) at the end, but it is ugly like hell and it 
should not be necessary.

Is there any tip to read available data in the datastore even with old data in 
glacier?

  was:
I have a dataset in parquet in S3 partitioned by date (dt) with oldest date 
stored in AWS Glacier to save some money. For instance, we have...



{noformat}
s3://my-bucket/my-dataset/dt=2017-07-01/[in glacier]
...
s3://my-bucket/my-dataset/dt=2017-07-09/[in glacier]
s3://my-bucket/my-dataset/dt=2017-07-10/[not in glacier]
...
s3://my-bucket/my-dataset/dt=2017-07-24/[not in glacier]
{noformat}

I want to read this dataset, but only the a subset of date that are not yet in 
glacier, eg:


{code:java}
val from = "2017-07-15"
val to = "2017-08-24"
val path = "s3://my-bucket/my-dataset/"
val X = spark.read.parquet(path).where(col("dt").between(from, to))
{code}


Unfortunately, I have the exception


{noformat}
java.io.IOException: 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
 The operation is not valid for the object's storage class (Service: Amazon S3; 
Status Code: 403; Error Code: InvalidObjectState; Request ID: C444D508B6042138)
{noformat}

I seems that spark does not like partitioned dataset when some partitions are 
in Glacier. I could always read specifically each date, add the column with 
current date and reduce(_ union _) at the end, but it is ugly like hell and it 
should not be necessary.

Is there any tip to read available data in the datastore even with old data in 
glacier?


> spark cannot read partitioned data in S3 that are partly in glacier
> ---
>
> Key: SPARK-21797
> URL: https://issues.apache.org/jira/browse/SPARK-21797
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Boris Clémençon 
>Priority: Critical
>  Labels: glacier, partitions, read, s3
>
> I have a dataset in parquet in S3 partitioned by date (dt) with oldest date 
> stored in AWS Glacier to save some money. For instance, we have...
> {noformat}
> s3://my-bucket/my-dataset/dt=2017-07-01/[in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-09/[in glacier]
> s3://my-bucket/my-dataset/dt=2017-07-10/[not in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-24/[not in glacier]
> {noformat}
> I want to read this dataset, but only a subset of date that are not yet in 
> glacier, eg:
> {code:java}
> val from = "2017-07-15"
> val to = "2017-08-24"
> val path = "s3://my-bucket/my-dataset/"
> val X = spark.read.parquet(path).where(col("dt").between(from, to))
> {code}
> Unfortunately, I have the exception
> {noformat}
> java.io.IOException: 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
>  The operation is not valid for the object's storage class (Service: Amazon 
> S3; Status Code: 403; Error Code: InvalidObjectState; Request ID: 
> C444D508B6042138)
> {noformat}
> I seems that spark does not like partitioned dataset when some partitions are 
> in Glacier. I could always read specifically each date, add the column with 
> current date and reduce(_ union _) at the end, but it is ugly like hell and 
> it should not be necessary.
> Is there any tip to read available data in the datastore even with old data 
> in glacier?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)