Re: [pyspark 2.4] broadcasting DataFrame throws error

2020-09-17 Thread Amit Joshi
Hi,

I think problem lies with driver memory. Broadcast in spark work by
collecting all the data to driver and then driver broadcasting to all the
executors. Different strategy could be employed for trasfer like bit
torrent though.

Please try increasing the driver memory. See if it works.

Regards,
Amit


On Thursday, September 17, 2020, Rishi Shah 
wrote:

> Hello All,
>
> Hope this email finds you well. I have a dataframe of size 8TB (parquet
> snappy compressed), however I group it by a column and get a much smaller
> aggregated dataframe of size 700 rows (just two columns, key and count).
> When I use it like below to broadcast this aggregated result, it throws
> dataframe can not be broadcasted error.
>
> df_agg = df.groupBy('column1').count().cache()
> # df_agg.count()
> df_join = df.join(broadcast(df_agg), 'column1', 'left_outer')
> df_join.write.parquet('PATH')
>
> The same code works with input df size of 3TB without any modifications.
>
> Any suggestions?
>
> --
> Regards,
>
> Rishi Shah
>


Re: [DISCUSS] Spark cannot identify the problem executor

2020-09-17 Thread roseyrathod456
In  spark
   2.3
with blacklist enabled this is a common problem when executor A has some
problem, for instance let’s say there’s some connection issue. Tasks on
executor B, executor C will fail saying cannot read from executor A. This
would make the job fail due to task on executor B failed 4 times.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



unsubscribe

2020-09-17 Thread Kaden Cho
unsubscribe


Re: Structured Streaming Checkpoint Error

2020-09-17 Thread German Schiavon
Hi Gabor,

Makes sense, thanks a lot!

On Thu, 17 Sep 2020 at 11:51, Gabor Somogyi 
wrote:

> Hi,
>
> Structured Streaming is simply not working when checkpoint location is on
> S3 due to it's read-after-write consistency.
> Please choose an HDFS compliant filesystem and it will work like a charm.
>
> BR,
> G
>
>
> On Wed, Sep 16, 2020 at 4:12 PM German Schiavon 
> wrote:
>
>> Hi!
>>
>> I have an Structured Streaming Application that reads from kafka,
>> performs some aggregations and writes in S3 in parquet format.
>>
>> Everything seems to work great except that from time to time I get a
>> checkpoint error, at the beginning I thought it was a random error but it
>> happened more than 3 times already in a few days
>>
>> Caused by: java.io.FileNotFoundException: No such file or directory:
>> s3a://xxx/xxx/validation-checkpoint/offsets/.140.11adef9a-7636-4752-9e6c-48d605a9cca5.tmp
>>
>>
>> Does this happen to anyone else?
>>
>> Thanks in advance.
>>
>> *This is the full error :*
>>
>> ERROR streaming.MicroBatchExecution: Query segmentValidation [id =
>> 14edaddf-25bb-4259-b7a2-6107907f962f, runId =
>> 0a757476-94ec-4a53-960a-91f54ce47110] terminated with error
>>
>> java.io.FileNotFoundException: No such file or directory:
>> s3a://xxx/xxx//validation-checkpoint/offsets/.140.11adef9a-7636-4752-9e6c-48d605a9cca5.tmp
>>
>> at
>> org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:2310)
>>
>> at
>> org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:2204)
>>
>> at
>> org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:2143)
>>
>> at org.apache.hadoop.fs.FileSystem.getFileLinkStatus(FileSystem.java:2664)
>>
>> at
>> org.apache.hadoop.fs.DelegateToFileSystem.getFileLinkStatus(DelegateToFileSystem.java:131)
>>
>> at
>> org.apache.hadoop.fs.AbstractFileSystem.renameInternal(AbstractFileSystem.java:726)
>>
>> at
>> org.apache.hadoop.fs.AbstractFileSystem.rename(AbstractFileSystem.java:699)
>>
>> at org.apache.hadoop.fs.FileContext.rename(FileContext.java:1032)
>>
>> at
>> org.apache.spark.sql.execution.streaming.FileContextBasedCheckpointFileManager.renameTempFile(CheckpointFileManager.scala:329)
>>
>> at
>> org.apache.spark.sql.execution.streaming.CheckpointFileManager$RenameBasedFSDataOutputStream.close(CheckpointFileManager.scala:147)
>>
>> at
>> org.apache.spark.sql.execution.streaming.HDFSMetadataLog.writeBatchToFile(HDFSMetadataLog.scala:134)
>>
>> at
>> org.apache.spark.sql.execution.streaming.HDFSMetadataLog.$anonfun$add$3(HDFSMetadataLog.scala:120)
>>
>> at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
>> at scala.Option.getOrElse(Option.scala:189)
>>
>


Re: Structured Streaming Checkpoint Error

2020-09-17 Thread Gabor Somogyi
Hi,

Structured Streaming is simply not working when checkpoint location is on
S3 due to it's read-after-write consistency.
Please choose an HDFS compliant filesystem and it will work like a charm.

BR,
G


On Wed, Sep 16, 2020 at 4:12 PM German Schiavon 
wrote:

> Hi!
>
> I have an Structured Streaming Application that reads from kafka, performs
> some aggregations and writes in S3 in parquet format.
>
> Everything seems to work great except that from time to time I get a
> checkpoint error, at the beginning I thought it was a random error but it
> happened more than 3 times already in a few days
>
> Caused by: java.io.FileNotFoundException: No such file or directory:
> s3a://xxx/xxx/validation-checkpoint/offsets/.140.11adef9a-7636-4752-9e6c-48d605a9cca5.tmp
>
>
> Does this happen to anyone else?
>
> Thanks in advance.
>
> *This is the full error :*
>
> ERROR streaming.MicroBatchExecution: Query segmentValidation [id =
> 14edaddf-25bb-4259-b7a2-6107907f962f, runId =
> 0a757476-94ec-4a53-960a-91f54ce47110] terminated with error
>
> java.io.FileNotFoundException: No such file or directory:
> s3a://xxx/xxx//validation-checkpoint/offsets/.140.11adef9a-7636-4752-9e6c-48d605a9cca5.tmp
>
> at
> org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:2310)
>
> at
> org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:2204)
>
> at
> org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:2143)
>
> at org.apache.hadoop.fs.FileSystem.getFileLinkStatus(FileSystem.java:2664)
>
> at
> org.apache.hadoop.fs.DelegateToFileSystem.getFileLinkStatus(DelegateToFileSystem.java:131)
>
> at
> org.apache.hadoop.fs.AbstractFileSystem.renameInternal(AbstractFileSystem.java:726)
>
> at
> org.apache.hadoop.fs.AbstractFileSystem.rename(AbstractFileSystem.java:699)
>
> at org.apache.hadoop.fs.FileContext.rename(FileContext.java:1032)
>
> at
> org.apache.spark.sql.execution.streaming.FileContextBasedCheckpointFileManager.renameTempFile(CheckpointFileManager.scala:329)
>
> at
> org.apache.spark.sql.execution.streaming.CheckpointFileManager$RenameBasedFSDataOutputStream.close(CheckpointFileManager.scala:147)
>
> at
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog.writeBatchToFile(HDFSMetadataLog.scala:134)
>
> at
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog.$anonfun$add$3(HDFSMetadataLog.scala:120)
>
> at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
> at scala.Option.getOrElse(Option.scala:189)
>


Re: Spark structured streaming: periodically refresh static data frame

2020-09-17 Thread Harsh
As per the solution, if we are closing and starting the query, then what
happens to the the state which is maintained in memory, will that be
retained ? 



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org