Re: [External Sender] Re: Driver pods stuck in running state indefinitely

2020-04-12 Thread Zhang Wei
I would like to suggest to double check the resolving with
logging into the failed node, and try the ping command:

ping spark-1586333186571-driver-svc.fractal-segmentation.svc

Just my 2 cents.

-- 
Cheers,
-z

On Fri, 10 Apr 2020 13:03:46 -0400
"Prudhvi Chennuru (CONT)"  wrote:

> No, there was no internal domain issue. As I mentioned I saw this issue
> only on a few nodes on the cluster.
> 
> On Thu, Apr 9, 2020 at 10:49 PM Wei Zhang  wrote:
> 
> > Is there any internal domain name resolving issues?
> >
> > > Caused by:  java.net.UnknownHostException:
> > spark-1586333186571-driver-svc.fractal-segmentation.svc
> >
> > -z
> > 
> > From: Prudhvi Chennuru (CONT) 
> > Sent: Friday, April 10, 2020 2:44
> > To: user
> > Subject: Driver pods stuck in running state indefinitely
> >
> >
> > Hi,
> >
> >We are running spark batch jobs on K8s.
> >Kubernetes version: 1.11.5 ,
> >spark version: 2.3.2,
> >   docker version: 19.3.8
> >
> >Issue: Few Driver pods are stuck in running state indefinitely with
> > error
> >
> >```
> >The Initial job has not accepted any resources; check your cluster UI
> > to ensure that workers are registered and have sufficient resources.
> >```
> >
> > Below is the log of the errored out executor pods
> >
> >   ```
> >Exception in thread "main"
> > java.lang.reflect.UndeclaredThrowableException
> > at
> > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1858)
> > at
> > org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:63)
> > at
> > org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188)
> > at
> > org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:293)
> > at
> > org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
> > Caused by: org.apache.spark.SparkException: Exception thrown in
> > awaitResult:
> > at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
> > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
> > at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101)
> > at
> > org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:201)
> > at
> > org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:64)
> > at
> > org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:63)
> > at java.security.AccessController.doPrivileged(Native Method)
> > at javax.security.auth.Subject.doAs(Subject.java:422)
> > at
> > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1840)
> > ... 4 more
> > Caused by: java.io.IOException: Failed to connect to
> > spark-1586333186571-driver-svc.fractal-segmentation.svc:7078
> > at
> > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:245)
> > at
> > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187)
> > at
> > org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:198)
> > at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194)
> > at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190)
> > at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> > at
> > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> > at
> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> > at java.lang.Thread.run(Thread.java:748)
> > Caused by: java.net.UnknownHostException:
> > spark-1586333186571-driver-svc.fractal-segmentation.svc
> > at java.net.InetAddress.getAllByName0(InetAddress.java:1280)
> > at java.net.InetAddress.getAllByName(InetAddress.java:1192)
> > at java.net.InetAddress.getAllByName(InetAddress.java:1126)
> > at java.net.InetAddress.getByName(InetAddress.java:1076)
> > at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:146)
> > at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:143)
> > at java.security.AccessController.doPrivileged(Native Method)
> > at io.netty.util.internal.SocketUtils.addressByName(SocketUtils.java:143)
> > at
> > io.netty.resolver.DefaultNameResolver.doResolve(DefaultNameResolver.java:43)
> > at io.netty.resolver.SimpleNameResolver.resolve(SimpleNameResolver.java:63)
> > at io.netty.resolver.SimpleNameResolver.resolve(SimpleNameResolver.java:55)
> > at
> > io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:57)
> > at
> > io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:32)
> > at
> > io.netty.resolver.AbstractAddressResolver.resolve(AbstractAddressResolver.java:108)
> > at io.netty.bootstrap.Bootstrap.doResolveAndConnect0(Bootstrap.java:208)
> > at io.netty.bootstrap.Bootstrap.access$000(Bootstrap.java:49)
> > at io.netty.b

Re: covid 19 Data [DISCUSSION]

2020-04-12 Thread jane thorpe
 
Thank you Sir,
I am currently developing a small  OLTP web application using  Spring 
Framework.Although Spring Framework  is open source it is actually a 
professional product which comes a professional code generator  at 
https://start.spring.io/.The code  generator is flawless and professional like 
yourself.

I am using the following two Java Libraries to ingest (fetch) data across the 
Wide Area Network for processing.These Java libraries only became available 
recently ( jdk12). 

import java.net.URI;

import java.net.http.HttpClient;

import java.net.http.HttpRequest;

import java.net.http.HttpResponse;


// declare temp store to prevent errors by calling only after population 
process complete.

List newStats = new ArrayList<>();



// create a new Http client new features in JDK 12+

HttpClient client = HttpClient.newHttpClient();

// create request with the URL using builder pattern

HttpRequest request = HttpRequest.newBuilder()

.uri(URI.create(VIRUS_DATA_URL))

.build();



// send request and body of the response as a String

HttpResponse httpResponse = 
client.send(request,HttpResponse.BodyHandlers.ofString());

// System.out.println(httpResponse.body());


I am also using Java Libraries 
http://commons.apache.org/proper/commons-csv/user-guide.html to process the raw 
data. ready for display in browser. 
// read whole csv file

StringReader csvBodyReader = new StringReader(httpResponse.body());



// populate array with each row  marking first row as table header

Iterable records = 
CSVFormat.DEFAULT.withFirstRecordAsHeader().parse(csvBodyReader);



for (CSVRecord record : records) {



LocationStats locationStat = new LocationStats();

locationStat.setState(record.get("Province/State"));

locationStat.setCountry(record.get("Country/Region"));



int latestCases = Integer.parseInt(record.get(record.size() - 1));

locationStat.setLatestTotalCases(latestCases);



newStats.add(locationStat);



System.out.println(locationStat);
Thank you once again sir for clarifying  WEKA and its scope of use case.
  
jane thorpe
janethor...@aol.com
 
 
-Original Message-
From: Teemu Heikkilä 
To: jane thorpe 
CC: user 
Sent: Sun, 12 Apr 2020 22:33
Subject: Re: covid 19 Data [DISCUSSION]

Hi Jane!
The data you pointed there is couple tens of MBs, I wouldn’t exacly say it’s 
"big data” and definitely you don’t need to use Apache Spark for processing 
that amount of data. I would suggest you using some other tools for your 
processing needs. 
WEKA is ”full suite” for data analysis and visualisation and it’s probably good 
choice for the task. If you want to go lower level like with Spark and you are 
familiar with Python, pandas could be good library to investigate. 
br,Teemu Heikkilä

te...@emblica.com 
+358 40 0963509

Emblica ı The data engineering company
Kaisaniemenkatu 1 B
00100 Helsinki
https://emblica.com

jane thorpe  kirjoitti 12.4.2020 kello 22.30:
 Hi,
Three weeks a phD guy proposed to start a project  to use Apache Spark 
to help the WHO with predictive analysis  using COVID -19 data.

I have located the daily updated data. 
It can be found here 
https://github.com/CSSEGISandData/COVID-19.
I was wondering if Apache Spark is up to the job of handling BIG DATA of this  
sizeor would it be better to use WEKA.
Please discuss which product is more suitable ?

 
Jane 
janethor...@aol.com




Spark interrupts S3 request backoff

2020-04-12 Thread Lian Jiang
Hi,

My Spark job failed when reading parquet files from S3 due to 503 slow
down. According to
https://docs.aws.amazon.com/AmazonS3/latest/dev/optimizing-performance.html,
I can use backoff to mitigate this issue. However, spark seems to interrupt
the backoff sleeping (see "sleep interrupted"). Is there a way (e.g. some
settings) to make spark not interrupt the backoff? Appreciate any hints.


20/04/12 20:15:37 WARN TaskSetManager: Lost task 3347.0 in stage 155.0
(TID 128138, ip-100-101-44-35.us-west-2.compute.internal, executor
34): org.apache.spark.sql.execution.datasources.FileDownloadException:
Failed to download file path:
s3://mybucket/myprefix/part-00178-d0a0d51f-f98e-4b9d-8d00-bb3b9acd9a47-c000.snappy.parquet,
range: 0-19231, partition values: [empty row], isDataPresent: false
at 
org.apache.spark.sql.execution.datasources.AsyncFileDownloader.next(AsyncFileDownloader.scala:142)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.getNextFile(FileScanRDD.scala:248)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:172)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:130)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch_0$(Unknown
Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithoutKey_0$(Unknown
Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Suppressed: 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
Slow Down (Service: Amazon S3; Status Code: 503; Error Code: 503 Slow
Down; Request ID: CECE220993AE7F89; S3 Extended Request ID:
UlQe4dEuBR1YWJUthSlrbV9phyqxUNHQEw7tsJ5zu+oNIH+nGlGHfAv7EKkQRUVP8tw8x918A4Y=),
S3 Extended Request ID:
UlQe4dEuBR1YWJUthSlrbV9phyqxUNHQEw7tsJ5zu+oNIH+nGlGHfAv7EKkQRUVP8tw8x918A4Y=
at 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1712)
at 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1367)
at 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1113)
at 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:770)
at 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:744)
at 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:726)
at 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:686)
at 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:668)
at 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:532)
at 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:512)
at 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4926)
at 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4872)
at 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3

Re: covid 19 Data [DISCUSSION]

2020-04-12 Thread Teemu Heikkilä
Hi Jane!

The data you pointed there is couple tens of MBs, I wouldn’t exacly say it’s 
"big data” and definitely you don’t need to use Apache Spark for processing 
that amount of data. I would suggest you using some other tools for your 
processing needs. 

WEKA is ”full suite” for data analysis and visualisation and it’s probably good 
choice for the task. If you want to go lower level like with Spark and you are 
familiar with Python, pandas could be good library to investigate. 

br,
Teemu Heikkilä

te...@emblica.com 
+358 40 0963509

Emblica ı The data engineering company
Kaisaniemenkatu 1 B
00100 Helsinki
https://emblica.com

> jane thorpe  kirjoitti 12.4.2020 kello 22.30:
> 
> Hi,
> 
> Three weeks a phD guy proposed to start a project  to use Apache Spark 
> to help the WHO with predictive analysis  using COVID -19 data.
> 
> 
> I have located the daily updated data. 
> It can be found here 
> https://github.com/CSSEGISandData/COVID-19.
> 
> I was wondering if Apache Spark is up to the job of handling BIG DATA of this 
>  size
> or would it be better to use WEKA.
> 
> Please discuss which product is more suitable ?
> 
> 
> Jane 
> janethor...@aol.com



Re: covid 19 Data [DISCUSSION]

2020-04-12 Thread Shamshad Ansari
Does any one know of any source to get chest X-rays or CT scan of COVID-19
patients?
Thank you.
--Sam

On Sun, Apr 12, 2020 at 3:30 PM jane thorpe 
wrote:

> Hi,
>
> Three weeks a phD guy proposed to start a project  to use Apache Spark
> to help the WHO with predictive analysis  using COVID -19 data.
>
>
> I have located the daily updated data.
> It can be found here
> https://github.com/CSSEGISandData/COVID-19.
>
> I was wondering if Apache Spark is up to the job of handling BIG DATA of
> this  size
> or would it be better to use WEKA.
>
> Please discuss which product is more suitable ?
>
>
> Jane
> janethor...@aol.com
>


covid 19 Data [DISCUSSION]

2020-04-12 Thread jane thorpe
 Hi,
Three weeks a phD guy proposed to start a project  to use Apache Spark 
to help the WHO with predictive analysis  using COVID -19 data.

I have located the daily updated data. 
It can be found here 
https://github.com/CSSEGISandData/COVID-19.
I was wondering if Apache Spark is up to the job of handling BIG DATA of this  
sizeor would it be better to use WEKA.
Please discuss which product is more suitable ?

 
Jane 
janethor...@aol.com


COVID 19 data

2020-04-12 Thread jane thorpe
 hi,

A phD guy  proposed to start a project for the WHO 
 accumulated 

 
jane thorpe
janethor...@aol.com