Re: Queries with streaming sources must be executed with writeStream.start()

2017-09-09 Thread kant kodali
Thanks Ryan! In this case, I will have Dataset so is there a way to
convert Row to Json string?

Thanks

On Sat, Sep 9, 2017 at 5:14 PM, Shixiong(Ryan) Zhu 
wrote:

> It's because "toJSON" doesn't support Structured Streaming. The current
> implementation will convert the Dataset to an RDD, which is not supported
> by streaming queries.
>
> On Sat, Sep 9, 2017 at 4:40 PM, kant kodali  wrote:
>
>> yes it is a streaming dataset. so what is the problem with following code?
>>
>> Dataset ds = dataset.toJSON().map(()->{some function that returns a 
>> string});
>>  StreamingQuery query = ds.writeStream().start();
>>  query.awaitTermination();
>>
>>
>> On Sat, Sep 9, 2017 at 4:20 PM, Felix Cheung 
>> wrote:
>>
>>> What is newDS?
>>> If it is a Streaming Dataset/DataFrame (since you have writeStream
>>> there) then there seems to be an issue preventing toJSON to work.
>>>
>>> --
>>> *From:* kant kodali 
>>> *Sent:* Saturday, September 9, 2017 4:04:33 PM
>>> *To:* user @spark
>>> *Subject:* Queries with streaming sources must be executed with
>>> writeStream.start()
>>>
>>> Hi All,
>>>
>>> I  have the following code and I am not sure what's wrong with it? I
>>> cannot call dataset.toJSON() (which returns a DataSet) ? I am using spark
>>> 2.2.0 so I am wondering if there is any work around?
>>>
>>>  Dataset ds = newDS.toJSON().map(()->{some function that returns a 
>>> string});
>>>  StreamingQuery query = ds.writeStream().start();
>>>  query.awaitTermination();
>>>
>>>
>>
>


Re: Queries with streaming sources must be executed with writeStream.start()

2017-09-09 Thread Shixiong(Ryan) Zhu
It's because "toJSON" doesn't support Structured Streaming. The current
implementation will convert the Dataset to an RDD, which is not supported
by streaming queries.

On Sat, Sep 9, 2017 at 4:40 PM, kant kodali  wrote:

> yes it is a streaming dataset. so what is the problem with following code?
>
> Dataset ds = dataset.toJSON().map(()->{some function that returns a 
> string});
>  StreamingQuery query = ds.writeStream().start();
>  query.awaitTermination();
>
>
> On Sat, Sep 9, 2017 at 4:20 PM, Felix Cheung 
> wrote:
>
>> What is newDS?
>> If it is a Streaming Dataset/DataFrame (since you have writeStream there)
>> then there seems to be an issue preventing toJSON to work.
>>
>> --
>> *From:* kant kodali 
>> *Sent:* Saturday, September 9, 2017 4:04:33 PM
>> *To:* user @spark
>> *Subject:* Queries with streaming sources must be executed with
>> writeStream.start()
>>
>> Hi All,
>>
>> I  have the following code and I am not sure what's wrong with it? I
>> cannot call dataset.toJSON() (which returns a DataSet) ? I am using spark
>> 2.2.0 so I am wondering if there is any work around?
>>
>>  Dataset ds = newDS.toJSON().map(()->{some function that returns a 
>> string});
>>  StreamingQuery query = ds.writeStream().start();
>>  query.awaitTermination();
>>
>>
>


Re: How to convert Row to JSON in Java?

2017-09-09 Thread kant kodali
toJSON on Row object.

On Sat, Sep 9, 2017 at 4:18 PM, Felix Cheung 
wrote:

> toJSON on Dataset/DataFrame?
>
> --
> *From:* kant kodali 
> *Sent:* Saturday, September 9, 2017 4:15:49 PM
> *To:* user @spark
> *Subject:* How to convert Row to JSON in Java?
>
> Hi All,
>
> How to convert Row to JSON in Java? It would be nice to have .toJson()
> method in the Row class.
>
> Thanks,
> kant
>


Re: Queries with streaming sources must be executed with writeStream.start()

2017-09-09 Thread kant kodali
yes it is a streaming dataset. so what is the problem with following code?

Dataset ds = dataset.toJSON().map(()->{some function that
returns a string});
 StreamingQuery query = ds.writeStream().start();
 query.awaitTermination();


On Sat, Sep 9, 2017 at 4:20 PM, Felix Cheung 
wrote:

> What is newDS?
> If it is a Streaming Dataset/DataFrame (since you have writeStream there)
> then there seems to be an issue preventing toJSON to work.
>
> --
> *From:* kant kodali 
> *Sent:* Saturday, September 9, 2017 4:04:33 PM
> *To:* user @spark
> *Subject:* Queries with streaming sources must be executed with
> writeStream.start()
>
> Hi All,
>
> I  have the following code and I am not sure what's wrong with it? I
> cannot call dataset.toJSON() (which returns a DataSet) ? I am using spark
> 2.2.0 so I am wondering if there is any work around?
>
>  Dataset ds = newDS.toJSON().map(()->{some function that returns a 
> string});
>  StreamingQuery query = ds.writeStream().start();
>  query.awaitTermination();
>
>


Re: Queries with streaming sources must be executed with writeStream.start()

2017-09-09 Thread Felix Cheung
What is newDS?
If it is a Streaming Dataset/DataFrame (since you have writeStream there) then 
there seems to be an issue preventing toJSON to work.


From: kant kodali 
Sent: Saturday, September 9, 2017 4:04:33 PM
To: user @spark
Subject: Queries with streaming sources must be executed with 
writeStream.start()

Hi All,

I  have the following code and I am not sure what's wrong with it? I cannot 
call dataset.toJSON() (which returns a DataSet) ? I am using spark 2.2.0 so I 
am wondering if there is any work around?


 Dataset ds = newDS.toJSON().map(()->{some function that returns a 
string});
 StreamingQuery query = ds.writeStream().start();
 query.awaitTermination();


Re: How to convert Row to JSON in Java?

2017-09-09 Thread Felix Cheung
toJSON on Dataset/DataFrame?


From: kant kodali 
Sent: Saturday, September 9, 2017 4:15:49 PM
To: user @spark
Subject: How to convert Row to JSON in Java?

Hi All,

How to convert Row to JSON in Java? It would be nice to have .toJson() method 
in the Row class.

Thanks,
kant


How to convert Row to JSON in Java?

2017-09-09 Thread kant kodali
Hi All,

How to convert Row to JSON in Java? It would be nice to have .toJson()
method in the Row class.

Thanks,
kant


Queries with streaming sources must be executed with writeStream.start()

2017-09-09 Thread kant kodali
Hi All,

I  have the following code and I am not sure what's wrong with it? I cannot
call dataset.toJSON() (which returns a DataSet) ? I am using spark 2.2.0 so
I am wondering if there is any work around?

 Dataset ds = newDS.toJSON().map(()->{some function that
returns a string});
 StreamingQuery query = ds.writeStream().start();
 query.awaitTermination();


[Spark Streaming] - Stopped worker throws FileNotFoundException

2017-09-09 Thread Davide.Mandrini
I am running a spark streaming application on a cluster composed by three
nodes, each one with a worker and three executors (so a total of 9
executors). I am using the spark standalone mode (version 2.1.1).

The application is run with a spark-submit command with option "-deploy-mode
client" and "--conf spark.streaming.stopGracefullyOnShutdown=true". The
submit command is run from one of the nodes, let's call it node 1.

As a fault tolerance test I am stopping the worker on node 2 by calling the
script "stop-slave.sh".

In executor logs on node 2 I can see several errors related to a
FileNotFoundException during a shuffle operation:



I can see 4 errors of this kind on the same task in each of the 3 executors
on node 2.

In driver logs I can see:



This is taking down the application, as expected: the executor reached the
"spark.task.maxFailures" on a single task and the application is then
stopped.

I ran different tests and all of them but one ended with the app stopped. My
idea is that the behaviour can vary depending on the precise step in the
stream process I ask the worker to stop. In any case, all other tests failed
with the same error described above.

Increasing the parameter "spark.task.maxFailures" to 8 did not help either,
with the TaskSetManager signalling task failed 8 times instead of 4.

What if the worker is killed?


I also ran a different test: I killed the worker and 3 executors processes
on node 2 with the command kill -9. And in this case, the streaming app
adapted to the remaining resources and kept working.

In driver log we can see the driver noticing the missing executors:



Then, we notice the a long long serie of the following errors:



This errors appears in the log until the killed worker is started again.

Conclusion


Stopping a worker with the dedicated command has a unexpected behaviour: the
app should be able to cope with the missed worked, adapting to the remaining
resources and keep working (as it does in the case of kill).

What are your observations on this issue?

Thank you, Davide



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



[Spark Streaming] - Stopped worker throws FileNotFoundException

2017-09-09 Thread Davide.Mandrini
I am running a spark streaming application on a cluster composed by three
nodes, each one with a worker and three executors (so a total of 9
executors). I am using the spark standalone mode (version 2.1.1).

The application is run with a spark-submit command with option "-deploy-mode
client" and "--conf spark.streaming.stopGracefullyOnShutdown=true". The
submit command is run from one of the nodes, let's call it node 1.

As a fault tolerance test I am stopping the worker on node 2 by calling the
script "stop-slave.sh".

In executor logs on node 2 I can see several errors related to a
FileNotFoundException during a shuffle operation:



I can see 4 errors of this kind on the same task in each of the 3 executors
on node 2.

In driver logs I can see:



This is taking down the application, as expected: the executor reached the
"spark.task.maxFailures" on a single task and the application is then
stopped.

I ran different tests and all of them but one ended with the app stopped. My
idea is that the behaviour can vary depending on the precise step in the
stream process I ask the worker to stop. In any case, all other tests failed
with the same error described above.

Increasing the parameter "spark.task.maxFailures" to 8 did not help either,
with the TaskSetManager signalling task failed 8 times instead of 4.

What if the worker is killed?


I also ran a different test: I killed the worker and 3 executors processes
on node 2 with the command kill -9. And in this case, the streaming app
adapted to the remaining resources and kept working.

In driver log we can see the driver noticing the missing executors:



Then, we notice the a long long serie of the following errors:



This errors appears in the log until the killed worker is started again.

Conclusion


Stopping a worker with the dedicated command has a unexpected behaviour: the
app should be able to cope with the missed worked, adapting to the remaining
resources and keep working (as it does in the case of kill).

What are your observations on this issue?

Thank you, Davide



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark standalone API...

2017-09-09 Thread Davide.Mandrini
Hello, 

you might get the information you are looking for from this hidden API: 

http://:/json/ 

Hope it helps, 
Davide



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark standalone API...

2017-09-09 Thread Davide.Mandrini
Hello, 

you might get the information you are looking for from this hidden API: 

http://:/json/ 

Hope it helps, 
Davide 



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark standalone API...

2017-09-09 Thread Davide.Mandrini
Hello, 

you might get the information you are looking for from this hidden API: 

http://:/json/ 

Hope it helps, 
Davide 



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark standalone API...

2017-09-09 Thread Davide.Mandrini
Hello, 

you might get the information you are looking for from this hidden API: 

http://:/json/ 

Hope it helps, 
Davide 



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark standalone API...

2017-09-09 Thread Davide.Mandrini
Hello,

you might get the information you are looking for from this hidden API:

http://:/json/

Hope it helps,
Davide




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark Streaming - Stopped worker throws FileNotFoundException

2017-09-09 Thread Davide.Mandrini
I am running a spark streaming application on a cluster composed by three
nodes, each one with a worker and three executors (so a total of 9
executors). I am using the spark standalone mode (version 2.1.1).

The application is run with a spark-submit command with option "-deploy-mode
client" and "--conf spark.streaming.stopGracefullyOnShutdown=true". The
submit command is run from one of the nodes, let's call it node 1.

As a fault tolerance test I am stopping the worker on node 2 by calling the
script "stop-slave.sh".

In executor logs on node 2 I can see several errors related to a
FileNotFoundException during a shuffle operation:



I can see 4 errors of this kind on the same task in each of the 3 executors
on node 2.

In driver logs I can see:



This is taking down the application, as expected: the executor reached the
"spark.task.maxFailures" on a single task and the application is then
stopped.

I ran different tests and all of them but one ended with the app stopped. My
idea is that the behaviour can vary depending on the precise step in the
stream process I ask the worker to stop. In any case, all other tests failed
with the same error described above.

Increasing the parameter "spark.task.maxFailures" to 8 did not help either,
with the TaskSetManager signalling task failed 8 times instead of 4.

What if the worker is killed?


I also ran a different test: I killed the worker and 3 executors processes
on node 2 with the command kill -9. And in this case, the streaming app
adapted to the remaining resources and kept working.

In driver log we can see the driver noticing the missing executors:



Then, we notice the a long long serie of the following errors:



This errors appears in the log until the killed worker is started again.

Conclusion


Stopping a worker with the dedicated command has a unexpected behaviour: the
app should be able to cope with the missed worked, adapting to the remaining
resources and keep working (as it does in the case of kill).

What are your observations on this issue?

Thank you, Davide



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: SPARK CSV ISSUE

2017-09-09 Thread Gourav Sengupta
Hi,

Naga has kindly suggested here that I should push the file into RDD and get
rid of header. But my partitions have hundreds of files in it and just
opening and processing the files using RDD is a way old method of working.
I think that SPARK community has moved on from RDD, to Dataframes to
Datasets now.

I know for special cases we still need RDD, but for a CSV file in case we
are asked to use RDD in order to just avoid the header then it does not
sound quite right for me.



Regards,
Gourav Sengupta

On Fri, Sep 8, 2017 at 7:25 PM, Gourav Sengupta 
wrote:

> Hi,
>
> According to this thread https://issues.apache.org/jira/browse/SPARK-11374.
> SPARK will not resolve the issue of skipping header option when the table
> is defined in HIVE.
>
> But I am unable to see a SPARK SQL option for setting up external
> partitioned table.
>
> Does that mean in case I have to create an external partitioned table I
> must use HIVE and when I use HIVE SPARK does not allow me to ignore the
> headers?
>
>
> Regards,
> Gourav Sengupta
>