unsubscribe

2017-07-24 Thread Jnana Sagar
Please unsubscribe me.

-- 
regards
Jnana Sagar


Re: Spark Job crash due to File Not found when shuffle intermittently

2017-07-24 Thread 周康
* If the file exists but is a directory rather than a regular file, does
* not exist but cannot be created, or cannot be opened for any other
* reason then a FileNotFoundException is thrown.

After searching into FileOutputStream i saw this annotation.So you can
check executor node first(may be no permission or no space,or no
enough file descriptor)


2017-07-25 13:05 GMT+08:00 周康 :

> You can also check whether space left in the executor node enough to store
> shuffle file or not.
>
> 2017-07-25 13:01 GMT+08:00 周康 :
>
>> First,spark will handle task fail so if job ended normally , this error
>> can be ignore.
>> Second, when using BypassMergeSortShuffleWriter, it will first write data
>> file then write an index file.
>> You can check "Failed to delete temporary index file at" or "fail to
>> rename file" in related executor node's log file.
>>
>> 2017-07-25 0:33 GMT+08:00 Martin Peng :
>>
>>> Is there anyone at share me some lights about this issue?
>>>
>>> Thanks
>>> Martin
>>>
>>> 2017-07-21 18:58 GMT-07:00 Martin Peng :
>>>
 Hi,

 I have several Spark jobs including both batch job and Stream jobs to
 process the system log and analyze them. We are using Kafka as the pipeline
 to connect each jobs.

 Once upgrade to Spark 2.1.0 + Spark Kafka Streaming 010, I found some
 of the jobs(both batch or streaming) are thrown below exceptions
 randomly(either after several hours run or just run in 20 mins). Can anyone
 give me some suggestions about how to figure out the real root cause?
 (Looks like google result is not very useful...)

 Thanks,
 Martin

 00:30:04,510 WARN  - 17/07/22 00:30:04 WARN TaskSetManager: Lost task
 60.0 in stage 1518490.0 (TID 338070, 10.133.96.21, executor 0):
 java.io.FileNotFoundException: /mnt/mesos/work_dir/slaves/201
 60924-021501-274760970-5050-7646-S2/frameworks/40aeb8e5-e82a
 -4df9-b034-8815a7a7564b-2543/executors/0/runs/fd15c15d-2511-
 4f37-a106-27431f583153/blockmgr-a0e0e673-f88b-4d12-a802-
 c35643e6c6b2/33/shuffle_2090_60_0.index.b66235be-79be-4455-9759-1c7ba70f91f6
 (No such file or directory)
 00:30:04,510 WARN  - at java.io.FileOutputStream.open0(Native
 Method)
 00:30:04,510 WARN  - at java.io.FileOutputStream.open(
 FileOutputStream.java:270)
 00:30:04,510 WARN  - at java.io.FileOutputStream.>>> >(FileOutputStream.java:213)
 00:30:04,510 WARN  - at java.io.FileOutputStream.>>> >(FileOutputStream.java:162)
 00:30:04,510 WARN  - at org.apache.spark.shuffle.Index
 ShuffleBlockResolver.writeIndexFileAndCommit(IndexShuffleBlo
 ckResolver.scala:144)
 00:30:04,510 WARN  - at org.apache.spark.shuffle.sort.
 BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWri
 ter.java:128)
 00:30:04,510 WARN  - at org.apache.spark.scheduler.Shu
 ffleMapTask.runTask(ShuffleMapTask.scala:96)
 00:30:04,510 WARN  - at org.apache.spark.scheduler.Shu
 ffleMapTask.runTask(ShuffleMapTask.scala:53)
 00:30:04,510 WARN  - at org.apache.spark.scheduler.Tas
 k.run(Task.scala:99)
 00:30:04,510 WARN  - at org.apache.spark.executor.Exec
 utor$TaskRunner.run(Executor.scala:282)
 00:30:04,510 WARN  - at java.util.concurrent.ThreadPoo
 lExecutor.runWorker(ThreadPoolExecutor.java:1142)
 00:30:04,510 WARN  - at java.util.concurrent.ThreadPoo
 lExecutor$Worker.run(ThreadPoolExecutor.java:617)
 00:30:04,510 WARN  - at java.lang.Thread.run(Thread.java:748)

 00:30:04,580 INFO  - Driver stacktrace:
 00:30:04,580 INFO  - org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$failJobAn
 dIndependentStages(DAGScheduler.scala:1435)
 00:30:04,580 INFO  - org.apache.spark.scheduler.DAG
 Scheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
 00:30:04,580 INFO  - org.apache.spark.scheduler.DAG
 Scheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
 00:30:04,580 INFO  - scala.collection.mutable.Resiz
 ableArray$class.foreach(ResizableArray.scala:59)
 00:30:04,580 INFO  - scala.collection.mutable.Array
 Buffer.foreach(ArrayBuffer.scala:48)
 00:30:04,580 INFO  - org.apache.spark.scheduler.DAG
 Scheduler.abortStage(DAGScheduler.scala:1422)
 00:30:04,580 INFO  - org.apache.spark.scheduler.DAG
 Scheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
 00:30:04,580 INFO  - org.apache.spark.scheduler.DAG
 Scheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
 00:30:04,580 INFO  - scala.Option.foreach(Option.scala:257)
 00:30:04,580 INFO  - org.apache.spark.scheduler.DAG
 Scheduler.handleTaskSetFailed(DAGScheduler.scala:802)
 00:30:04,580 INFO  - org.apache.spark.scheduler.DAG
 

Re: Spark Job crash due to File Not found when shuffle intermittently

2017-07-24 Thread 周康
You can also check whether space left in the executor node enough to store
shuffle file or not.

2017-07-25 13:01 GMT+08:00 周康 :

> First,spark will handle task fail so if job ended normally , this error
> can be ignore.
> Second, when using BypassMergeSortShuffleWriter, it will first write data
> file then write an index file.
> You can check "Failed to delete temporary index file at" or "fail to
> rename file" in related executor node's log file.
>
> 2017-07-25 0:33 GMT+08:00 Martin Peng :
>
>> Is there anyone at share me some lights about this issue?
>>
>> Thanks
>> Martin
>>
>> 2017-07-21 18:58 GMT-07:00 Martin Peng :
>>
>>> Hi,
>>>
>>> I have several Spark jobs including both batch job and Stream jobs to
>>> process the system log and analyze them. We are using Kafka as the pipeline
>>> to connect each jobs.
>>>
>>> Once upgrade to Spark 2.1.0 + Spark Kafka Streaming 010, I found some of
>>> the jobs(both batch or streaming) are thrown below exceptions
>>> randomly(either after several hours run or just run in 20 mins). Can anyone
>>> give me some suggestions about how to figure out the real root cause?
>>> (Looks like google result is not very useful...)
>>>
>>> Thanks,
>>> Martin
>>>
>>> 00:30:04,510 WARN  - 17/07/22 00:30:04 WARN TaskSetManager: Lost task
>>> 60.0 in stage 1518490.0 (TID 338070, 10.133.96.21, executor 0):
>>> java.io.FileNotFoundException: /mnt/mesos/work_dir/slaves/201
>>> 60924-021501-274760970-5050-7646-S2/frameworks/40aeb8e5-e82a
>>> -4df9-b034-8815a7a7564b-2543/executors/0/runs/fd15c15d-
>>> 2511-4f37-a106-27431f583153/blockmgr-a0e0e673-f88b-4d12-
>>> a802-c35643e6c6b2/33/shuffle_2090_60_0.index.b66235be-79be-4455-9759-1c7ba70f91f6
>>> (No such file or directory)
>>> 00:30:04,510 WARN  - at java.io.FileOutputStream.open0(Native
>>> Method)
>>> 00:30:04,510 WARN  - at java.io.FileOutputStream.open(
>>> FileOutputStream.java:270)
>>> 00:30:04,510 WARN  - at java.io.FileOutputStream.>> >(FileOutputStream.java:213)
>>> 00:30:04,510 WARN  - at java.io.FileOutputStream.>> >(FileOutputStream.java:162)
>>> 00:30:04,510 WARN  - at org.apache.spark.shuffle.Index
>>> ShuffleBlockResolver.writeIndexFileAndCommit(IndexShuffleBlo
>>> ckResolver.scala:144)
>>> 00:30:04,510 WARN  - at org.apache.spark.shuffle.sort.
>>> BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWri
>>> ter.java:128)
>>> 00:30:04,510 WARN  - at org.apache.spark.scheduler.Shu
>>> ffleMapTask.runTask(ShuffleMapTask.scala:96)
>>> 00:30:04,510 WARN  - at org.apache.spark.scheduler.Shu
>>> ffleMapTask.runTask(ShuffleMapTask.scala:53)
>>> 00:30:04,510 WARN  - at org.apache.spark.scheduler.Tas
>>> k.run(Task.scala:99)
>>> 00:30:04,510 WARN  - at org.apache.spark.executor.Exec
>>> utor$TaskRunner.run(Executor.scala:282)
>>> 00:30:04,510 WARN  - at java.util.concurrent.ThreadPoo
>>> lExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>> 00:30:04,510 WARN  - at java.util.concurrent.ThreadPoo
>>> lExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>> 00:30:04,510 WARN  - at java.lang.Thread.run(Thread.java:748)
>>>
>>> 00:30:04,580 INFO  - Driver stacktrace:
>>> 00:30:04,580 INFO  - org.apache.spark.scheduler.DAGScheduler.org
>>> $apache$spark$scheduler$DAGScheduler$$failJobAn
>>> dIndependentStages(DAGScheduler.scala:1435)
>>> 00:30:04,580 INFO  - org.apache.spark.scheduler.DAG
>>> Scheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
>>> 00:30:04,580 INFO  - org.apache.spark.scheduler.DAG
>>> Scheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
>>> 00:30:04,580 INFO  - scala.collection.mutable.Resiz
>>> ableArray$class.foreach(ResizableArray.scala:59)
>>> 00:30:04,580 INFO  - scala.collection.mutable.Array
>>> Buffer.foreach(ArrayBuffer.scala:48)
>>> 00:30:04,580 INFO  - org.apache.spark.scheduler.DAG
>>> Scheduler.abortStage(DAGScheduler.scala:1422)
>>> 00:30:04,580 INFO  - org.apache.spark.scheduler.DAG
>>> Scheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
>>> 00:30:04,580 INFO  - org.apache.spark.scheduler.DAG
>>> Scheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
>>> 00:30:04,580 INFO  - scala.Option.foreach(Option.scala:257)
>>> 00:30:04,580 INFO  - org.apache.spark.scheduler.DAG
>>> Scheduler.handleTaskSetFailed(DAGScheduler.scala:802)
>>> 00:30:04,580 INFO  - org.apache.spark.scheduler.DAG
>>> SchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
>>> 00:30:04,580 INFO  - org.apache.spark.scheduler.DAG
>>> SchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
>>> 00:30:04,580 INFO  - org.apache.spark.scheduler.DAG
>>> SchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
>>> 00:30:04,580 INFO  - org.apache.spark.util.EventLoo
>>> p$$anon$1.run(EventLoop.scala:48)
>>> 00:30:04,580 INFO  - org.apache.spark.scheduler.DAG
>>> Scheduler.runJob(DAGScheduler.scala:628)
>>> 00:30:04,580 INFO  - org.apache.spark.SparkContext.
>>> 

Re: Spark Job crash due to File Not found when shuffle intermittently

2017-07-24 Thread 周康
First,spark will handle task fail so if job ended normally , this error can
be ignore.
Second, when using BypassMergeSortShuffleWriter, it will first write data
file then write an index file.
You can check "Failed to delete temporary index file at" or "fail to rename
file" in related executor node's log file.

2017-07-25 0:33 GMT+08:00 Martin Peng :

> Is there anyone at share me some lights about this issue?
>
> Thanks
> Martin
>
> 2017-07-21 18:58 GMT-07:00 Martin Peng :
>
>> Hi,
>>
>> I have several Spark jobs including both batch job and Stream jobs to
>> process the system log and analyze them. We are using Kafka as the pipeline
>> to connect each jobs.
>>
>> Once upgrade to Spark 2.1.0 + Spark Kafka Streaming 010, I found some of
>> the jobs(both batch or streaming) are thrown below exceptions
>> randomly(either after several hours run or just run in 20 mins). Can anyone
>> give me some suggestions about how to figure out the real root cause?
>> (Looks like google result is not very useful...)
>>
>> Thanks,
>> Martin
>>
>> 00:30:04,510 WARN  - 17/07/22 00:30:04 WARN TaskSetManager: Lost task
>> 60.0 in stage 1518490.0 (TID 338070, 10.133.96.21, executor 0):
>> java.io.FileNotFoundException: /mnt/mesos/work_dir/slaves/201
>> 60924-021501-274760970-5050-7646-S2/frameworks/40aeb8e5-
>> e82a-4df9-b034-8815a7a7564b-2543/executors/0/runs/
>> fd15c15d-2511-4f37-a106-27431f583153/blockmgr-a0e0e673-f88b-
>> 4d12-a802-c35643e6c6b2/33/shuffle_2090_60_0.index.
>> b66235be-79be-4455-9759-1c7ba70f91f6 (No such file or directory)
>> 00:30:04,510 WARN  - at java.io.FileOutputStream.open0(Native Method)
>> 00:30:04,510 WARN  - at java.io.FileOutputStream.open(
>> FileOutputStream.java:270)
>> 00:30:04,510 WARN  - at java.io.FileOutputStream.> >(FileOutputStream.java:213)
>> 00:30:04,510 WARN  - at java.io.FileOutputStream.> >(FileOutputStream.java:162)
>> 00:30:04,510 WARN  - at org.apache.spark.shuffle.Index
>> ShuffleBlockResolver.writeIndexFileAndCommit(IndexShuffleBlo
>> ckResolver.scala:144)
>> 00:30:04,510 WARN  - at org.apache.spark.shuffle.sort.
>> BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:128)
>> 00:30:04,510 WARN  - at org.apache.spark.scheduler.Shu
>> ffleMapTask.runTask(ShuffleMapTask.scala:96)
>> 00:30:04,510 WARN  - at org.apache.spark.scheduler.Shu
>> ffleMapTask.runTask(ShuffleMapTask.scala:53)
>> 00:30:04,510 WARN  - at org.apache.spark.scheduler.Tas
>> k.run(Task.scala:99)
>> 00:30:04,510 WARN  - at org.apache.spark.executor.Exec
>> utor$TaskRunner.run(Executor.scala:282)
>> 00:30:04,510 WARN  - at java.util.concurrent.ThreadPoo
>> lExecutor.runWorker(ThreadPoolExecutor.java:1142)
>> 00:30:04,510 WARN  - at java.util.concurrent.ThreadPoo
>> lExecutor$Worker.run(ThreadPoolExecutor.java:617)
>> 00:30:04,510 WARN  - at java.lang.Thread.run(Thread.java:748)
>>
>> 00:30:04,580 INFO  - Driver stacktrace:
>> 00:30:04,580 INFO  - org.apache.spark.scheduler.DAGScheduler.org
>> $apache$spark$scheduler$DAGScheduler$$failJobAn
>> dIndependentStages(DAGScheduler.scala:1435)
>> 00:30:04,580 INFO  - org.apache.spark.scheduler.DAG
>> Scheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
>> 00:30:04,580 INFO  - org.apache.spark.scheduler.DAG
>> Scheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
>> 00:30:04,580 INFO  - scala.collection.mutable.Resiz
>> ableArray$class.foreach(ResizableArray.scala:59)
>> 00:30:04,580 INFO  - scala.collection.mutable.Array
>> Buffer.foreach(ArrayBuffer.scala:48)
>> 00:30:04,580 INFO  - org.apache.spark.scheduler.DAG
>> Scheduler.abortStage(DAGScheduler.scala:1422)
>> 00:30:04,580 INFO  - org.apache.spark.scheduler.DAG
>> Scheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
>> 00:30:04,580 INFO  - org.apache.spark.scheduler.DAG
>> Scheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
>> 00:30:04,580 INFO  - scala.Option.foreach(Option.scala:257)
>> 00:30:04,580 INFO  - org.apache.spark.scheduler.DAG
>> Scheduler.handleTaskSetFailed(DAGScheduler.scala:802)
>> 00:30:04,580 INFO  - org.apache.spark.scheduler.DAG
>> SchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
>> 00:30:04,580 INFO  - org.apache.spark.scheduler.DAG
>> SchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
>> 00:30:04,580 INFO  - org.apache.spark.scheduler.DAG
>> SchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
>> 00:30:04,580 INFO  - org.apache.spark.util.EventLoo
>> p$$anon$1.run(EventLoop.scala:48)
>> 00:30:04,580 INFO  - org.apache.spark.scheduler.DAG
>> Scheduler.runJob(DAGScheduler.scala:628)
>> 00:30:04,580 INFO  - org.apache.spark.SparkContext.
>> runJob(SparkContext.scala:1918)
>> 00:30:04,580 INFO  - org.apache.spark.SparkContext.
>> runJob(SparkContext.scala:1931)
>> 00:30:04,580 INFO  - org.apache.spark.SparkContext.
>> runJob(SparkContext.scala:1944)
>> 00:30:04,580 INFO  - org.apache.spark.rdd.RDD$$anon
>> 

Re: How to list only erros for a stage

2017-07-24 Thread 周康
May be you can click  Header Status cloumn of Task section,then failed task
will appear first.

2017-07-25 10:02 GMT+08:00 jeff saremi :

> On the Spark status UI you can click Stages on the menu and see Active
> (and completed stages). For the active stage, you can see Succeeded/Total
> and a count of failed ones in paranthesis.
>
> I'm looking for a way to go straight to the failed tasks and list the
> errors. Currently I must go into details on that stage, then scroll down to
> Tasks section. Change the number of records per page so I can see
> everything. and click Go. There is no way that I can just filter the ones
> with errors
>
> thanks
>
> jeff
>
>
>


How to list only erros for a stage

2017-07-24 Thread jeff saremi
On the Spark status UI you can click Stages on the menu and see Active (and 
completed stages). For the active stage, you can see Succeeded/Total and a 
count of failed ones in paranthesis.

I'm looking for a way to go straight to the failed tasks and list the errors. 
Currently I must go into details on that stage, then scroll down to Tasks 
section. Change the number of records per page so I can see everything. and 
click Go. There is no way that I can just filter the ones with errors

thanks

jeff



Re: how to convert the binary from kafak to srring pleaae

2017-07-24 Thread ??????????
Hi Armbrust,


It works well.


Thanks.


 
---Original---
From: "Michael Armbrust"
Date: 2017/7/25 04:58:44
To: "??"<1427357...@qq.com>;
Cc: "user";
Subject: Re: how to convert the binary from kafak to srring pleaae


There are end to end examples of using Kafka in in this 
blog:https://databricks.com/blog/2017/04/26/processing-data-in-apache-kafka-with-structured-streaming-in-apache-spark-2-2.html



On Sun, Jul 23, 2017 at 7:44 PM, ?? <1427357...@qq.com> wrote:
Hi all 


I want to change the binary from kafka to string. Would you like help me please?


val df = ss.readStream.format("kafka").option("kafka.bootstrap.server","")
.option("subscribe","")
.load


val value = df.select("value")


value.writeStream
.outputMode("append")
.format("console")
.start()
.awaitTermination()




Above code outputs result like:


++
|value|
+-+
|[61,61]|
+-+




61 is character a receiced from kafka.
I want to print [a,a] or aa.
How should I do please?

Re: how to set the assignee in JIRA please?

2017-07-24 Thread Hyukjin Kwon
I think that's described in the link I used - http://spark.apache.org/
contributing.html.

On 25 Jul 2017 10:22 am, "萝卜丝炒饭" <1427357...@qq.com> wrote:

Another issue about contribution.

After a pull request is created, what should creator do next please?
Who will close it please?

---Original---
*From:* "Hyukjin Kwon"
*Date:* 2017/7/25 09:15:49
*To:* "Marcelo Vanzin";
*Cc:* "user";"萝卜丝炒饭"<1427357...@qq.com>;
*Subject:* Re: how to set the assignee in JIRA please?

I see. In any event, it sounds not required to work on an issue -
http://spark.apache.org/contributing.html .

"... There is no need to be the Assignee of the JIRA to work on it, though
you are welcome to comment that you have begun work.."

and I was just wondering out of my curiosity. It should be not a big deal
anyway.


Thanks for the details.



2017-07-25 10:09 GMT+09:00 Marcelo Vanzin :

> On Mon, Jul 24, 2017 at 6:04 PM, Hyukjin Kwon  wrote:
> > However, I see some JIRAs are assigned to someone time to time. Were
> those
> > mistakes or would you mind if I ask when someone is assigned?
>
> I'm not sure if there are any guidelines of when to assign; since
> there has been an agreement that bugs should remain unassigned I don't
> think I've personally done it, although I have seen others do it. In
> general I'd say it's ok if there's a good justification for it (e.g.
> "this is a large change and this person who is an active contributor
> will work on it"), but in the general case should be avoided.
>
> I agree it's a little confusing, especially comparing to other
> projects, but it's how it's been done for a couple of years at least
> (or at least what I have understood).
>
>
> --
> Marcelo
>


Re: how to set the assignee in JIRA please?

2017-07-24 Thread ??????????
Another issue about contribution.


After a pull request is created, what should creator do next please?
Who will close it please?


 
---Original---
From: "Hyukjin Kwon"
Date: 2017/7/25 09:15:49
To: "Marcelo Vanzin";
Cc: "user";"??"<1427357...@qq.com>;
Subject: Re: how to set the assignee in JIRA please?


I see. In any event, it sounds not required to work on an issue - 
http://spark.apache.org/contributing.html .
"... There is no need to be the Assignee of the JIRA to work on it, though you 
are welcome to comment that you have begun work.."

and I was just wondering out of my curiosity. It should be not a big deal 
anyway.



Thanks for the details.








2017-07-25 10:09 GMT+09:00 Marcelo Vanzin :
On Mon, Jul 24, 2017 at 6:04 PM, Hyukjin Kwon  wrote:
 > However, I see some JIRAs are assigned to someone time to time. Were those
 > mistakes or would you mind if I ask when someone is assigned?
 
 I'm not sure if there are any guidelines of when to assign; since
 there has been an agreement that bugs should remain unassigned I don't
 think I've personally done it, although I have seen others do it. In
 general I'd say it's ok if there's a good justification for it (e.g.
 "this is a large change and this person who is an active contributor
 will work on it"), but in the general case should be avoided.
 
 I agree it's a little confusing, especially comparing to other
 projects, but it's how it's been done for a couple of years at least
 (or at least what I have understood).
 
 
 --
 Marcelo

Re: how to set the assignee in JIRA please?

2017-07-24 Thread Hyukjin Kwon
I see. In any event, it sounds not required to work on an issue -
http://spark.apache.org/contributing.html .

"... There is no need to be the Assignee of the JIRA to work on it, though
you are welcome to comment that you have begun work.."

and I was just wondering out of my curiosity. It should be not a big deal
anyway.


Thanks for the details.



2017-07-25 10:09 GMT+09:00 Marcelo Vanzin :

> On Mon, Jul 24, 2017 at 6:04 PM, Hyukjin Kwon  wrote:
> > However, I see some JIRAs are assigned to someone time to time. Were
> those
> > mistakes or would you mind if I ask when someone is assigned?
>
> I'm not sure if there are any guidelines of when to assign; since
> there has been an agreement that bugs should remain unassigned I don't
> think I've personally done it, although I have seen others do it. In
> general I'd say it's ok if there's a good justification for it (e.g.
> "this is a large change and this person who is an active contributor
> will work on it"), but in the general case should be avoided.
>
> I agree it's a little confusing, especially comparing to other
> projects, but it's how it's been done for a couple of years at least
> (or at least what I have understood).
>
>
> --
> Marcelo
>


Re: how to set the assignee in JIRA please?

2017-07-24 Thread ??????????
Hi vanzin,  kwon,


thanks for your help.




 
---Original---
From: "Hyukjin Kwon"
Date: 2017/7/25 09:04:44
To: "Marcelo Vanzin";
Cc: "user";"??"<1427357...@qq.com>;
Subject: Re: how to set the assignee in JIRA please?


However, I see some JIRAs are assigned to someone time to time. Were those 
mistakes or would you mind if I ask when someone is assigned?

When I started to contribute to Spark few years ago, I was confused by this and 
I am pretty sure some guys are still confused.

I do usually say something like "it is generally not in that way" too when I am 
asked but I find myself unable to explain further.





2017-07-25 9:59 GMT+09:00 Marcelo Vanzin :
We don't generally set assignees. Submit a PR on github and the PR
 will be linked on JIRA; if your PR is submitted, then the bug is
 assigned to you.
 
 On Mon, Jul 24, 2017 at 5:57 PM, ?? <1427357...@qq.com> wrote:
 > Hi all,
 > If I want to do some work about an issue registed in JIRA, how to set the
 > assignee to me please?
 >
 > thanks
 >
 >
 
 
 
 

--
 Marcelo
 
 -
 To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: how to set the assignee in JIRA please?

2017-07-24 Thread Marcelo Vanzin
On Mon, Jul 24, 2017 at 6:04 PM, Hyukjin Kwon  wrote:
> However, I see some JIRAs are assigned to someone time to time. Were those
> mistakes or would you mind if I ask when someone is assigned?

I'm not sure if there are any guidelines of when to assign; since
there has been an agreement that bugs should remain unassigned I don't
think I've personally done it, although I have seen others do it. In
general I'd say it's ok if there's a good justification for it (e.g.
"this is a large change and this person who is an active contributor
will work on it"), but in the general case should be avoided.

I agree it's a little confusing, especially comparing to other
projects, but it's how it's been done for a couple of years at least
(or at least what I have understood).


-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: how to set the assignee in JIRA please?

2017-07-24 Thread Hyukjin Kwon
However, I see some JIRAs are assigned to someone time to time. Were those
mistakes or would you mind if I ask when someone is assigned?

When I started to contribute to Spark few years ago, I was confused by this
and I am pretty sure some guys are still confused.

I do usually say something like "it is generally not in that way" too when
I am asked but I find myself unable to explain further.



2017-07-25 9:59 GMT+09:00 Marcelo Vanzin :

> We don't generally set assignees. Submit a PR on github and the PR
> will be linked on JIRA; if your PR is submitted, then the bug is
> assigned to you.
>
> On Mon, Jul 24, 2017 at 5:57 PM, 萝卜丝炒饭 <1427357...@qq.com> wrote:
> > Hi all,
> > If I want to do some work about an issue registed in JIRA, how to set the
> > assignee to me please?
> >
> > thanks
> >
> >
>
>
>
> --
> Marcelo
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: how to set the assignee in JIRA please?

2017-07-24 Thread Marcelo Vanzin
We don't generally set assignees. Submit a PR on github and the PR
will be linked on JIRA; if your PR is submitted, then the bug is
assigned to you.

On Mon, Jul 24, 2017 at 5:57 PM, 萝卜丝炒饭 <1427357...@qq.com> wrote:
> Hi all,
> If I want to do some work about an issue registed in JIRA, how to set the
> assignee to me please?
>
> thanks
>
>



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



how to set the assignee in JIRA please?

2017-07-24 Thread ??????????
Hi all,
If I want to do some work about an issue registed in JIRA, how to set the 
assignee to me please?


thanks

real world spark code

2017-07-24 Thread Adaryl Wakefield
Anybody know of publicly available GitHub repos of real world Spark 
applications written in scala?

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics, LLC
913.938.6685
www.massstreet.net
www.linkedin.com/in/bobwakefieldmba
Twitter: @BobLovesData




Re: how to convert the binary from kafak to srring pleaae

2017-07-24 Thread Michael Armbrust
There are end to end examples of using Kafka in in this blog:
https://databricks.com/blog/2017/04/26/processing-data-in-apache-kafka-with-structured-streaming-in-apache-spark-2-2.html

On Sun, Jul 23, 2017 at 7:44 PM, 萝卜丝炒饭 <1427357...@qq.com> wrote:

> Hi all
>
> I want to change the binary from kafka to string. Would you like help me
> please?
>
> val df = ss.readStream.format("kafka").option("kafka.bootstrap.
> server","")
> .option("subscribe","")
> .load
>
> val value = df.select("value")
>
> value.writeStream
> .outputMode("append")
> .format("console")
> .start()
> .awaitTermination()
>
>
> Above code outputs result like:
>
> ++
> |value|
> +-+
> |[61,61]|
> +-+
>
>
> 61 is character a receiced from kafka.
> I want to print [a,a] or aa.
> How should I do please?
>


Re: Spark 2.0 and Oracle 12.1 error

2017-07-24 Thread Cassa L
Hi Another related question to this. Has anyone tried transactions using
Oracle JDBC and spark. How do you do it given that code will be distributed
on workers. Do I combine certain queries to make sure they don't get
distributed?

Regards,
Leena

On Fri, Jul 21, 2017 at 1:50 PM, Cassa L  wrote:

> Hi Xiao,
> I am trying JSON sample table provided by Oracle 12C. It is on the website
> -https://docs.oracle.com/database/121/ADXDB/json.htm#ADXDB6371
>
> CREATE TABLE j_purchaseorder
>(id  RAW (16) NOT NULL,
> date_loaded TIMESTAMP WITH TIME ZONE,
> po_document CLOB
> CONSTRAINT ensure_json CHECK (po_document IS JSON));
>
> Data that I inserted was -
>
> { "PONumber" : 1600,
>   "Reference": "ABULL-20140421",
>   "Requestor": "Alexis Bull",
>   "User" : "ABULL",
>   "CostCenter"   : "A50",
>   "ShippingInstructions" : { "name"   : "Alexis Bull",
>  "Address": { "street"  : "200 Sporting Green",
>   "city": "South San Francisco",
>   "state"   : "CA",
>   "zipCode" : 99236,
>   "country" : "United States of 
> America" },
>  "Phone" : [ { "type" : "Office", "number" : 
> "909-555-7307 <(909)%20555-7307>" },
>  { "type" : "Mobile", "number" : 
> "415-555-1234 <(415)%20555-1234>" } ] },
>   "Special Instructions" : null,
>   "AllowPartialShipment" : false,
>   "LineItems": [ { "ItemNumber" : 1,
>"Part"   : { "Description" : "One Magic 
> Christmas",
> "UnitPrice"   : 19.95,
> "UPCCode" : 13131092899 },
>"Quantity"   : 9.0 },
>  { "ItemNumber" : 2,
>"Part"   : { "Description" : "Lethal 
> Weapon",
> "UnitPrice"   : 19.95,
> "UPCCode" : 85391628927 },
>"Quantity"   : 5.0 } ] }
>
>
> On Fri, Jul 21, 2017 at 10:12 AM, Xiao Li  wrote:
>
>> Could you share the schema of your Oracle table and open a JIRA?
>>
>> Thanks!
>>
>> Xiao
>>
>>
>> 2017-07-21 9:40 GMT-07:00 Cassa L :
>>
>>> I am using 2.2.0. I resolved the problem by removing SELECT * and adding
>>> column names to the SELECT statement. That works. I'm wondering why SELECT
>>> * will not work.
>>>
>>> Regards,
>>> Leena
>>>
>>> On Fri, Jul 21, 2017 at 8:21 AM, Xiao Li  wrote:
>>>
 Could you try 2.2? We fixed multiple Oracle related issues in the
 latest release.

 Thanks

 Xiao


 On Wed, 19 Jul 2017 at 11:10 PM Cassa L  wrote:

> Hi,
> I am trying to use Spark to read from Oracle (12.1) table using Spark
> 2.0. My table has JSON data.  I am getting below exception in my code. Any
> clue?
>
> >
> java.sql.SQLException: Unsupported type -101
>
> at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.o
> rg$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$ge
> tCatalystType(JdbcUtils.scala:233)
> at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$a
> nonfun$8.apply(JdbcUtils.scala:290)
> at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$a
> nonfun$8.apply(JdbcUtils.scala:290)
> at scala.Option.getOrElse(Option.scala:121)
> at
>
> ==
> My code is very simple.
>
> SparkSession spark = SparkSession
> .builder()
> .appName("Oracle Example")
> .master("local[4]")
> .getOrCreate();
>
> final Properties connectionProperties = new Properties();
> connectionProperties.put("user", *"some_user"*));
> connectionProperties.put("password", "some_pwd"));
>
> final String dbTable =
> "(select *  from  MySampleTable)";
>
> Dataset jdbcDF = spark.read().jdbc(*URL*, dbTable, 
> connectionProperties);
>
>
>>>
>>
>


Re: using Kudu with Spark

2017-07-24 Thread Mich Talebzadeh
thanks Pierce.That compilation looks very cool.

Now as always the question is what is the best fit for the job at hand and
I don't think there is a single answer.

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 24 July 2017 at 18:51, Pierce Lamb  wrote:

> Hi Mich,
>
> I tried to compile a list of datastores that connect to Spark and provide
> a bit of context. The list may help you in your research:
>
> https://stackoverflow.com/a/39753976/3723346
>
> I'm going to add Kudu, Druid and Ampool from this thread.
>
> I'd like to point out SnappyData
>  as an option you should
> try. SnappyData provides many of the features you've discussed (columnar
> storage, replication, in-place updates etc) while also integrating the
> datastore with Spark directly. That is, there is no "connector" to go over
> for database operations; Spark and the datastore share the same JVM and
> block manager. Thus, if performance is one of your concerns, this should
> give you some of the best performance
>  in this area.
>
> Hope this helps,
>
> Pierce
>
> On Mon, Jul 24, 2017 at 10:02 AM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> now they are bringing up Ampool with spark for real time analytics
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 24 July 2017 at 11:15, Mich Talebzadeh 
>> wrote:
>>
>>> sounds like Druid can do the same?
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 24 July 2017 at 08:38, Mich Talebzadeh 
>>> wrote:
>>>
 Yes this storage layer is something I have been investigating in my own
 lab for mixed load such as Lambda Architecture.



 It offers the convenience of columnar RDBMS (much like Sybase IQ). Kudu
 tables look like those in SQL relational databases, each with a primary key
 made up of one or more columns that enforce uniqueness and acts as an index
 for efficient updates and deletes. Data is partitioned using what is known
 as tablets that make up tables. Kudu replicates these tablets to other
 nodes for redundancy.


 As you said there are a number of options. Kudu also claims in-place
 updates that needs to be tried for its consistency.

 Cheers

 Dr Mich Talebzadeh



 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *



 http://talebzadehmich.wordpress.com


 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.



 On 24 July 2017 at 08:30, Jörn Franke  wrote:

> I guess you have to find out yourself with experiments. Cloudera has
> some benchmarks, but it always depends what you test, your data volume and
> what is 

Parquet error while saving in HDFS

2017-07-24 Thread unk1102
Hi I am getting the following error not sure why seems like race condition
but I dont use any threads just one thread which owns spark context is
writing to hdfs with one parquet partition. I am using Scala 2.10 and Spark
1.5.1. Please guide. Thanks in advance.


java.io.IOException: The file being written is in an invalid state. Probably
caused by an error thrown previously. Current state: COLUMN
at parquet.hadoop.ParquetFileWriter$STATE.error(ParquetFileWriter.java:137)
at
parquet.hadoop.ParquetFileWriter$STATE.startBlock(ParquetFileWriter.java:129)
at parquet.hadoop.ParquetFileWriter.startBlock(ParquetFileWriter.java:173)
at
parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:152)
at
parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:112)
at parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:73)
at
org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$writeShard$1(newParquet.scala:635)
at
org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:649)
at
org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:649)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Parquet-error-while-saving-in-HDFS-tp28998.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: using Kudu with Spark

2017-07-24 Thread Pierce Lamb
Hi Mich,

I tried to compile a list of datastores that connect to Spark and provide a
bit of context. The list may help you in your research:

https://stackoverflow.com/a/39753976/3723346

I'm going to add Kudu, Druid and Ampool from this thread.

I'd like to point out SnappyData
 as an option you should try.
SnappyData provides many of the features you've discussed (columnar
storage, replication, in-place updates etc) while also integrating the
datastore with Spark directly. That is, there is no "connector" to go over
for database operations; Spark and the datastore share the same JVM and
block manager. Thus, if performance is one of your concerns, this should
give you some of the best performance
 in this area.

Hope this helps,

Pierce

On Mon, Jul 24, 2017 at 10:02 AM, Mich Talebzadeh  wrote:

> now they are bringing up Ampool with spark for real time analytics
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 24 July 2017 at 11:15, Mich Talebzadeh 
> wrote:
>
>> sounds like Druid can do the same?
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 24 July 2017 at 08:38, Mich Talebzadeh 
>> wrote:
>>
>>> Yes this storage layer is something I have been investigating in my own
>>> lab for mixed load such as Lambda Architecture.
>>>
>>>
>>>
>>> It offers the convenience of columnar RDBMS (much like Sybase IQ). Kudu
>>> tables look like those in SQL relational databases, each with a primary key
>>> made up of one or more columns that enforce uniqueness and acts as an index
>>> for efficient updates and deletes. Data is partitioned using what is known
>>> as tablets that make up tables. Kudu replicates these tablets to other
>>> nodes for redundancy.
>>>
>>>
>>> As you said there are a number of options. Kudu also claims in-place
>>> updates that needs to be tried for its consistency.
>>>
>>> Cheers
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 24 July 2017 at 08:30, Jörn Franke  wrote:
>>>
 I guess you have to find out yourself with experiments. Cloudera has
 some benchmarks, but it always depends what you test, your data volume and
 what is meant by "fast". It is also more than a file format with servers
 that communicate with each other etc.  - more complexity.
 Of course there are alternatives that you could benchmark again, such
 as Apache HAWQ (which is basically postgres on Hadoop), Apache ignite or
 depending on your analysis even Flink or Spark Streaming.

 On 24. Jul 2017, at 09:25, Mich Talebzadeh 
 wrote:

 hi,

 Has anyone had experience of using Kudu for faster analytics with Spark?

 How efficient is it compared to usinh HBase and other traditional
 storage for fast changing data please?

 Any insight will be appreciated.

 Thanks

 Dr Mich Talebzadeh



 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *


Re: using Kudu with Spark

2017-07-24 Thread Mich Talebzadeh
now they are bringing up Ampool with spark for real time analytics

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 24 July 2017 at 11:15, Mich Talebzadeh  wrote:

> sounds like Druid can do the same?
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 24 July 2017 at 08:38, Mich Talebzadeh 
> wrote:
>
>> Yes this storage layer is something I have been investigating in my own
>> lab for mixed load such as Lambda Architecture.
>>
>>
>>
>> It offers the convenience of columnar RDBMS (much like Sybase IQ). Kudu
>> tables look like those in SQL relational databases, each with a primary key
>> made up of one or more columns that enforce uniqueness and acts as an index
>> for efficient updates and deletes. Data is partitioned using what is known
>> as tablets that make up tables. Kudu replicates these tablets to other
>> nodes for redundancy.
>>
>>
>> As you said there are a number of options. Kudu also claims in-place
>> updates that needs to be tried for its consistency.
>>
>> Cheers
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 24 July 2017 at 08:30, Jörn Franke  wrote:
>>
>>> I guess you have to find out yourself with experiments. Cloudera has
>>> some benchmarks, but it always depends what you test, your data volume and
>>> what is meant by "fast". It is also more than a file format with servers
>>> that communicate with each other etc.  - more complexity.
>>> Of course there are alternatives that you could benchmark again, such as
>>> Apache HAWQ (which is basically postgres on Hadoop), Apache ignite or
>>> depending on your analysis even Flink or Spark Streaming.
>>>
>>> On 24. Jul 2017, at 09:25, Mich Talebzadeh 
>>> wrote:
>>>
>>> hi,
>>>
>>> Has anyone had experience of using Kudu for faster analytics with Spark?
>>>
>>> How efficient is it compared to usinh HBase and other traditional
>>> storage for fast changing data please?
>>>
>>> Any insight will be appreciated.
>>>
>>> Thanks
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>
>


Re: Spark Job crash due to File Not found when shuffle intermittently

2017-07-24 Thread Martin Peng
Is there anyone at share me some lights about this issue?

Thanks
Martin

2017-07-21 18:58 GMT-07:00 Martin Peng :

> Hi,
>
> I have several Spark jobs including both batch job and Stream jobs to
> process the system log and analyze them. We are using Kafka as the pipeline
> to connect each jobs.
>
> Once upgrade to Spark 2.1.0 + Spark Kafka Streaming 010, I found some of
> the jobs(both batch or streaming) are thrown below exceptions
> randomly(either after several hours run or just run in 20 mins). Can anyone
> give me some suggestions about how to figure out the real root cause?
> (Looks like google result is not very useful...)
>
> Thanks,
> Martin
>
> 00:30:04,510 WARN  - 17/07/22 00:30:04 WARN TaskSetManager: Lost task 60.0
> in stage 1518490.0 (TID 338070, 10.133.96.21, executor 0):
> java.io.FileNotFoundException: /mnt/mesos/work_dir/slaves/
> 20160924-021501-274760970-5050-7646-S2/frameworks/40aeb8e5-e82a-4df9-b034-
> 8815a7a7564b-2543/executors/0/runs/fd15c15d-2511-4f37-a106-
> 27431f583153/blockmgr-a0e0e673-f88b-4d12-a802-
> c35643e6c6b2/33/shuffle_2090_60_0.index.b66235be-79be-4455-9759-1c7ba70f91f6
> (No such file or directory)
> 00:30:04,510 WARN  - at java.io.FileOutputStream.open0(Native Method)
> 00:30:04,510 WARN  - at java.io.FileOutputStream.open(
> FileOutputStream.java:270)
> 00:30:04,510 WARN  - at java.io.FileOutputStream.<
> init>(FileOutputStream.java:213)
> 00:30:04,510 WARN  - at java.io.FileOutputStream.<
> init>(FileOutputStream.java:162)
> 00:30:04,510 WARN  - at org.apache.spark.shuffle.
> IndexShuffleBlockResolver.writeIndexFileAndCommit(
> IndexShuffleBlockResolver.scala:144)
> 00:30:04,510 WARN  - at org.apache.spark.shuffle.sort.
> BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:128)
> 00:30:04,510 WARN  - at org.apache.spark.scheduler.
> ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
> 00:30:04,510 WARN  - at org.apache.spark.scheduler.
> ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
> 00:30:04,510 WARN  - at org.apache.spark.scheduler.
> Task.run(Task.scala:99)
> 00:30:04,510 WARN  - at org.apache.spark.executor.
> Executor$TaskRunner.run(Executor.scala:282)
> 00:30:04,510 WARN  - at java.util.concurrent.
> ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> 00:30:04,510 WARN  - at java.util.concurrent.
> ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> 00:30:04,510 WARN  - at java.lang.Thread.run(Thread.java:748)
>
> 00:30:04,580 INFO  - Driver stacktrace:
> 00:30:04,580 INFO  - org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(
> DAGScheduler.scala:1435)
> 00:30:04,580 INFO  - org.apache.spark.scheduler.DAGScheduler$$anonfun$
> abortStage$1.apply(DAGScheduler.scala:1423)
> 00:30:04,580 INFO  - org.apache.spark.scheduler.DAGScheduler$$anonfun$
> abortStage$1.apply(DAGScheduler.scala:1422)
> 00:30:04,580 INFO  - scala.collection.mutable.
> ResizableArray$class.foreach(ResizableArray.scala:59)
> 00:30:04,580 INFO  - scala.collection.mutable.ArrayBuffer.foreach(
> ArrayBuffer.scala:48)
> 00:30:04,580 INFO  - org.apache.spark.scheduler.DAGScheduler.abortStage(
> DAGScheduler.scala:1422)
> 00:30:04,580 INFO  - org.apache.spark.scheduler.DAGScheduler$$anonfun$
> handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
> 00:30:04,580 INFO  - org.apache.spark.scheduler.DAGScheduler$$anonfun$
> handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
> 00:30:04,580 INFO  - scala.Option.foreach(Option.scala:257)
> 00:30:04,580 INFO  - org.apache.spark.scheduler.DAGScheduler.
> handleTaskSetFailed(DAGScheduler.scala:802)
> 00:30:04,580 INFO  - org.apache.spark.scheduler.
> DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
> 00:30:04,580 INFO  - org.apache.spark.scheduler.
> DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
> 00:30:04,580 INFO  - org.apache.spark.scheduler.
> DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
> 00:30:04,580 INFO  - org.apache.spark.util.EventLoop$$anon$1.run(
> EventLoop.scala:48)
> 00:30:04,580 INFO  - org.apache.spark.scheduler.DAGScheduler.runJob(
> DAGScheduler.scala:628)
> 00:30:04,580 INFO  - org.apache.spark.SparkContext.
> runJob(SparkContext.scala:1918)
> 00:30:04,580 INFO  - org.apache.spark.SparkContext.
> runJob(SparkContext.scala:1931)
> 00:30:04,580 INFO  - org.apache.spark.SparkContext.
> runJob(SparkContext.scala:1944)
> 00:30:04,580 INFO  - org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.
> scala:1353)
> 00:30:04,580 INFO  - org.apache.spark.rdd.RDDOperationScope$.withScope(
> RDDOperationScope.scala:151)
> 00:30:04,580 INFO  - org.apache.spark.rdd.RDDOperationScope$.withScope(
> RDDOperationScope.scala:112)
> 00:30:04,580 INFO  - org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
> 00:30:04,580 INFO  - org.apache.spark.rdd.RDD.take(RDD.scala:1326)
> 00:30:04,580 INFO  - org.apache.spark.rdd.RDD$$
> 

Re: how to convert the binary from kafak to srring pleaae

2017-07-24 Thread ??????????
Hi cheers,


Would you like  write samw code please?
I check the select method, i do not know how to cast it and how to set the 
value.deserialize.


Thanks


 
---Original---
From: "Szuromi Tam??s"
Date: 2017/7/24 16:32:52
To: "??"<1427357...@qq.com>;
Cc: "user";
Subject: Re: how to convert the binary from kafak to srring pleaae


Hi, 

You can cast it to string in a select or you can set the value.deserializer 
parameter for kafka.


cheers,


2017-07-24 4:44 GMT+02:00 ?? <1427357...@qq.com>:
Hi all 


I want to change the binary from kafka to string. Would you like help me please?


val df = ss.readStream.format("kafka").option("kafka.bootstrap.server","")
.option("subscribe","")
.load


val value = df.select("value")


value.writeStream
.outputMode("append")
.format("console")
.start()
.awaitTermination()




Above code outputs result like:


++
|value|
+-+
|[61,61]|
+-+




61 is character a receiced from kafka.
I want to print [a,a] or aa.
How should I do please?

Conflict resolution for data in spark streaming

2017-07-24 Thread Biplob Biswas
Hi,

I have a situation where updates are coming from 2 different data sources,
this data at times are arriving in the same batch defined in streaming
context duration parameter of 500 ms  (recommended in spark according to
the documentation).

Now that is not the problem, the problem is that when the data is
partitioned to different executors, the order in which it originally
arrived, it's not processed in the same order, this I know because the
event data which comes last should be used for the updated state. This kind
of race condition exists and is not consistent.

Has anyone any idea to fix this issue? I am not really sure if anyone faced
this kind of any issue and if someone fixed anything like this?

Thanks & Regards
Biplob Biswas


NullPointer when collecting a dataset grouped a column

2017-07-24 Thread Aseem Bansal
I was doing this

dataset.groupBy("column").collectAsList()


It worked for a small dataset but for a bigger dataset I got a NullPointer
exception in which went down to spark's code. Is this known behaviour?

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1505)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1493)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1492)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1492)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:803)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:803)
at scala.Option.foreach(Option.scala:257)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:803)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1720)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1675)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1664)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:629)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1944)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1958)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:935)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.collect(RDD.scala:934)
at
org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:275)
at
org.apache.spark.sql.Dataset$$anonfun$collectAsList$1$$anonfun$apply$11.apply(Dataset.scala:2364)
at
org.apache.spark.sql.Dataset$$anonfun$collectAsList$1$$anonfun$apply$11.apply(Dataset.scala:2363)
at
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2765)
at
org.apache.spark.sql.Dataset$$anonfun$collectAsList$1.apply(Dataset.scala:2363)
at
org.apache.spark.sql.Dataset$$anonfun$collectAsList$1.apply(Dataset.scala:2362)
at org.apache.spark.sql.Dataset.withCallback(Dataset.scala:2778)
at org.apache.spark.sql.Dataset.collectAsList(Dataset.scala:2362)
at com.companyname.Main(Main.java:151)
... 7 more
Caused by: java.lang.NullPointerException
at sun.reflect.GeneratedMethodAccessor58.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1113)
at
org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1113)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
at
org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1.apply(SQLContext.scala:1113)
at
org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1.apply(SQLContext.scala:)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at
org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.next(InMemoryRelation.scala:107)
at
org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.next(InMemoryRelation.scala:97)
at
org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216)
at
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:957)
at
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:948)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:888)
at
org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:948)
at
org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:694)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)

Re: Is there a difference between these aggregations

2017-07-24 Thread Aseem Bansal
Any difference between using agg or select to do the aggregations?

On Mon, Jul 24, 2017 at 5:08 PM, yohann jardin 
wrote:

> Seen directly in the code:
>
>
>   /**
>* Aggregate function: returns the average of the values in a group.
>* Alias for avg.
>*
>* @group agg_funcs
>* @since 1.4.0
>*/
>   def mean(e: Column): Column = avg(e)
>
>
> That's the same when the argument is the column name.
>
> So no difference between mean and avg functions.
>
>
> --
> *De :* Aseem Bansal 
> *Envoyé :* lundi 24 juillet 2017 13:34
> *À :* user
> *Objet :* Is there a difference between these aggregations
>
> If I want to aggregate mean and subtract from my column I can do either of
> the following in Spark 2.1.0 Java API. Is there any difference between
> these? Couldn't find anything from reading the docs.
>
> dataset.select(mean("mycol"))
> dataset.agg(mean("mycol"))
>
> dataset.select(avg("mycol"))
> dataset.agg(avg("mycol"))
>


RE: Is there a difference between these aggregations

2017-07-24 Thread yohann jardin
Seen directly in the code:


  /**
   * Aggregate function: returns the average of the values in a group.
   * Alias for avg.
   *
   * @group agg_funcs
   * @since 1.4.0
   */
  def mean(e: Column): Column = avg(e)



That's the same when the argument is the column name.

So no difference between mean and avg functions.



De : Aseem Bansal 
Envoyé : lundi 24 juillet 2017 13:34
À : user
Objet : Is there a difference between these aggregations

If I want to aggregate mean and subtract from my column I can do either of the 
following in Spark 2.1.0 Java API. Is there any difference between these? 
Couldn't find anything from reading the docs.

dataset.select(mean("mycol"))
dataset.agg(mean("mycol"))

dataset.select(avg("mycol"))
dataset.agg(avg("mycol"))


Is there a difference between these aggregations

2017-07-24 Thread Aseem Bansal
If I want to aggregate mean and subtract from my column I can do either of
the following in Spark 2.1.0 Java API. Is there any difference between
these? Couldn't find anything from reading the docs.

dataset.select(mean("mycol"))
dataset.agg(mean("mycol"))

dataset.select(avg("mycol"))
dataset.agg(avg("mycol"))


Re: Complext JSON Handling in Spark 2.1

2017-07-24 Thread Patrick
To avoid confusion, the query i am referring above is over some numeric
element inside *a: struct (nullable = true).*

On Mon, Jul 24, 2017 at 4:04 PM, Patrick  wrote:

> Hi,
>
> On reading a complex JSON, Spark infers schema as following:
>
> root
>  |-- header: struct (nullable = true)
>  ||-- deviceId: string (nullable = true)
>  ||-- sessionId: string (nullable = true)
>  |-- payload: struct (nullable = true)
>  ||-- deviceObjects: array (nullable = true)
>  |||-- element: struct (containsNull = true)
>  ||||-- additionalPayload: array (nullable = true)
>  |||||-- element: struct (containsNull = true)
>  ||||||-- data: struct (nullable = true)
>  |||||||-- *a: struct (nullable = true)*
>  ||||||||-- address: string (nullable = true)
>
> When we save the above Json in parquet using Spark sql we get only two top
> level columns "header" and "payload" in parquet.
>
> So now we want to do a mean calculation on element  *a: struct (nullable
> = true)*
>
> With reference to the Databricks blog for handling complex JSON
> https://databricks.com/blog/2017/02/23/working-complex-
> data-formats-structured-streaming-apache-spark-2-1.html
>
> *"when using Parquet, all struct columns will receive the same treatment
> as top-level columns. Therefore, if you have filters on a nested field, you
> will get the same benefits as a top-level column."*
>
> Referring to the above statement, will parquet treat *a: struct (nullable
> = true)* as top-level column struct and SQL query on the Dataset will be
> optimized?
>
> If not, do we need to externally impose the schema by exploding the
> complex type before writing to parquet in order to get top-level column
> benefit? What we can do with Spark 2.1, to extract the best performance
> over such nested structure like *a: struct (nullable = true).*
>
> Thanks
>
>


Complext JSON Handling in Spark 2.1

2017-07-24 Thread Patrick
Hi,

On reading a complex JSON, Spark infers schema as following:

root
 |-- header: struct (nullable = true)
 ||-- deviceId: string (nullable = true)
 ||-- sessionId: string (nullable = true)
 |-- payload: struct (nullable = true)
 ||-- deviceObjects: array (nullable = true)
 |||-- element: struct (containsNull = true)
 ||||-- additionalPayload: array (nullable = true)
 |||||-- element: struct (containsNull = true)
 ||||||-- data: struct (nullable = true)
 |||||||-- *a: struct (nullable = true)*
 ||||||||-- address: string (nullable = true)

When we save the above Json in parquet using Spark sql we get only two top
level columns "header" and "payload" in parquet.

So now we want to do a mean calculation on element  *a: struct (nullable =
true)*

With reference to the Databricks blog for handling complex JSON
https://databricks.com/blog/2017/02/23/working-complex-data-formats-structured-streaming-apache-spark-2-1.html

*"when using Parquet, all struct columns will receive the same treatment as
top-level columns. Therefore, if you have filters on a nested field, you
will get the same benefits as a top-level column."*

Referring to the above statement, will parquet treat *a: struct (nullable =
true)* as top-level column struct and SQL query on the Dataset will be
optimized?

If not, do we need to externally impose the schema by exploding the complex
type before writing to parquet in order to get top-level column benefit?
What we can do with Spark 2.1, to extract the best performance over such
nested structure like *a: struct (nullable = true).*

Thanks


Re: using Kudu with Spark

2017-07-24 Thread Mich Talebzadeh
sounds like Druid can do the same?

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 24 July 2017 at 08:38, Mich Talebzadeh  wrote:

> Yes this storage layer is something I have been investigating in my own
> lab for mixed load such as Lambda Architecture.
>
>
>
> It offers the convenience of columnar RDBMS (much like Sybase IQ). Kudu
> tables look like those in SQL relational databases, each with a primary key
> made up of one or more columns that enforce uniqueness and acts as an index
> for efficient updates and deletes. Data is partitioned using what is known
> as tablets that make up tables. Kudu replicates these tablets to other
> nodes for redundancy.
>
>
> As you said there are a number of options. Kudu also claims in-place
> updates that needs to be tried for its consistency.
>
> Cheers
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 24 July 2017 at 08:30, Jörn Franke  wrote:
>
>> I guess you have to find out yourself with experiments. Cloudera has some
>> benchmarks, but it always depends what you test, your data volume and what
>> is meant by "fast". It is also more than a file format with servers that
>> communicate with each other etc.  - more complexity.
>> Of course there are alternatives that you could benchmark again, such as
>> Apache HAWQ (which is basically postgres on Hadoop), Apache ignite or
>> depending on your analysis even Flink or Spark Streaming.
>>
>> On 24. Jul 2017, at 09:25, Mich Talebzadeh 
>> wrote:
>>
>> hi,
>>
>> Has anyone had experience of using Kudu for faster analytics with Spark?
>>
>> How efficient is it compared to usinh HBase and other traditional storage
>> for fast changing data please?
>>
>> Any insight will be appreciated.
>>
>> Thanks
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>


Union large number of DataFrames

2017-07-24 Thread julio . cesare

Hi there !

Let's imagine I have a large number of very small dataframe with the 
same schema ( a list of DataFrames : allDFs)

and I want to create one large dataset with this.

I have been trying this :
-> allDFs.reduce ( (a,b) => a.union(b) )

And after this one :
-> allDFs.reduce ( (a,b) => a.union(b).repartition(200) )
to prevent df with large number of partitions


Two questions :
1) Will the reduce operation be done in parallel in the previous code ? 
or may be should I replace my reduce by allDFs.par.reduce ?

2) Is there a better way to concatenate them ?


Thanks !
Julio

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: how to convert the binary from kafak to srring pleaae

2017-07-24 Thread Szuromi Tamás
Hi,

You can cast it to string in a select or you can set the value.deserializer
parameter for kafka.

cheers,

2017-07-24 4:44 GMT+02:00 萝卜丝炒饭 <1427357...@qq.com>:

> Hi all
>
> I want to change the binary from kafka to string. Would you like help me
> please?
>
> val df = ss.readStream.format("kafka").option("kafka.bootstrap.
> server","")
> .option("subscribe","")
> .load
>
> val value = df.select("value")
>
> value.writeStream
> .outputMode("append")
> .format("console")
> .start()
> .awaitTermination()
>
>
> Above code outputs result like:
>
> ++
> |value|
> +-+
> |[61,61]|
> +-+
>
>
> 61 is character a receiced from kafka.
> I want to print [a,a] or aa.
> How should I do please?
>


Re: using Kudu with Spark

2017-07-24 Thread Mich Talebzadeh
Yes this storage layer is something I have been investigating in my own lab
for mixed load such as Lambda Architecture.



It offers the convenience of columnar RDBMS (much like Sybase IQ). Kudu
tables look like those in SQL relational databases, each with a primary key
made up of one or more columns that enforce uniqueness and acts as an index
for efficient updates and deletes. Data is partitioned using what is known
as tablets that make up tables. Kudu replicates these tablets to other
nodes for redundancy.


As you said there are a number of options. Kudu also claims in-place
updates that needs to be tried for its consistency.

Cheers

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 24 July 2017 at 08:30, Jörn Franke  wrote:

> I guess you have to find out yourself with experiments. Cloudera has some
> benchmarks, but it always depends what you test, your data volume and what
> is meant by "fast". It is also more than a file format with servers that
> communicate with each other etc.  - more complexity.
> Of course there are alternatives that you could benchmark again, such as
> Apache HAWQ (which is basically postgres on Hadoop), Apache ignite or
> depending on your analysis even Flink or Spark Streaming.
>
> On 24. Jul 2017, at 09:25, Mich Talebzadeh 
> wrote:
>
> hi,
>
> Has anyone had experience of using Kudu for faster analytics with Spark?
>
> How efficient is it compared to usinh HBase and other traditional storage
> for fast changing data please?
>
> Any insight will be appreciated.
>
> Thanks
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>


Re: using Kudu with Spark

2017-07-24 Thread Jörn Franke
I guess you have to find out yourself with experiments. Cloudera has some 
benchmarks, but it always depends what you test, your data volume and what is 
meant by "fast". It is also more than a file format with servers that 
communicate with each other etc.  - more complexity. 
Of course there are alternatives that you could benchmark again, such as Apache 
HAWQ (which is basically postgres on Hadoop), Apache ignite or depending on 
your analysis even Flink or Spark Streaming.

> On 24. Jul 2017, at 09:25, Mich Talebzadeh  wrote:
> 
> hi,
> 
> Has anyone had experience of using Kudu for faster analytics with Spark?
> 
> How efficient is it compared to usinh HBase and other traditional storage for 
> fast changing data please?
> 
> Any insight will be appreciated.
> 
> Thanks
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  


using Kudu with Spark

2017-07-24 Thread Mich Talebzadeh
hi,

Has anyone had experience of using Kudu for faster analytics with Spark?

How efficient is it compared to usinh HBase and other traditional storage
for fast changing data please?

Any insight will be appreciated.

Thanks

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Re: custom joins on dataframe

2017-07-24 Thread Jörn Franke
It might be faster if you add the column with the hash result before the join 
to the dataframe and then do simply a normal join on that column

> On 22. Jul 2017, at 17:39, Stephen Fletcher  
> wrote:
> 
> Normally a family of joins (left, right outter, inner) are performed on two 
> dataframes using columns for the comparison ie left("acol") === ight("acol") 
> . the comparison operator of the "left" dataframe does something internally 
> and produces a column that i assume is used by the join.
> 
> What I want is to create my own comparison operation (i have a case where i 
> want to use some fuzzy matching between rows and if they fall within some 
> threshold we allow the join to happen)
> 
> so it would look something like
> 
> left.join(right, my_fuzzy_udf (left("cola"),right("cola")))
> 
> Where my_fuzzy_udf  is my defined UDF. My main concern is the column that 
> would have to be output what would its value be ie what would the function 
> need to return that the udf susbsystem would then turn to a column to be 
> evaluated by the join.
> 
> 
> Thanks in advance for any advice

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: How to configure spark with java

2017-07-24 Thread Patrik Medvedev
What exactly do you need?
Basically you need to add spark libs at your pom.

пн, 24 июл. 2017 г. в 6:22, amit kumar singh :

> Hello everyone
>
> I want to use spark with java API
>
> Please let me know how can I configure it
>
>
> Thanks
> A
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Is there a way to run Spark SQL through REST?

2017-07-24 Thread Sumedh Wale

  
  
Yes, using the new Spark structured streaming you can keep
submitting streaming jobs against the same SparkContext in different
requests (or you can create a new SparkContext if required in a
request). The SparkJob implementation will get handle of the
SparkContext which will be existing one or new one depending on the
REST API calls -- see its github page for details on transient vs
persistent SparkContexts.
With the old Spark streaming model, you cannot add new DStreams once
StreamingContext has started (which has been a limitation of the old
streaming model), so you can submit against the same context but
only until one last job starts the StreamingContext.

regards
sumedh

On Monday 24 July 2017 06:09 AM, kant kodali wrote:

  @Sumedh Can I run streaming jobs on the same
context with spark-jobserver ? so there is no waiting for
results since the spark sql job is expected stream forever and
results of each streaming job are captured through a message
queue.


In my case each spark sql query will be a streaming job.
  
  
On Sat, Jul 22, 2017 at 6:19 AM, Sumedh
  Wale 
  wrote:
  On Saturday 22 July 2017 01:31 PM, kant kodali
  wrote:
  
Is there a way to run Spark SQL through REST?
  
  

There is spark-jobserver (https://github.com/spark-jobserver/spark-jobserver).
It does more than just REST API (like long running
SparkContext).

regards

--
Sumedh Wale
SnappyData (http://www.snappydata.io)

  


  


  


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org