Re: spark streaming exception

2019-11-10 Thread Akshay Bhardwaj
Hi,

Could you provide with the code snippet of how you are connecting and
reading data from kafka?

Akshay Bhardwaj
+91-97111-33849


On Thu, Oct 17, 2019 at 8:39 PM Amit Sharma  wrote:

> Please update me if any one knows about it.
>
>
> Thanks
> Amit
>
> On Thu, Oct 10, 2019 at 3:49 PM Amit Sharma  wrote:
>
>> Hi , we have spark streaming job to which we send a request through our
>> UI using kafka. It process and returned the response. We are getting below
>> error and this stareming is not processing any request.
>>
>> Listener StreamingJobProgressListener threw an exception
>> java.util.NoSuchElementException: key not found: 1570689515000 ms
>> at scala.collection.MapLike$class.default(MapLike.scala:228)
>> at scala.collection.AbstractMap.default(Map.scala:59)
>> at scala.collection.mutable.HashMap.apply(HashMap.scala:65)
>> at
>> org.apache.spark.streaming.ui.StreamingJobProgressListener.onOutputOperationCompleted(StreamingJobProgressListener.scala:134)
>> at
>> org.apache.spark.streaming.scheduler.StreamingListenerBus.doPostEvent(StreamingListenerBus.scala:67)
>> at
>> org.apache.spark.streaming.scheduler.StreamingListenerBus.doPostEvent(StreamingListenerBus.scala:29).
>>
>> Please help me in find out the root cause of this issue.
>>
>


Re: Build customized resource manager

2019-11-10 Thread Klaus Ma
hm that'll be better to me if we can build customized resource manager
out of core; otherwise, we have to go through the long discussion in the
community :)
But if we support that, why still mesos/yarn/k8s resource manager there in
the tree?

On Fri, Nov 8, 2019 at 10:18 PM Tom Graves  wrote:

> I don't know if it all works but some work was done to make cluster
> manager pluggable, see SPARK-13904.
>
> Tom
>
> On Wednesday, November 6, 2019, 07:22:59 PM CST, Klaus Ma <
> klaus1982...@gmail.com> wrote:
>
>
> Any suggestions?
>
> - Klaus
>
> On Mon, Nov 4, 2019 at 5:04 PM Klaus Ma  wrote:
>
> Hi team,
>
> AFAIK, we built k8s/yarn/mesos as resource manager; but I'd like to did
> some enhancement to them, e.g. integrate with Volcano
>  in k8s. Is that possible to do
> that without fork the whole spark project? For example, enable customized
> resource manager with configuration, e.g. replace
> `org.apache.spark.deploy.k8s.submit.KubernetesClientApplication` with
> `MyK8SClient`, so I can only maintain the resource manager instead of the
> whole project.
>
> -- Klaus
>
>


announce: spark-postgres 3 released

2019-11-10 Thread Nicolas Paris
Hello spark users,

Spark-postgres is designed for reliable and performant ETL in big-data
workload and offer read/write/scd capability . The version 3 introduces
a  datasource API and simplifies the usage. It outperforms sqoop by
factor 8 and the apache spark core jdbc by infinity.

Features:
- use of pg COPY statements
- parallel reads/writes
- use of hdfs to store intermediary csv
- reindex after bulk-loading
- SCD1 computations done on the spark side
- use unlogged tables when needed
- handle arrays and multiline string columns
- useful jdbc functions (ddl, updates...)

The official repository:
https://framagit.org/parisni/spark-etl/tree/master/spark-postgres

And its mirror on microsoft github:
https://github.com/EDS-APHP/spark-etl/tree/master/spark-postgres

-- 
nicolas

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [pyspark 2.3.0] Task was denied committing errors

2019-11-10 Thread Rishi Shah
Hi Team,

I could really use your insight here, any help is appreciated!

Thanks,
Rishi


On Wed, Nov 6, 2019 at 8:27 PM Rishi Shah  wrote:

> Any suggestions?
>
> On Wed, Nov 6, 2019 at 7:30 AM Rishi Shah 
> wrote:
>
>> Hi All,
>>
>> I have two relatively big tables and join on them keeps throwing
>> TaskCommitErrors, eventually job succeeds but I was wondering what these
>> errors are and if there's any solution?
>>
>> --
>> Regards,
>>
>> Rishi Shah
>>
>
>
> --
> Regards,
>
> Rishi Shah
>


-- 
Regards,

Rishi Shah


Re: PySpark Pandas UDF

2019-11-10 Thread Holden Karau
Can you switch the write for a count just so we can isolate if it’s the
write or the count?
Also what’s the output path your using?

On Sun, Nov 10, 2019 at 7:31 AM Gal Benshlomo 
wrote:

>
>
> Hi,
>
>
>
> I’m using pandas_udf and not able to run it from cluster mode, even though
> the same code works on standalone.
>
>
>
> The code is as follows:
>
>
>
>
>
>
>
> schema_test = StructType([
> StructField("cluster", LongType()),
> StructField("name", StringType())
> ])
>
>
> @pandas_udf(schema_test, PandasUDFType.GROUPED_MAP)
> def test_foo(pd_df):
> print('\n\nSid is problematic\n\n')
> pd_df['cluster'] = 1
> return pd_df[['name', 'cluster']]
>
>
>
>
>
> department1 = Row(id='123456', name='Computer Science')
> department2 = Row(id='789012', name='Mechanical Engineering')
> users_data = spark.createDataFrame([department1, department2])
> res = users_data.groupby('id').apply(test_foo)
>
> res.write.parquet(RES_OUTPUT_PATH, mode='overwrite')
>
>
>
>
>
> the errors I’m getting are:
>
> ERROR FileFormatWriter: Aborting job c6eefb8c-c8d5-4236-82d7-298924b03b25.
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 81
> in stage 1.0 failed 4 times, most recent failure: Lost task 81.3 in stage
> 1.0 (TID 192, 10.10.1.17, executor 1): java.lang.IllegalArgumentException
> at java.nio.ByteBuffer.allocate(ByteBuffer.java:334)
> at
> org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:543)
> at
> org.apache.arrow.vector.ipc.message.MessageChannelReader.readNext(MessageChannelReader.java:58)
> at
> org.apache.arrow.vector.ipc.ArrowStreamReader.readSchema(ArrowStreamReader.java:132)
> at org.apache.arrow.vector.ipc.ArrowReader.initialize(ArrowReader.java:181)
> at
> org.apache.arrow.vector.ipc.ArrowReader.ensureInitialized(ArrowReader.java:172)
> at
> org.apache.arrow.vector.ipc.ArrowReader.getVectorSchemaRoot(ArrowReader.java:65)
> at
> org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:162)
> at
> org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:122)
> at
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
> at
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
> at
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:232)
> at
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
> at
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> at org.apache.spark.scheduler.Task.run(Task.scala:121)
> at
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:403)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:409)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
>
> Driver stacktrace:
> at org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
> at scala.Option.foreach(Option.scala:257)
> at
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
> at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
> at
> 

RE: PySpark Pandas UDF

2019-11-10 Thread Gal Benshlomo

Hi,

I'm using pandas_udf and not able to run it from cluster mode, even though the 
same code works on standalone.

The code is as follows:




schema_test = StructType([
StructField("cluster", LongType()),
StructField("name", StringType())
])


@pandas_udf(schema_test, PandasUDFType.GROUPED_MAP)
def test_foo(pd_df):
print('\n\nSid is problematic\n\n')
pd_df['cluster'] = 1
return pd_df[['name', 'cluster']]


department1 = Row(id='123456', name='Computer Science')
department2 = Row(id='789012', name='Mechanical Engineering')
users_data = spark.createDataFrame([department1, department2])
res = users_data.groupby('id').apply(test_foo)

res.write.parquet(RES_OUTPUT_PATH, mode='overwrite')


the errors I'm getting are:
ERROR FileFormatWriter: Aborting job c6eefb8c-c8d5-4236-82d7-298924b03b25.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 81 in 
stage 1.0 failed 4 times, most recent failure: Lost task 81.3 in stage 1.0 (TID 
192, 10.10.1.17, executor 1): java.lang.IllegalArgumentException
at java.nio.ByteBuffer.allocate(ByteBuffer.java:334)
at 
org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:543)
at 
org.apache.arrow.vector.ipc.message.MessageChannelReader.readNext(MessageChannelReader.java:58)
at 
org.apache.arrow.vector.ipc.ArrowStreamReader.readSchema(ArrowStreamReader.java:132)
at org.apache.arrow.vector.ipc.ArrowReader.initialize(ArrowReader.java:181)
at 
org.apache.arrow.vector.ipc.ArrowReader.ensureInitialized(ArrowReader.java:172)
at 
org.apache.arrow.vector.ipc.ArrowReader.getVectorSchemaRoot(ArrowReader.java:65)
at 
org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:162)
at 
org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:122)
at 
org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:232)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:403)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:409)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
at scala.Option.foreach(Option.scala:257)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:167)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
at 

Re: Why Spark generates Java code and not Scala?

2019-11-10 Thread Holden Karau
If you look inside of the generation we generate java code and compile it
with Janino. For interested folks the conversation moved over to the dev@
list

On Sat, Nov 9, 2019 at 10:37 AM Marcin Tustin
 wrote:

> What do you mean by this? Spark is written in a combination of Scala and
> Java, and then compiled to Java Byte Code, as is typical for both Scala and
> Java. If there's additional byte code generation happening, it's java byte
> code, because the platform runs on the JVM.
>
> On Sat, Nov 9, 2019 at 12:47 PM Bartosz Konieczny 
> wrote:
>
>> *This Message originated outside your organization.*
>> --
>> Hi there,
>>
>
>> Few days ago I got an intriguing but hard to answer question:
>> "Why Spark generates Java code and not Scala code?"
>> (https://github.com/bartosz25/spark-scala-playground/issues/18)
>>
>> Since I'm not sure about the exact answer, I'd like to ask you to confirm
>> or not my thinking. I was looking for the reasons in the JIRA and the
>> research paper "Spark SQL: Relational Data Processing in Spark" (
>> http://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf) but
>> found nothing explaining why Java over Scala. The single task I found was
>> about why Scala and not Java but concerning data types (
>> https://issues.apache.org/jira/browse/SPARK-5193) That's why I'm writing
>> here.
>>
>> My guesses about choosing Java code are:
>> - Java runtime compiler libs are more mature and prod-ready than the
>> Scala's - or at least, they were at the implementation time
>> - Scala compiler tends to be slower than the Java's
>> https://stackoverflow.com/questions/3490383/java-compile-speed-vs-scala-compile-speed
>> - Scala compiler seems to be more complex, so debugging & maintaining it
>> would be harder
>> - it was easier to represent a pure Java OO design than mixed FP/OO in
>> Scala
>> ?
>>
>> Thank you for your help.
>>
>> --
>> Bartosz Konieczny
>> data engineer
>> https://www.waitingforcode.com
>> https://github.com/bartosz25/
>> https://twitter.com/waitingforcode
>>
>> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau