Re: Problem in Spark-Kafka Connector

2017-12-27 Thread Sitakant Mishra
Hi,

Kindly help me with this problem, for which I will be grateful.

Thanks and Regards,
Sitakanta Mishra

On Tue, Dec 26, 2017 at 12:34 PM, Sitakant Mishra <
sitakanta.mis...@gmail.com> wrote:

> Hi,
>
> I am trying to connect my Spark cluster to a single Kafka Topic which
> running as a separate process in a machine. While submitting the spark
> application, I am getting the following error.
>
>
>
> *17/12/25 16:56:57 ERROR TransportRequestHandler: Error sending result
> StreamResponse{streamId=/jars/learning-spark-examples-assembly-0.0.1.jar,
> byteCount=186935315,
> body=FileSegmentManagedBuffer{file=/s/chopin/a/grad/skmishra/Thesis/spark-pipeline/./target/scala-2.10/learning-spark-examples-assembly-0.0.1.jar,
> offset=0, length=186935315}} to /129.82.44.156:55168
> ; closing connection*
> *java.nio.channels.ClosedChannelException*
> * at io.netty.channel.AbstractChannel$AbstractUnsafe.close(...)(Unknown
> Source)*
> *17/12/25 16:56:57 INFO TaskSetManager: Starting task 21.0 in stage 0.0
> (TID 21, 129.82.44.156, executor 9, partition 21, PROCESS_LOCAL, 4706
> bytes)*
> *17/12/25 16:56:57 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0,
> 129.82.44.156, executor 9): java.nio.channels.ClosedChannelException*
> * at
> org.apache.spark.network.client.StreamInterceptor.channelInactive(StreamInterceptor.java:60)*
> * at
> org.apache.spark.network.util.TransportFrameDecoder.channelInactive(TransportFrameDecoder.java:179)*
> * at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:241)*
> * at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:227)*
> * at
> io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:220)*
> * at
> io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1289)*
> * at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:241)*
> * at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:227)*
> * at
> io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:893)*
> * at
> io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:691)*
> * at
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399)*
> * at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:446)*
> * at
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)*
> * at
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)*
> * at java.lang.Thread.run(Thread.java:745)*
>
> *17/12/25 16:56:57 ERROR TransportRequestHandler: Error sending result
> StreamResponse{streamId=/jars/learning-spark-examples-assembly-0.0.1.jar,
> byteCount=186935315,
> body=FileSegmentManagedBuffer{file=/s/chopin/a/grad/skmishra/Thesis/spark-pipeline/./target/scala-2.10/learning-spark-examples-assembly-0.0.1.jar,
> offset=0, length=186935315}} to /129.82.44.164:45988
> ; closing connection*
> *java.nio.channels.ClosedChannelException*
> * at io.netty.channel.AbstractChannel$AbstractUnsafe.close(...)(Unknown
> Source)*
> *17/12/25 16:56:57 ERROR TransportRequestHandler: Error sending result
> StreamResponse{streamId=/jars/learning-spark-examples-assembly-0.0.1.jar,
> byteCount=186935315,
> body=FileSegmentManagedBuffer{file=/s/chopin/a/grad/skmishra/Thesis/spark-pipeline/./target/scala-2.10/learning-spark-examples-assembly-0.0.1.jar,
> offset=0, length=186935315}} to /129.82.44.142:56136
> ; closing connection*
>
>
>
> I looked over the web and I found only the following relevant link "
> https://stackoverflow.com/questions/29781489/apache-spark-
> network-errors-between-executors?noredirect=1=1". I tried with the
> suggestion given in the discussion as below.
>
>
> val conf = new 
> SparkConf().setAppName("KafkaInput").set("spark.shuffle.blockTransferService",
> "nio")
>
>
> But still it does not work. I am using "spark-2.2.0-bin-hadoop2.7" version
> of spark. Please help me with this issue and let me know if you need any
> other information from my side.
>
>
>
> Thanks and Regards,
> Sitakanta Mishra
>


Partition Dataframe Using UDF On Partition Column

2017-12-27 Thread Richard Primera
Greetings,


In version 1.6.0, is it possible to write a partitioned dataframe into
parquet format using a UDF function on the partition column? I'm using
pyspark.

Let's say I have a dataframe with coumn `date`, of type string or int, which
contains values such as `20170825`. Is it possible to define a UDF called
`by_month` or `by_year`, which could then be used to write the table as
parquet, ideally in this way:

*dataframe.write.format("parquet").partitionBy(by_month(dataframe["date"])).save("/some/parquet")*

I haven't even tried this so I don't know if it's possible. If so, what are
the ways by which this can be done? Ideally, without having to resort to add
an additional column like `part_id` to the dataframe with the result of
`by_month(date)` and partitioning by that column instead.


Thanks in advance.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark and neural networks

2017-12-27 Thread Esa Heikkinen
Hi


What would be the best way to use Spark and neutral networks (especially RNN 
LSTM) ?


I think it would be possible by "tool"-combination:

Pyspark + anaconda + pandas + numpy + keras + tensorflow + scikit


But what about scalability and usability by Spark (pyspark) ?


How compatible are data structures (for example dataframe) between Spark and 
other "tools" ?

And it is possible convert them between different "tools" ?


---

Eras


Re: Standalone Cluster: ClassNotFound org.apache.kafka.common.serialization.ByteArrayDeserializer

2017-12-27 Thread Geoff Von Allmen
I’ve tried it both ways.

Uber jar gives me gives me the following:

   - Caused by: java.lang.ClassNotFoundException: Failed to find data
   source: kafka. Please find packages at
   http://spark.apache.org/third-party-projects.html

If I only do minimal packaging and add
org.apache.spark_spark-sql-kafka-0-10_2.11-2.2.0.jar as a --package and
then add it to the --driver-class-path then I get past that error, but I
get the error I showed in the original post.

I agree it seems it’s missing the kafka-clients jar file as that is where
the ByteArrayDeserializer is, though it looks like it’s present as far as I
can tell.

I can see the following two packages in the ClassPath entries on the
history server (Though the source shows: **(redacted) — not sure
why?)

   - spark://:/jars/org.apache.kafka_kafka-clients-0.10.0.1.jar
   -
   spark://:/jars/org.apache.spark_spark-sql-kafka-0-10_2.11-2.2.0.jar

As as side note, i’m running both a master and worker on the same system
just to test out running in cluster mode. Not sure if that would have
anything to do with it. I would think it would make it easier since it's
got access to all the same file system... but I'm pretty new to Spark.

I have also read through and followed those instructions as well as many
others at this point.

Thanks!
​

On Wed, Dec 27, 2017 at 12:56 AM, Eyal Zituny 
wrote:

> Hi,
> it seems that you're missing the kafka-clients jar (and probably some
> other dependencies as well)
> how did you packaged you application jar? does it includes all the
> required dependencies (as an uber jar)?
> if it's not an uber jar you need to pass via the driver-class-path and the
> executor-class-path all the files\dirs where your dependencies can be found
> (note that those must be accessible from each node in the cluster)
> i suggest to go over the manual
> 
>
> Eyal
>
>
> On Wed, Dec 27, 2017 at 1:08 AM, Geoff Von Allmen 
> wrote:
>
>> I am trying to deploy a standalone cluster but running into ClassNotFound
>> errors.
>>
>> I have tried a whole myriad of different approaches varying from
>> packaging all dependencies into a single JAR and using the --packages
>> and --driver-class-path options.
>>
>> I’ve got a master node started, a slave node running on the same system,
>> and am using spark submit to get the streaming job kicked off.
>>
>> Here is the error I’m getting:
>>
>> Exception in thread "main" java.lang.reflect.InvocationTargetException
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at 
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>> at 
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:498)
>> at 
>> org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58)
>> at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
>> Caused by: java.lang.NoClassDefFoundError: 
>> org/apache/kafka/common/serialization/ByteArrayDeserializer
>> at 
>> org.apache.spark.sql.kafka010.KafkaSourceProvider$.(KafkaSourceProvider.scala:376)
>> at 
>> org.apache.spark.sql.kafka010.KafkaSourceProvider$.(KafkaSourceProvider.scala)
>> at 
>> org.apache.spark.sql.kafka010.KafkaSourceProvider.validateStreamOptions(KafkaSourceProvider.scala:323)
>> at 
>> org.apache.spark.sql.kafka010.KafkaSourceProvider.sourceSchema(KafkaSourceProvider.scala:60)
>> at 
>> org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:198)
>> at 
>> org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:88)
>> at 
>> org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:88)
>> at 
>> org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:30)
>> at 
>> org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:150)
>> at com.Customer.start(Customer.scala:47)
>> at com.Main$.main(Main.scala:23)
>> at com.Main.main(Main.scala)
>> ... 6 more
>> Caused by: java.lang.ClassNotFoundException: 
>> org.apache.kafka.common.serialization.ByteArrayDeserializer
>> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>> ... 18 more
>>
>> Here is the spark submit command I’m using:
>>
>> ./spark-submit \
>> --master spark://: \
>> --files jaas.conf \
>> --deploy-mode cluster \
>> --driver-java-options "-Djava.security.auth.login.config=./jaas.conf" \
>> --conf 
>> "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=./jaas.conf"
>>  \
>> --packages 

Re: Apache Spark - Structured Streaming graceful shutdown

2017-12-27 Thread Eyal Zituny
Hi
if you're interested in stopping you're spark application externally, you
will probably need a way to communicate with the spark driver  (which start
and holds a ref to the spark context)
this can be done by adding some code to the driver app, for example:

   - you can expose a rest api that stop the query and the spark context
   - if running in client mode you can listen to stdin
   - you can also listen to an external system (like kafka)

Eyal

On Tue, Dec 26, 2017 at 10:37 PM, M Singh 
wrote:

> Thanks Diogo.  My question is how to gracefully call the stop method while
> the streaming application is running in a cluster.
>
>
>
>
> On Monday, December 25, 2017 5:39 PM, Diogo Munaro Vieira <
> diogo.mun...@corp.globo.com> wrote:
>
>
> Hi M Singh! Here I'm using query.stop()
>
> Em 25 de dez de 2017 19:19, "M Singh" 
> escreveu:
>
> Hi:
> Are there any patterns/recommendations for gracefully stopping a
> structured streaming application ?
> Thanks
>
>
>
>
>


Re: Standalone Cluster: ClassNotFound org.apache.kafka.common.serialization.ByteArrayDeserializer

2017-12-27 Thread Eyal Zituny
Hi,
it seems that you're missing the kafka-clients jar (and probably some other
dependencies as well)
how did you packaged you application jar? does it includes all the required
dependencies (as an uber jar)?
if it's not an uber jar you need to pass via the driver-class-path and the
executor-class-path all the files\dirs where your dependencies can be found
(note that those must be accessible from each node in the cluster)
i suggest to go over the manual


Eyal


On Wed, Dec 27, 2017 at 1:08 AM, Geoff Von Allmen 
wrote:

> I am trying to deploy a standalone cluster but running into ClassNotFound
> errors.
>
> I have tried a whole myriad of different approaches varying from packaging
> all dependencies into a single JAR and using the --packages and
> --driver-class-path options.
>
> I’ve got a master node started, a slave node running on the same system,
> and am using spark submit to get the streaming job kicked off.
>
> Here is the error I’m getting:
>
> Exception in thread "main" java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58)
> at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
> Caused by: java.lang.NoClassDefFoundError: 
> org/apache/kafka/common/serialization/ByteArrayDeserializer
> at 
> org.apache.spark.sql.kafka010.KafkaSourceProvider$.(KafkaSourceProvider.scala:376)
> at 
> org.apache.spark.sql.kafka010.KafkaSourceProvider$.(KafkaSourceProvider.scala)
> at 
> org.apache.spark.sql.kafka010.KafkaSourceProvider.validateStreamOptions(KafkaSourceProvider.scala:323)
> at 
> org.apache.spark.sql.kafka010.KafkaSourceProvider.sourceSchema(KafkaSourceProvider.scala:60)
> at 
> org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:198)
> at 
> org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:88)
> at 
> org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:88)
> at 
> org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:30)
> at 
> org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:150)
> at com.Customer.start(Customer.scala:47)
> at com.Main$.main(Main.scala:23)
> at com.Main.main(Main.scala)
> ... 6 more
> Caused by: java.lang.ClassNotFoundException: 
> org.apache.kafka.common.serialization.ByteArrayDeserializer
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> ... 18 more
>
> Here is the spark submit command I’m using:
>
> ./spark-submit \
> --master spark://: \
> --files jaas.conf \
> --deploy-mode cluster \
> --driver-java-options "-Djava.security.auth.login.config=./jaas.conf" \
> --conf 
> "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=./jaas.conf"
>  \
> --packages org.apache.spark:spark-sql-kafka-0-10_2.11 \
> --driver-class-path 
> ~/.ivy2/jars/org.apache.spark_spark-sql-kafka-0-10_2.11-2.2.1.jar \
> --class  \
> --verbose \
> my_jar.jar
>
> I’ve tried all sorts of combinations of including different packages and
> driver-class-path jar files. As far as I can find, the serializer should be
> in the kafka-clients jar file, which I’ve tried including to no success.
>
> Pom Dependencies are as follows:
>
> 
> 
> org.scala-lang
> scala-library
> 2.11.12
> 
> 
> org.apache.spark
> spark-streaming-kafka-0-10_2.11
> 2.2.1
> 
> 
> org.apache.spark
> spark-core_2.11
> 2.2.1
> 
> 
> org.apache.spark
> spark-sql_2.11
> 2.2.1
> 
> 
> org.apache.spark
> spark-sql-kafka-0-10_2.11
> 2.2.1
> 
> 
> mysql
> mysql-connector-java
> 8.0.8-dmr
> 
> 
> joda-time
> joda-time
> 2.9.9
> 
> 
>
> If I remove --deploy-mode and run it as client … it works just fine.
>
> Thanks Everyone -
>
> Geoff V.
> ​
>