Re: [Structured Streaming] NullPointerException in long running query

2020-04-29 Thread Shixiong(Ryan) Zhu
The stack trace is omitted by JVM when an exception is thrown too many times. This usually happens when you have multiple Spark tasks on the same executor JVM throwing the same exception. See https://stackoverflow.com/a/3010106 Best Regards, Ryan On Tue, Apr 28, 2020 at 10:45 PM lec ssmi

Re: Spark 2.3 and Kafka client library version

2020-04-28 Thread Shixiong(Ryan) Zhu
You should be able to override the Kafka client version. The Kafka APIs used by Structured Streaming exist in new Kafka versions. There is a known correctness issue in Kafka 0.10.1.*. Other versions should be fine. Best Regards, Ryan On Tue,

Re: Problems during upgrade 2.2.2 -> 2.4.4

2020-01-31 Thread Shixiong(Ryan) Zhu
The reason of this is Spark RPC and the persisted states of HA mode are both using Java serialization to serialize internal classes which don't have any compatibility guarantee. Best Regards, Ryan On Fri, Jan 31, 2020 at 9:08 AM Shixiong(Ryan) Zhu wrote: > Unfortunately, Spark standalone m

Re: Problems during upgrade 2.2.2 -> 2.4.4

2020-01-31 Thread Shixiong(Ryan) Zhu
Unfortunately, Spark standalone mode doesn't support rolling update. All Spark components (master, worker, driver) must be updated to the same version. When using HA mode, the states persisted in zookeeper (or files if not using zookeeper) need to be cleaned because they are not compatible between

Re: Issue with offset management using Spark on Dataproc

2019-04-30 Thread Shixiong(Ryan) Zhu
I recommend you to use Structured Streaming as it has a patch that can workaround this issue: https://issues.apache.org/jira/browse/SPARK-26267 Best Regards, Ryan On Tue, Apr 30, 2019 at 3:34 PM Shixiong(Ryan) Zhu wrote: > There is a known issue that Kafka may return a wrong offset e

Re: Issue with offset management using Spark on Dataproc

2019-04-30 Thread Shixiong(Ryan) Zhu
There is a known issue that Kafka may return a wrong offset even if there is no reset happening: https://issues.apache.org/jira/browse/KAFKA-7703 Best Regards, Ryan On Tue, Apr 30, 2019 at 10:41 AM Austin Weaver wrote: > @deng - There was a short erroneous period where 2 streams were reading

Re: Spark streaming error - Query terminated with exception: assertion failed: Invalid batch: a#660,b#661L,c#662,d#663,,… 26 more fields != b#1291L

2019-04-01 Thread Shixiong(Ryan) Zhu
Could you try to use $”a” rather than df(“a”)? The latter one sometimes doesn’t work. On Thu, Mar 21, 2019 at 10:41 AM kineret M wrote: > I try to read a stream using my custom data source (v2, using spark 2.3), > and it fails *in the second iteration* with the following exception while >

Re: Spark Kafka Batch Write guarantees

2019-04-01 Thread Shixiong(Ryan) Zhu
The Kafka source doesn’t support transaction. You may see partial data or duplicated data if a Spark task fails. On Wed, Mar 27, 2019 at 1:15 AM hemant singh wrote: > We are using spark batch to write Dataframe to Kafka topic. The spark > write function with write.format(source = Kafka). > Does

Re: Structured streaming from Kafka by timestamp

2019-01-24 Thread Shixiong(Ryan) Zhu
Hey Tomas, >From your description, you just ran a batch query rather than a Structured Streaming query. The Kafka data source doesn't support filter push down right now. But that's definitely doable. One workaround here is setting proper "startingOffsets" and "endingOffsets" options when loading

Re: Spark Streaming join taking long to process

2018-11-27 Thread Shixiong(Ryan) Zhu
If you are using the same code to run on Yarn, I believe it’s still using the local mode as it overwrites the master url set by CLI. You can check the “executors” tab in the Spark UI to set how many executors are running, and verify if it matches your config. On Tue, Nov 27, 2018 at 6:17 AM

Re: Recreate Dataset from list of Row in spark streaming application.

2018-10-05 Thread Shixiong(Ryan) Zhu
oracle.insight.spark.identifier_processor.ForeachFederatedEventProcessor is a ForeachWriter. Right? You can not use SparkSession in its process method as it will run in executors. Best Regards, Ryan On Fri, Oct 5, 2018 at 6:54 AM Kuttaiah Robin wrote: > Hello, > > I have a spark streaming

Re: PySpark structured streaming job throws socket exception

2018-10-04 Thread Shixiong(Ryan) Zhu
As far as I know, the error log in updateAccumulators will not fail a Spark task. Did you see other error messages? Best Regards, Ryan On Thu, Oct 4, 2018 at 2:14 PM mmuru wrote: > Hi, > > Running Pyspark structured streaming job on K8S with 2 executor pods. The > driver pod failed with the

Re: Kafka Connector version support

2018-09-21 Thread Shixiong(Ryan) Zhu
-dev +user We don't backport new features to a maintenance branch. All new updates will be just in 2.4. Best Regards, Ryan On Fri, Sep 21, 2018 at 2:44 PM, Basil Hariri < basil.har...@microsoft.com.invalid> wrote: > Hi all, > > > > Are there any plans to backport the recent (2.4) updates to

Re: Error while doing stream-stream inner join (java.util.ConcurrentModificationException: KafkaConsumer is not safe for multi-threaded access)

2018-07-03 Thread Shixiong(Ryan) Zhu
Which version are you using? There is a known issue regarding this and should be fixed in 2.3.1. See https://issues.apache.org/jira/browse/SPARK-23623 for details. Best Regards, Ryan On Mon, Jul 2, 2018 at 3:56 AM, kant kodali wrote: > Hi All, > > I get the below error quite often when I do an

Re: [structured-streaming][kafka] Will the Kafka readstream timeout after connections.max.idle.ms 540000 ms ?

2018-05-16 Thread Shixiong(Ryan) Zhu
The streaming query should keep polling data from Kafka. When the query was stopped, did you see any exception? Best Regards, Shixiong Zhu Databricks Inc. shixi...@databricks.com databricks.com [image: http://databricks.com]

Re: Continuous Processing mode behaves differently from Batch mode

2018-05-16 Thread Shixiong(Ryan) Zhu
One possible case is you don't have enough resources to launch all tasks for your continuous processing query. Could you check the Spark UI and see if all tasks are running rather than waiting for resources? Best Regards, Shixiong Zhu Databricks Inc. shixi...@databricks.com databricks.com

Re: I cannot use spark 2.3.0 and kafka 0.9?

2018-05-08 Thread Shixiong(Ryan) Zhu
"note that the 0.8 integration is compatible with later 0.9 and 0.10 brokers, but the 0.10 integration is not compatible with earlier brokers." This is pretty clear. You can use 0.8 integration to talk to 0.9 broker. Best Regards, Shixiong Zhu Databricks Inc. shixi...@databricks.com

Re: Structured Streaming + Kafka - Corrupted Checkpoint Offsets / Commits

2018-01-04 Thread Shixiong(Ryan) Zhu
The root cause is probably that HDFSMetadataLog ignores exceptions thrown by "output.close". I think this should be fixed by this line in Spark 2.2.1 and 3.0.0: https://github.com/apache/spark/commit/6edfff055caea81dc3a98a6b4081313a0c0b0729#diff-aaeb546880508bb771df502318c40a99L126 Could you try

Re: Standalone Cluster: ClassNotFound org.apache.kafka.common.serialization.ByteArrayDeserializer

2017-12-28 Thread Shixiong(Ryan) Zhu
The cluster mode doesn't upload jars to the driver node. This is a known issue: https://issues.apache.org/jira/browse/SPARK-4160 On Wed, Dec 27, 2017 at 1:27 AM, Geoff Von Allmen wrote: > I’ve tried it both ways. > > Uber jar gives me gives me the following: > >-

Re: PySpark 2.2.0, Kafka 0.10 DataFrames

2017-11-20 Thread Shixiong(Ryan) Zhu
You are using Spark Streaming Kafka package. The correct package name is " org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0" On Mon, Nov 20, 2017 at 4:15 PM, salemi wrote: > Yes, we are using --packages > > $SPARK_HOME/bin/spark-submit --packages >

Re: How to print plan of Structured Streaming DataFrame

2017-11-20 Thread Shixiong(Ryan) Zhu
-dev +user Which Spark version are you using? There is a bug in the old Spark. Try to use the latest version. In addition, you can call `query.explain()` as well. On Mon, Nov 20, 2017 at 4:00 AM, Chang Chen wrote: > Hi Guys > > I modified StructuredNetworkWordCount to

Re: spark-stream memory table global?

2017-11-10 Thread Shixiong(Ryan) Zhu
It must be accessed under the same SparkSession. We can also add an option to make it be a global temp view. Feel free to open a PR to improve it. On Fri, Nov 10, 2017 at 4:56 AM, Imran Rajjad wrote: > Hi, > > Does the memory table in which spark-structured streaming results

Re: Structured Stream in Spark

2017-10-27 Thread Shixiong(Ryan) Zhu
The codes in the link write the data into files. Did you check the output location? By the way, if you want to see the data on the console, you can use the console sink by changing this line *format("parquet").option("path", outputPath + "/ETL").partitionBy("creationTime").start()* to

Re: Queries with streaming sources must be executed with writeStream.start()

2017-09-09 Thread Shixiong(Ryan) Zhu
It's because "toJSON" doesn't support Structured Streaming. The current implementation will convert the Dataset to an RDD, which is not supported by streaming queries. On Sat, Sep 9, 2017 at 4:40 PM, kant kodali wrote: > yes it is a streaming dataset. so what is the problem

Re: [Structured Streaming]Data processing and output trigger should be decoupled

2017-08-30 Thread Shixiong(Ryan) Zhu
I don't think that's a good idea. If the engine keeps on processing data but doesn't output anything, where to keep the intermediate data? On Wed, Aug 30, 2017 at 9:26 AM, KevinZwx wrote: > Hi, > > I'm working with structured streaming, and I'm wondering whether there >

Re: PySpark, Structured Streaming and Kafka

2017-08-23 Thread Shixiong(Ryan) Zhu
You can use `bin/pyspark --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0` to start "pyspark". If you want to use "spark-submit", you also need to provide your Python file. On Wed, Aug 23, 2017 at 1:41 PM, Brian Wylie wrote: > Hi All, > > I'm trying the new

Re: [StructuredStreaming] multiple queries of the socket source: only one query works.

2017-08-12 Thread Shixiong(Ryan) Zhu
Spark creates one connection for each query. The behavior you observed is because how "nc -lk" works. If you use `netstat` to check the tcp connections, you will see there are two connections when starting two queries. However, "nc" forwards the input to only one connection. On Fri, Aug 11, 2017

Re: spark streaming socket read issue

2017-06-30 Thread Shixiong(Ryan) Zhu
Could you show the codes that start the StreamingQuery from Dataset?. If you don't call `writeStream.start(...)`, it won't run anything. On Fri, Jun 30, 2017 at 6:47 AM, pradeepbill wrote: > hi there, I have a spark streaming issue that i am not able to figure out ,

Re: Building Kafka 0.10 Source for Structured Streaming Error.

2017-06-28 Thread Shixiong(Ryan) Zhu
"--package" will add transitive dependencies that are not "$SPARK_HOME/external/kafka-0-10-sql/target/*.jar". > i have tried building the jar with dependencies, but still face the same error. What's the command you used? On Wed, Jun 28, 2017 at 12:00 PM, satyajit vegesna <

Re: ZeroMQ Streaming in Spark2.x

2017-06-26 Thread Shixiong(Ryan) Zhu
It's moved to http://bahir.apache.org/ You can find document there. On Mon, Jun 26, 2017 at 11:58 AM, Aashish Chaudhary < aashish.chaudh...@kitware.com> wrote: > Hi there, > > I am a beginner when it comes to Spark streaming. I was looking for some > examples related to ZeroMQ and Spark and

Re: Error while doing mvn release for spark 2.0.2 using scala 2.10

2017-06-19 Thread Shixiong(Ryan) Zhu
Some of projects (such as spark-tags) are Java projects. Spark doesn't fix the artifact name and just hard-core 2.11. For your issue, try to use `install` rather than `package`. On Sat, Jun 17, 2017 at 7:20 PM, Kanagha Kumar wrote: > Hi, > > Bumping up again! Why does

Re: Is Structured streaming ready for production usage

2017-06-08 Thread Shixiong(Ryan) Zhu
Please take a look at http://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html On Thu, Jun 8, 2017 at 4:46 PM, swetha kasireddy wrote: > OK. Can we use Spark Kafka Direct with Structured Streaming? > > On Thu, Jun 8, 2017 at 4:46 PM, swetha

Re: Running into the same problem as JIRA SPARK-19268

2017-05-25 Thread Shixiong(Ryan) Zhu
I don't know what happened in your case so cannot provide any work around. It would be great if you can provide logs output by HDFSBackedStateStoreProvider. On Thu, May 25, 2017 at 4:05 PM, kant kodali <kanth...@gmail.com> wrote: > > On Thu, May 25, 2017 at 3:41 PM, Shixiong(Ryan) Z

Re: Running into the same problem as JIRA SPARK-19268

2017-05-25 Thread Shixiong(Ryan) Zhu
nt/state >> >> >> However it still fails with >> >> *org.apache.hadoop.ipc.RemoteException(java.io >> <http://java.io>.FileNotFoundException): File does not exist: >> /usr/local/hadoop/checkpoint/state/0/1/1.delta* >> >> >> >&g

Re: Running into the same problem as JIRA SPARK-19268

2017-05-25 Thread Shixiong(Ryan) Zhu
ion variable >> like HADOOP_CONF_DIR ? I am currently not setting that in >> conf/spark-env.sh and thats the only hadoop related environment variable I >> see. please let me know >> >> thanks! >> >> >> >> On Thu, May 25, 2017 at 1:19 AM, kant kodali <kanth

Re: Running into the same problem as JIRA SPARK-19268

2017-05-25 Thread Shixiong(Ryan) Zhu
rver.java:2045) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) > at org.apache.hadoop.ipc.Server$Handler.r

Re: Running into the same problem as JIRA SPARK-19268

2017-05-24 Thread Shixiong(Ryan) Zhu
What's the value of "hdfsCheckPointDir"? Could you list this directory on HDFS and report the files there? On Wed, May 24, 2017 at 3:50 PM, Michael Armbrust wrote: > -dev > > Have you tried clearing out the checkpoint directory? Can you also give > the full stack trace?

Re: How to print data to console in structured streaming using Spark 2.1.0?

2017-05-16 Thread Shixiong(Ryan) Zhu
The default "startingOffsets" is "latest". If you don't push any data after starting the query, it won't fetch anything. You can set it to "earliest" like ".option("startingOffsets", "earliest")" to start the stream from the beginning. On Tue, May 16, 2017 at 12:36 AM, kant kodali

Re: Application dies, Driver keeps on running

2017-05-15 Thread Shixiong(Ryan) Zhu
So you are using `client` mode. Right? If so, Spark cluster doesn't manage the driver for you. Did you see any error logs in driver? On Mon, May 15, 2017 at 3:01 PM, map reduced wrote: > Hi, > > Setup: Standalone cluster with 32 workers, 1 master > I am running a long

Re: Why does dataset.union fails but dataset.rdd.union execute correctly?

2017-05-08 Thread Shixiong(Ryan) Zhu
This is because RDD.union doesn't check the schema, so you won't see the problem unless you run RDD and hit the incompatible column problem. For RDD, You may not see any error if you don't use the incompatible column. Dataset.union requires compatible schema. You can print ds.schema and

Re: Upgrade the scala code using the most updated Spark version

2017-03-28 Thread Shixiong(Ryan) Zhu
mapPartitionsWithSplit was removed in Spark 2.0.0. You can use mapPartitionsWithIndex instead. On Tue, Mar 28, 2017 at 3:52 PM, Anahita Talebi wrote: > Thanks. > I tried this one, as well. Unfortunately I still get the same error. > > > On Wednesday, March 29, 2017,

Re: Failed to connect to master ...

2017-03-07 Thread Shixiong(Ryan) Zhu
The Spark master may bind to a different address. Take a look at this page to find the correct URL: http://VM_IPAddress:8080/ On Tue, Mar 7, 2017 at 10:13 PM, Mina Aslani wrote: > Master and worker processes are running! > > On Wed, Mar 8, 2017 at 12:38 AM, ayan guha

Re: Structured Streaming - Kafka

2017-03-07 Thread Shixiong(Ryan) Zhu
Good catch. Could you create a ticket? You can also submit a PR to fix it if you have time :) On Tue, Mar 7, 2017 at 1:52 PM, Bowden, Chris wrote: > Potential bug when using startingOffsets = SpecificOffsets with Kafka > topics containing uppercase characters? > >

Re: Why spark history server does not show RDD even if it is persisted?

2017-02-28 Thread Shixiong(Ryan) Zhu
The REST APIs are not just for Spark history server. When an application is running, you can use the REST APIs to talk to Spark UI HTTP server as well. On Tue, Feb 28, 2017 at 10:46 AM, Parag Chaudhari wrote: > ping... > > > > *Thanks,Parag Chaudhari,**USC Alumnus (Fight

Re: Fault tolerant broadcast in updateStateByKey

2017-02-07 Thread Shixiong(Ryan) Zhu
It's documented here: http://spark.apache.org/docs/latest/streaming-programming-guide.html#accumulators-broadcast-variables-and-checkpoints On Tue, Feb 7, 2017 at 8:12 AM, Amit Sela wrote: > Hi all, > > I was wondering if anyone ever used a broadcast variable within > an

Re: Spark streaming question - SPARK-13758 Need to use an external RDD inside DStream processing...Please help

2017-02-07 Thread Shixiong(Ryan) Zhu
You can create lazily instantiated singleton instances. See http://spark.apache.org/docs/latest/streaming-programming-guide.html#accumulators-broadcast-variables-and-checkpoints for examples of accumulators and broadcast variables. You can use the same approach to create your cached RDD. On Tue,

Re: Resource Leak in Spark Streaming

2017-01-31 Thread Shixiong(Ryan) Zhu
> } > > private void initialize() { > pool = new ConcurrentLinkedQueue<KafkaProducer<String, String>>(); > > for (int i = 0; i < minIdle; i++) { > pool.add(createProducer()); > } >} > >public void close

Re: Resource Leak in Spark Streaming

2017-01-31 Thread Shixiong(Ryan) Zhu
t; String>(topic, message); > > if (async) { > producer.send(record); > } else { > try { > producer.send(record).get(); > } catch (Exception e) { > e.printStackTrace(); > }

Re: Resource Leak in Spark Streaming

2017-01-31 Thread Shixiong(Ryan) Zhu
gt; On Tue, Jan 31, 2017 at 1:45 PM Shixiong(Ryan) Zhu < > shixi...@databricks.com> wrote: > >> Could you provide your Spark version please? >> >> On Tue, Jan 31, 2017 at 10:37 AM, Nipun Arora <nipunarora2...@gmail.com> >> wrote: >> >>

Re: Question about "Output Op Duration" in SparkStreaming Batch details UX

2017-01-31 Thread Shixiong(Ryan) Zhu
It means the total time to run a batch, including the Spark job duration + time spent on the driver. E.g., foreachRDD { rdd => rdd.count() // say this takes 1 second. Thread.sleep(1) // sleep 10 seconds } In the above example, the Spark job duration is 1 seconds and the output op

Re: Resource Leak in Spark Streaming

2017-01-31 Thread Shixiong(Ryan) Zhu
Could you provide your Spark version please? On Tue, Jan 31, 2017 at 10:37 AM, Nipun Arora wrote: > Hi, > > I get a resource leak, where the number of file descriptors in spark > streaming keeps increasing. We end up with a "too many file open" error > eventually

Re: kafka structured streaming source refuses to read

2017-01-27 Thread Shixiong(Ryan) Zhu
Thanks for reporting this. Which Spark version are you using? Could you provide the full log, please? On Fri, Jan 27, 2017 at 10:24 AM, Koert Kuipers wrote: > i checked my topic. it has 5 partitions but all the data is written to a > single partition: wikipedia-2 > i turned

Re: Using mapWithState without a checkpoint

2017-01-23 Thread Shixiong(Ryan) Zhu
Even if you don't need the checkpointing data for recovery, "mapWithState" still needs to use "checkpoint" to cut off the RDD lineage. On Mon, Jan 23, 2017 at 12:30 AM, shyla deshpande wrote: > Hello spark users, > > I do have the same question as Daniel. > > I would

Re: Setting startingOffsets to earliest in structured streaming never catches up

2017-01-22 Thread Shixiong(Ryan) Zhu
Which Spark version are you using? If you are using 2.1.0, could you use the monitoring APIs ( http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#monitoring-streaming-queries) to check the input rate and the processing rate? One possible issue is that the Kafka source

Fwd: Spark streaming app that processes Kafka DStreams produces no output and no error

2017-01-20 Thread Shixiong(Ryan) Zhu
-- Forwarded message -- From: Shixiong(Ryan) Zhu <shixi...@databricks.com> Date: Fri, Jan 20, 2017 at 12:06 PM Subject: Re: Spark streaming app that processes Kafka DStreams produces no output and no error To: shyla deshpande <deshpandesh...@gmail.com> That's how K

Re: How to access metrics for Structured Streaming 2.1

2017-01-17 Thread Shixiong(Ryan) Zhu
You can use the monitoring APIs of Structured Streaming to get metrics. See http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#monitoring-streaming-queries On Tue, Jan 17, 2017 at 5:01 PM, Heji Kim wrote: > Hello. We are trying to

Re: [Spark Streaming] NoClassDefFoundError : StateSpec

2017-01-12 Thread Shixiong(Ryan) Zhu
You can find the Spark version of spark-submit in the log. Could you check if it's not consistent? On Thu, Jan 12, 2017 at 7:35 AM Ramkumar Venkataraman < ram.the.m...@gmail.com> wrote: > Spark: 1.6.1 > > I am trying to use the new mapWithState API and I am getting the following > error: > >

Re: structured streaming polling timeouts

2017-01-11 Thread Shixiong(Ryan) Zhu
2. > > Do you have any other suggestions for additional parameters to adjust > besides "kafkaConsumer.pollTimeoutMs"? > > On Wed, Jan 11, 2017 at 11:17 AM Shixiong(Ryan) Zhu < > shixi...@databricks.com> wrote: > >> You can increase the timeout using the opt

Re: structured streaming polling timeouts

2017-01-11 Thread Shixiong(Ryan) Zhu
You can increase the timeout using the option "kafkaConsumer.pollTimeoutMs". In addition, I would recommend you try Spark 2.1.0 as there are many improvements in Structured Streaming. On Wed, Jan 11, 2017 at 11:05 AM, Timothy Chan wrote: > I'm using Spark 2.0.2 and running

Re: Does MapWithState follow with a shuffle ?

2016-11-29 Thread Shixiong(Ryan) Zhu
Right. And you can specify the partitioner via "StateSpec.partitioner(partitioner: Partitioner)". On Tue, Nov 29, 2016 at 1:16 PM, Amit Sela wrote: > Hi all, > > I've been digging into MapWithState code (branch 1.6), and I came across > the compute >

Re: Pasting into spark-shell doesn't work for Databricks example

2016-11-23 Thread Shixiong(Ryan) Zhu
; On 22 November 2016 at 22:13, Shixiong(Ryan) Zhu <shixi...@databricks.com> > wrote: > >> The workaround is defining the imports and class together using ":paste". >> >> On Tue, Nov 22, 2016 at 11:12 AM, Shixiong(Ryan) Zhu < >> shixi...@databricks.com&

Re: getting error on spark streaming : java.lang.OutOfMemoryError: unable to create new native thread

2016-11-22 Thread Shixiong(Ryan) Zhu
Possibly https://issues.apache.org/jira/browse/SPARK-17396 On Tue, Nov 22, 2016 at 1:42 PM, Mohit Durgapal wrote: > Hi Everyone, > > > I am getting the following error while running a spark streaming example > on my local machine, the being ingested is only 506kb. > > >

Re: Pasting into spark-shell doesn't work for Databricks example

2016-11-22 Thread Shixiong(Ryan) Zhu
The workaround is defining the imports and class together using ":paste". On Tue, Nov 22, 2016 at 11:12 AM, Shixiong(Ryan) Zhu < shixi...@databricks.com> wrote: > This relates to a known issue: https://issues.apache. > org/jira/browse/SPARK-14146 and https://issues.scala

Re: Pasting into spark-shell doesn't work for Databricks example

2016-11-22 Thread Shixiong(Ryan) Zhu
This relates to a known issue: https://issues.apache.org/jira/browse/SPARK-14146 and https://issues.scala-lang.org/browse/SI-9799 On Tue, Nov 22, 2016 at 6:37 AM, dbolshak wrote: > Hello, > > We have the same issue, > > We use latest release 2.0.2. > > Setup with

Re: HiveContext.getOrCreate not accessible

2016-11-17 Thread Shixiong(Ryan) Zhu
`SQLContext.getOrCreate` will return the HiveContext you created. On Mon, Nov 14, 2016 at 11:17 PM, Praseetha wrote: > > Hi All, > > > I have a streaming app and when i try invoking the > HiveContext.getOrCreate, it errors out with the following stmt. 'object >

Re: Spark 2.0.2 with Kafka source, Error please help!

2016-11-17 Thread Shixiong(Ryan) Zhu
aultInstance = com.example.protos.demo.Person( >> ) >> implicit class PersonLens[UpperPB](_l: com.trueaccord.lenses.Lens[UpperPB, >> com.example.protos.demo.Person]) extends >> com.trueaccord.lenses.ObjectLens[UpperPB, >> com.example.protos.demo.Person](_l) { >>

Re: Spark 2.0.2 with Kafka source, Error please help!

2016-11-16 Thread Shixiong(Ryan) Zhu
Could you provide the Person class? On Wed, Nov 16, 2016 at 1:19 PM, shyla deshpande <deshpandesh...@gmail.com> wrote: > I am using 2.11.8. Thanks > > On Wed, Nov 16, 2016 at 1:15 PM, Shixiong(Ryan) Zhu < > shixi...@databricks.com> wrote: > >> Which Scala versio

Re: Spark 2.0.2 with Kafka source, Error please help!

2016-11-16 Thread Shixiong(Ryan) Zhu
Which Scala version are you using? Is it Scala 2.10? Scala 2.10 has some known race conditions in reflection and the Scala community doesn't have plan to fix it ( http://docs.scala-lang.org/overviews/reflection/thread-safety.html) AFAIK, the only way to fix it is upgrading to Scala 2.11. On Wed,

Re: Instability issues with Spark 2.0.1 and Kafka 0.10

2016-11-10 Thread Shixiong(Ryan) Zhu
Yeah, the KafkaRDD cannot be reused. It's better to document it. On Thu, Nov 10, 2016 at 8:26 AM, Ivan von Nagy wrote: > Ok, I have split he KafkaRDD logic to each use their own group and bumped > the poll.ms to 10 seconds. Anything less then 2 seconds on the poll.ms > ends up

Re: why spark driver program is creating so many threads? How can I limit this number?

2016-11-01 Thread Shixiong(Ryan) Zhu
Yes, try 2.0.1! On Tue, Nov 1, 2016 at 11:25 AM, kant kodali <kanth...@gmail.com> wrote: > AH!!! Got it! Should I use 2.0.1 then ? I don't see 2.1.0 > > On Tue, Nov 1, 2016 at 10:14 AM, Shixiong(Ryan) Zhu < > shixi...@databricks.com> wrote: > >> Dstream "Wi

Re: why spark driver program is creating so many threads? How can I limit this number?

2016-11-01 Thread Shixiong(Ryan) Zhu
Dstream "Window" uses "union" to combine multiple RDDs in one window into a single RDD. On Tue, Nov 1, 2016 at 2:59 AM kant kodali wrote: > @Sean It looks like this problem can happen with other RDD's as well. Not > just unionRDD > > On Tue, Nov 1, 2016 at 2:52 AM, kant

Re: why spark driver program is creating so many threads? How can I limit this number?

2016-10-31 Thread Shixiong(Ryan) Zhu
rs for me to > reproduce the error so I can get back as early as possible. > > Thanks a lot! > > On Mon, Oct 31, 2016 at 12:41 PM, Shixiong(Ryan) Zhu < > shixi...@databricks.com> wrote: > >> Then it should not be a Receiver issue. Could you use `jstack` to find >> ou

Re: why spark driver program is creating so many threads? How can I limit this number?

2016-10-31 Thread Shixiong(Ryan) Zhu
mode). > > Thanks! > > On Mon, Oct 31, 2016 at 12:32 PM, Shixiong(Ryan) Zhu < > shixi...@databricks.com> wrote: > >> Sorry, there is a typo in my previous email: this may **not** be the >> root cause if the leak threads are in the driver side. >> >> Doe

Re: why spark driver program is creating so many threads? How can I limit this number?

2016-10-31 Thread Shixiong(Ryan) Zhu
cs.oracle.com/javase/8/docs/api/java/lang/Thread.html doesn't seem > to have any method where I can clean up the threads created during OnStart. > any ideas? > > Thanks! > > > On Mon, Oct 31, 2016 at 11:58 AM, Shixiong(Ryan) Zhu < > shixi...@databricks.com> wrote: &g

Re: why spark driver program is creating so many threads? How can I limit this number?

2016-10-31 Thread Shixiong(Ryan) Zhu
So in your code, each Receiver will start a new thread. Did you stop the receiver properly in `Receiver.onStop`? Otherwise, you may leak threads after a receiver crashes and is restarted by Spark. However, this may be the root cause since the leak threads are in the driver side. Could you use

Re: Map with state keys serialization

2016-10-12 Thread Shixiong(Ryan) Zhu
> wrote: > I tried with 1.6.2 and saw the same behavior. > > -Joey > > On Tue, Oct 11, 2016 at 5:18 PM, Shixiong(Ryan) Zhu > <shixi...@databricks.com> wrote: > > There are some known issues in 1.6.0, e.g., > > https://issues.apache.org/jira/browse/SPARK-12591

Re: Map with state keys serialization

2016-10-11 Thread Shixiong(Ryan) Zhu
; Kryo. Also, this is with 1.6.0 so if this behavior changed/got fixed > > in a later release let me know. > > > > -Joey > > > > On Mon, Oct 10, 2016 at 9:54 AM, Shixiong(Ryan) Zhu > > <shixi...@databricks.com> wrote: > >> That's enough. Did you see a

Re: Map with state keys serialization

2016-10-10 Thread Shixiong(Ryan) Zhu
Conf and I registered the class. Is there a different > configuration setting for the state map keys? > > Thanks! > > -Joey > > On Sun, Oct 9, 2016 at 10:58 PM, Shixiong(Ryan) Zhu > <shixi...@databricks.com> wrote: > > You can use Kryo. It also implements KryoSerializa

Re: Unresponsive Spark Streaming UI in YARN cluster mode - 1.5.2

2016-07-08 Thread Shixiong(Ryan) Zhu
Hey Tom, Could you provide all blocked threads? Perhaps due to some potential deadlock. On Fri, Jul 8, 2016 at 10:30 AM, Ellis, Tom (Financial Markets IT) < tom.el...@lloydsbanking.com.invalid> wrote: > Hi There, > > > > We’re currently using HDP 2.3.4, Spark 1.5.2 with a Spark Streaming job in

Re: Spark and Kafka direct approach problem

2016-05-04 Thread Shixiong(Ryan) Zhu
It's because the Scala version of Spark and the Scala version of Kafka don't match. Please check them. On Wed, May 4, 2016 at 6:17 AM, أنس الليثي wrote: > NoSuchMethodError usually appears because of a difference in the library > versions. > > Check the version of the

Re: how to deploy new code with checkpointing

2016-04-11 Thread Shixiong(Ryan) Zhu
You cannot. Streaming doesn't support it because code changes will break Java serialization. On Mon, Apr 11, 2016 at 4:30 PM, Siva Gudavalli wrote: > hello, > > i am writing a spark streaming application to read data from kafka. I am > using no receiver approach and enabled

Re: Can not kill driver properly

2016-03-21 Thread Shixiong(Ryan) Zhu
Could you post the log of Master? On Mon, Mar 21, 2016 at 9:25 AM, Hao Ren wrote: > Update: > > I am using --supervise flag for fault tolerance. > > > > On Mon, Mar 21, 2016 at 4:16 PM, Hao Ren wrote: > >> Using spark 1.6.1 >> Spark Streaming Jobs are

Re: Docker configuration for akka spark streaming

2016-03-14 Thread Shixiong(Ryan) Zhu
Could you use netstat to show the ports that the driver is listening? On Mon, Mar 14, 2016 at 1:45 PM, David Gomez Saavedra wrote: > hi everyone, > > I'm trying to set up spark streaming using akka with a similar example of > the word count provided. When using spark master

Re: Spark Streaming 1.6 mapWithState not working well with Kryo Serialization

2016-03-02 Thread Shixiong(Ryan) Zhu
See https://issues.apache.org/jira/browse/SPARK-12591 After applying the patch, it should work. However, if you want to enable "registrationRequired", you still need to register "org.apache.spark.streaming.util.OpenHashMapBasedStateMap", "org.apache.spark.streaming.util.EmptyStateMap" and

Re: spark streaming

2016-03-02 Thread Shixiong(Ryan) Zhu
Hey, KafkaUtils.createDirectStream doesn't need a StorageLevel as it doesn't store blocks to BlockManager. However, the error is not related to StorageLevel. It may be a bug. Could you provide more info about it? E.g., Spark version, your codes, logs. On Wed, Mar 2, 2016 at 3:02 AM, Vinti

Re: Spark executor killed without apparent reason

2016-03-01 Thread Shixiong(Ryan) Zhu
Could you search "OutOfMemoryError" in the executor logs? It could be "OufOfMemoryError: Direct Buffer Memory" or something else. On Tue, Mar 1, 2016 at 6:23 AM, Nirav Patel wrote: > Hi, > > We are using spark 1.5.2 or yarn. We have a spark application utilizing > about

Re: Converting array to DF

2016-03-01 Thread Shixiong(Ryan) Zhu
For Array, you need to all `toSeq` at first. Scala can convert Array to ArrayOps automatically. However, it's not a `Seq` and you need to call `toSeq` explicitly. On Tue, Mar 1, 2016 at 1:02 AM, Ashok Kumar wrote: > Thank you sir > > This works OK > import

Re: Using a non serializable third party JSON serializable on a spark worker node throws NotSerializableException

2016-03-01 Thread Shixiong(Ryan) Zhu
lared inside a companion object of a case class. > > The problem is that Spark will still try to serialize the method, as it > needs to execute on the worker. How will that change the fact that > `EncodeJson[T]` is not serializable? > > > On Tue, Mar 1, 2016, 21:12 Shixiong(Ryan) Zhu

Re: Using a non serializable third party JSON serializable on a spark worker node throws NotSerializableException

2016-03-01 Thread Shixiong(Ryan) Zhu
Don't know where "argonaut.EncodeJson$$anon$2" comes from. However, you can always put your codes into an method of an "object". Then just call it like a Java static method. On Tue, Mar 1, 2016 at 10:30 AM, Yuval.Itzchakov wrote: > I have a small snippet of code which relays

Re: Stateful Operation on JavaPairDStream Help Needed !!

2016-02-29 Thread Shixiong(Ryan) Zhu
ype of operation that we can perform > inside mapWithState ? > > Really need to resolve this one as currently if my application is > restarted from checkpoint it has to repartition 120 previous stages which > takes hell lot of time. > > Thanks !! > Abhi > > On Mon, Feb

Re: Spark UI standalone "crashes" after an application finishes

2016-02-29 Thread Shixiong(Ryan) Zhu
Do you mean you cannot access Master UI after your application completes? Could you check the master log? On Mon, Feb 29, 2016 at 3:48 PM, Sumona Routh wrote: > Hi there, > I've been doing some performance tuning of our Spark application, which is > using Spark 1.2.1

Re: java.io.IOException: java.lang.reflect.InvocationTargetException on new spark machines

2016-02-28 Thread Shixiong(Ryan) Zhu
This is because the Snappy library cannot load the native library. Did you forget to install the snappy native library in your new machines? On Fri, Feb 26, 2016 at 11:05 PM, Abhishek Anand wrote: > Any insights on this ? > > On Fri, Feb 26, 2016 at 1:21 PM, Abhishek

Re: Stateful Operation on JavaPairDStream Help Needed !!

2016-02-28 Thread Shixiong(Ryan) Zhu
; > Thanks > Abhi > > > On Tue, Feb 23, 2016 at 2:36 AM, Shixiong(Ryan) Zhu < > shixi...@databricks.com> wrote: > >> Hey Abhi, >> >> Using reducebykeyandwindow and mapWithState will trigger the bug >> in SPARK-6847. Here is a workaround to trigger c

Re: Access fields by name/index from Avro data read from Kafka through Spark Streaming

2016-02-25 Thread Shixiong(Ryan) Zhu
You can use `DStream.map` to transform objects to anything you want. On Thu, Feb 25, 2016 at 11:06 AM, Mohammad Tariq wrote: > Hi group, > > I have just started working with confluent platform and spark streaming, > and was wondering if it is possible to access individual

Re: PLease help: installation of spark 1.6.0 on ubuntu fails

2016-02-25 Thread Shixiong(Ryan) Zhu
Please use Java 7 instead. On Thu, Feb 25, 2016 at 1:54 PM, Marco Mistroni wrote: > Hello all > could anyone help? > i have tried to install spark 1.6.0 on ubuntu, but the installation failed > Here are my steps > > 1. download spark (successful) > > 31 wget

Re: Spark Streaming not reading input coming from the other ip

2016-02-22 Thread Shixiong(Ryan) Zhu
What's the error info reported by Streaming? And could you use "telnet" to test if the network is normal? On Mon, Feb 22, 2016 at 6:59 AM, Vinti Maheshwari wrote: > For reference, my program: > > def main(args: Array[String]): Unit = { > val conf = new

Re: Stateful Operation on JavaPairDStream Help Needed !!

2016-02-22 Thread Shixiong(Ryan) Zhu
urrently I am using reducebykeyandwindow without the inverse function and > I am able to get the correct data. But, issue the might arise is when I > have to restart my application from checkpoint and it repartitions and > computes the previous 120 partitions, which delays the incoming batch

Re: Stateful Operation on JavaPairDStream Help Needed !!

2016-02-22 Thread Shixiong(Ryan) Zhu
=> rdd.count()) On Mon, Feb 22, 2016 at 12:25 PM, Ted Yu <yuzhih...@gmail.com> wrote: > Fix for SPARK-6847 is not in branch-1.6 > > Should the fix be ported to branch-1.6 ? > > Thanks > > On Feb 22, 2016, at 11:55 AM, Shixiong(Ryan) Zhu <shixi...@databricks.com>

Re: Stateful Operation on JavaPairDStream Help Needed !!

2016-02-22 Thread Shixiong(Ryan) Zhu
Hey Abhi, Could you post how you use mapWithState? By default, it should do checkpointing every 10 batches. However, there is a known issue that prevents mapWithState from checkpointing in some special cases: https://issues.apache.org/jira/browse/SPARK-6847 On Mon, Feb 22, 2016 at 5:47 AM,

Re: Access to broadcasted variable

2016-02-19 Thread Shixiong(Ryan) Zhu
The broadcasted object is serialized in driver and sent to the executors. And in the executor, it will deserialize the bytes to get the broadcasted object. On Fri, Feb 19, 2016 at 5:54 AM, jeff saremi wrote: > could someone please comment on this? thanks > >

  1   2   >