[Spark2] huge BloomFilters
Hi, I need to build huge BloomFilter with 150 millions or even more insertions import org.apache.spark.util.sketch.BloomFilter val bf = spark.read.avro("/hdfs/path").filter("some == 1").stat.bloomFilter("id", 15000, 0.01) if I use keys for serialization implicit val bfEncoder = org.apache.spark.sql.Encoders.kryo[BloomFilter] And then try to save this filter in hdfs the size of this bloom filter is more than 1G. Is there any way to compress BloomFilter? Do anybody have an experience with such a huge bloom filters? In general I need to check some condition in Spark-streaming application. I was thinking to use BloomFilters for that. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark2-huge-BloomFilters-tp27991.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Run spark-shell inside Docker container against remote YARN cluster
Hi, May be someone already had experience to build docker image for spark? I want to build docker image with spark inside but configured against remote YARN cluster. I have already created image with spark 1.6.2 inside. But when I run spark-shell --master yarn --deploy-mode client --driver-memory 32G --executor-memory 32G --executor-cores 8 inside docker I get the following exception Diagnostics: java.io.FileNotFoundException: File file:/usr/local/spark/lib/spark-assembly-1.6.2-hadoop2.2.0.jar does not exist Any suggestions? Do I need to load spark-assembly i HDFS and set spark.yarn.jar=hdfs://spark-assembly-1.6.2-hadoop2.2.0.jar ? Here is my Dockerfile https://gist.github.com/ponkin/cac0a071e7fe75ca7c390b7388cf4f91 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Run-spark-shell-inside-Docker-container-against-remote-YARN-cluster-tp27967.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Spark 2.0 regression when querying very wide data frames
I generated CSV file with 300 columns, and it seems to work fine with Spark Dataframes(Spark 2.0). I think you need to post your issue in spark-cassandra-connector community (https://groups.google.com/a/lists.datastax.com/forum/#!forum/spark-connector-user) - if you are using it. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-2-0-regression-when-querying-very-wide-data-frames-tp27567p27572.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Spark 2.0 regression when querying very wide data frames
Did you try to load wide, for example, CSV file or Parquet? May be the problem is in spark-cassandra-connector not Spark itself? Are you using spark-cassandra-connector(https://github.com/datastax/spark-cassandra-connector)? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-2-0-regression-when-querying-very-wide-data-frames-tp27567p27571.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Spark 2.0 regression when querying very wide data frames
Hi, What kind of datasource do you have? CSV, Avro, Parquet? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-2-0-regression-when-querying-very-wide-data-frames-tp27567p27569.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
[KafkaRDD]: rdd.cache() does not seem to work
Hi, Here is my use case : I have kafka topic. The job is fairly simple - it reads topic and save data to several hdfs paths. I create rdd with the following code val r = KafkaUtils.createRDD[Array[Byte],Array[Byte],DefaultDecoder,DefaultDecoder](context,kafkaParams,range) Then I am trying to cache that rdd with r.cache() and then save this rdd to several hdfs locations. But it seems that KafkaRDD is fetching data from kafka broker every time I call saveAsNewAPIHadoopFile. How can I cache data from Kafka in memory? P.S. When I do repartition add it seems to work properly( read kafka only once) but spark store shuffled data localy. Is it possible to keep data in memory? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/KafkaRDD-rdd-cache-does-not-seem-to-work-tp25936.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
[streaming] KafkaUtils.createDirectStream - how to start streming from checkpoints?
HI, When I create stream with KafkaUtils.createDirectStream I can explicitly define the position "largest" or "smallest" - where to read topic from. What if I have previous checkpoints( in HDFS for example) with offsets, and I want to start reading from the last checkpoint? In source code of KafkaUtils.createDirectStream I see the following val reset = kafkaParams.get("auto.offset.reset").map(_.toLowerCase) (for { topicPartitions <- kc.getPartitions(topics).right leaderOffsets <- (if (reset == Some("smallest")) { kc.getEarliestLeaderOffsets(topicPartitions) } else { kc.getLatestLeaderOffsets(topicPartitions) }).right ... So it turns out that, I have no options to start reading from checkpoints(and offsets)? Am I right? How can I force Spark to start reading from saved offesets(in checkpoints)? Is it possible at all or I need to store offsets in external datastore? Alexey Ponkin -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/streaming-KafkaUtils-createDirectStream-how-to-start-streming-from-checkpoints-tp25461.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
[streaming] reading Kafka direct stream throws kafka.common.OffsetOutOfRangeException
Hi I have simple spark-streaming job(8 executors 1 core - on 8 node cluster) - read from Kafka topic( 3 brokers with 8 partitions) and save to Cassandra. The problem is that when I increase number of incoming messages in topic the job is starting to fail with kafka.common.OffsetOutOfRangeException. Job fails starting from 100 events per second. Thanks in advance - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: [streaming] DStream with window performance issue
Ok. Spark 1.4.1 on yarn Here is my application I have 4 different Kafka topics(different object streams) type Edge = (String,String) val a = KafkaUtils.createDirectStream[...](sc,"A",params).filter( nonEmpty ).map( toEdge ) val b = KafkaUtils.createDirectStream[...](sc,"B",params).filter( nonEmpty ).map( toEdge ) val c = KafkaUtils.createDirectStream[...](sc,"C",params).filter( nonEmpty ).map( toEdge ) val u = a union b union c val source = u.window(Seconds(600), Seconds(10)) val z = KafkaUtils.createDirectStream[...](sc,"Z",params).filter( nonEmpty ).map( toEdge ) val joinResult = source.rightOuterJoin( z ) joinResult.foreachRDD { rdd=> rdd.foreachPartition { partition => // save to result topic in kafka } } The 'window' function in the code above is constantly growing, no matter how many events appeared in corresponding kafka topics but if I change one line from val source = u.window(Seconds(600), Seconds(10)) to val partitioner = ssc.sparkContext.broadcast(new HashPartitioner(8)) val source = u.transform(_.partitionBy(partitioner.value) ).window(Seconds(600), Seconds(10)) Everything works perfect. Perhaps the problem was in WindowedDStream I forced to use PartitionerAwareUnionRDD( partitionBy the same partitioner ) instead of UnionRDD. Nonetheless I did not see any hints about such a bahaviour in doc. Is it a bug or absolutely normal behaviour? 08.09.2015, 17:03, "Cody Koeninger" <c...@koeninger.org>: > Can you provide more info (what version of spark, code example)? > > On Tue, Sep 8, 2015 at 8:18 AM, Alexey Ponkin <alexey.pon...@ya.ru> wrote: >> Hi, >> >> I have an application with 2 streams, which are joined together. >> Stream1 - is simple DStream(relativly small size batch chunks) >> Stream2 - is a windowed DStream(with duration for example 60 seconds) >> >> Stream1 and Stream2 are Kafka direct stream. >> The problem is that according to logs window operation is constantly >> increasing(> href="http://en.zimagez.com/zimage/screenshot2015-09-0816-06-55.php;>screen). >> And also I see gap in pocessing window(> href="http://en.zimagez.com/zimage/screenshot2015-09-0816-10-22.php;>screen) >> in logs there are no events in that period. >> So what is happen in that gap and why window is constantly insreasing? >> >> Thank you in advance >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
[streaming] DStream with window performance issue
Hi, I have an application with 2 streams, which are joined together. Stream1 - is simple DStream(relativly small size batch chunks) Stream2 - is a windowed DStream(with duration for example 60 seconds) Stream1 and Stream2 are Kafka direct stream. The problem is that according to logs window operation is constantly increasing(http://en.zimagez.com/zimage/screenshot2015-09-0816-06-55.php;>screen). And also I see gap in pocessing window(http://en.zimagez.com/zimage/screenshot2015-09-0816-10-22.php;>screen) in logs there are no events in that period. So what is happen in that gap and why window is constantly insreasing? Thank you in advance - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
[streaming] Using org.apache.spark.Logging will silently break task execution
Hi, I have the following code object MyJob extends org.apache.spark.Logging{ ... val source: DStream[SomeType] ... source.foreachRDD { rdd => logInfo(s"""+++ForEachRDD+++""") rdd.foreachPartition { partitionOfRecords => logInfo(s"""+++ForEachPartition+++""") } } I was expecting to see both log messages in job log. But unfortunately you will never see string '+++ForEachPartition+++' in logs, cause block foreachPartition will never execute. And also there is no error message or something in logs. I wonder is this a bug or known behavior? I know that org.apache.spark.Logging is DeveloperAPI, but why it is silently fails with no messages? What to use instead of org.apache.spark.Logging? in spark-streaming jobs? P.S. running spark 1.4.1 (on yarn) Thanks in advance - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
[spark-streaming] New directStream API reads topic's partitions sequentially. Why?
Hi, I am trying to read kafka topic with new directStream method in KafkaUtils. I have Kafka topic with 8 partitions. I am running streaming job on yarn with 8 execuors with 1 core for each one. So noticed that spark reads all topic's partitions in one executor sequentially - this is obviously not what I want. I want spark to read all partitions in parallel. How can I achieve that? Thank you, in advance. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-New-directStream-API-reads-topic-s-partitions-sequentially-Why-tp24577.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark shell and StackOverFlowError
Hi, Can not reproduce your error on Spark 1.2.1 . It is not enough information. What is your command line arguments wцру you starting spark-shell? what data are you reading? etc. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-shell-and-StackOverFlowError-tp24508p24531.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark executor OOM issue on YARN
Hi, Can you please post your stack trace with exceptions? and also command line attributes in spark-submit? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-executor-OOM-issue-on-YARN-tp24522p24530.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: query avro hive table in spark sql
Can you select something from this table using Hive? And also could you post your spark code which leads to this exception. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/query-avro-hive-table-in-spark-sql-tp24462p24468.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: error accessing vertexRDD
Check permission for user which runs spark-shell (Permission denied) - means that you do not have permissions to /tmp -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/error-accessing-vertexRDD-tp24466p24469.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: use GraphX with Spark Streaming
Hi, Sure you can. StreamingContext has property /def sparkContext: SparkContext/(see docs http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.StreamingContext ). Think about DStream - main abstraction in Spark Streaming, as a sequence of RDD. Each DStream can be transform as RDD with method transform(see docs http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.dstream.DStream ) . So you can use whatever you want depends on your problem. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/use-GraphX-with-Spark-Streaming-tp24418p24451.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark Number of Partitions Recommendations
Hi Rahul, Where did you see such a recommendation? I personally define partitions with the following formula partitions = nextPrimeNumberAbove( K*(--num-executors * --executor-cores ) ) where nextPrimeNumberAbove(x) - prime number which is greater than x K - multiplicator to calculate start with 1 and encrease untill join perfomance start to degrade -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Number-of-Partitions-Recommendations-tp24022p24059.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How do we control output part files created by Spark job?
Hi, Did you try to reduce number of executors and cores? usually num-executors * executor-cores = number of parallel tasks, so you can reduce number of parallel tasks in command line like ./bin/spark-submit --class org.apache.spark.examples.SparkPi \ --master yarn-cluster \ --num-executors 3 \ --driver-memory 4g \ --executor-memory 2g \ --executor-cores 1 \ --queue thequeue \ lib/spark-examples*.jar \ 10 for more details see https://spark.apache.org/docs/1.2.0/running-on-yarn.html -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-do-we-control-output-part-files-created-by-Spark-job-tp23649p23706.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org