Re: column expression in left outer join for DataFrame
Hi, Thanks for your response. I modified my code as per your suggestion, but now I am getting a runtime error. Here's my code: val df_1 = df.filter( df(event) === 0) . select(country, cnt) val df_2 = df.filter( df(event) === 3) . select(country, cnt) df_1.show() //produces the following output : // countrycnt // tw 3000 // uk 2000 // us 1000 df_2.show() //produces the following output : // countrycnt // tw 25 // uk 200 // us 95 val both = df_2.join(df_1, df_2(country)===df_1(country), left_outer) I am getting the following error when executing the join statement: java.util.NoSuchElementException: next on empty iterator. This error seems to be originating at DataFrame.join (line 133 in DataFrame.scala). The show() results show that both dataframes do have columns named country and that they are non-empty. I also tried the simpler join ( i.e. df_2.join(df_1) ) and got the same error stated above. I would like to know what is wrong with the join statement above. thanks On Tue, Mar 24, 2015 at 6:08 PM, Michael Armbrust mich...@databricks.com wrote: You need to use `===`, so that you are constructing a column expression instead of evaluating the standard scala equality method. Calling methods to access columns (i.e. df.county is only supported in python). val join_df = df1.join( df2, df1(country) === df2(country), left_outer) On Tue, Mar 24, 2015 at 5:50 PM, SK skrishna...@gmail.com wrote: Hi, I am trying to port some code that was working in Spark 1.2.0 on the latest version, Spark 1.3.0. This code involves a left outer join between two SchemaRDDs which I am now trying to change to a left outer join between 2 DataFrames. I followed the example for left outer join of DataFrame at https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html Here's my code, where df1 and df2 are the 2 dataframes I am joining on the country field: val join_df = df1.join( df2, df1.country == df2.country, left_outer) But I got a compilation error that value country is not a member of sql.DataFrame I also tried the following: val join_df = df1.join( df2, df1(country) == df2(country), left_outer) I got a compilation error that it is a Boolean whereas a Column is required. So what is the correct Column expression I need to provide for joining the 2 dataframes on a specific field ? thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/column-expression-in-left-outer-join-for-DataFrame-tp22209.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark Application Hung
In production, i'd suggest you having a High availability cluster with minimum of 3 nodes (data nodes in your case). Now lets examine your scenario: - When you suddenly brings down one of the node which has 2 executors running on it, what happens is that the node (DN2) will be having your jobs shuffle data or computed data stored in it for the next stages (this is the same effect as deleting your spark's local/work dir from DN1). The absence of this node will lead to fetchFailures as you are seeing in the logs. But eventually it will end up trying for sometime and i believe it will recompute your whole pipeline on DN1 Thanks Best Regards On Wed, Mar 25, 2015 at 12:11 AM, Ashish Rawat ashish.ra...@guavus.com wrote: Hi, We are observing a hung spark application when one of the yarn datanode (running multiple spark executors) go down. *Setup details*: - Spark: 1.2.1 - Hadoop: 2.4.0 - Spark Application Mode: yarn-client - 2 datanodes (DN1, DN2) - 6 spark executors (initially 3 executors on both DN1 and DN2, after rebooting DN2, changes to 4 executors on DN1 and 2 executors on DN2) *Scenario*: When one of the datanodes (DN2) is brought down, the application gets hung, with spark driver continuously showing the following warning: 15/03/24 12:39:26 WARN TaskSetManager: Lost task 5.0 in stage 232.0 (TID 37941, DN1): FetchFailed(null, shuffleId=155, mapId=-1, reduceId=5, message= org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 155 at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:384) at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:381) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) at org.apache.spark.MapOutputTracker$.org$apache$spark$MapOutputTracker$$convertMapStatuses(MapOutputTracker.scala:380) at org.apache.spark.MapOutputTracker.getServerStatuses(MapOutputTracker.scala:176) at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.fetch(BlockStoreShuffleFetcher.scala:42) at org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:40) at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) When DN2 is brought down, one executor gets launched on DN1. When DN2 is brought back up after 15mins, 2 executors get launched on it. All the executors (including the ones which got launched after DN2 comes back), keep showing the following errors: 15/03/24 12:43:30 INFO spark.MapOutputTrackerWorker: Don't have map outputs for shuffle 155, fetching them
Spark Performance -Hive or Hbase?
HI , We have started RnD on Apache Spark to use its features such as Spark-SQL Spark Streaming. I have two Pain points , can anyone of you address them which are as follows: 1. Does spark allow us the feature to fetch updated items after an RDD has been mapped and schema has been applied? Or every time while running the query we have to perform RDD Mapping and Apply schema? In this case I am using hbase tables to map the RDD. 2. Spark-SQL provides better performance when used with Hive or Hbase? Thanks, Siddharth Ubale, Synchronized Communications #43, Velankani Tech Park, Block No. II, 3rd Floor, Electronic City Phase I, Bangalore – 560 100 Tel : +91 80 3202 4060 Web: www.syncoms.comhttp://www.syncoms.com/ [LogoNEWmohLARGE] London|Bangalore|Orlando we innovate, plan, execute, and transform the business
Re: spark worker on mesos slave | possible networking config issue
It says: ried to associate with unreachable remote address [akka.tcp://sparkDriver@localhost:51849]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused: localhost/ 127.0.0.1:51849 I'd suggest you changing this property: export SPARK_LOCAL_IP=127.0.0.1 Point it to your network address like 192.168.1.10 Thanks Best Regards On Tue, Mar 24, 2015 at 11:18 PM, Anirudha Jadhav aniru...@nyu.edu wrote: is there some setting i am missing: this is my spark-env.sh export MESOS_NATIVE_LIBRARY=/usr/local/lib/libmesos.so export SPARK_EXECUTOR_URI=http://100.125.5.93/sparkx.tgz export SPARK_LOCAL_IP=127.0.0.1 here is what i see on the slave node. less 20150226-160708-78932-5050-8971-S0/frameworks/20150323-205508-78932-5050-29804-0012/executors/20150226-160708-78932-5050-8971-S0/runs/cceea834-c4d9-49d6-a579-8352f1889b56/stderr WARNING: Logging before InitGoogleLogging() is written to STDERR I0324 02:30:29.389225 27755 fetcher.cpp:76] Fetching URI ' http://100.125.5.93/sparkx.tgz' I0324 02:30:29.389361 27755 fetcher.cpp:126] Downloading ' http://100.125.5.93/sparkx.tgz' to '/tmp/mesos/slaves/20150226-160708-78932-5050-8971-S0/frameworks/20150323-205508-78932-5050-29804-0012/executors/20150226-160708-78932-5050-8971-S0/runs/cceea834-c4d9-49d6-a579-8352f1889b56/sparkx.tgz' I0324 02:30:35.353446 27755 fetcher.cpp:64] Extracted resource '/tmp/mesos/slaves/20150226-160708-78932-5050-8971-S0/frameworks/20150323-205508-78932-5050-29804-0012/executors/20150226-160708-78932-5050-8971-S0/runs/cceea834-c4d9-49d6-a579-8352f1889b56/sparkx.tgz' into '/tmp/mesos/slaves/20150226-160708-78932-5050-8971-S0/frameworks/20150323-205508-78932-5050-29804-0012/executors/20150226-160708-78932-5050-8971-S0/runs/cceea834-c4d9-49d6-a579-8352f1889b56' Spark assembly has been built with Hive, including Datanucleus jars on classpath Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 15/03/24 02:30:37 INFO MesosExecutorBackend: Registered signal handlers for [TERM, HUP, INT] I0324 02:30:37.071077 27863 exec.cpp:132] Version: 0.21.1 I0324 02:30:37.080971 27885 exec.cpp:206] Executor registered on slave 20150226-160708-78932-5050-8971-S0 15/03/24 02:30:37 INFO MesosExecutorBackend: Registered with Mesos as executor ID 20150226-160708-78932-5050-8971-S0 with 1 cpus 15/03/24 02:30:37 INFO SecurityManager: Changing view acls to: ubuntu 15/03/24 02:30:37 INFO SecurityManager: Changing modify acls to: ubuntu 15/03/24 02:30:37 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(ubuntu); users with modify permissions: Set(ubuntu) 15/03/24 02:30:37 INFO Slf4jLogger: Slf4jLogger started 15/03/24 02:30:37 INFO Remoting: Starting remoting 15/03/24 02:30:38 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkExecutor@mesos-si2:50542] 15/03/24 02:30:38 INFO Utils: Successfully started service 'sparkExecutor' on port 50542. 15/03/24 02:30:38 INFO AkkaUtils: Connecting to MapOutputTracker: akka.tcp://sparkDriver@localhost:51849/user/MapOutputTracker 15/03/24 02:30:38 WARN Remoting: Tried to associate with unreachable remote address [akka.tcp://sparkDriver@localhost:51849]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused: localhost/127.0.0.1:51849 akka.actor.ActorNotFound: Actor not found for: ActorSelection[Anchor(akka.tcp://sparkDriver@localhost:51849/), Path(/user/MapOutputTracker)] at akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:65) at akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:63) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:67) at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:82) at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59) at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59) at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72) at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:58) at akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.unbatchedExecute(Future.scala:74) at akka.dispatch.BatchingExecutor$class.execute(BatchingExecutor.scala:110) at akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.execute(Future.scala:73) at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40) at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248) at
Re: Weird exception in Spark job
As it says, you are having a jar conflict: java.lang.NoSuchMethodError: org.jboss.netty.channel.socket.nio.NioWorkerPool.init(Ljava/util/concurrent/Executor;I)V Verify your classpath and see the netty versions Thanks Best Regards On Tue, Mar 24, 2015 at 11:07 PM, nitinkak001 nitinkak...@gmail.com wrote: Any Ideas on this? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Weird-exception-in-Spark-job-tp22195p22204.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Server IPC version 9 cannot communicate with client version 4
Hi Sparkers, I am trying to load data in spark with the following command *sqlContext.sql(LOAD DATA LOCAL INPATH '/home/spark12/sandeep/sandeep.txt ' INTO TABLE src);* *Getting exception below* *Server IPC version 9 cannot communicate with client version 4* NOte : i am using Hadoop 2.2 version and spark 1.2 and hive 0.13
Re: Optimal solution for getting the header from CSV with Spark
The spark-csv package can handle header row, and the code is at the link below. It could also use the header to infer field names in the schema. https://github.com/databricks/spark-csv/blob/master/src/main/scala/com/databricks/spark/csv/CsvRelation.scala --- Original Message --- From: Dean Wampler deanwamp...@gmail.com Sent: March 24, 2015 9:19 AM To: Sean Owen so...@cloudera.com Cc: Spico Florin spicoflo...@gmail.com, user user@spark.apache.org Subject: Re: Optimal solution for getting the header from CSV with Spark Good point. There's no guarantee that you'll get the actual first partition. One reason why I wouldn't allow a CSV header line in a real data file, if I could avoid it. Back to Spark, a safer approach is RDD.foreachPartition, which takes a function expecting an iterator. You'll only need to grab the first element (being careful that the partition isn't empty!) and then determine which of those first lines has the header info. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Tue, Mar 24, 2015 at 11:12 AM, Sean Owen so...@cloudera.com wrote: I think this works in practice, but I don't know that the first block of the file is guaranteed to be in the first partition? certainly later down the pipeline that won't be true but presumably this is happening right after reading the file. I've always just written some filter that would only match the header, which assumes that this is possible to distinguish, but usually is. On Tue, Mar 24, 2015 at 2:41 PM, Dean Wampler deanwamp...@gmail.com wrote: Instead of data.zipWithIndex().filter(_._2==0), which will cause Spark to read the whole file, use data.take(1), which is simpler. From the Rdd.take documentation, it works by first scanning one partition, and using the results from that partition to estimate the number of additional partitions needed to satisfy the limit. In this case, it will trivially stop at the first. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition (O'Reilly) Typesafe @deanwampler http://polyglotprogramming.com On Tue, Mar 24, 2015 at 7:12 AM, Spico Florin spicoflo...@gmail.com wrote: Hello! I would like to know what is the optimal solution for getting the header with from a CSV file with Spark? My aproach was: def getHeader(data: RDD[String]): String = { data.zipWithIndex().filter(_._2==0).map(x=x._1).take(1).mkString() } Thanks.
Re: FAILED SelectChannelConnector@0.0.0.0:4040 java.net.BindException: Address already in use
It means you are already having 4 applications running on 4040, 4041, 4042, 4043. And that's why it was able to run on 4044. You can do a *netstat -pnat | grep 404* *And see what all processes are running. Thanks Best Regards On Wed, Mar 25, 2015 at 1:13 AM, , Roy rp...@njit.edu wrote: I get following message for each time I run spark job 1. 15/03/24 15:35:56 WARN AbstractLifeCycle: FAILED SelectChannelConnector@0.0.0.0:4040: java.net.BindException: Address already in use full trace is here http://pastebin.com/xSvRN01f how do I fix this ? I am on CDH 5.3.1 thanks roy
Re: Optimal solution for getting the header from CSV with Spark
Hello! Thank for your responses. I was afraid that due to partitioning I will loose the logic that the first element is the header. I observe that rdd.first calls behind the rdd.take(1) method. Best regards, Florin On Tue, Mar 24, 2015 at 4:41 PM, Dean Wampler deanwamp...@gmail.com wrote: Instead of data.zipWithIndex().filter(_._2==0), which will cause Spark to read the whole file, use data.take(1), which is simpler. From the Rdd.take documentation, it works by first scanning one partition, and using the results from that partition to estimate the number of additional partitions needed to satisfy the limit. In this case, it will trivially stop at the first. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Tue, Mar 24, 2015 at 7:12 AM, Spico Florin spicoflo...@gmail.com wrote: Hello! I would like to know what is the optimal solution for getting the header with from a CSV file with Spark? My aproach was: def getHeader(data: RDD[String]): String = { data.zipWithIndex().filter(_._2==0).map(x=x._1).take(1).mkString() } Thanks.
Explanation streaming-cep-engine with example
Hi, Can someone explain how spark streaming cep engine works ? How to use it with sample example? http://spark-packages.org/package/Stratio/streaming-cep-engine -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Explanation-streaming-cep-engine-with-example-tp22218.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Server IPC version 9 cannot communicate with client version 4
Looks like you have to build Spark with related Hadoop version, otherwise you will meet exception as mentioned. you could follow this doc: http://spark.apache.org/docs/latest/building-spark.html 2015-03-25 15:22 GMT+08:00 sandeep vura sandeepv...@gmail.com: Hi Sparkers, I am trying to load data in spark with the following command *sqlContext.sql(LOAD DATA LOCAL INPATH '/home/spark12/sandeep/sandeep.txt ' INTO TABLE src);* *Getting exception below* *Server IPC version 9 cannot communicate with client version 4* NOte : i am using Hadoop 2.2 version and spark 1.2 and hive 0.13
Spark Maven Test error
I use command to run Unit test, as follow: ./make-distribution.sh --tgz --skip-java-test -Pscala-2.10 -Phadoop-2.3 -Phive -Phive-thriftserver -Pyarn -Dyarn.version=2.3.0-cdh5.1.2 -Dhadoop.version=2.3.0-cdh5.1.2 mvn -Pscala-2.10 -Phadoop-2.3 -Phive -Phive-thriftserver -Pyarn -Dyarn.version=2.3.0-cdh5.1.2 -Dhadoop.version=2.3.0-cdh5.1.2 -pl core,launcher,network/common,network/shuffle,sql/core,sql/catalyst,sql/hive -DwildcardSuites=org.apache.spark.sql.JoinSuite test Error occur: --- T E S T S --- Running org.apache.spark.network.ProtocolSuite log4j:WARN No appenders could be found for logger (io.netty.util.internal.logging.InternalLoggerFactory). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.251 sec - in org.apache.spark.network.ProtocolSuite Running org.apache.spark.network.RpcIntegrationSuite Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.54 sec - in org.apache.spark.network.RpcIntegrationSuite Running org.apache.spark.network.ChunkFetchIntegrationSuite Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.34 sec - in org.apache.spark.network.ChunkFetchIntegrationSuite Running org.apache.spark.network.sasl.SparkSaslSuite Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.103 sec - in org.apache.spark.network.sasl.SparkSaslSuite Running org.apache.spark.network.TransportClientFactorySuite Tests run: 5, Failures: 0, Errors: 5, Skipped: 0, Time elapsed: 11.42 sec FAILURE! - in org.apache.spark.network.TransportClientFactorySuite reuseClientsUpToConfigVariable(org.apache.spark.network.TransportClientFactorySuite) Time elapsed: 2.332 sec ERROR! java.lang.IllegalStateException: failed to create a child event loop at sun.nio.ch.IOUtil.makePipe(Native Method) at sun.nio.ch.EPollSelectorImpl.init(EPollSelectorImpl.java:65) at sun.nio.ch.EPollSelectorProvider.openSelector(EPollSelectorProvider.java:36) at io.netty.channel.nio.NioEventLoop.openSelector(NioEventLoop.java:126) at io.netty.channel.nio.NioEventLoop.init(NioEventLoop.java:120) at io.netty.channel.nio.NioEventLoopGroup.newChild(NioEventLoopGroup.java:87) at io.netty.util.concurrent.MultithreadEventExecutorGroup.init(MultithreadEventExecutorGroup.java:64) at io.netty.channel.MultithreadEventLoopGroup.init(MultithreadEventLoopGroup.java:49) at io.netty.channel.nio.NioEventLoopGroup.init(NioEventLoopGroup.java:61) at io.netty.channel.nio.NioEventLoopGroup.init(NioEventLoopGroup.java:52) at org.apache.spark.network.util.NettyUtils.createEventLoop(NettyUtils.java:56) at org.apache.spark.network.client.TransportClientFactory.init(TransportClientFactory.java:104) at org.apache.spark.network.TransportContext.createClientFactory(TransportContext.java:76) at org.apache.spark.network.TransportContext.createClientFactory(TransportContext.java:80) at org.apache.spark.network.TransportClientFactorySuite.testClientReuse(TransportClientFactorySuite.java:86) at org.apache.spark.network.TransportClientFactorySuite.reuseClientsUpToConfigVariable(TransportClientFactorySuite.java:131) reuseClientsUpToConfigVariableConcurrent(org.apache.spark.network.TransportClientFactorySuite) Time elapsed: 2.279 sec ERROR! java.lang.IllegalStateException: failed to create a child event loop at sun.nio.ch.IOUtil.makePipe(Native Method) at sun.nio.ch.EPollSelectorImpl.init(EPollSelectorImpl.java:65) at sun.nio.ch.EPollSelectorProvider.openSelector(EPollSelectorProvider.java:36) at io.netty.channel.nio.NioEventLoop.openSelector(NioEventLoop.java:126) at io.netty.channel.nio.NioEventLoop.init(NioEventLoop.java:120) at io.netty.channel.nio.NioEventLoopGroup.newChild(NioEventLoopGroup.java:87) at io.netty.util.concurrent.MultithreadEventExecutorGroup.init(MultithreadEventExecutorGroup.java:64) at io.netty.channel.MultithreadEventLoopGroup.init(MultithreadEventLoopGroup.java:49) at io.netty.channel.nio.NioEventLoopGroup.init(NioEventLoopGroup.java:61) at io.netty.channel.nio.NioEventLoopGroup.init(NioEventLoopGroup.java:52) at org.apache.spark.network.util.NettyUtils.createEventLoop(NettyUtils.java:56) at org.apache.spark.network.client.TransportClientFactory.init(TransportClientFactory.java:104) at org.apache.spark.network.TransportContext.createClientFactory(TransportContext.java:76) at org.apache.spark.network.TransportContext.createClientFactory(TransportContext.java:80)
Exception in thread main java.lang.VerifyError: class org.apache.hadoop.yarn.proto.YarnProtos$PriorityProto overrides final method getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet;
I compile spark-1.3.0 on Hadoop 2.3.0-cdh5.1.0 with protoc 2.5.0. But when I try to run the examples, it throws: Exception in thread main java.lang.VerifyError: class org.apache.hadoop.yarn.proto.YarnProtos$PriorityProto overrides final method getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet;at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:800) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)at java.net.URLClassLoader.access$100(URLClassLoader.java:71) at java.net.URLClassLoader$1.run(URLClassLoader.java:361) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.getDeclaredConstructors0(Native Method) at java.lang.Class.privateGetDeclaredConstructors(Class.java:2532) at java.lang.Class.getConstructor0(Class.java:2842)at java.lang.Class.getConstructor(Class.java:1718) at org.apache.hadoop.yarn.factories.impl.pb.RecordFactoryPBImpl.newRecordInstance(RecordFactoryPBImpl.java:62) at org.apache.hadoop.yarn.util.Records.newRecord(Records.java:36) at org.apache.hadoop.yarn.api.records.Priority.newInstance(Priority.java:39) at org.apache.hadoop.yarn.api.records.Priority.(Priority.java:34) at org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$.(YarnSparkHadoopUtil.scala:101) at org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$.(YarnSparkHadoopUtil.scala) at org.apache.spark.deploy.yarn.ClientArguments.(ClientArguments.scala:38) at org.apache.spark.deploy.yarn.Client$.main(Client.scala:646) at org.apache.spark.deploy.yarn.Client.main(Client.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)Hey, guys. It may be caused by the proto version unmatched. Can anyone tell me what should I do to avoid this problem? I did not find any information about this. Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Exception-in-thread-main-java-lang-VerifyError-class-org-apache-hadoop-yarn-proto-YarnProtos-Priorit-tp22217.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Server IPC version 9 cannot communicate with client version 4
Is your spark compiled against hadoop 2.2? If not download https://spark.apache.org/downloads.html the Spark 1.2 binary with Hadoop 2.2 Thanks Best Regards On Wed, Mar 25, 2015 at 12:52 PM, sandeep vura sandeepv...@gmail.com wrote: Hi Sparkers, I am trying to load data in spark with the following command *sqlContext.sql(LOAD DATA LOCAL INPATH '/home/spark12/sandeep/sandeep.txt ' INTO TABLE src);* *Getting exception below* *Server IPC version 9 cannot communicate with client version 4* NOte : i am using Hadoop 2.2 version and spark 1.2 and hive 0.13
Serialization Problem in Spark Program
Hi, experts I wrote a very simple spark program to test the KryoSerialization function. The codes are as following: object TestKryoSerialization { def main(args: Array[String]) { val conf = new SparkConf() conf.set(spark.serializer, org.apache.spark.serializer.KryoSerializer) conf.set(spark.kryo.registrationRequired,true) //I use this statement to force checking registration. conf.registerKryoClasses(Array(classOf[MyObject])) val sc = new SparkContext(conf) val rdd = sc.textFile(hdfs://dhao.hrb:8020/user/spark/tstfiles/charset/A_utf8.txt) val objs = rdd.map(new MyObject(_,1)).collect() for (x - objs ) { x.printMyObject } } The class MyObject is also a very simple Class, which is only used to test the serialization function: class MyObject { var myStr : String = var myInt : Int = 0 def this(inStr : String, inInt : Int) { this() this.myStr = inStr this.myInt = inInt } def printMyObject { println(MyString is : +myStr+\tMyInt is : +myInt) } } But when I ran the application, it reported the following error: java.lang.IllegalArgumentException: Class is not registered: dhao.test.Serialization.MyObject[] Note: To register this class use: kryo.register(dhao.test.Serialization.MyObject[].class); at com.esotericsoftware.kryo.Kryo.getRegistration(Kryo.java:442) at com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:79) at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:472) at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:565) at org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:161) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) I don't understand what cause this problem. I have used the conf.registerKryoClasses to register my class. Could anyone help me ? Thanks By the way, the spark version is 1.3.0.
OutOfMemoryError when using DataFrame created by Spark SQL
Hi, I am using *Spark SQL* to query on my *Hive cluster*, following Spark SQL and DataFrame Guide https://spark.apache.org/docs/latest/sql-programming-guide.html step by step. However, my HiveQL via sqlContext.sql() fails and java.lang.OutOfMemoryError was raised. The expected result of such query is considered to be small (by adding limit 1000 clause). My code is shown below: scala import sqlContext.implicits._ scala val df = sqlContext.sql(select * from some_table where logdate=2015-03-24 limit 1000) and the error msg: [ERROR] [03/25/2015 16:08:22.379] [sparkDriver-scheduler-27] [ActorSystem(sparkDriver)] Uncaught fatal error from thread [sparkDriver-scheduler-27] shutting down ActorSystem [sparkDriver] java.lang.OutOfMemoryError: GC overhead limit exceeded the master heap memory is set by -Xms512m -Xmx512m, while workers set by -Xms4096M -Xmx4096M, which I presume sufficient for this trivial query. Additionally, after restarted the spark-shell and re-run the limit 5 query , the df object is returned and can be printed by df.show(), but other APIs fails on OutOfMemoryError, namely, df.count(), df.select(some_field).show() and so forth. I understand that the RDD can be collected to master hence further transmutations can be applied, as DataFrame has “richer optimizations under the hood” and the convention from an R/julia user, I really hope this error is able to be tackled, and DataFrame is robust enough to depend. Thanks in advance! REGARDS, Todd
OutOfMemoryError when using DataFrame created by Spark SQL
Hi, I am using *Spark SQL* to query on my *Hive cluster*, following Spark SQL and DataFrame Guide https://spark.apache.org/docs/latest/sql-programming-guide.html step by step. However, my HiveQL via sqlContext.sql() fails and java.lang.OutOfMemoryError was raised. The expected result of such query is considered to be small (by adding limit 1000 clause). My code is shown below: scala import sqlContext.implicits._ scala val df = sqlContext.sql(select * from some_table where logdate=2015-03-24 limit 1000) and the error msg: [ERROR] [03/25/2015 16:08:22.379] [sparkDriver-scheduler-27] [ActorSystem(sparkDriver)] Uncaught fatal error from thread [sparkDriver-scheduler-27] shutting down ActorSystem [sparkDriver] java.lang.OutOfMemoryError: GC overhead limit exceeded the master heap memory is set by -Xms512m -Xmx512m, while workers set by -Xms4096M -Xmx4096M, which I presume sufficient for this trivial query. Additionally, after restarted the spark-shell and re-run the limit 5 query , the df object is returned and can be printed by df.show(), but other APIs fails on OutOfMemoryError, namely, df.count(), df.select(some_field).show() and so forth. I understand that the RDD can be collected to master hence further transmutations can be applied, as DataFrame has “richer optimizations under the hood” and the convention from an R/julia user, I really hope this error is able to be tackled, and DataFrame is robust enough to depend. Thanks in advance! REGARDS, Todd -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/OutOfMemoryError-when-using-DataFrame-created-by-Spark-SQL-tp22219.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Server IPC version 9 cannot communicate with client version 4
Where do i export MAVEN_OPTS in spark-env.sh or hadoop-env.sh I am running the below command in spark/yarn directory where pom.xml file is available mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=VERSION -DskipTests clean package Please correct me if i am wrong. On Wed, Mar 25, 2015 at 12:55 PM, Saisai Shao sai.sai.s...@gmail.com wrote: Looks like you have to build Spark with related Hadoop version, otherwise you will meet exception as mentioned. you could follow this doc: http://spark.apache.org/docs/latest/building-spark.html 2015-03-25 15:22 GMT+08:00 sandeep vura sandeepv...@gmail.com: Hi Sparkers, I am trying to load data in spark with the following command *sqlContext.sql(LOAD DATA LOCAL INPATH '/home/spark12/sandeep/sandeep.txt ' INTO TABLE src);* *Getting exception below* *Server IPC version 9 cannot communicate with client version 4* NOte : i am using Hadoop 2.2 version and spark 1.2 and hive 0.13
Re: OutOfMemoryError when using DataFrame created by Spark SQL
Can you try giving Spark driver more heap ? Cheers On Mar 25, 2015, at 2:14 AM, Todd Leo sliznmail...@gmail.com wrote: Hi, I am using Spark SQL to query on my Hive cluster, following Spark SQL and DataFrame Guide step by step. However, my HiveQL via sqlContext.sql() fails and java.lang.OutOfMemoryError was raised. The expected result of such query is considered to be small (by adding limit 1000 clause). My code is shown below: scala import sqlContext.implicits._ scala val df = sqlContext.sql(select * from some_table where logdate=2015-03-24 limit 1000) and the error msg: [ERROR] [03/25/2015 16:08:22.379] [sparkDriver-scheduler-27] [ActorSystem(sparkDriver)] Uncaught fatal error from thread [sparkDriver-scheduler-27] shutting down ActorSystem [sparkDriver] java.lang.OutOfMemoryError: GC overhead limit exceeded the master heap memory is set by -Xms512m -Xmx512m, while workers set by -Xms4096M -Xmx4096M, which I presume sufficient for this trivial query. Additionally, after restarted the spark-shell and re-run the limit 5 query , the df object is returned and can be printed by df.show(), but other APIs fails on OutOfMemoryError, namely, df.count(), df.select(some_field).show() and so forth. I understand that the RDD can be collected to master hence further transmutations can be applied, as DataFrame has “richer optimizations under the hood” and the convention from an R/julia user, I really hope this error is able to be tackled, and DataFrame is robust enough to depend. Thanks in advance! REGARDS, Todd View this message in context: OutOfMemoryError when using DataFrame created by Spark SQL Sent from the Apache Spark User List mailing list archive at Nabble.com.
issue while submitting Spark Job as --master yarn-cluster
Hi , when I am submitting spark job in cluster mode getting error as under in hadoop-yarn log, someone has any idea,please suggest, 2015-03-25 23:35:22,467 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1427124496008_0028 State change from FINAL_SAVING to FAILED 2015-03-25 23:35:22,467 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=hdfs OPERATION=Application Finished - Failed TARGET=RMAppManager RESULT=FAILURE DESCRIPTION=App failed with state: FAILED PERMISSIONS=Application application_1427124496008_0028 failed 2 times due to AM Container for appattempt_1427124496008_0028_02 exited with exitCode: 13 due to: Exception from container-launch. Container id: container_1427124496008_0028_02_01 Exit code: 13 Stack trace: ExitCodeException exitCode=13: at org.apache.hadoop.util.Shell.runCommand(Shell.java:538) at org.apache.hadoop.util.Shell.run(Shell.java:455) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:702) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:197) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Container exited with a non-zero exit code 13 .Failing this attempt.. Failing the application. APPID=application_1427124496008_0028 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/issue-while-submitting-Spark-Job-as-master-yarn-cluster-tp0.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Server IPC version 9 cannot communicate with client version 4
Just run : mvn -Pyarn -Phadoop-2.4 -D*hadoop.version=2.2* -DskipTests clean package Thanks Best Regards On Wed, Mar 25, 2015 at 3:08 PM, sandeep vura sandeepv...@gmail.com wrote: Where do i export MAVEN_OPTS in spark-env.sh or hadoop-env.sh I am running the below command in spark/yarn directory where pom.xml file is available mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=VERSION -DskipTests clean package Please correct me if i am wrong. On Wed, Mar 25, 2015 at 12:55 PM, Saisai Shao sai.sai.s...@gmail.com wrote: Looks like you have to build Spark with related Hadoop version, otherwise you will meet exception as mentioned. you could follow this doc: http://spark.apache.org/docs/latest/building-spark.html 2015-03-25 15:22 GMT+08:00 sandeep vura sandeepv...@gmail.com: Hi Sparkers, I am trying to load data in spark with the following command *sqlContext.sql(LOAD DATA LOCAL INPATH '/home/spark12/sandeep/sandeep.txt ' INTO TABLE src);* *Getting exception below* *Server IPC version 9 cannot communicate with client version 4* NOte : i am using Hadoop 2.2 version and spark 1.2 and hive 0.13
Spark-sql query got exception.Help
It is ok when I do query data from a small hdfs file. But if the hdfs file is 152m,I got this exception. I try this code .'sc.setSystemProperty(spark.kryoserializer.buffer.mb,'256')'.error still. ``` com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0, required: 39135 at com.esotericsoftware.kryo.io.Output.require(Output.java:138) at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:220) at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:206) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:29) at ```
How do you write Dataframes to elasticsearch
It seems that elasticsearch-spark_2.10 currently not supporting spart 1.3. Could you tell me if there is an alternative way to save Dataframes to elasticsearch? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-do-you-write-Dataframes-to-elasticsearch-tp3.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Server IPC version 9 cannot communicate with client version 4
Oh, in that case you should mention 2.4, If you don't want to compile spark, then you can download the precompiled version from Downloads page https://spark.apache.org/downloads.html. http://d3kbcqa49mib13.cloudfront.net/spark-1.2.0-bin-hadoop2.4.tgz Thanks Best Regards On Wed, Mar 25, 2015 at 5:40 PM, sandeep vura sandeepv...@gmail.com wrote: *I am using hadoop 2.4 should i mention -Dhadoop.version=2.2* *$ hadoop version* *Hadoop 2.4.1* *Subversion http://svn.apache.org/repos/asf/hadoop/common http://svn.apache.org/repos/asf/hadoop/common -r 1604318* *Compiled by jenkins on 2014-06-21T05:43Z* *Compiled with protoc 2.5.0* *From source with checksum bb7ac0a3c73dc131f4844b873c74b630* *This command was run using /home/hadoop24/hadoop-2.4.1/share/hadoop/common/hadoop-common-2.4.1.jar* On Wed, Mar 25, 2015 at 5:38 PM, Akhil Das ak...@sigmoidanalytics.com wrote: -D*hadoop.version=2.2* Thanks Best Regards On Wed, Mar 25, 2015 at 5:34 PM, sandeep vura sandeepv...@gmail.com wrote: Build failed with following errors. I have executed the below following command. * mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=VERSION -DskipTests clean package* [INFO] [INFO] BUILD FAILURE [INFO] [INFO] Total time: 2:11:59.461s [INFO] Finished at: Wed Mar 25 17:22:29 IST 2015 [INFO] Final Memory: 30M/440M [INFO] [ERROR] Failed to execute goal on project spark-core_2.10: Could not resolve dep endencies for project org.apache.spark:spark-core_2.10:jar:1.2.1: Could not find artifact org.apache.hadoop:hadoop-client:jar:VERSION in central ( https://repo1. maven.org/maven2) - [Help 1] [ERROR] [ERROR] To see the full stack trace of the errors, re-run Maven with the -e swit ch. [ERROR] Re-run Maven using the -X switch to enable full debug logging. [ERROR] [ERROR] For more information about the errors and possible solutions, please rea d the following articles: [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/DependencyReso lutionException [ERROR] [ERROR] After correcting the problems, you can resume the build with the command [ERROR] mvn goals -rf :spark-core_2.10 On Wed, Mar 25, 2015 at 3:38 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Just run : mvn -Pyarn -Phadoop-2.4 -D*hadoop.version=2.2* -DskipTests clean package Thanks Best Regards On Wed, Mar 25, 2015 at 3:08 PM, sandeep vura sandeepv...@gmail.com wrote: Where do i export MAVEN_OPTS in spark-env.sh or hadoop-env.sh I am running the below command in spark/yarn directory where pom.xml file is available mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=VERSION -DskipTests clean package Please correct me if i am wrong. On Wed, Mar 25, 2015 at 12:55 PM, Saisai Shao sai.sai.s...@gmail.com wrote: Looks like you have to build Spark with related Hadoop version, otherwise you will meet exception as mentioned. you could follow this doc: http://spark.apache.org/docs/latest/building-spark.html 2015-03-25 15:22 GMT+08:00 sandeep vura sandeepv...@gmail.com: Hi Sparkers, I am trying to load data in spark with the following command *sqlContext.sql(LOAD DATA LOCAL INPATH '/home/spark12/sandeep/sandeep.txt ' INTO TABLE src);* *Getting exception below* *Server IPC version 9 cannot communicate with client version 4* NOte : i am using Hadoop 2.2 version and spark 1.2 and hive 0.13
Re: Server IPC version 9 cannot communicate with client version 4
Of course, VERSION is supposed to be replaced by a real Hadoop version! On Wed, Mar 25, 2015 at 12:04 PM, sandeep vura sandeepv...@gmail.com wrote: Build failed with following errors. I have executed the below following command. mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=VERSION -DskipTests clean package [INFO] [INFO] BUILD FAILURE [INFO] [INFO] Total time: 2:11:59.461s [INFO] Finished at: Wed Mar 25 17:22:29 IST 2015 [INFO] Final Memory: 30M/440M [INFO] [ERROR] Failed to execute goal on project spark-core_2.10: Could not resolve dep endencies for project org.apache.spark:spark-core_2.10:jar:1.2.1: Could not find artifact org.apache.hadoop:hadoop-client:jar:VERSION in central (https://repo1. maven.org/maven2) - [Help 1] [ERROR] [ERROR] To see the full stack trace of the errors, re-run Maven with the -e swit ch. [ERROR] Re-run Maven using the -X switch to enable full debug logging. [ERROR] [ERROR] For more information about the errors and possible solutions, please rea d the following articles: [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/DependencyReso lutionException [ERROR] [ERROR] After correcting the problems, you can resume the build with the command [ERROR] mvn goals -rf :spark-core_2.10 - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: spark worker on mesos slave | possible networking config issue
is there a way to have this dynamically pick the local IP. static assignment does not work cos the workers are dynamically allocated on mesos On Wed, Mar 25, 2015 at 3:04 AM, Akhil Das ak...@sigmoidanalytics.com wrote: It says: ried to associate with unreachable remote address [akka.tcp://sparkDriver@localhost:51849]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused: localhost/ 127.0.0.1:51849 I'd suggest you changing this property: export SPARK_LOCAL_IP=127.0.0.1 Point it to your network address like 192.168.1.10 Thanks Best Regards On Tue, Mar 24, 2015 at 11:18 PM, Anirudha Jadhav aniru...@nyu.edu wrote: is there some setting i am missing: this is my spark-env.sh export MESOS_NATIVE_LIBRARY=/usr/local/lib/libmesos.so export SPARK_EXECUTOR_URI=http://100.125.5.93/sparkx.tgz export SPARK_LOCAL_IP=127.0.0.1 here is what i see on the slave node. less 20150226-160708-78932-5050-8971-S0/frameworks/20150323-205508-78932-5050-29804-0012/executors/20150226-160708-78932-5050-8971-S0/runs/cceea834-c4d9-49d6-a579-8352f1889b56/stderr WARNING: Logging before InitGoogleLogging() is written to STDERR I0324 02:30:29.389225 27755 fetcher.cpp:76] Fetching URI ' http://100.125.5.93/sparkx.tgz' I0324 02:30:29.389361 27755 fetcher.cpp:126] Downloading ' http://100.125.5.93/sparkx.tgz' to '/tmp/mesos/slaves/20150226-160708-78932-5050-8971-S0/frameworks/20150323-205508-78932-5050-29804-0012/executors/20150226-160708-78932-5050-8971-S0/runs/cceea834-c4d9-49d6-a579-8352f1889b56/sparkx.tgz' I0324 02:30:35.353446 27755 fetcher.cpp:64] Extracted resource '/tmp/mesos/slaves/20150226-160708-78932-5050-8971-S0/frameworks/20150323-205508-78932-5050-29804-0012/executors/20150226-160708-78932-5050-8971-S0/runs/cceea834-c4d9-49d6-a579-8352f1889b56/sparkx.tgz' into '/tmp/mesos/slaves/20150226-160708-78932-5050-8971-S0/frameworks/20150323-205508-78932-5050-29804-0012/executors/20150226-160708-78932-5050-8971-S0/runs/cceea834-c4d9-49d6-a579-8352f1889b56' Spark assembly has been built with Hive, including Datanucleus jars on classpath Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 15/03/24 02:30:37 INFO MesosExecutorBackend: Registered signal handlers for [TERM, HUP, INT] I0324 02:30:37.071077 27863 exec.cpp:132] Version: 0.21.1 I0324 02:30:37.080971 27885 exec.cpp:206] Executor registered on slave 20150226-160708-78932-5050-8971-S0 15/03/24 02:30:37 INFO MesosExecutorBackend: Registered with Mesos as executor ID 20150226-160708-78932-5050-8971-S0 with 1 cpus 15/03/24 02:30:37 INFO SecurityManager: Changing view acls to: ubuntu 15/03/24 02:30:37 INFO SecurityManager: Changing modify acls to: ubuntu 15/03/24 02:30:37 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(ubuntu); users with modify permissions: Set(ubuntu) 15/03/24 02:30:37 INFO Slf4jLogger: Slf4jLogger started 15/03/24 02:30:37 INFO Remoting: Starting remoting 15/03/24 02:30:38 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkExecutor@mesos-si2:50542] 15/03/24 02:30:38 INFO Utils: Successfully started service 'sparkExecutor' on port 50542. 15/03/24 02:30:38 INFO AkkaUtils: Connecting to MapOutputTracker: akka.tcp://sparkDriver@localhost:51849/user/MapOutputTracker 15/03/24 02:30:38 WARN Remoting: Tried to associate with unreachable remote address [akka.tcp://sparkDriver@localhost:51849]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused: localhost/127.0.0.1:51849 akka.actor.ActorNotFound: Actor not found for: ActorSelection[Anchor(akka.tcp://sparkDriver@localhost:51849/), Path(/user/MapOutputTracker)] at akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:65) at akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:63) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:67) at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:82) at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59) at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59) at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72) at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:58) at akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.unbatchedExecute(Future.scala:74) at akka.dispatch.BatchingExecutor$class.execute(BatchingExecutor.scala:110) at
Re: spark worker on mesos slave | possible networking config issue
Remove SPARK_LOCAL_IP then? Thanks Best Regards On Wed, Mar 25, 2015 at 6:45 PM, Anirudha Jadhav aniru...@nyu.edu wrote: is there a way to have this dynamically pick the local IP. static assignment does not work cos the workers are dynamically allocated on mesos On Wed, Mar 25, 2015 at 3:04 AM, Akhil Das ak...@sigmoidanalytics.com wrote: It says: ried to associate with unreachable remote address [akka.tcp://sparkDriver@localhost:51849]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused: localhost/ 127.0.0.1:51849 I'd suggest you changing this property: export SPARK_LOCAL_IP=127.0.0.1 Point it to your network address like 192.168.1.10 Thanks Best Regards On Tue, Mar 24, 2015 at 11:18 PM, Anirudha Jadhav aniru...@nyu.edu wrote: is there some setting i am missing: this is my spark-env.sh export MESOS_NATIVE_LIBRARY=/usr/local/lib/libmesos.so export SPARK_EXECUTOR_URI=http://100.125.5.93/sparkx.tgz export SPARK_LOCAL_IP=127.0.0.1 here is what i see on the slave node. less 20150226-160708-78932-5050-8971-S0/frameworks/20150323-205508-78932-5050-29804-0012/executors/20150226-160708-78932-5050-8971-S0/runs/cceea834-c4d9-49d6-a579-8352f1889b56/stderr WARNING: Logging before InitGoogleLogging() is written to STDERR I0324 02:30:29.389225 27755 fetcher.cpp:76] Fetching URI ' http://100.125.5.93/sparkx.tgz' I0324 02:30:29.389361 27755 fetcher.cpp:126] Downloading ' http://100.125.5.93/sparkx.tgz' to '/tmp/mesos/slaves/20150226-160708-78932-5050-8971-S0/frameworks/20150323-205508-78932-5050-29804-0012/executors/20150226-160708-78932-5050-8971-S0/runs/cceea834-c4d9-49d6-a579-8352f1889b56/sparkx.tgz' I0324 02:30:35.353446 27755 fetcher.cpp:64] Extracted resource '/tmp/mesos/slaves/20150226-160708-78932-5050-8971-S0/frameworks/20150323-205508-78932-5050-29804-0012/executors/20150226-160708-78932-5050-8971-S0/runs/cceea834-c4d9-49d6-a579-8352f1889b56/sparkx.tgz' into '/tmp/mesos/slaves/20150226-160708-78932-5050-8971-S0/frameworks/20150323-205508-78932-5050-29804-0012/executors/20150226-160708-78932-5050-8971-S0/runs/cceea834-c4d9-49d6-a579-8352f1889b56' Spark assembly has been built with Hive, including Datanucleus jars on classpath Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 15/03/24 02:30:37 INFO MesosExecutorBackend: Registered signal handlers for [TERM, HUP, INT] I0324 02:30:37.071077 27863 exec.cpp:132] Version: 0.21.1 I0324 02:30:37.080971 27885 exec.cpp:206] Executor registered on slave 20150226-160708-78932-5050-8971-S0 15/03/24 02:30:37 INFO MesosExecutorBackend: Registered with Mesos as executor ID 20150226-160708-78932-5050-8971-S0 with 1 cpus 15/03/24 02:30:37 INFO SecurityManager: Changing view acls to: ubuntu 15/03/24 02:30:37 INFO SecurityManager: Changing modify acls to: ubuntu 15/03/24 02:30:37 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(ubuntu); users with modify permissions: Set(ubuntu) 15/03/24 02:30:37 INFO Slf4jLogger: Slf4jLogger started 15/03/24 02:30:37 INFO Remoting: Starting remoting 15/03/24 02:30:38 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkExecutor@mesos-si2:50542] 15/03/24 02:30:38 INFO Utils: Successfully started service 'sparkExecutor' on port 50542. 15/03/24 02:30:38 INFO AkkaUtils: Connecting to MapOutputTracker: akka.tcp://sparkDriver@localhost:51849/user/MapOutputTracker 15/03/24 02:30:38 WARN Remoting: Tried to associate with unreachable remote address [akka.tcp://sparkDriver@localhost:51849]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused: localhost/127.0.0.1:51849 akka.actor.ActorNotFound: Actor not found for: ActorSelection[Anchor(akka.tcp://sparkDriver@localhost:51849/), Path(/user/MapOutputTracker)] at akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:65) at akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:63) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:67) at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:82) at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59) at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59) at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72) at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:58) at akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.unbatchedExecute(Future.scala:74) at
JavaKinesisWordCountASLYARN Example not working on EMR
Hi, I am trying to run a Spark on YARN program provided by Spark in the examples directory using Amazon Kinesis on EMR cluster : I am using Spark 1.3.0 and EMR AMI: 3.5.0 I've setup the Credentials export AWS_ACCESS_KEY_ID=XX export AWS_SECRET_KEY=XXX *A) This is the Kinesis Word Count Producer which ran Successfully : * run-example org.apache.spark.examples.streaming.KinesisWordCountProducerASL mySparkStream https://kinesis.us-east-1.amazonaws.com 1 5 *B) This one is the Normal Consumer using Spark Streaming which also ran Successfully: * run-example org.apache.spark.examples.streaming.JavaKinesisWordCountASL mySparkStream https://kinesis.us-east-1.amazonaws.com *C) And this is the YARN based program which is NOT WORKING: * run-example org.apache.spark.examples.streaming.JavaKinesisWordCountASLYARN mySparkStream https://kinesis.us-east-1.amazonaws.com\ Spark assembly has been built with Hive, including Datanucleus jars on classpath 15/03/25 11:52:45 INFO spark.SparkContext: Running Spark version 1.3.0 15/03/25 11:52:45 WARN spark.SparkConf: SPARK_CLASSPATH was detected (set to '/home/hadoop/spark/conf:/home/hadoop/conf:/home/hadoop/spark/classpath/emr/:/home/hadoop/spark/classpath/emrfs/:/home/hadoop/share/hadoop/common/lib/*:/home/hadoop/share/hadoop/common/lib/hadoop-lzo.jar'). This is deprecated in Spark 1.0+. Please instead use: • ./spark-submit with --driver-class-path to augment the driver classpath • spark.executor.extraClassPath to augment the executor classpath 15/03/25 11:52:45 WARN spark.SparkConf: Setting 'spark.executor.extraClassPath' to '/home/hadoop/spark/conf:/home/hadoop/conf:/home/hadoop/spark/classpath/emr/:/home/hadoop/spark/classpath/emrfs/:/home/hadoop/share/hadoop/common/lib/:/home/hadoop/share/hadoop/common/lib/hadoop-lzo.jar' as a work-around. 15/03/25 11:52:45 WARN spark.SparkConf: Setting 'spark.driver.extraClassPath' to '/home/hadoop/spark/conf:/home/hadoop/conf:/home/hadoop/spark/classpath/emr/:/home/hadoop/spark/classpath/emrfs/:/home/hadoop/share/hadoop/common/lib/:/home/hadoop/share/hadoop/common/lib/hadoop-lzo.jar' as a work-around. 15/03/25 11:52:46 INFO spark.SecurityManager: Changing view acls to: hadoop 15/03/25 11:52:46 INFO spark.SecurityManager: Changing modify acls to: hadoop 15/03/25 11:52:46 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); users with modify permissions: Set(hadoop) 15/03/25 11:52:47 INFO slf4j.Slf4jLogger: Slf4jLogger started 15/03/25 11:52:48 INFO Remoting: Starting remoting 15/03/25 11:52:48 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@ip-10-80-175-92.ec2.internal:59504] 15/03/25 11:52:48 INFO util.Utils: Successfully started service 'sparkDriver' on port 59504. 15/03/25 11:52:48 INFO spark.SparkEnv: Registering MapOutputTracker 15/03/25 11:52:48 INFO spark.SparkEnv: Registering BlockManagerMaster 15/03/25 11:52:48 INFO storage.DiskBlockManager: Created local directory at /mnt/spark/spark-120befbc-6dae-4751-b41f-dbf7b3d97616/blockmgr-d339d180-36f5-465f-bda3-cecccb23b1d3 15/03/25 11:52:48 INFO storage.MemoryStore: MemoryStore started with capacity 265.4 MB 15/03/25 11:52:48 INFO spark.HttpFileServer: HTTP File server directory is /mnt/spark/spark-85e88478-3dad-4fcf-a43a-efd15166bef3/httpd-6115870a-0d90-44df-aa7c-a6bd1a47e107 15/03/25 11:52:48 INFO spark.HttpServer: Starting HTTP Server 15/03/25 11:52:49 INFO server.Server: jetty-8.y.z-SNAPSHOT 15/03/25 11:52:49 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:44879 15/03/25 11:52:49 INFO util.Utils: Successfully started service 'HTTP file server' on port 44879. 15/03/25 11:52:49 INFO spark.SparkEnv: Registering OutputCommitCoordinator 15/03/25 11:52:49 INFO server.Server: jetty-8.y.z-SNAPSHOT 15/03/25 11:52:49 INFO server.AbstractConnector: Started SelectChannelConnector@0.0.0.0:4040 15/03/25 11:52:49 INFO util.Utils: Successfully started service 'SparkUI' on port 4040. 15/03/25 11:52:49 INFO ui.SparkUI: Started SparkUI at http://ip-10-80-175-92.ec2.internal:4040 15/03/25 11:52:50 INFO spark.SparkContext: Added JAR file:/home/hadoop/spark/lib/spark-examples-1.3.0-hadoop2.4.0.jar at http://10.80.175.92:44879/jars/spark-examples-1.3.0-hadoop2.4.0.jar with timestamp 1427284370358 15/03/25 11:52:50 INFO cluster.YarnClusterScheduler: Created YarnClusterScheduler 15/03/25 11:52:51 ERROR cluster.YarnClusterSchedulerBackend: Application ID is not set. 15/03/25 11:52:51 INFO netty.NettyBlockTransferService: Server created on 49982 15/03/25 11:52:51 INFO storage.BlockManagerMaster: Trying to register BlockManager 15/03/25 11:52:51 INFO storage.BlockManagerMasterActor: Registering block manager ip-10-80-175-92.ec2.internal:49982 with 265.4 MB RAM, BlockManagerId(, ip-10-80-175-92.ec2.internal, 49982) 15/03/25 11:52:51 INFO storage.BlockManagerMaster: Registered BlockManager *Exception in thread main java.lang.NullPointerException* *at
Re: NetwrokWordCount + Spark standalone
You can open the Master UI running on 8080 port of your ubuntu machine and after submitting the job, you can see how many cores are being used etc from the UI. Thanks Best Regards On Wed, Mar 25, 2015 at 6:50 PM, James King jakwebin...@gmail.com wrote: Thanks Akhil, Yes indeed this is why it works when using local[2] but I'm unclear of why it doesn't work when using standalone daemons? Is there way to check what cores are being seen when running against standalone daemons? I'm running the master and worker on same ubuntu host. The Driver program is running from a windows machine. On ubuntu host command cat /proc/cpuinfo | grep processor | wc -l is giving 2 On Windows machine it is: NumberOfCores=2 NumberOfLogicalProcessors=4 On Wed, Mar 25, 2015 at 2:06 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Spark Streaming requires you to have minimum of 2 cores, 1 for receiving your data and the other for processing. So when you say local[2] it basically initialize 2 threads on your local machine, 1 for receiving data from network and the other for your word count processing. Thanks Best Regards On Wed, Mar 25, 2015 at 6:31 PM, James King jakwebin...@gmail.com wrote: I'm trying to run the Java NetwrokWordCount example against a simple spark standalone runtime of one master and one worker. But it doesn't seem to work, the text entered on the Netcat data server is not being picked up and printed to Eclispe console output. However if I use conf.setMaster(local[2]) it works, the correct text gets picked up and printed to Eclipse console. Any ideas why, any pointers?
Re: Server IPC version 9 cannot communicate with client version 4
Build failed with following errors. I have executed the below following command. * mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=VERSION -DskipTests clean package* [INFO] [INFO] BUILD FAILURE [INFO] [INFO] Total time: 2:11:59.461s [INFO] Finished at: Wed Mar 25 17:22:29 IST 2015 [INFO] Final Memory: 30M/440M [INFO] [ERROR] Failed to execute goal on project spark-core_2.10: Could not resolve dep endencies for project org.apache.spark:spark-core_2.10:jar:1.2.1: Could not find artifact org.apache.hadoop:hadoop-client:jar:VERSION in central ( https://repo1. maven.org/maven2) - [Help 1] [ERROR] [ERROR] To see the full stack trace of the errors, re-run Maven with the -e swit ch. [ERROR] Re-run Maven using the -X switch to enable full debug logging. [ERROR] [ERROR] For more information about the errors and possible solutions, please rea d the following articles: [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/DependencyReso lutionException [ERROR] [ERROR] After correcting the problems, you can resume the build with the command [ERROR] mvn goals -rf :spark-core_2.10 On Wed, Mar 25, 2015 at 3:38 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Just run : mvn -Pyarn -Phadoop-2.4 -D*hadoop.version=2.2* -DskipTests clean package Thanks Best Regards On Wed, Mar 25, 2015 at 3:08 PM, sandeep vura sandeepv...@gmail.com wrote: Where do i export MAVEN_OPTS in spark-env.sh or hadoop-env.sh I am running the below command in spark/yarn directory where pom.xml file is available mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=VERSION -DskipTests clean package Please correct me if i am wrong. On Wed, Mar 25, 2015 at 12:55 PM, Saisai Shao sai.sai.s...@gmail.com wrote: Looks like you have to build Spark with related Hadoop version, otherwise you will meet exception as mentioned. you could follow this doc: http://spark.apache.org/docs/latest/building-spark.html 2015-03-25 15:22 GMT+08:00 sandeep vura sandeepv...@gmail.com: Hi Sparkers, I am trying to load data in spark with the following command *sqlContext.sql(LOAD DATA LOCAL INPATH '/home/spark12/sandeep/sandeep.txt ' INTO TABLE src);* *Getting exception below* *Server IPC version 9 cannot communicate with client version 4* NOte : i am using Hadoop 2.2 version and spark 1.2 and hive 0.13
Re: JavaKinesisWordCountASLYARN Example not working on EMR
Did you built for kineses using profile *-Pkinesis-asl* On Wed, Mar 25, 2015 at 7:18 PM, ankur.jain ankur.j...@yash.com wrote: Hi, I am trying to run a Spark on YARN program provided by Spark in the examples directory using Amazon Kinesis on EMR cluster : I am using Spark 1.3.0 and EMR AMI: 3.5.0 I've setup the Credentials export AWS_ACCESS_KEY_ID=XX export AWS_SECRET_KEY=XXX *A) This is the Kinesis Word Count Producer which ran Successfully : * run-example org.apache.spark.examples.streaming.KinesisWordCountProducerASL mySparkStream https://kinesis.us-east-1.amazonaws.com 1 5 *B) This one is the Normal Consumer using Spark Streaming which also ran Successfully: * run-example org.apache.spark.examples.streaming.JavaKinesisWordCountASL mySparkStream https://kinesis.us-east-1.amazonaws.com *C) And this is the YARN based program which is NOT WORKING: * run-example org.apache.spark.examples.streaming.JavaKinesisWordCountASLYARN mySparkStream https://kinesis.us-east-1.amazonaws.com\ Spark assembly has been built with Hive, including Datanucleus jars on classpath 15/03/25 11:52:45 INFO spark.SparkContext: Running Spark version 1.3.0 15/03/25 11:52:45 WARN spark.SparkConf: SPARK_CLASSPATH was detected (set to '/home/hadoop/spark/conf:/home/hadoop/conf:/home/hadoop/spark/classpath/emr/:/home/hadoop/spark/classpath/emrfs/:/home/hadoop/share/hadoop/common/lib/*:/home/hadoop/share/hadoop/common/lib/hadoop-lzo.jar'). This is deprecated in Spark 1.0+. Please instead use: • ./spark-submit with --driver-class-path to augment the driver classpath • spark.executor.extraClassPath to augment the executor classpath 15/03/25 11:52:45 WARN spark.SparkConf: Setting 'spark.executor.extraClassPath' to '/home/hadoop/spark/conf:/home/hadoop/conf:/home/hadoop/spark/classpath/emr/:/home/hadoop/spark/classpath/emrfs/:/home/hadoop/share/hadoop/common/lib/:/home/hadoop/share/hadoop/common/lib/hadoop-lzo.jar' as a work-around. 15/03/25 11:52:45 WARN spark.SparkConf: Setting 'spark.driver.extraClassPath' to '/home/hadoop/spark/conf:/home/hadoop/conf:/home/hadoop/spark/classpath/emr/:/home/hadoop/spark/classpath/emrfs/:/home/hadoop/share/hadoop/common/lib/:/home/hadoop/share/hadoop/common/lib/hadoop-lzo.jar' as a work-around. 15/03/25 11:52:46 INFO spark.SecurityManager: Changing view acls to: hadoop 15/03/25 11:52:46 INFO spark.SecurityManager: Changing modify acls to: hadoop 15/03/25 11:52:46 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); users with modify permissions: Set(hadoop) 15/03/25 11:52:47 INFO slf4j.Slf4jLogger: Slf4jLogger started 15/03/25 11:52:48 INFO Remoting: Starting remoting 15/03/25 11:52:48 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@ip-10-80-175-92.ec2.internal:59504] 15/03/25 11:52:48 INFO util.Utils: Successfully started service 'sparkDriver' on port 59504. 15/03/25 11:52:48 INFO spark.SparkEnv: Registering MapOutputTracker 15/03/25 11:52:48 INFO spark.SparkEnv: Registering BlockManagerMaster 15/03/25 11:52:48 INFO storage.DiskBlockManager: Created local directory at /mnt/spark/spark-120befbc-6dae-4751-b41f-dbf7b3d97616/blockmgr-d339d180-36f5-465f-bda3-cecccb23b1d3 15/03/25 11:52:48 INFO storage.MemoryStore: MemoryStore started with capacity 265.4 MB 15/03/25 11:52:48 INFO spark.HttpFileServer: HTTP File server directory is /mnt/spark/spark-85e88478-3dad-4fcf-a43a-efd15166bef3/httpd-6115870a-0d90-44df-aa7c-a6bd1a47e107 15/03/25 11:52:48 INFO spark.HttpServer: Starting HTTP Server 15/03/25 11:52:49 INFO server.Server: jetty-8.y.z-SNAPSHOT 15/03/25 11:52:49 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:44879 15/03/25 11:52:49 INFO util.Utils: Successfully started service 'HTTP file server' on port 44879. 15/03/25 11:52:49 INFO spark.SparkEnv: Registering OutputCommitCoordinator 15/03/25 11:52:49 INFO server.Server: jetty-8.y.z-SNAPSHOT 15/03/25 11:52:49 INFO server.AbstractConnector: Started SelectChannelConnector@0.0.0.0:4040 15/03/25 11:52:49 INFO util.Utils: Successfully started service 'SparkUI' on port 4040. 15/03/25 11:52:49 INFO ui.SparkUI: Started SparkUI at http://ip-10-80-175-92.ec2.internal:4040 15/03/25 11:52:50 INFO spark.SparkContext: Added JAR file:/home/hadoop/spark/lib/spark-examples-1.3.0-hadoop2.4.0.jar at http://10.80.175.92:44879/jars/spark-examples-1.3.0-hadoop2.4.0.jar with timestamp 1427284370358 15/03/25 11:52:50 INFO cluster.YarnClusterScheduler: Created YarnClusterScheduler 15/03/25 11:52:51 ERROR cluster.YarnClusterSchedulerBackend: Application ID is not set. 15/03/25 11:52:51 INFO netty.NettyBlockTransferService: Server created on 49982 15/03/25 11:52:51 INFO storage.BlockManagerMaster: Trying to register BlockManager 15/03/25 11:52:51 INFO storage.BlockManagerMasterActor: Registering block manager ip-10-80-175-92.ec2.internal:49982 with
Re: FAILED SelectChannelConnector@0.0.0.0:4040 java.net.BindException: Address already in use
Yes I do have other application already running. Thanks for your explanation. On Wed, Mar 25, 2015 at 2:49 AM, Akhil Das ak...@sigmoidanalytics.com wrote: It means you are already having 4 applications running on 4040, 4041, 4042, 4043. And that's why it was able to run on 4044. You can do a *netstat -pnat | grep 404* *And see what all processes are running. Thanks Best Regards On Wed, Mar 25, 2015 at 1:13 AM, , Roy rp...@njit.edu wrote: I get following message for each time I run spark job 1. 15/03/24 15:35:56 WARN AbstractLifeCycle: FAILED SelectChannelConnector@0.0.0.0:4040: java.net.BindException: Address already in use full trace is here http://pastebin.com/xSvRN01f how do I fix this ? I am on CDH 5.3.1 thanks roy
Re: Server IPC version 9 cannot communicate with client version 4
-D*hadoop.version=2.2* Thanks Best Regards On Wed, Mar 25, 2015 at 5:34 PM, sandeep vura sandeepv...@gmail.com wrote: Build failed with following errors. I have executed the below following command. * mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=VERSION -DskipTests clean package* [INFO] [INFO] BUILD FAILURE [INFO] [INFO] Total time: 2:11:59.461s [INFO] Finished at: Wed Mar 25 17:22:29 IST 2015 [INFO] Final Memory: 30M/440M [INFO] [ERROR] Failed to execute goal on project spark-core_2.10: Could not resolve dep endencies for project org.apache.spark:spark-core_2.10:jar:1.2.1: Could not find artifact org.apache.hadoop:hadoop-client:jar:VERSION in central ( https://repo1. maven.org/maven2) - [Help 1] [ERROR] [ERROR] To see the full stack trace of the errors, re-run Maven with the -e swit ch. [ERROR] Re-run Maven using the -X switch to enable full debug logging. [ERROR] [ERROR] For more information about the errors and possible solutions, please rea d the following articles: [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/DependencyReso lutionException [ERROR] [ERROR] After correcting the problems, you can resume the build with the command [ERROR] mvn goals -rf :spark-core_2.10 On Wed, Mar 25, 2015 at 3:38 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Just run : mvn -Pyarn -Phadoop-2.4 -D*hadoop.version=2.2* -DskipTests clean package Thanks Best Regards On Wed, Mar 25, 2015 at 3:08 PM, sandeep vura sandeepv...@gmail.com wrote: Where do i export MAVEN_OPTS in spark-env.sh or hadoop-env.sh I am running the below command in spark/yarn directory where pom.xml file is available mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=VERSION -DskipTests clean package Please correct me if i am wrong. On Wed, Mar 25, 2015 at 12:55 PM, Saisai Shao sai.sai.s...@gmail.com wrote: Looks like you have to build Spark with related Hadoop version, otherwise you will meet exception as mentioned. you could follow this doc: http://spark.apache.org/docs/latest/building-spark.html 2015-03-25 15:22 GMT+08:00 sandeep vura sandeepv...@gmail.com: Hi Sparkers, I am trying to load data in spark with the following command *sqlContext.sql(LOAD DATA LOCAL INPATH '/home/spark12/sandeep/sandeep.txt ' INTO TABLE src);* *Getting exception below* *Server IPC version 9 cannot communicate with client version 4* NOte : i am using Hadoop 2.2 version and spark 1.2 and hive 0.13
Re: 1.3 Hadoop File System problem
Thanks Patrick and Michael for your responses. For anyone else that runs across this problem prior to 1.3.1 being released, I've been pointed to this Jira ticket that's scheduled for 1.3.1: https://issues.apache.org/jira/browse/SPARK-6351 Thanks again. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/1-3-Hadoop-File-System-problem-tp22207p5.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: EC2 Having script run at startup
You can use AWS user-data feature. try this, if it help for you. http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/user-data.html http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/user-data.html - Software Developer SigmoidAnalytics, Bangalore -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/EC2-Having-script-run-at-startup-tp22197p4.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
NetwrokWordCount + Spark standalone
I'm trying to run the Java NetwrokWordCount example against a simple spark standalone runtime of one master and one worker. But it doesn't seem to work, the text entered on the Netcat data server is not being picked up and printed to Eclispe console output. However if I use conf.setMaster(local[2]) it works, the correct text gets picked up and printed to Eclipse console. Any ideas why, any pointers?
Re: Spark ML Pipeline inaccessible types
Hi Martin, here’s 2 possibilities to overcome this: 1) Put your logic into org.apache.spark package in your project - then everything would be accessible. 2) Dirty trick: |object SparkVector extends HashingTF { val VectorUDT: DataType = outputDataType } | then you can do like this: |StructType(vectorTypeColumn, SparkVector.VectorUDT, false)) | Thanks, Peter Rudenko On 2015-03-25 13:14, zapletal-mar...@email.cz wrote: Sean, thanks for your response. I am familiar with /NoSuchMethodException/ in general, but I think it is not the case this time. The code actually attempts to get parameter by name using /val m = this.getClass.getMethodName(paramName)./ This may be a bug, but it is only a side effect caused by the real problem I am facing. My issue is that VectorUDT is not accessible by user code and therefore it is not possible to use custom ML pipeline with the existing Predictors (see the last two paragraphs in my first email). Best Regards, Martin -- Původní zpráva -- Od: Sean Owen so...@cloudera.com Komu: zapletal-mar...@email.cz Datum: 25. 3. 2015 11:05:54 Předmět: Re: Spark ML Pipeline inaccessible types NoSuchMethodError in general means that your runtime and compile-time environments are different. I think you need to first make sure you don't have mismatching versions of Spark. On Wed, Mar 25, 2015 at 11:00 AM, zapletal-mar...@email.cz wrote: Hi, I have started implementing a machine learning pipeline using Spark 1.3.0 and the new pipelining API and DataFrames. I got to a point where I have my training data set prepared using a sequence of Transformers, but I am struggling to actually train a model and use it for predictions. I am getting a java.lang.NoSuchMethodException: org.apache.spark.ml.regression.LinearRegression.myFeaturesColumnName() exception thrown at checkInputColumn method in Params trait when using a Predictor (LinearRegression in my case, but that should not matter). This looks like a bug - the exception is thrown when executing getParam(colName) when the require(actualDataType.equals(datatype), ...) requirement is not met so the expected requirement failed exception is not thrown and is hidden by the unexpected NoSuchMethodException instead. I can raise a bug if this really is an issue and I am not using something incorrectly. The problem I am facing however is that the Predictor expects features to have VectorUDT type as defined in Predictor class (protected def featuresDataType: DataType = new VectorUDT). But since this type is private[spark] my Transformer can not prepare features with this type which then correctly results in the exception above when I use a different type. Is there a way to define a custom Pipeline that would be able to use the existing Predictors without having to bypass the access modifiers or reimplement something or is the pipelining API not yet expected to be used in this way? Thanks, Martin - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: issue while submitting Spark Job as --master yarn-cluster
Hi Sachin, It appears that the application master is failing. To figure out what's wrong you need to get the logs for the application master. -Sandy On Wed, Mar 25, 2015 at 7:05 AM, Sachin Singh sachin.sha...@gmail.com wrote: OS I am using Linux, when I will run simply as master yarn, its running fine, Regards Sachin On Wed, Mar 25, 2015 at 4:25 PM, Xi Shen davidshe...@gmail.com wrote: What is your environment? I remember I had similar error when running spark-shell --master yarn-client in Windows environment. On Wed, Mar 25, 2015 at 9:07 PM sachin Singh sachin.sha...@gmail.com wrote: Hi , when I am submitting spark job in cluster mode getting error as under in hadoop-yarn log, someone has any idea,please suggest, 2015-03-25 23:35:22,467 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1427124496008_0028 State change from FINAL_SAVING to FAILED 2015-03-25 23:35:22,467 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=hdfs OPERATION=Application Finished - Failed TARGET=RMAppManager RESULT=FAILURE DESCRIPTION=App failed with state: FAILED PERMISSIONS=Application application_1427124496008_0028 failed 2 times due to AM Container for appattempt_1427124496008_0028_02 exited with exitCode: 13 due to: Exception from container-launch. Container id: container_1427124496008_0028_02_01 Exit code: 13 Stack trace: ExitCodeException exitCode=13: at org.apache.hadoop.util.Shell.runCommand(Shell.java:538) at org.apache.hadoop.util.Shell.run(Shell.java:455) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute( Shell.java:702) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor. launchContainer(DefaultContainerExecutor.java:197) at org.apache.hadoop.yarn.server.nodemanager.containermanager. launcher.ContainerLaunch.call(ContainerLaunch.java:299) at org.apache.hadoop.yarn.server.nodemanager.containermanager. launcher.ContainerLaunch.call(ContainerLaunch.java:81) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker( ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run( ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Container exited with a non-zero exit code 13 .Failing this attempt.. Failing the application. APPID=application_1427124496008_0028 -- View this message in context: http://apache-spark-user-list. 1001560.n3.nabble.com/issue-while-submitting-Spark-Job-as- master-yarn-cluster-tp0.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark-sql query got exception.Help
Oh, just noticed that you were calling |sc.setSystemProperty|. Actually you need to set this property in SparkConf or in spark-defaults.conf. And there are two configurations related to Kryo buffer size, * spark.kryoserializer.buffer.mb, which is the initial size, and * spark.kryoserializer.buffer.max.mb, which is the max buffer size. Make sure the 2nd one is larger (it seems that Kryo doesn’t check for it). Cheng On 3/25/15 7:31 PM, 李铖 wrote: Here is the full track 15/03/25 17:48:34 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, cloud1): com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0, required: 39135 at com.esotericsoftware.kryo.io.Output.require(Output.java:138) at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:220) at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:206) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:29) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:18) at com.esotericsoftware.kryo.Kryo.writeObjectOrNull(Kryo.java:549) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:312) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:293) at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568) at org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:165) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:206) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 2015-03-25 19:05 GMT+08:00 Cheng Lian lian.cs@gmail.com mailto:lian.cs@gmail.com: Could you please provide the full stack trace? On 3/25/15 6:26 PM, 李铖 wrote: It is ok when I do query data from a small hdfs file. But if the hdfs file is 152m,I got this exception. I try this code .'sc.setSystemProperty(spark.kryoserializer.buffer.mb,'256')'.error still. ``` com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0, required: 39135 at com.esotericsoftware.kryo.io.Output.require(Output.java:138) at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:220) at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:206) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:29) at ```
foreachRDD execution
I have a simple and probably dumb question about foreachRDD. We are using spark streaming + cassandra to compute concurrent users every 5min. Our batch size is 10secs and our block interval is 2.5secs. At the end of the world we are using foreachRDD to join the data in the RDD with existing data in Cassandra, update the counters and then save it back to Cassandra. To the best of my understanding, in this scenario, spark streaming produces one RDD every 10secs and foreachRDD executes them sequentially, that is, foreachRDD would never run in parallel. Am I right? Regards, Luis
Re: NetwrokWordCount + Spark standalone
Spark Streaming requires you to have minimum of 2 cores, 1 for receiving your data and the other for processing. So when you say local[2] it basically initialize 2 threads on your local machine, 1 for receiving data from network and the other for your word count processing. Thanks Best Regards On Wed, Mar 25, 2015 at 6:31 PM, James King jakwebin...@gmail.com wrote: I'm trying to run the Java NetwrokWordCount example against a simple spark standalone runtime of one master and one worker. But it doesn't seem to work, the text entered on the Netcat data server is not being picked up and printed to Eclispe console output. However if I use conf.setMaster(local[2]) it works, the correct text gets picked up and printed to Eclipse console. Any ideas why, any pointers?
Re: Server IPC version 9 cannot communicate with client version 4
*I am using hadoop 2.4 should i mention -Dhadoop.version=2.2* *$ hadoop version* *Hadoop 2.4.1* *Subversion http://svn.apache.org/repos/asf/hadoop/common http://svn.apache.org/repos/asf/hadoop/common -r 1604318* *Compiled by jenkins on 2014-06-21T05:43Z* *Compiled with protoc 2.5.0* *From source with checksum bb7ac0a3c73dc131f4844b873c74b630* *This command was run using /home/hadoop24/hadoop-2.4.1/share/hadoop/common/hadoop-common-2.4.1.jar* On Wed, Mar 25, 2015 at 5:38 PM, Akhil Das ak...@sigmoidanalytics.com wrote: -D*hadoop.version=2.2* Thanks Best Regards On Wed, Mar 25, 2015 at 5:34 PM, sandeep vura sandeepv...@gmail.com wrote: Build failed with following errors. I have executed the below following command. * mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=VERSION -DskipTests clean package* [INFO] [INFO] BUILD FAILURE [INFO] [INFO] Total time: 2:11:59.461s [INFO] Finished at: Wed Mar 25 17:22:29 IST 2015 [INFO] Final Memory: 30M/440M [INFO] [ERROR] Failed to execute goal on project spark-core_2.10: Could not resolve dep endencies for project org.apache.spark:spark-core_2.10:jar:1.2.1: Could not find artifact org.apache.hadoop:hadoop-client:jar:VERSION in central ( https://repo1. maven.org/maven2) - [Help 1] [ERROR] [ERROR] To see the full stack trace of the errors, re-run Maven with the -e swit ch. [ERROR] Re-run Maven using the -X switch to enable full debug logging. [ERROR] [ERROR] For more information about the errors and possible solutions, please rea d the following articles: [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/DependencyReso lutionException [ERROR] [ERROR] After correcting the problems, you can resume the build with the command [ERROR] mvn goals -rf :spark-core_2.10 On Wed, Mar 25, 2015 at 3:38 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Just run : mvn -Pyarn -Phadoop-2.4 -D*hadoop.version=2.2* -DskipTests clean package Thanks Best Regards On Wed, Mar 25, 2015 at 3:08 PM, sandeep vura sandeepv...@gmail.com wrote: Where do i export MAVEN_OPTS in spark-env.sh or hadoop-env.sh I am running the below command in spark/yarn directory where pom.xml file is available mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=VERSION -DskipTests clean package Please correct me if i am wrong. On Wed, Mar 25, 2015 at 12:55 PM, Saisai Shao sai.sai.s...@gmail.com wrote: Looks like you have to build Spark with related Hadoop version, otherwise you will meet exception as mentioned. you could follow this doc: http://spark.apache.org/docs/latest/building-spark.html 2015-03-25 15:22 GMT+08:00 sandeep vura sandeepv...@gmail.com: Hi Sparkers, I am trying to load data in spark with the following command *sqlContext.sql(LOAD DATA LOCAL INPATH '/home/spark12/sandeep/sandeep.txt ' INTO TABLE src);* *Getting exception below* *Server IPC version 9 cannot communicate with client version 4* NOte : i am using Hadoop 2.2 version and spark 1.2 and hive 0.13
Re: NetwrokWordCount + Spark standalone
Thanks Akhil, Yes indeed this is why it works when using local[2] but I'm unclear of why it doesn't work when using standalone daemons? Is there way to check what cores are being seen when running against standalone daemons? I'm running the master and worker on same ubuntu host. The Driver program is running from a windows machine. On ubuntu host command cat /proc/cpuinfo | grep processor | wc -l is giving 2 On Windows machine it is: NumberOfCores=2 NumberOfLogicalProcessors=4 On Wed, Mar 25, 2015 at 2:06 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Spark Streaming requires you to have minimum of 2 cores, 1 for receiving your data and the other for processing. So when you say local[2] it basically initialize 2 threads on your local machine, 1 for receiving data from network and the other for your word count processing. Thanks Best Regards On Wed, Mar 25, 2015 at 6:31 PM, James King jakwebin...@gmail.com wrote: I'm trying to run the Java NetwrokWordCount example against a simple spark standalone runtime of one master and one worker. But it doesn't seem to work, the text entered on the Netcat data server is not being picked up and printed to Eclispe console output. However if I use conf.setMaster(local[2]) it works, the correct text gets picked up and printed to Eclipse console. Any ideas why, any pointers?
Re: Spark as a service
You're welcome. How did it go? *Irfan Ahmad* CTO | Co-Founder | *CloudPhysics* http://www.cloudphysics.com Best of VMworld Finalist Best Cloud Management Award NetworkWorld 10 Startups to Watch EMA Most Notable Vendor On Wed, Mar 25, 2015 at 7:53 AM, Ashish Mukherjee ashish.mukher...@gmail.com wrote: Thank you On Tue, Mar 24, 2015 at 8:40 PM, Irfan Ahmad ir...@cloudphysics.com wrote: Also look at the spark-kernel and spark job server projects. Irfan On Mar 24, 2015 5:03 AM, Todd Nist tsind...@gmail.com wrote: Perhaps this project, https://github.com/calrissian/spark-jetty-server, could help with your requirements. On Tue, Mar 24, 2015 at 7:12 AM, Jeffrey Jedele jeffrey.jed...@gmail.com wrote: I don't think there's are general approach to that - the usecases are just to different. If you really need it, you probably will have to implement yourself in the driver of your application. PS: Make sure to use the reply to all button so that the mailing list is included in your reply. Otherwise only I will get your mail. Regards, Jeff 2015-03-24 12:01 GMT+01:00 Ashish Mukherjee ashish.mukher...@gmail.com : Hi Jeffrey, Thanks. Yes, this resolves the SQL problem. My bad - I was looking for something which would work for Spark Streaming and other Spark jobs too, not just SQL. Regards, Ashish On Tue, Mar 24, 2015 at 4:07 PM, Jeffrey Jedele jeffrey.jed...@gmail.com wrote: Hi Ashish, this might be what you're looking for: https://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbcodbc-server Regards, Jeff 2015-03-24 11:28 GMT+01:00 Ashish Mukherjee ashish.mukher...@gmail.com: Hello, As of now, if I have to execute a Spark job, I need to create a jar and deploy it. If I need to run a dynamically formed SQL from a Web application, is there any way of using SparkSQL in this manner? Perhaps, through a Web Service or something similar. Regards, Ashish
Write Parquet File with spark-streaming with Spark 1.3
Hi I've succeed to write kafka stream to parquet file in Spark 1.2 but I can't make it with spark 1.3 As in streaming I can't use saveAsParquetFile() because I can't add data to an existing parquet File I know that it's possible to stream data directly into parquet could you help me by providing a little sample what API I need to use ?? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Write-Parquet-File-with-spark-streaming-with-Spark-1-3-tp8.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How do you write Dataframes to elasticsearch
Spark 1.3 is not supported by elasticsearch-hadoop yet but will be very soon: https://github.com/elastic/elasticsearch-hadoop/issues/400 However in the meantime you could use df.toRDD.saveToEs - though you may have to manipulate the Row object perhaps to extract fields, not sure if it will serialize directly to ES JSON... — Sent from Mailbox On Wed, Mar 25, 2015 at 2:07 PM, yamanoj manoj.per...@gmail.com wrote: It seems that elasticsearch-spark_2.10 currently not supporting spart 1.3. Could you tell me if there is an alternative way to save Dataframes to elasticsearch? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-do-you-write-Dataframes-to-elasticsearch-tp3.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark Streaming - Minimizing batch interval
I've been given a feature requirement that means processing events on a latency lower than 0.25ms. Meaning I would have to make sure that Spark streaming gets new events from the messaging layer within that period of time. Would anyone have achieve such numbers using a Spark cluster? Or would this be even possible, even assuming we don't use the write ahead logs... tnks in advance! Rod -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Minimizing-batch-interval-tp7.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems
I have a EC2 cluster created using spark version 1.2.1. And I have a SBT project . Now I want to upgrade to spark 1.3 and use the new features. Below are issues . Sorry for the long post. Appreciate your help. Thanks -Roni Question - Do I have to create a new cluster using spark 1.3? Here is what I did - In my SBT file I changed to - libraryDependencies += org.apache.spark %% spark-core % 1.3.0 But then I started getting compilation error. along with Here are some of the libraries that were evicted: [warn] * org.apache.spark:spark-core_2.10:1.2.0 - 1.3.0 [warn] * org.apache.hadoop:hadoop-client:(2.5.0-cdh5.2.0, 2.2.0) - 2.6.0 [warn] Run 'evicted' to see detailed eviction warnings constructor cannot be instantiated to expected type; [error] found : (T1, T2) [error] required: org.apache.spark.sql.catalyst.expressions.Row [error] val ty = joinRDD.map{case(word, (file1Counts, file2Counts)) = KmerIntesect(word, file1Counts,xyz)} [error] ^ Here is my SBT and code -- SBT - version := 1.0 scalaVersion := 2.10.4 resolvers += Sonatype OSS Snapshots at https://oss.sonatype.org/content/repositories/snapshots;; resolvers += Maven Repo1 at https://repo1.maven.org/maven2;; resolvers += Maven Repo at https://s3.amazonaws.com/h2o-release/h2o-dev/master/1056/maven/repo/;; /* Dependencies - %% appends Scala version to artifactId */ libraryDependencies += org.apache.hadoop % hadoop-client % 2.6.0 libraryDependencies += org.apache.spark %% spark-core % 1.3.0 libraryDependencies += org.bdgenomics.adam % adam-core % 0.16.0 libraryDependencies += ai.h2o % sparkling-water-core_2.10 % 0.2.10 CODE -- import org.apache.spark.{SparkConf, SparkContext} case class KmerIntesect(kmer: String, kCount: Int, fileName: String) object preDefKmerIntersection { def main(args: Array[String]) { val sparkConf = new SparkConf().setAppName(preDefKmer-intersect) val sc = new SparkContext(sparkConf) import sqlContext.createSchemaRDD val sqlContext = new org.apache.spark.sql.SQLContext(sc) val bedFile = sc.textFile(s3n://a/b/c,40) val hgfasta = sc.textFile(hdfs://a/b/c,40) val hgPair = hgfasta.map(_.split (,)).map(a= (a(0), a(1).trim().toInt)) val filtered = hgPair.filter(kv = kv._2 == 1) val bedPair = bedFile.map(_.split (,)).map(a= (a(0), a(1).trim().toInt)) val joinRDD = bedPair.join(filtered) val ty = joinRDD.map{case(word, (file1Counts, file2Counts)) = KmerIntesect(word, file1Counts,xyz)} ty.registerTempTable(KmerIntesect) ty.saveAsParquetFile(hdfs://x/y/z/kmerIntersect.parquet) } }
Re: Spark Streaming - Minimizing batch interval
I don't think it's feasible to set a batch interval of 0.25ms. Even at tens of ms the overhead of the framework is a large factor. Do you mean 0.25s = 250ms? Related thoughts, and I don't know if they apply to your case: If you mean, can you just read off the source that quickly? yes. Sometimes when people say I need very low latency streaming because I need to answer quickly they are really trying to design a synchronous API, and I don't think asynchronous streaming is the right architecture. Sometimes people really mean I need to process 400 items per ms on average, which is different and entirely possible. On Wed, Mar 25, 2015 at 2:53 PM, RodrigoB rodrigo.boav...@aspect.com wrote: I've been given a feature requirement that means processing events on a latency lower than 0.25ms. Meaning I would have to make sure that Spark streaming gets new events from the messaging layer within that period of time. Would anyone have achieve such numbers using a Spark cluster? Or would this be even possible, even assuming we don't use the write ahead logs... tnks in advance! Rod -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Minimizing-batch-interval-tp7.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
What are the best options for quickly filtering a DataFrame on a single column?
I have a SparkSQL dataframe with a a few billion rows that I need to quickly filter down to a few hundred thousand rows, using an operation like (syntax may not be correct) df = df[ df.filter(lambda x: x.key_col in approved_keys)] I was thinking about serializing the data using parquet and saving it to S3, however as I want to optimize for filtering speed I'm not sure this is the best option. -- Stuart Layton
Recovered state for updateStateByKey and incremental streams processing
I want to use the restore from checkpoint to continue from last accumulated word counts and process new streams of data. This recovery process will keep accurate state of accumulated counters state (calculated by updateStateByKey) after failure/recovery or temp shutdown/upgrade to new code. However, the recomendation seem to indicate you have to delete the checkpoint data if you upgrade to new code. How would this work if I change the word count accumulation logic and still want to continue to work from last remembered state?. An example would be that the word counters could be weighted in a logic that is used for streams that are coming from a point-in-time later. This is an example but there are quite a few scenarios where one needs to continue from previous state as rememberd by updateStateByKey and apply new logic. code snippets below As in example RecoverableNetworkWordCount we should build setupInputStreamAndProcessWordCounts in the context of retrieved checkpoint only. If setupInputStreamAndProcessWordCounts called outside the createContext, you will get error [Receiver-0-1427260249292] is not unique!. def createContext(checkPointDir: String, host: String, port : Int) = { // If you do not see this printed, that means the StreamingContext has been loaded // from the new checkpoint println(Creating new context) val sparkConf = new SparkConf().setAppName(StatefulNetworkWordCount) // Create the context with a 1 second batch size val ssc = new StreamingContext(sparkConf, Seconds(1)) ssc.checkpoint(checkPointDir) setupInputStreamAndProcessWordCounts(ssc, host, port) ssc } and invoke in main as def main(args: Array[String]) { val checkPointDir = ./saved_state if (args.length 2) { System.err.println(Usage: SavedNetworkWordCount hostname port) System.exit(1) } val ssc = StreamingContext.getOrCreate(checkPointDir, () = { createContext(checkPointDir,args(0), args(1).toInt) }) //setupInputStreamAndProcessWordCounts(ssc, args(0), args(1).toInt) ssc.start() ssc.awaitTermination() } } def setupInputStreamAndProcessWordCounts(ssc: StreamingContext, hostname: String, port: Int) { // InputDStream has to be created inside createContext, else you get an error val lines = ssc.socketTextStream(hostname, port) val words = lines.flatMap(_.split( )) val wordDstream = words.map(x = (x, 1)) val updateFunc = (values: Seq[Int], state: Option[Int]) = { val currentCount = values.sum val previousCount = state.getOrElse(0) Some(currentCount + previousCount) } // Update and print the cumulative count using updateStateByKey val countsDstream = wordDstream.updateStateByKey[Int](updateFunc) countsDstream.print() // print or save to external system } -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Recovered-state-for-updateStateByKey-and-incremental-streams-processing-tp9.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
python : Out of memory: Kill process
Hi Guys, I running the following function with spark-submmit and de SO is killing my process : def getRdd(self,date,provider): path='s3n://'+AWS_BUCKET+'/'+date+'/*.log.gz' log2= self.sqlContext.jsonFile(path) log2.registerTempTable('log_test') log2.cache() out=self.sqlContext.sql(SELECT user, tax from log_test where provider = '+provider+'and country '').map(lambda row: (row.user, row.tax)) print out1 return map((lambda (x,y): (x, list(y))), sorted(out.groupByKey(2000).collect())) The input dataset has 57 zip files (2 GB) The same process with a smaller dataset completed successfully Any ideas to debug is welcome. Regards Eduardo
Re: OOM for HiveFromSpark example
I am facing same issue, posted a new thread. Please respond. On Wed, Jan 14, 2015 at 4:38 AM, Zhan Zhang zzh...@hortonworks.com wrote: Hi Folks, I am trying to run hive context in yarn-cluster mode, but met some error. Does anybody know what cause the issue. I use following cmd to build the distribution: ./make-distribution.sh -Phive -Phive-thriftserver -Pyarn -Phadoop-2.4 15/01/13 17:59:42 INFO cluster.YarnClusterScheduler: YarnClusterScheduler.postStartHook done 15/01/13 17:59:42 INFO storage.BlockManagerMasterActor: Registering block manager cn122-10.l42scl.hortonworks.com:56157 with 1589.8 MB RAM, BlockManagerId(2, cn122-10.l42scl.hortonworks.com, 56157) 15/01/13 17:59:43 INFO parse.ParseDriver: Parsing command: CREATE TABLE IF NOT EXISTS src (key INT, value STRING) 15/01/13 17:59:43 INFO parse.ParseDriver: Parse Completed 15/01/13 17:59:44 INFO metastore.HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore 15/01/13 17:59:44 INFO metastore.ObjectStore: ObjectStore, initialize called 15/01/13 17:59:44 INFO DataNucleus.Persistence: Property datanucleus.cache.level2 unknown - will be ignored 15/01/13 17:59:44 INFO DataNucleus.Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored 15/01/13 17:59:44 WARN DataNucleus.Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 15/01/13 17:59:44 WARN DataNucleus.Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 15/01/13 17:59:52 INFO metastore.ObjectStore: Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes=Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order 15/01/13 17:59:52 INFO metastore.MetaStoreDirectSql: MySQL check failed, assuming we are not on mysql: Lexical error at line 1, column 5. Encountered: @ (64), after : . 15/01/13 17:59:53 INFO DataNucleus.Datastore: The class org.apache.hadoop.hive.metastore.model.MFieldSchema is tagged as embedded-only so does not have its own datastore table. 15/01/13 17:59:53 INFO DataNucleus.Datastore: The class org.apache.hadoop.hive.metastore.model.MOrder is tagged as embedded-only so does not have its own datastore table. 15/01/13 17:59:59 INFO DataNucleus.Datastore: The class org.apache.hadoop.hive.metastore.model.MFieldSchema is tagged as embedded-only so does not have its own datastore table. 15/01/13 17:59:59 INFO DataNucleus.Datastore: The class org.apache.hadoop.hive.metastore.model.MOrder is tagged as embedded-only so does not have its own datastore table. 15/01/13 18:00:00 INFO metastore.ObjectStore: Initialized ObjectStore 15/01/13 18:00:00 WARN metastore.ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 0.13.1aa 15/01/13 18:00:01 INFO metastore.HiveMetaStore: Added admin role in metastore 15/01/13 18:00:01 INFO metastore.HiveMetaStore: Added public role in metastore 15/01/13 18:00:01 INFO metastore.HiveMetaStore: No user is added in admin role, since config is empty 15/01/13 18:00:01 INFO session.SessionState: No Tez session required at this point. hive.execution.engine=mr. 15/01/13 18:00:02 INFO log.PerfLogger: PERFLOG method=Driver.run from=org.apache.hadoop.hive.ql.Driver 15/01/13 18:00:02 INFO log.PerfLogger: PERFLOG method=TimeToSubmit from=org.apache.hadoop.hive.ql.Driver 15/01/13 18:00:02 INFO ql.Driver: Concurrency mode is disabled, not creating a lock manager 15/01/13 18:00:02 INFO log.PerfLogger: PERFLOG method=compile from=org.apache.hadoop.hive.ql.Driver 15/01/13 18:00:03 INFO log.PerfLogger: PERFLOG method=parse from=org.apache.hadoop.hive.ql.Driver 15/01/13 18:00:03 INFO parse.ParseDriver: Parsing command: CREATE TABLE IF NOT EXISTS src (key INT, value STRING) 15/01/13 18:00:03 INFO parse.ParseDriver: Parse Completed 15/01/13 18:00:03 INFO log.PerfLogger: /PERFLOG method=parse start=1421190003030 end=1421190003031 duration=1 from=org.apache.hadoop.hive.ql.Driver 15/01/13 18:00:03 INFO log.PerfLogger: PERFLOG method=semanticAnalyze from=org.apache.hadoop.hive.ql.Driver 15/01/13 18:00:03 INFO parse.SemanticAnalyzer: Starting Semantic Analysis 15/01/13 18:00:03 INFO parse.SemanticAnalyzer: Creating table src position=27 15/01/13 18:00:03 INFO metastore.HiveMetaStore: 0: get_table : db=default tbl=src 15/01/13 18:00:03 INFO HiveMetaStore.audit: ugi=zzhang ip=unknown-ip-addr cmd=get_table : db=default tbl=src 15/01/13 18:00:03 INFO metastore.HiveMetaStore: 0: get_database: default 15/01/13 18:00:03 INFO HiveMetaStore.audit: ugi=zzhang ip=unknown-ip-addr cmd=get_database: default 15/01/13 18:00:03 INFO ql.Driver: Semantic Analysis Completed 15/01/13 18:00:03 INFO log.PerfLogger: /PERFLOG method=semanticAnalyze start=1421190003031 end=1421190003406 duration=375 from=org.apache.hadoop.hive.ql.Driver 15/01/13 18:00:03 INFO
Re: OOM for HiveFromSpark example
I solve this by increase the PermGen memory size in driver. -XX:MaxPermSize=512m Thanks. Zhan Zhang On Mar 25, 2015, at 10:54 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.commailto:deepuj...@gmail.com wrote: I am facing same issue, posted a new thread. Please respond. On Wed, Jan 14, 2015 at 4:38 AM, Zhan Zhang zzh...@hortonworks.commailto:zzh...@hortonworks.com wrote: Hi Folks, I am trying to run hive context in yarn-cluster mode, but met some error. Does anybody know what cause the issue. I use following cmd to build the distribution: ./make-distribution.sh -Phive -Phive-thriftserver -Pyarn -Phadoop-2.4 15/01/13 17:59:42 INFO cluster.YarnClusterScheduler: YarnClusterScheduler.postStartHook done 15/01/13 17:59:42 INFO storage.BlockManagerMasterActor: Registering block manager cn122-10.l42scl.hortonworks.com:56157http://cn122-10.l42scl.hortonworks.com:56157/ with 1589.8 MB RAM, BlockManagerId(2, cn122-10.l42scl.hortonworks.comhttp://cn122-10.l42scl.hortonworks.com/, 56157) 15/01/13 17:59:43 INFO parse.ParseDriver: Parsing command: CREATE TABLE IF NOT EXISTS src (key INT, value STRING) 15/01/13 17:59:43 INFO parse.ParseDriver: Parse Completed 15/01/13 17:59:44 INFO metastore.HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore 15/01/13 17:59:44 INFO metastore.ObjectStore: ObjectStore, initialize called 15/01/13 17:59:44 INFO DataNucleus.Persistence: Property datanucleus.cache.level2 unknown - will be ignored 15/01/13 17:59:44 INFO DataNucleus.Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored 15/01/13 17:59:44 WARN DataNucleus.Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 15/01/13 17:59:44 WARN DataNucleus.Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 15/01/13 17:59:52 INFO metastore.ObjectStore: Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes=Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order 15/01/13 17:59:52 INFO metastore.MetaStoreDirectSql: MySQL check failed, assuming we are not on mysql: Lexical error at line 1, column 5. Encountered: @ (64), after : . 15/01/13 17:59:53 INFO DataNucleus.Datastore: The class org.apache.hadoop.hive.metastore.model.MFieldSchema is tagged as embedded-only so does not have its own datastore table. 15/01/13 17:59:53 INFO DataNucleus.Datastore: The class org.apache.hadoop.hive.metastore.model.MOrder is tagged as embedded-only so does not have its own datastore table. 15/01/13 17:59:59 INFO DataNucleus.Datastore: The class org.apache.hadoop.hive.metastore.model.MFieldSchema is tagged as embedded-only so does not have its own datastore table. 15/01/13 17:59:59 INFO DataNucleus.Datastore: The class org.apache.hadoop.hive.metastore.model.MOrder is tagged as embedded-only so does not have its own datastore table. 15/01/13 18:00:00 INFO metastore.ObjectStore: Initialized ObjectStore 15/01/13 18:00:00 WARN metastore.ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 0.13.1aa 15/01/13 18:00:01 INFO metastore.HiveMetaStore: Added admin role in metastore 15/01/13 18:00:01 INFO metastore.HiveMetaStore: Added public role in metastore 15/01/13 18:00:01 INFO metastore.HiveMetaStore: No user is added in admin role, since config is empty 15/01/13 18:00:01 INFO session.SessionState: No Tez session required at this point. hive.execution.engine=mr. 15/01/13 18:00:02 INFO log.PerfLogger: PERFLOG method=Driver.run from=org.apache.hadoop.hive.ql.Driver 15/01/13 18:00:02 INFO log.PerfLogger: PERFLOG method=TimeToSubmit from=org.apache.hadoop.hive.ql.Driver 15/01/13 18:00:02 INFO ql.Driver: Concurrency mode is disabled, not creating a lock manager 15/01/13 18:00:02 INFO log.PerfLogger: PERFLOG method=compile from=org.apache.hadoop.hive.ql.Driver 15/01/13 18:00:03 INFO log.PerfLogger: PERFLOG method=parse from=org.apache.hadoop.hive.ql.Driver 15/01/13 18:00:03 INFO parse.ParseDriver: Parsing command: CREATE TABLE IF NOT EXISTS src (key INT, value STRING) 15/01/13 18:00:03 INFO parse.ParseDriver: Parse Completed 15/01/13 18:00:03 INFO log.PerfLogger: /PERFLOG method=parse start=1421190003030 end=1421190003031 duration=1 from=org.apache.hadoop.hive.ql.Driver 15/01/13 18:00:03 INFO log.PerfLogger: PERFLOG method=semanticAnalyze from=org.apache.hadoop.hive.ql.Driver 15/01/13 18:00:03 INFO parse.SemanticAnalyzer: Starting Semantic Analysis 15/01/13 18:00:03 INFO parse.SemanticAnalyzer: Creating table src position=27 15/01/13 18:00:03 INFO metastore.HiveMetaStore: 0: get_table : db=default tbl=src 15/01/13 18:00:03 INFO HiveMetaStore.audit: ugi=zzhang ip=unknown-ip-addr cmd=get_table : db=default tbl=src 15/01/13 18:00:03 INFO metastore.HiveMetaStore: 0: get_database: default 15/01/13 18:00:03 INFO HiveMetaStore.audit: ugi=zzhang ip=unknown-ip-addr
Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems
For the Spark SQL parts, 1.3 breaks backwards compatibility, because before 1.3, Spark SQL was considered experimental where API changes were allowed. So, H2O and ADA compatible with 1.2.X might not work with 1.3. dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Wed, Mar 25, 2015 at 9:39 AM, roni roni.epi...@gmail.com wrote: Even if H2o and ADA are dependent on 1.2.1 , it should be backword compatible, right? So using 1.3 should not break them. And the code is not using the classes from those libs. I tried sbt clean compile .. same errror Thanks _R On Wed, Mar 25, 2015 at 9:26 AM, Nick Pentreath nick.pentre...@gmail.com wrote: What version of Spark do the other dependencies rely on (Adam and H2O?) - that could be it Or try sbt clean compile — Sent from Mailbox https://www.dropbox.com/mailbox On Wed, Mar 25, 2015 at 5:58 PM, roni roni.epi...@gmail.com wrote: I have a EC2 cluster created using spark version 1.2.1. And I have a SBT project . Now I want to upgrade to spark 1.3 and use the new features. Below are issues . Sorry for the long post. Appreciate your help. Thanks -Roni Question - Do I have to create a new cluster using spark 1.3? Here is what I did - In my SBT file I changed to - libraryDependencies += org.apache.spark %% spark-core % 1.3.0 But then I started getting compilation error. along with Here are some of the libraries that were evicted: [warn] * org.apache.spark:spark-core_2.10:1.2.0 - 1.3.0 [warn] * org.apache.hadoop:hadoop-client:(2.5.0-cdh5.2.0, 2.2.0) - 2.6.0 [warn] Run 'evicted' to see detailed eviction warnings constructor cannot be instantiated to expected type; [error] found : (T1, T2) [error] required: org.apache.spark.sql.catalyst.expressions.Row [error] val ty = joinRDD.map{case(word, (file1Counts, file2Counts)) = KmerIntesect(word, file1Counts,xyz)} [error] ^ Here is my SBT and code -- SBT - version := 1.0 scalaVersion := 2.10.4 resolvers += Sonatype OSS Snapshots at https://oss.sonatype.org/content/repositories/snapshots;; resolvers += Maven Repo1 at https://repo1.maven.org/maven2;; resolvers += Maven Repo at https://s3.amazonaws.com/h2o-release/h2o-dev/master/1056/maven/repo/;; /* Dependencies - %% appends Scala version to artifactId */ libraryDependencies += org.apache.hadoop % hadoop-client % 2.6.0 libraryDependencies += org.apache.spark %% spark-core % 1.3.0 libraryDependencies += org.bdgenomics.adam % adam-core % 0.16.0 libraryDependencies += ai.h2o % sparkling-water-core_2.10 % 0.2.10 CODE -- import org.apache.spark.{SparkConf, SparkContext} case class KmerIntesect(kmer: String, kCount: Int, fileName: String) object preDefKmerIntersection { def main(args: Array[String]) { val sparkConf = new SparkConf().setAppName(preDefKmer-intersect) val sc = new SparkContext(sparkConf) import sqlContext.createSchemaRDD val sqlContext = new org.apache.spark.sql.SQLContext(sc) val bedFile = sc.textFile(s3n://a/b/c,40) val hgfasta = sc.textFile(hdfs://a/b/c,40) val hgPair = hgfasta.map(_.split (,)).map(a= (a(0), a(1).trim().toInt)) val filtered = hgPair.filter(kv = kv._2 == 1) val bedPair = bedFile.map(_.split (,)).map(a= (a(0), a(1).trim().toInt)) val joinRDD = bedPair.join(filtered) val ty = joinRDD.map{case(word, (file1Counts, file2Counts)) = KmerIntesect(word, file1Counts,xyz)} ty.registerTempTable(KmerIntesect) ty.saveAsParquetFile(hdfs://x/y/z/kmerIntersect.parquet) } }
Unable to Hive program from Spark Programming Guide (OutOfMemoryError)
http://spark.apache.org/docs/1.3.0/sql-programming-guide.html#hive-tables I modified the Hive query but run into same error. ( http://spark.apache.org/docs/1.3.0/sql-programming-guide.html#hive-tables) val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) sqlContext.sql(CREATE TABLE IF NOT EXISTS src_spark (key INT, value STRING)) sqlContext.sql(LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src) // Queries are expressed in HiveQL sqlContext.sql(FROM src SELECT key, value).collect().foreach(println) Command ./bin/spark-submit -v --master yarn-cluster --jars /home/dvasthimal/spark1.3/spark-avro_2.10-1.0.0.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar --num-executors 3 --driver-memory 8g --executor-memory 2g --executor-cores 1 --queue hdmi-express --class com.ebay.ep.poc.spark.reporting.SparkApp spark_reporting-1.0-SNAPSHOT.jar startDate=2015-02-16 endDate=2015-02-16 input=/user/dvasthimal/epdatasets/successdetail1/part-r-0.avro subcommand=successevents2 output=/user/dvasthimal/epdatasets/successdetail2 Input -sh-4.1$ ls -l examples/src/main/resources/kv1.txt -rw-r--r-- 1 dvasthimal gid-dvasthimal 5812 Mar 5 17:31 examples/src/main/resources/kv1.txt -sh-4.1$ head examples/src/main/resources/kv1.txt 238val_238 86val_86 311val_311 27val_27 165val_165 409val_409 255val_255 278val_278 98val_98 484val_484 -sh-4.1$ Log /apache/hadoop/bin/yarn logs -applicationId application_1426715280024_82757 … … … 15/03/25 07:52:44 INFO metastore.HiveMetaStore: No user is added in admin role, since config is empty 15/03/25 07:52:44 INFO session.SessionState: No Tez session required at this point. hive.execution.engine=mr. 15/03/25 07:52:47 INFO parse.ParseDriver: Parsing command: CREATE TABLE IF NOT EXISTS src_spark (key INT, value STRING) 15/03/25 07:52:47 INFO parse.ParseDriver: Parse Completed 15/03/25 07:52:48 INFO log.PerfLogger: PERFLOG method=Driver.run from=org.apache.hadoop.hive.ql.Driver 15/03/25 07:52:48 INFO log.PerfLogger: PERFLOG method=TimeToSubmit from=org.apache.hadoop.hive.ql.Driver 15/03/25 07:52:48 INFO ql.Driver: Concurrency mode is disabled, not creating a lock manager 15/03/25 07:52:48 INFO log.PerfLogger: PERFLOG method=compile from=org.apache.hadoop.hive.ql.Driver 15/03/25 07:52:48 INFO log.PerfLogger: PERFLOG method=parse from=org.apache.hadoop.hive.ql.Driver 15/03/25 07:52:48 INFO parse.ParseDriver: Parsing command: CREATE TABLE IF NOT EXISTS src_spark (key INT, value STRING) 15/03/25 07:52:48 INFO parse.ParseDriver: Parse Completed 15/03/25 07:52:48 INFO log.PerfLogger: /PERFLOG method=parse start=1427295168392 end=1427295168393 duration=1 from=org.apache.hadoop.hive.ql.Driver 15/03/25 07:52:48 INFO log.PerfLogger: PERFLOG method=semanticAnalyze from=org.apache.hadoop.hive.ql.Driver 15/03/25 07:52:48 INFO parse.SemanticAnalyzer: Starting Semantic Analysis 15/03/25 07:52:48 INFO parse.SemanticAnalyzer: Creating table src_spark position=27 15/03/25 07:52:48 INFO metastore.HiveMetaStore: 0: get_table : db=default tbl=src_spark 15/03/25 07:52:48 INFO HiveMetaStore.audit: ugi=dvasthimal ip=unknown-ip-addr cmd=get_table : db=default tbl=src_spark 15/03/25 07:52:48 INFO metastore.HiveMetaStore: 0: get_database: default 15/03/25 07:52:48 INFO HiveMetaStore.audit: ugi=dvasthimal ip=unknown-ip-addr cmd=get_database: default 15/03/25 07:52:48 INFO ql.Driver: Semantic Analysis Completed 15/03/25 07:52:48 INFO log.PerfLogger: /PERFLOG method=semanticAnalyze start=1427295168393 end=1427295168595 duration=202 from=org.apache.hadoop.hive.ql.Driver 15/03/25 07:52:48 INFO ql.Driver: Returning Hive schema: Schema(fieldSchemas:null, properties:null) 15/03/25 07:52:48 INFO log.PerfLogger: /PERFLOG method=compile start=1427295168352 end=1427295168607 duration=255 from=org.apache.hadoop.hive.ql.Driver 15/03/25 07:52:48 INFO log.PerfLogger: PERFLOG method=Driver.execute from=org.apache.hadoop.hive.ql.Driver 15/03/25 07:52:48 INFO ql.Driver: Starting command: CREATE TABLE IF NOT EXISTS src_spark (key INT, value STRING) 15/03/25 07:52:48 INFO log.PerfLogger: /PERFLOG method=TimeToSubmit start=1427295168349 end=1427295168625 duration=276 from=org.apache.hadoop.hive.ql.Driver 15/03/25 07:52:48 INFO log.PerfLogger: PERFLOG method=runTasks from=org.apache.hadoop.hive.ql.Driver 15/03/25 07:52:48 INFO log.PerfLogger: PERFLOG method=task.DDL.Stage-0 from=org.apache.hadoop.hive.ql.Driver 15/03/25 07:52:51 INFO exec.DDLTask: Default to LazySimpleSerDe for table src_spark 15/03/25 07:52:52 INFO metastore.HiveMetaStore: 0: create_table: Table(tableName:src_spark, dbName:default, owner:dvasthimal, createTime:1427295171, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:key, type:int, comment:null), FieldSchema(name:value,
Re: OutOfMemory : Java heap space error
I am facing same issue, posted a new thread. Please respond. On Wed, Jul 9, 2014 at 1:56 AM, Rahul Bhojwani rahulbhojwani2...@gmail.com wrote: Hi, My code was running properly but then it suddenly gave this error. Can you just put some light on it. ### 0 KB, free: 38.7 MB) 14/07/09 01:46:12 INFO BlockManagerMaster: Updated info of block rdd_2212_4 14/07/09 01:46:13 INFO PythonRDD: Times: total = 1486, boot = 698, init = 626, finish = 162 Exception in thread stdin writer for python 14/07/09 01:46:14 INFO MemoryStore: ensureFreeSpace(61480) called with cur Mem=270794224, maxMem=311387750 java.lang.OutOfMemoryError: Java heap space at java.io.BufferedOutputStream.init(Unknown Source) at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:62) 14/07/09 01:46:15 INFO MemoryStore: Block rdd_2212_0 stored as values to memory (estimated size 60.0 KB, free 38.7 MB) Exception in thread stdin writer for python java.lang.OutOfMemoryError: Java heap space at java.io.BufferedOutputStream.init(Unknown Source) at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:62) 14/07/09 01:46:18 INFO BlockManagerMasterActor$BlockManagerInfo: Added rdd_2212_0 in memory on shawn-PC:51451 (size: 60. 0 KB, free: 38.7 MB) PySpark worker failed with exception: Traceback (most recent call last): File F:\spark-0.9.1\spark-0.9.1\bin\..\/python/pyspark/worker.py, line 50, in main split_index = read_int(infile) File F:\spark-0.9.1\spark-0.9.1\python\pyspark\serializers.py, line 328, in read_int raise EOFError EOFError 14/07/09 01:46:25 INFO BlockManagerMaster: Updated info of block rdd_2212_0 Exception in thread stdin writer for python java.lang.OutOfMemoryError: Java heap space PySpark worker failed with exception: Traceback (most recent call last): File F:\spark-0.9.1\spark-0.9.1\bin\..\/python/pyspark/worker.py, line 50, in main split_index = read_int(infile) File F:\spark-0.9.1\spark-0.9.1\python\pyspark\serializers.py, line 328, in read_int raise EOFError EOFError Exception in thread Executor task launch worker-3 java.lang.OutOfMemoryError: Java heap space Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread spark-akka.actor.default-dispa tcher-15 Exception in thread Executor task launch worker-1 Exception in thread Executor task launch worker-2 java.lang.OutOfM emoryError: Java heap space java.lang.OutOfMemoryError: Java heap space Exception in thread Executor task launch worker-0 Exception in thread Executor task launch worker-5 java.lang.OutOfM emoryError: Java heap space java.lang.OutOfMemoryError: Java heap space 14/07/09 01:46:52 WARN BlockManagerMaster: Error sending message to BlockManagerMaster in 1 attempts akka.pattern.AskTimeoutException: Recipient[Actor[akka://spark/user/BlockManagerMaster#920823400]] had already been term inated. at akka.pattern.AskableActorRef$.ask$extension(AskSupport.scala:134) at org.apache.spark.storage.BlockManagerMaster.askDriverWithReply(BlockManagerMaster.scala:161) at org.apache.spark.storage.BlockManagerMaster.sendHeartBeat(BlockManagerMaster.scala:52) at org.apache.spark.storage.BlockManager.org $apache$spark$storage$BlockManager$$heartBeat(BlockManager.scala:97) at org.apache.spark.storage.BlockManager$$anonfun$initialize$1.apply$mcV$sp(BlockManager.scala:135) at akka.actor.Scheduler$$anon$9.run(Scheduler.scala:80) at akka.actor.LightArrayRevolverScheduler$$anon$3$$anon$2.run(Scheduler.scala:241) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) 14/07/09 01:46:56 WARN BlockManagerMaster: Error sending message to BlockManagerMaster in 2 attempts akka.pattern.AskTimeoutException: Recipient[Actor[akka://spark/user/BlockManagerMaster#920823400]] had already been term inated. at akka.pattern.AskableActorRef$.ask$extension(AskSupport.scala:134) at org.apache.spark.storage.BlockManagerMaster.askDriverWithReply(BlockManagerMaster.scala:161) at org.apache.spark.storage.BlockManagerMaster.sendHeartBeat(BlockManagerMaster.scala:52) at org.apache.spark.storage.BlockManager.org $apache$spark$storage$BlockManager$$heartBeat(BlockManager.scala:97) at org.apache.spark.storage.BlockManager$$anonfun$initialize$1.apply$mcV$sp(BlockManager.scala:135) at akka.actor.Scheduler$$anon$9.run(Scheduler.scala:80) at akka.actor.LightArrayRevolverScheduler$$anon$3$$anon$2.run(Scheduler.scala:241) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) 14/07/09
Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems
What version of Spark do the other dependencies rely on (Adam and H2O?) - that could be it Or try sbt clean compile — Sent from Mailbox On Wed, Mar 25, 2015 at 5:58 PM, roni roni.epi...@gmail.com wrote: I have a EC2 cluster created using spark version 1.2.1. And I have a SBT project . Now I want to upgrade to spark 1.3 and use the new features. Below are issues . Sorry for the long post. Appreciate your help. Thanks -Roni Question - Do I have to create a new cluster using spark 1.3? Here is what I did - In my SBT file I changed to - libraryDependencies += org.apache.spark %% spark-core % 1.3.0 But then I started getting compilation error. along with Here are some of the libraries that were evicted: [warn] * org.apache.spark:spark-core_2.10:1.2.0 - 1.3.0 [warn] * org.apache.hadoop:hadoop-client:(2.5.0-cdh5.2.0, 2.2.0) - 2.6.0 [warn] Run 'evicted' to see detailed eviction warnings constructor cannot be instantiated to expected type; [error] found : (T1, T2) [error] required: org.apache.spark.sql.catalyst.expressions.Row [error] val ty = joinRDD.map{case(word, (file1Counts, file2Counts)) = KmerIntesect(word, file1Counts,xyz)} [error] ^ Here is my SBT and code -- SBT - version := 1.0 scalaVersion := 2.10.4 resolvers += Sonatype OSS Snapshots at https://oss.sonatype.org/content/repositories/snapshots;; resolvers += Maven Repo1 at https://repo1.maven.org/maven2;; resolvers += Maven Repo at https://s3.amazonaws.com/h2o-release/h2o-dev/master/1056/maven/repo/;; /* Dependencies - %% appends Scala version to artifactId */ libraryDependencies += org.apache.hadoop % hadoop-client % 2.6.0 libraryDependencies += org.apache.spark %% spark-core % 1.3.0 libraryDependencies += org.bdgenomics.adam % adam-core % 0.16.0 libraryDependencies += ai.h2o % sparkling-water-core_2.10 % 0.2.10 CODE -- import org.apache.spark.{SparkConf, SparkContext} case class KmerIntesect(kmer: String, kCount: Int, fileName: String) object preDefKmerIntersection { def main(args: Array[String]) { val sparkConf = new SparkConf().setAppName(preDefKmer-intersect) val sc = new SparkContext(sparkConf) import sqlContext.createSchemaRDD val sqlContext = new org.apache.spark.sql.SQLContext(sc) val bedFile = sc.textFile(s3n://a/b/c,40) val hgfasta = sc.textFile(hdfs://a/b/c,40) val hgPair = hgfasta.map(_.split (,)).map(a= (a(0), a(1).trim().toInt)) val filtered = hgPair.filter(kv = kv._2 == 1) val bedPair = bedFile.map(_.split (,)).map(a= (a(0), a(1).trim().toInt)) val joinRDD = bedPair.join(filtered) val ty = joinRDD.map{case(word, (file1Counts, file2Counts)) = KmerIntesect(word, file1Counts,xyz)} ty.registerTempTable(KmerIntesect) ty.saveAsParquetFile(hdfs://x/y/z/kmerIntersect.parquet) } }
Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems
Even if H2o and ADA are dependent on 1.2.1 , it should be backword compatible, right? So using 1.3 should not break them. And the code is not using the classes from those libs. I tried sbt clean compile .. same errror Thanks _R On Wed, Mar 25, 2015 at 9:26 AM, Nick Pentreath nick.pentre...@gmail.com wrote: What version of Spark do the other dependencies rely on (Adam and H2O?) - that could be it Or try sbt clean compile — Sent from Mailbox https://www.dropbox.com/mailbox On Wed, Mar 25, 2015 at 5:58 PM, roni roni.epi...@gmail.com wrote: I have a EC2 cluster created using spark version 1.2.1. And I have a SBT project . Now I want to upgrade to spark 1.3 and use the new features. Below are issues . Sorry for the long post. Appreciate your help. Thanks -Roni Question - Do I have to create a new cluster using spark 1.3? Here is what I did - In my SBT file I changed to - libraryDependencies += org.apache.spark %% spark-core % 1.3.0 But then I started getting compilation error. along with Here are some of the libraries that were evicted: [warn] * org.apache.spark:spark-core_2.10:1.2.0 - 1.3.0 [warn] * org.apache.hadoop:hadoop-client:(2.5.0-cdh5.2.0, 2.2.0) - 2.6.0 [warn] Run 'evicted' to see detailed eviction warnings constructor cannot be instantiated to expected type; [error] found : (T1, T2) [error] required: org.apache.spark.sql.catalyst.expressions.Row [error] val ty = joinRDD.map{case(word, (file1Counts, file2Counts)) = KmerIntesect(word, file1Counts,xyz)} [error] ^ Here is my SBT and code -- SBT - version := 1.0 scalaVersion := 2.10.4 resolvers += Sonatype OSS Snapshots at https://oss.sonatype.org/content/repositories/snapshots;; resolvers += Maven Repo1 at https://repo1.maven.org/maven2;; resolvers += Maven Repo at https://s3.amazonaws.com/h2o-release/h2o-dev/master/1056/maven/repo/;; /* Dependencies - %% appends Scala version to artifactId */ libraryDependencies += org.apache.hadoop % hadoop-client % 2.6.0 libraryDependencies += org.apache.spark %% spark-core % 1.3.0 libraryDependencies += org.bdgenomics.adam % adam-core % 0.16.0 libraryDependencies += ai.h2o % sparkling-water-core_2.10 % 0.2.10 CODE -- import org.apache.spark.{SparkConf, SparkContext} case class KmerIntesect(kmer: String, kCount: Int, fileName: String) object preDefKmerIntersection { def main(args: Array[String]) { val sparkConf = new SparkConf().setAppName(preDefKmer-intersect) val sc = new SparkContext(sparkConf) import sqlContext.createSchemaRDD val sqlContext = new org.apache.spark.sql.SQLContext(sc) val bedFile = sc.textFile(s3n://a/b/c,40) val hgfasta = sc.textFile(hdfs://a/b/c,40) val hgPair = hgfasta.map(_.split (,)).map(a= (a(0), a(1).trim().toInt)) val filtered = hgPair.filter(kv = kv._2 == 1) val bedPair = bedFile.map(_.split (,)).map(a= (a(0), a(1).trim().toInt)) val joinRDD = bedPair.join(filtered) val ty = joinRDD.map{case(word, (file1Counts, file2Counts)) = KmerIntesect(word, file1Counts,xyz)} ty.registerTempTable(KmerIntesect) ty.saveAsParquetFile(hdfs://x/y/z/kmerIntersect.parquet) } }
RE: Date and decimal datatype not working
Thanks. This library is only available with Spark 1.3. I am using version 1.2.1. Before I upgrade to 1.3, I want to try what can be done in 1.2.1. So I am using following: val MyDataset = sqlContext.sql(my select query”) MyDataset.map(t = t(0)+|+t(1)+|+t(2)+|+t(3)+|+t(4)+|+t(5)).saveAsTextFile(/my_destination_path) But it is giving following error: 15/03/24 17:05:51 ERROR Executor: Exception in task 1.0 in stage 13.0 (TID 106) java.lang.NumberFormatException: For input string: at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Long.parseLong(Long.java:453) at java.lang.Long.parseLong(Long.java:483) at scala.collection.immutable.StringLike$class.toLong(StringLike.scala:230) is there something wrong with the TSTAMP field which is Long datatype? Thanks Regards --- Ananda Basak Ph: 425-213-7092 From: Yin Huai [mailto:yh...@databricks.com] Sent: Monday, March 23, 2015 8:55 PM To: BASAK, ANANDA Cc: user@spark.apache.org Subject: Re: Date and decimal datatype not working To store to csv file, you can use Spark-CSVhttps://github.com/databricks/spark-csv library. On Mon, Mar 23, 2015 at 5:35 PM, BASAK, ANANDA ab9...@att.commailto:ab9...@att.com wrote: Thanks. This worked well as per your suggestions. I had to run following: val TABLE_A = sc.textFile(/Myhome/SPARK/files/table_a_file.txt).map(_.split(|)).map(p = ROW_A(p(0).trim.toLong, p(1), p(2).trim.toInt, p(3), BigDecimal(p(4)), BigDecimal(p(5)), BigDecimal(p(6 Now I am stuck at another step. I have run a SQL query, where I am Selecting from all the fields with some where clause , TSTAMP filtered with date range and order by TSTAMP clause. That is running fine. Then I am trying to store the output in a CSV file. I am using saveAsTextFile(“filename”) function. But it is giving error. Can you please help me to write a proper syntax to store output in a CSV file? Thanks Regards --- Ananda Basak Ph: 425-213-7092tel:425-213-7092 From: BASAK, ANANDA Sent: Tuesday, March 17, 2015 3:08 PM To: Yin Huai Cc: user@spark.apache.orgmailto:user@spark.apache.org Subject: RE: Date and decimal datatype not working Ok, thanks for the suggestions. Let me try and will confirm all. Regards Ananda From: Yin Huai [mailto:yh...@databricks.commailto:yh...@databricks.com] Sent: Tuesday, March 17, 2015 3:04 PM To: BASAK, ANANDA Cc: user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: Date and decimal datatype not working p(0) is a String. So, you need to explicitly convert it to a Long. e.g. p(0).trim.toLong. You also need to do it for p(2). For those BigDecimals value, you need to create BigDecimal objects from your String values. On Tue, Mar 17, 2015 at 5:55 PM, BASAK, ANANDA ab9...@att.commailto:ab9...@att.com wrote: Hi All, I am very new in Spark world. Just started some test coding from last week. I am using spark-1.2.1-bin-hadoop2.4 and scala coding. I am having issues while using Date and decimal data types. Following is my code that I am simply running on scala prompt. I am trying to define a table and point that to my flat file containing raw data (pipe delimited format). Once that is done, I will run some SQL queries and put the output data in to another flat file with pipe delimited format. *** val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.createSchemaRDD // Define row and table case class ROW_A( TSTAMP: Long, USIDAN: String, SECNT:Int, SECT: String, BLOCK_NUM:BigDecimal, BLOCK_DEN:BigDecimal, BLOCK_PCT:BigDecimal) val TABLE_A = sc.textFile(/Myhome/SPARK/files/table_a_file.txt).map(_.split(|)).map(p = ROW_A(p(0), p(1), p(2), p(3), p(4), p(5), p(6))) TABLE_A.registerTempTable(TABLE_A) *** The second last command is giving error, like following: console:17: error: type mismatch; found : String required: Long Looks like the content from my flat file are considered as String always and not as Date or decimal. How can I make Spark to take them as Date or decimal types? Regards Ananda
Re: Total size of serialized results is bigger than spark.driver.maxResultSize
As you noted, you can change the spark.driver.maxResultSize value in your Spark Configurations (https://spark.apache.org/docs/1.2.0/configuration.html). Please reference the Spark Properties section noting that you can modify these properties via the spark-defaults.conf or via SparkConf(). HTH! On Wed, Mar 25, 2015 at 8:01 AM Wang, Ningjun (LNG-NPV) ningjun.w...@lexisnexis.com wrote: Hi I ran a spark job and got the following error. Can anybody tell me how to work around this problem? For example how can I increase spark.driver.maxResultSize? Thanks. org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 128 tasks (1029.1 MB) is bigger than spark.driver.maxResultSize (1024.0 MB) at org.apache.spark.scheduler.DAGScheduler.org $apache$spark$scheduler$DAGScheduler$$failJobA ndIndependentStages(DAGScheduler.scala:1214) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:12 03) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:12 02) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1202) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler .scala:696) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler .scala:696) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:696) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(D AGScheduler.scala:1420) at akka.actor.Actor$class.aroundReceive(Actor.scala:465) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala :1375) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.ActorCell.invoke(ActorCell.scala:487) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) at akka.dispatch.Mailbox.run(Mailbox.scala:220) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala :393) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 15/03/25 10:48:38 WARN TaskSetManager: Lost task 128.0 in stage 199.0 (TID 6324, INT1-CAS01.pcc.lexi snexis.com): TaskKilled (killed intentionally) Ningjun
Re: java.lang.OutOfMemoryError: unable to create new native thread
I have a YARN cluster where the max memory allowed is 16GB. I set 12G for my driver, however i see OutOFMemory error even for this program http://spark.apache.org/docs/1.3.0/sql-programming-guide.html#hive-tables . What do you suggest ? On Wed, Mar 25, 2015 at 8:23 AM, Thomas Gerber thomas.ger...@radius.com wrote: So, 1. I reduced my -XX:ThreadStackSize to 5m (instead of 10m - default is 1m), which is still OK for my need. 2. I reduced the executor memory to 44GB for a 60GB machine (instead of 49GB). This seems to have helped. Thanks to Matthew and Sean. Thomas On Tue, Mar 24, 2015 at 3:49 PM, Matt Silvey matt.sil...@videoamp.com wrote: My memory is hazy on this but aren't there hidden limitations to Linux-based threads? I ran into some issues a couple of years ago where, and here is the fuzzy part, the kernel wants to reserve virtual memory per thread equal to the stack size. When the total amount of reserved memory (not necessarily resident memory) exceeds the memory of the system it throws an OOM. I'm looking for material to back this up. Sorry for the initial vague response. Matthew On Tue, Mar 24, 2015 at 12:53 PM, Thomas Gerber thomas.ger...@radius.com wrote: Additional notes: I did not find anything wrong with the number of threads (ps -u USER -L | wc -l): around 780 on the master and 400 on executors. I am running on 100 r3.2xlarge. On Tue, Mar 24, 2015 at 12:38 PM, Thomas Gerber thomas.ger...@radius.com wrote: Hello, I am seeing various crashes in spark on large jobs which all share a similar exception: java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:714) I increased nproc (i.e. ulimit -u) 10 fold, but it doesn't help. Does anyone know how to avoid those kinds of errors? Noteworthy: I added -XX:ThreadStackSize=10m on both driver and executor extra java options, which might have amplified the problem. Thanks for you help, Thomas -- Deepak
Re: column expression in left outer join for DataFrame
Unfortunately you are now hitting a bug (that is fixed in master and will be released in 1.3.1 hopefully next week). However, even with that your query is still ambiguous and you will need to use aliases: val df_1 = df.filter( df(event) === 0) . select(country, cnt).as(a) val df_2 = df.filter( df(event) === 3) . select(country, cnt).as(b) val both = df_2.join(df_1, $a.country === $b.country), left_outer) On Tue, Mar 24, 2015 at 11:57 PM, S Krishna skrishna...@gmail.com wrote: Hi, Thanks for your response. I modified my code as per your suggestion, but now I am getting a runtime error. Here's my code: val df_1 = df.filter( df(event) === 0) . select(country, cnt) val df_2 = df.filter( df(event) === 3) . select(country, cnt) df_1.show() //produces the following output : // countrycnt // tw 3000 // uk 2000 // us 1000 df_2.show() //produces the following output : // countrycnt // tw 25 // uk 200 // us 95 val both = df_2.join(df_1, df_2(country)===df_1(country), left_outer) I am getting the following error when executing the join statement: java.util.NoSuchElementException: next on empty iterator. This error seems to be originating at DataFrame.join (line 133 in DataFrame.scala). The show() results show that both dataframes do have columns named country and that they are non-empty. I also tried the simpler join ( i.e. df_2.join(df_1) ) and got the same error stated above. I would like to know what is wrong with the join statement above. thanks On Tue, Mar 24, 2015 at 6:08 PM, Michael Armbrust mich...@databricks.com wrote: You need to use `===`, so that you are constructing a column expression instead of evaluating the standard scala equality method. Calling methods to access columns (i.e. df.county is only supported in python). val join_df = df1.join( df2, df1(country) === df2(country), left_outer) On Tue, Mar 24, 2015 at 5:50 PM, SK skrishna...@gmail.com wrote: Hi, I am trying to port some code that was working in Spark 1.2.0 on the latest version, Spark 1.3.0. This code involves a left outer join between two SchemaRDDs which I am now trying to change to a left outer join between 2 DataFrames. I followed the example for left outer join of DataFrame at https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html Here's my code, where df1 and df2 are the 2 dataframes I am joining on the country field: val join_df = df1.join( df2, df1.country == df2.country, left_outer) But I got a compilation error that value country is not a member of sql.DataFrame I also tried the following: val join_df = df1.join( df2, df1(country) == df2(country), left_outer) I got a compilation error that it is a Boolean whereas a Column is required. So what is the correct Column expression I need to provide for joining the 2 dataframes on a specific field ? thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/column-expression-in-left-outer-join-for-DataFrame-tp22209.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems
Ah I see now you are trying to use a spark 1.2 cluster - you will need to be running spark 1.3 on your EC2 cluster in order to run programs built against spark 1.3. You will need to terminate and restart your cluster with spark 1.3 — Sent from Mailbox On Wed, Mar 25, 2015 at 6:39 PM, roni roni.epi...@gmail.com wrote: Even if H2o and ADA are dependent on 1.2.1 , it should be backword compatible, right? So using 1.3 should not break them. And the code is not using the classes from those libs. I tried sbt clean compile .. same errror Thanks _R On Wed, Mar 25, 2015 at 9:26 AM, Nick Pentreath nick.pentre...@gmail.com wrote: What version of Spark do the other dependencies rely on (Adam and H2O?) - that could be it Or try sbt clean compile — Sent from Mailbox https://www.dropbox.com/mailbox On Wed, Mar 25, 2015 at 5:58 PM, roni roni.epi...@gmail.com wrote: I have a EC2 cluster created using spark version 1.2.1. And I have a SBT project . Now I want to upgrade to spark 1.3 and use the new features. Below are issues . Sorry for the long post. Appreciate your help. Thanks -Roni Question - Do I have to create a new cluster using spark 1.3? Here is what I did - In my SBT file I changed to - libraryDependencies += org.apache.spark %% spark-core % 1.3.0 But then I started getting compilation error. along with Here are some of the libraries that were evicted: [warn] * org.apache.spark:spark-core_2.10:1.2.0 - 1.3.0 [warn] * org.apache.hadoop:hadoop-client:(2.5.0-cdh5.2.0, 2.2.0) - 2.6.0 [warn] Run 'evicted' to see detailed eviction warnings constructor cannot be instantiated to expected type; [error] found : (T1, T2) [error] required: org.apache.spark.sql.catalyst.expressions.Row [error] val ty = joinRDD.map{case(word, (file1Counts, file2Counts)) = KmerIntesect(word, file1Counts,xyz)} [error] ^ Here is my SBT and code -- SBT - version := 1.0 scalaVersion := 2.10.4 resolvers += Sonatype OSS Snapshots at https://oss.sonatype.org/content/repositories/snapshots;; resolvers += Maven Repo1 at https://repo1.maven.org/maven2;; resolvers += Maven Repo at https://s3.amazonaws.com/h2o-release/h2o-dev/master/1056/maven/repo/;; /* Dependencies - %% appends Scala version to artifactId */ libraryDependencies += org.apache.hadoop % hadoop-client % 2.6.0 libraryDependencies += org.apache.spark %% spark-core % 1.3.0 libraryDependencies += org.bdgenomics.adam % adam-core % 0.16.0 libraryDependencies += ai.h2o % sparkling-water-core_2.10 % 0.2.10 CODE -- import org.apache.spark.{SparkConf, SparkContext} case class KmerIntesect(kmer: String, kCount: Int, fileName: String) object preDefKmerIntersection { def main(args: Array[String]) { val sparkConf = new SparkConf().setAppName(preDefKmer-intersect) val sc = new SparkContext(sparkConf) import sqlContext.createSchemaRDD val sqlContext = new org.apache.spark.sql.SQLContext(sc) val bedFile = sc.textFile(s3n://a/b/c,40) val hgfasta = sc.textFile(hdfs://a/b/c,40) val hgPair = hgfasta.map(_.split (,)).map(a= (a(0), a(1).trim().toInt)) val filtered = hgPair.filter(kv = kv._2 == 1) val bedPair = bedFile.map(_.split (,)).map(a= (a(0), a(1).trim().toInt)) val joinRDD = bedPair.join(filtered) val ty = joinRDD.map{case(word, (file1Counts, file2Counts)) = KmerIntesect(word, file1Counts,xyz)} ty.registerTempTable(KmerIntesect) ty.saveAsParquetFile(hdfs://x/y/z/kmerIntersect.parquet) } }
[no subject]
Hi, I have a RDD of pairs of strings like below : (A,B) (B,C) (C,D) (A,D) (E,F) (B,F) I need to transform/filter this into a RDD of pairs that does not repeat a string once it has been used once. So something like , (A,B) (C,D) (E,F) (B,C) is out because B has already ben used in (A,B), (A,D) is out because A (and D) has been used etc. I was thinking of a option of using a shared variable to keep track of what has already been used but that may only work for a single partition and would not scale for larger dataset. Is there any other efficient way to accomplish this ? -- Thanks Regards Himanish
Re: OutOfMemoryError when using DataFrame created by Spark SQL
You should also try increasing the perm gen size: -XX:MaxPermSize=512m On Wed, Mar 25, 2015 at 2:37 AM, Ted Yu yuzhih...@gmail.com wrote: Can you try giving Spark driver more heap ? Cheers On Mar 25, 2015, at 2:14 AM, Todd Leo sliznmail...@gmail.com wrote: Hi, I am using *Spark SQL* to query on my *Hive cluster*, following Spark SQL and DataFrame Guide https://spark.apache.org/docs/latest/sql-programming-guide.html step by step. However, my HiveQL via sqlContext.sql() fails and java.lang.OutOfMemoryError was raised. The expected result of such query is considered to be small (by adding limit 1000 clause). My code is shown below: scala import sqlContext.implicits._ scala val df = sqlContext.sql(select * from some_table where logdate=2015-03-24 limit 1000) and the error msg: [ERROR] [03/25/2015 16:08:22.379] [sparkDriver-scheduler-27] [ActorSystem(sparkDriver)] Uncaught fatal error from thread [sparkDriver-scheduler-27] shutting down ActorSystem [sparkDriver] java.lang.OutOfMemoryError: GC overhead limit exceeded the master heap memory is set by -Xms512m -Xmx512m, while workers set by -Xms4096M -Xmx4096M, which I presume sufficient for this trivial query. Additionally, after restarted the spark-shell and re-run the limit 5 query , the df object is returned and can be printed by df.show(), but other APIs fails on OutOfMemoryError, namely, df.count(), df.select(some_field).show() and so forth. I understand that the RDD can be collected to master hence further transmutations can be applied, as DataFrame has “richer optimizations under the hood” and the convention from an R/julia user, I really hope this error is able to be tackled, and DataFrame is robust enough to depend. Thanks in advance! REGARDS, Todd -- View this message in context: OutOfMemoryError when using DataFrame created by Spark SQL http://apache-spark-user-list.1001560.n3.nabble.com/OutOfMemoryError-when-using-DataFrame-created-by-Spark-SQL-tp22219.html Sent from the Apache Spark User List mailing list archive http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com.
Re:
What would it do with the following dataset? (A, B) (A, C) (B, D) On Wed, Mar 25, 2015 at 1:02 PM, Himanish Kushary himan...@gmail.com wrote: Hi, I have a RDD of pairs of strings like below : (A,B) (B,C) (C,D) (A,D) (E,F) (B,F) I need to transform/filter this into a RDD of pairs that does not repeat a string once it has been used once. So something like , (A,B) (C,D) (E,F) (B,C) is out because B has already ben used in (A,B), (A,D) is out because A (and D) has been used etc. I was thinking of a option of using a shared variable to keep track of what has already been used but that may only work for a single partition and would not scale for larger dataset. Is there any other efficient way to accomplish this ? -- Thanks Regards Himanish
Re: What are the best options for quickly filtering a DataFrame on a single column?
The only way to do in using python currently is to use the string based filter API (where you pass us an expression as a string, and we parse it using our SQL parser). from pyspark.sql import Row from pyspark.sql.functions import * df = sc.parallelize([Row(name=test)]).toDF() df.filter(name in ('a', 'b')).collect() Out[1]: [] df.filter(name in ('test')).collect() Out[2]: [Row(name=u'test')] In general you want to avoid lambda functions whenever you can do the same thing a dataframe expression. This is because your lambda function is a black box that we cannot optimize (though you should certainly use them for the advanced stuff that expressions can't handle). I opened SPARK-6536 https://issues.apache.org/jira/browse/SPARK-6536 to provide a nicer interface for this. On Wed, Mar 25, 2015 at 7:41 AM, Stuart Layton stuart.lay...@gmail.com wrote: I have a SparkSQL dataframe with a a few billion rows that I need to quickly filter down to a few hundred thousand rows, using an operation like (syntax may not be correct) df = df[ df.filter(lambda x: x.key_col in approved_keys)] I was thinking about serializing the data using parquet and saving it to S3, however as I want to optimize for filtering speed I'm not sure this is the best option. -- Stuart Layton
Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems
Thanks Dean and Nick. So, I removed the ADAM and H2o from my SBT as I was not using them. I got the code to compile - only for fail while running with - SparkContext: Created broadcast 1 from textFile at kmerIntersetion.scala:21 Exception in thread main java.lang.NoClassDefFoundError: org/apache/spark/rdd/RDD$ at preDefKmerIntersection$.main(kmerIntersetion.scala:26) This line is where I do a JOIN operation. val hgPair = hgfasta.map(_.split (,)).map(a= (a(0), a(1).trim().toInt)) val filtered = hgPair.filter(kv = kv._2 == 1) val bedPair = bedFile.map(_.split (,)).map(a= (a(0), a(1).trim().toInt)) * val joinRDD = bedPair.join(filtered) * Any idea whats going on? I have data on the EC2 so I am avoiding creating the new cluster , but just upgrading and changing the code to use 1.3 and Spark SQL Thanks Roni On Wed, Mar 25, 2015 at 9:50 AM, Dean Wampler deanwamp...@gmail.com wrote: For the Spark SQL parts, 1.3 breaks backwards compatibility, because before 1.3, Spark SQL was considered experimental where API changes were allowed. So, H2O and ADA compatible with 1.2.X might not work with 1.3. dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Wed, Mar 25, 2015 at 9:39 AM, roni roni.epi...@gmail.com wrote: Even if H2o and ADA are dependent on 1.2.1 , it should be backword compatible, right? So using 1.3 should not break them. And the code is not using the classes from those libs. I tried sbt clean compile .. same errror Thanks _R On Wed, Mar 25, 2015 at 9:26 AM, Nick Pentreath nick.pentre...@gmail.com wrote: What version of Spark do the other dependencies rely on (Adam and H2O?) - that could be it Or try sbt clean compile — Sent from Mailbox https://www.dropbox.com/mailbox On Wed, Mar 25, 2015 at 5:58 PM, roni roni.epi...@gmail.com wrote: I have a EC2 cluster created using spark version 1.2.1. And I have a SBT project . Now I want to upgrade to spark 1.3 and use the new features. Below are issues . Sorry for the long post. Appreciate your help. Thanks -Roni Question - Do I have to create a new cluster using spark 1.3? Here is what I did - In my SBT file I changed to - libraryDependencies += org.apache.spark %% spark-core % 1.3.0 But then I started getting compilation error. along with Here are some of the libraries that were evicted: [warn] * org.apache.spark:spark-core_2.10:1.2.0 - 1.3.0 [warn] * org.apache.hadoop:hadoop-client:(2.5.0-cdh5.2.0, 2.2.0) - 2.6.0 [warn] Run 'evicted' to see detailed eviction warnings constructor cannot be instantiated to expected type; [error] found : (T1, T2) [error] required: org.apache.spark.sql.catalyst.expressions.Row [error] val ty = joinRDD.map{case(word, (file1Counts, file2Counts)) = KmerIntesect(word, file1Counts,xyz)} [error] ^ Here is my SBT and code -- SBT - version := 1.0 scalaVersion := 2.10.4 resolvers += Sonatype OSS Snapshots at https://oss.sonatype.org/content/repositories/snapshots;; resolvers += Maven Repo1 at https://repo1.maven.org/maven2;; resolvers += Maven Repo at https://s3.amazonaws.com/h2o-release/h2o-dev/master/1056/maven/repo/;; /* Dependencies - %% appends Scala version to artifactId */ libraryDependencies += org.apache.hadoop % hadoop-client % 2.6.0 libraryDependencies += org.apache.spark %% spark-core % 1.3.0 libraryDependencies += org.bdgenomics.adam % adam-core % 0.16.0 libraryDependencies += ai.h2o % sparkling-water-core_2.10 % 0.2.10 CODE -- import org.apache.spark.{SparkConf, SparkContext} case class KmerIntesect(kmer: String, kCount: Int, fileName: String) object preDefKmerIntersection { def main(args: Array[String]) { val sparkConf = new SparkConf().setAppName(preDefKmer-intersect) val sc = new SparkContext(sparkConf) import sqlContext.createSchemaRDD val sqlContext = new org.apache.spark.sql.SQLContext(sc) val bedFile = sc.textFile(s3n://a/b/c,40) val hgfasta = sc.textFile(hdfs://a/b/c,40) val hgPair = hgfasta.map(_.split (,)).map(a= (a(0), a(1).trim().toInt)) val filtered = hgPair.filter(kv = kv._2 == 1) val bedPair = bedFile.map(_.split (,)).map(a= (a(0), a(1).trim().toInt)) val joinRDD = bedPair.join(filtered) val ty = joinRDD.map{case(word, (file1Counts, file2Counts)) = KmerIntesect(word, file1Counts,xyz)} ty.registerTempTable(KmerIntesect) ty.saveAsParquetFile(hdfs://x/y/z/kmerIntersect.parquet) } }
Re:
It will only give (A,B). I am generating the pair from combinations of the the strings A,B,C and D, so the pairs (ignoring order) would be (A,B),(A,C),(A,D),(B,C),(B,D),(C,D) On successful filtering using the original condition it will transform to (A,B) and (C,D) On Wed, Mar 25, 2015 at 3:00 PM, Nathan Kronenfeld nkronenfeld@uncharted.software wrote: What would it do with the following dataset? (A, B) (A, C) (B, D) On Wed, Mar 25, 2015 at 1:02 PM, Himanish Kushary himan...@gmail.com wrote: Hi, I have a RDD of pairs of strings like below : (A,B) (B,C) (C,D) (A,D) (E,F) (B,F) I need to transform/filter this into a RDD of pairs that does not repeat a string once it has been used once. So something like , (A,B) (C,D) (E,F) (B,C) is out because B has already ben used in (A,B), (A,D) is out because A (and D) has been used etc. I was thinking of a option of using a shared variable to keep track of what has already been used but that may only work for a single partition and would not scale for larger dataset. Is there any other efficient way to accomplish this ? -- Thanks Regards Himanish -- Thanks Regards Himanish
Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems
Weird. Are you running using SBT console? It should have the spark-core jar on the classpath. Similarly, spark-shell or spark-submit should work, but be sure you're using the same version of Spark when running as when compiling. Also, you might need to add spark-sql to your SBT dependencies, but that shouldn't be this issue. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Wed, Mar 25, 2015 at 12:09 PM, roni roni.epi...@gmail.com wrote: Thanks Dean and Nick. So, I removed the ADAM and H2o from my SBT as I was not using them. I got the code to compile - only for fail while running with - SparkContext: Created broadcast 1 from textFile at kmerIntersetion.scala:21 Exception in thread main java.lang.NoClassDefFoundError: org/apache/spark/rdd/RDD$ at preDefKmerIntersection$.main(kmerIntersetion.scala:26) This line is where I do a JOIN operation. val hgPair = hgfasta.map(_.split (,)).map(a= (a(0), a(1).trim().toInt)) val filtered = hgPair.filter(kv = kv._2 == 1) val bedPair = bedFile.map(_.split (,)).map(a= (a(0), a(1).trim().toInt)) * val joinRDD = bedPair.join(filtered) * Any idea whats going on? I have data on the EC2 so I am avoiding creating the new cluster , but just upgrading and changing the code to use 1.3 and Spark SQL Thanks Roni On Wed, Mar 25, 2015 at 9:50 AM, Dean Wampler deanwamp...@gmail.com wrote: For the Spark SQL parts, 1.3 breaks backwards compatibility, because before 1.3, Spark SQL was considered experimental where API changes were allowed. So, H2O and ADA compatible with 1.2.X might not work with 1.3. dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Wed, Mar 25, 2015 at 9:39 AM, roni roni.epi...@gmail.com wrote: Even if H2o and ADA are dependent on 1.2.1 , it should be backword compatible, right? So using 1.3 should not break them. And the code is not using the classes from those libs. I tried sbt clean compile .. same errror Thanks _R On Wed, Mar 25, 2015 at 9:26 AM, Nick Pentreath nick.pentre...@gmail.com wrote: What version of Spark do the other dependencies rely on (Adam and H2O?) - that could be it Or try sbt clean compile — Sent from Mailbox https://www.dropbox.com/mailbox On Wed, Mar 25, 2015 at 5:58 PM, roni roni.epi...@gmail.com wrote: I have a EC2 cluster created using spark version 1.2.1. And I have a SBT project . Now I want to upgrade to spark 1.3 and use the new features. Below are issues . Sorry for the long post. Appreciate your help. Thanks -Roni Question - Do I have to create a new cluster using spark 1.3? Here is what I did - In my SBT file I changed to - libraryDependencies += org.apache.spark %% spark-core % 1.3.0 But then I started getting compilation error. along with Here are some of the libraries that were evicted: [warn] * org.apache.spark:spark-core_2.10:1.2.0 - 1.3.0 [warn] * org.apache.hadoop:hadoop-client:(2.5.0-cdh5.2.0, 2.2.0) - 2.6.0 [warn] Run 'evicted' to see detailed eviction warnings constructor cannot be instantiated to expected type; [error] found : (T1, T2) [error] required: org.apache.spark.sql.catalyst.expressions.Row [error] val ty = joinRDD.map{case(word, (file1Counts, file2Counts)) = KmerIntesect(word, file1Counts,xyz)} [error] ^ Here is my SBT and code -- SBT - version := 1.0 scalaVersion := 2.10.4 resolvers += Sonatype OSS Snapshots at https://oss.sonatype.org/content/repositories/snapshots;; resolvers += Maven Repo1 at https://repo1.maven.org/maven2;; resolvers += Maven Repo at https://s3.amazonaws.com/h2o-release/h2o-dev/master/1056/maven/repo/;; /* Dependencies - %% appends Scala version to artifactId */ libraryDependencies += org.apache.hadoop % hadoop-client % 2.6.0 libraryDependencies += org.apache.spark %% spark-core % 1.3.0 libraryDependencies += org.bdgenomics.adam % adam-core % 0.16.0 libraryDependencies += ai.h2o % sparkling-water-core_2.10 % 0.2.10 CODE -- import org.apache.spark.{SparkConf, SparkContext} case class KmerIntesect(kmer: String, kCount: Int, fileName: String) object preDefKmerIntersection { def main(args: Array[String]) { val sparkConf = new SparkConf().setAppName(preDefKmer-intersect) val sc = new SparkContext(sparkConf) import sqlContext.createSchemaRDD val sqlContext = new org.apache.spark.sql.SQLContext(sc) val bedFile = sc.textFile(s3n://a/b/c,40) val hgfasta = sc.textFile(hdfs://a/b/c,40) val hgPair =
Re: Can LBFGS be used on streaming data?
Hello Jeremy, Sorry for the delayed reply! First issue was resolved, I believe it was just production and consumption rate problem. Regarding the second question, I am streaming the data from the file and there are about 38k records. I am sending the streams in the same sequence as I am reading from the file, but I am getting different weights each time may be something to do with how the DStreams are being processed. Can you suggest me some solution for this case? my requirement is that my program must generate the same weights for both static and streaming data? Thank you for your help! Best Regards, Arunkumar On Thu, Mar 19, 2015 at 9:25 PM, Jeremy Freeman freeman.jer...@gmail.com wrote: Regarding the first question, can you say more about how you are loading your data? And what is the size of the data set? And is that the only error you see, and do you only see it in the streaming version? For the second question, there are a couple reasons the weights might slightly differ, it depends on exactly how you set up the comparison. When you split it into 5, were those the same 5 chunks of data you used for the streaming case? And were they presented to the optimizer in the same order? Difference in either could produce small differences in the resulting weights, but that doesn’t mean it’s doing anything wrong. - jeremyfreeman.net @thefreemanlab On Mar 17, 2015, at 6:19 PM, EcoMotto Inc. ecomot...@gmail.com wrote: Hello Jeremy, Thank you for your reply. When I am running this code on the local machine on a streaming data, it keeps giving me this error: *WARN TaskSetManager: Lost task 2.0 in stage 211.0 (TID 4138, localhost): java.io.FileNotFoundException: /tmp/spark-local-20150316165742-9ac0/27/shuffle_102_2_0.data (No such file or directory) * And when I execute the same code on a static data after randomly splitting it into 5 sets, it gives me a little bit different weights (difference is in decimals). I am still trying to analyse why would this be happening. Any inputs, on why would this be happening? Best Regards, Arunkumar On Tue, Mar 17, 2015 at 11:32 AM, Jeremy Freeman freeman.jer...@gmail.com wrote: Hi Arunkumar, That looks like it should work. Logically, it’s similar to the implementation used by StreamingLinearRegression and StreamingLogisticRegression, see this class: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/regression/StreamingLinearAlgorithm.scala which exposes the kind of operation your describing (for any linear method). The nice thing about the gradient-based methods is that they can use existing MLLib optimization routines in this fairly direct way. Other methods (such as KMeans) require a bit more reengineering. — Jeremy - jeremyfreeman.net @thefreemanlab On Mar 16, 2015, at 6:19 PM, EcoMotto Inc. ecomot...@gmail.com wrote: Hello, I am new to spark streaming API. I wanted to ask if I can apply LBFGS (with LeastSquaresGradient) on streaming data? Currently I am using forecahRDD for parsing through DStream and I am generating a model based on each RDD. Am I doing anything logically wrong here? Thank you. Sample Code: val algorithm = new LBFGS(new LeastSquaresGradient(), new SimpleUpdater()) var initialWeights = Vectors.dense(Array.fill(numFeatures)(scala.util.Random.nextDouble())) var isFirst = true var model = new LinearRegressionModel(null,1.0) parsedData.foreachRDD{rdd = if(isFirst) { val weights = algorithm.optimize(rdd, initialWeights) val w = weights.toArray val intercept = w.head model = new LinearRegressionModel(Vectors.dense(w.drop(1)), intercept) isFirst = false }else{ var ab = ArrayBuffer[Double]() ab.insert(0, model.intercept) ab.appendAll( model.weights.toArray) print(Intercept = +model.intercept+ :: modelWeights = +model.weights) initialWeights = Vectors.dense(ab.toArray) print(Initial Weights: + initialWeights) val weights = algorithm.optimize(rdd, initialWeights) val w = weights.toArray val intercept = w.head model = new LinearRegressionModel(Vectors.dense(w.drop(1)), intercept) } Best Regards, Arunkumar
Re: Can a DataFrame be saved to s3 directly using Parquet?
Until then you can try sql(SET spark.sql.parquet.useDataSourceApi=false) On Wed, Mar 25, 2015 at 12:15 PM, Michael Armbrust mich...@databricks.com wrote: This will be fixed in Spark 1.3.1: https://issues.apache.org/jira/browse/SPARK-6351 and is fixed in master/branch-1.3 if you want to compile from source On Wed, Mar 25, 2015 at 11:59 AM, Stuart Layton stuart.lay...@gmail.com wrote: I'm trying to save a dataframe to s3 as a parquet file but I'm getting Wrong FS errors df.saveAsParquetFile(parquetFile) 15/03/25 18:56:10 INFO storage.MemoryStore: ensureFreeSpace(46645) called with curMem=82744, maxMem=278302556 15/03/25 18:56:10 INFO storage.MemoryStore: Block broadcast_5 stored as values in memory (estimated size 45.6 KB, free 265.3 MB) 15/03/25 18:56:10 INFO storage.MemoryStore: ensureFreeSpace(7078) called with curMem=129389, maxMem=278302556 15/03/25 18:56:10 INFO storage.MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 6.9 KB, free 265.3 MB) 15/03/25 18:56:10 INFO storage.BlockManagerInfo: Added broadcast_5_piece0 in memory on ip-172-31-1-219.ec2.internal:58280 (size: 6.9 KB, free: 265.4 MB) 15/03/25 18:56:10 INFO storage.BlockManagerMaster: Updated info of block broadcast_5_piece0 15/03/25 18:56:10 INFO spark.SparkContext: Created broadcast 5 from textFile at JSONRelation.scala:98 Traceback (most recent call last): File stdin, line 1, in module File /root/spark/python/pyspark/sql/dataframe.py, line 121, in saveAsParquetFile self._jdf.saveAsParquetFile(path) File /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line 538, in __call__ File /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o22.saveAsParquetFile. : java.lang.IllegalArgumentException: Wrong FS: s3n://com.my.bucket/spark-testing/, expected: hdfs:// ec2-52-0-159-113.compute-1.amazonaws.com:9000 Is it possible to save a dataframe to s3 directly using parquet? -- Stuart Layton
Re: Spark shell never leaves ACCEPTED state in YARN CDH5
The probably means there are not enough free resources in your cluster to run the AM for the Spark job. Check your RM's web ui to see the resources you have available. On Wed, Mar 25, 2015 at 12:08 PM, Khandeshi, Ami ami.khande...@fmr.com.invalid wrote: I am seeing the same behavior. I have enough resources….. How do I resolve it? Thanks, Ami -- Marcelo - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: What are the best options for quickly filtering a DataFrame on a single column?
My example is a totally reasonable way to do it, it just requires constructing strings In many cases you can also do it with column objects df[df.name == test].collect() Out[15]: [Row(name=u'test')] You should check out: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column and http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions On Wed, Mar 25, 2015 at 11:39 AM, Stuart Layton stuart.lay...@gmail.com wrote: Thanks for the response, I was using IN as an example of the type of operation I need to do. Is there another way to do this that lines up more naturally with the way things are supposed to be done in SparkSQL? On Wed, Mar 25, 2015 at 2:29 PM, Michael Armbrust mich...@databricks.com wrote: The only way to do in using python currently is to use the string based filter API (where you pass us an expression as a string, and we parse it using our SQL parser). from pyspark.sql import Row from pyspark.sql.functions import * df = sc.parallelize([Row(name=test)]).toDF() df.filter(name in ('a', 'b')).collect() Out[1]: [] df.filter(name in ('test')).collect() Out[2]: [Row(name=u'test')] In general you want to avoid lambda functions whenever you can do the same thing a dataframe expression. This is because your lambda function is a black box that we cannot optimize (though you should certainly use them for the advanced stuff that expressions can't handle). I opened SPARK-6536 https://issues.apache.org/jira/browse/SPARK-6536 to provide a nicer interface for this. On Wed, Mar 25, 2015 at 7:41 AM, Stuart Layton stuart.lay...@gmail.com wrote: I have a SparkSQL dataframe with a a few billion rows that I need to quickly filter down to a few hundred thousand rows, using an operation like (syntax may not be correct) df = df[ df.filter(lambda x: x.key_col in approved_keys)] I was thinking about serializing the data using parquet and saving it to S3, however as I want to optimize for filtering speed I'm not sure this is the best option. -- Stuart Layton -- Stuart Layton
Re: column expression in left outer join for DataFrame
Hi, Thanks for your response. I am not clear about why the query is ambiguous. val both = df_2.join(df_1, df_2(country)===df_1(country), left_outer) I thought df_2(country)===df_1(country) indicates that the country field in the 2 dataframes should match and df_2(country) is the equivalent of df_2.country in SQL, while df_1(country) is the equivalent of df_1.country in SQL. So I am not sure why it is ambiguous. In Spark 1.2.0 I have used the same logic using SparkSQL and Tables ( e.g. WHERE tab1.country = tab2.country) and had no problems getting the correct result. thanks On Wed, Mar 25, 2015 at 11:05 AM, Michael Armbrust mich...@databricks.com wrote: Unfortunately you are now hitting a bug (that is fixed in master and will be released in 1.3.1 hopefully next week). However, even with that your query is still ambiguous and you will need to use aliases: val df_1 = df.filter( df(event) === 0) . select(country, cnt).as(a) val df_2 = df.filter( df(event) === 3) . select(country, cnt).as(b) val both = df_2.join(df_1, $a.country === $b.country), left_outer) On Tue, Mar 24, 2015 at 11:57 PM, S Krishna skrishna...@gmail.com wrote: Hi, Thanks for your response. I modified my code as per your suggestion, but now I am getting a runtime error. Here's my code: val df_1 = df.filter( df(event) === 0) . select(country, cnt) val df_2 = df.filter( df(event) === 3) . select(country, cnt) df_1.show() //produces the following output : // countrycnt // tw 3000 // uk 2000 // us 1000 df_2.show() //produces the following output : // countrycnt // tw 25 // uk 200 // us 95 val both = df_2.join(df_1, df_2(country)===df_1(country), left_outer) I am getting the following error when executing the join statement: java.util.NoSuchElementException: next on empty iterator. This error seems to be originating at DataFrame.join (line 133 in DataFrame.scala). The show() results show that both dataframes do have columns named country and that they are non-empty. I also tried the simpler join ( i.e. df_2.join(df_1) ) and got the same error stated above. I would like to know what is wrong with the join statement above. thanks On Tue, Mar 24, 2015 at 6:08 PM, Michael Armbrust mich...@databricks.com wrote: You need to use `===`, so that you are constructing a column expression instead of evaluating the standard scala equality method. Calling methods to access columns (i.e. df.county is only supported in python). val join_df = df1.join( df2, df1(country) === df2(country), left_outer) On Tue, Mar 24, 2015 at 5:50 PM, SK skrishna...@gmail.com wrote: Hi, I am trying to port some code that was working in Spark 1.2.0 on the latest version, Spark 1.3.0. This code involves a left outer join between two SchemaRDDs which I am now trying to change to a left outer join between 2 DataFrames. I followed the example for left outer join of DataFrame at https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html Here's my code, where df1 and df2 are the 2 dataframes I am joining on the country field: val join_df = df1.join( df2, df1.country == df2.country, left_outer) But I got a compilation error that value country is not a member of sql.DataFrame I also tried the following: val join_df = df1.join( df2, df1(country) == df2(country), left_outer) I got a compilation error that it is a Boolean whereas a Column is required. So what is the correct Column expression I need to provide for joining the 2 dataframes on a specific field ? thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/column-expression-in-left-outer-join-for-DataFrame-tp22209.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: python : Out of memory: Kill process
Hi Davies, I running 1.1.0. Now I'm following this thread that recommend use batchsize parameter = 1 http://apache-spark-user-list.1001560.n3.nabble.com/pySpark-memory-usage-td3022.html if this does not work I will install 1.2.1 or 1.3 Regards On Wed, Mar 25, 2015 at 3:39 PM, Davies Liu dav...@databricks.com wrote: What's the version of Spark you are running? There is a bug in SQL Python API [1], it's fixed in 1.2.1 and 1.3, [1] https://issues.apache.org/jira/browse/SPARK-6055 On Wed, Mar 25, 2015 at 10:33 AM, Eduardo Cusa eduardo.c...@usmediaconsulting.com wrote: Hi Guys, I running the following function with spark-submmit and de SO is killing my process : def getRdd(self,date,provider): path='s3n://'+AWS_BUCKET+'/'+date+'/*.log.gz' log2= self.sqlContext.jsonFile(path) log2.registerTempTable('log_test') log2.cache() You only visit the table once, cache does not help here. out=self.sqlContext.sql(SELECT user, tax from log_test where provider = '+provider+'and country '').map(lambda row: (row.user, row.tax)) print out1 return map((lambda (x,y): (x, list(y))), sorted(out.groupByKey(2000).collect())) 100 partitions (or less) will be enough for 2G dataset. The input dataset has 57 zip files (2 GB) The same process with a smaller dataset completed successfully Any ideas to debug is welcome. Regards Eduardo
Spark shell never leaves ACCEPTED state in YARN CDH5
I am seeing the same behavior. I have enough resources. How do I resolve it? Thanks, Ami
Re: Can a DataFrame be saved to s3 directly using Parquet?
This will be fixed in Spark 1.3.1: https://issues.apache.org/jira/browse/SPARK-6351 and is fixed in master/branch-1.3 if you want to compile from source On Wed, Mar 25, 2015 at 11:59 AM, Stuart Layton stuart.lay...@gmail.com wrote: I'm trying to save a dataframe to s3 as a parquet file but I'm getting Wrong FS errors df.saveAsParquetFile(parquetFile) 15/03/25 18:56:10 INFO storage.MemoryStore: ensureFreeSpace(46645) called with curMem=82744, maxMem=278302556 15/03/25 18:56:10 INFO storage.MemoryStore: Block broadcast_5 stored as values in memory (estimated size 45.6 KB, free 265.3 MB) 15/03/25 18:56:10 INFO storage.MemoryStore: ensureFreeSpace(7078) called with curMem=129389, maxMem=278302556 15/03/25 18:56:10 INFO storage.MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 6.9 KB, free 265.3 MB) 15/03/25 18:56:10 INFO storage.BlockManagerInfo: Added broadcast_5_piece0 in memory on ip-172-31-1-219.ec2.internal:58280 (size: 6.9 KB, free: 265.4 MB) 15/03/25 18:56:10 INFO storage.BlockManagerMaster: Updated info of block broadcast_5_piece0 15/03/25 18:56:10 INFO spark.SparkContext: Created broadcast 5 from textFile at JSONRelation.scala:98 Traceback (most recent call last): File stdin, line 1, in module File /root/spark/python/pyspark/sql/dataframe.py, line 121, in saveAsParquetFile self._jdf.saveAsParquetFile(path) File /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line 538, in __call__ File /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o22.saveAsParquetFile. : java.lang.IllegalArgumentException: Wrong FS: s3n://com.my.bucket/spark-testing/, expected: hdfs:// ec2-52-0-159-113.compute-1.amazonaws.com:9000 Is it possible to save a dataframe to s3 directly using parquet? -- Stuart Layton
writing DStream RDDs to the same file
Hi Is there a way to write all RDDs in a DStream to the same file? I tried this and got an empty file. I think it's bc the file is not closed i.e. ESMinibatchFunctions.writer.close() executes before the stream is created. Here's my code myStream.foreachRDD(rdd = { rdd.foreach(x = { ESMinibatchFunctions.writer.append(rdd.collect()(0).toString()+ the data )}) //localRdd = localRdd.union(rdd) //localArray = localArray ++ rdd.collect() } ) ESMinibatchFunctions.writer.close() object ESMinibatchFunctions { val writer = new PrintWriter(c:/delme/exxx.txt) }
Re: python : Out of memory: Kill process
What's the version of Spark you are running? There is a bug in SQL Python API [1], it's fixed in 1.2.1 and 1.3, [1] https://issues.apache.org/jira/browse/SPARK-6055 On Wed, Mar 25, 2015 at 10:33 AM, Eduardo Cusa eduardo.c...@usmediaconsulting.com wrote: Hi Guys, I running the following function with spark-submmit and de SO is killing my process : def getRdd(self,date,provider): path='s3n://'+AWS_BUCKET+'/'+date+'/*.log.gz' log2= self.sqlContext.jsonFile(path) log2.registerTempTable('log_test') log2.cache() You only visit the table once, cache does not help here. out=self.sqlContext.sql(SELECT user, tax from log_test where provider = '+provider+'and country '').map(lambda row: (row.user, row.tax)) print out1 return map((lambda (x,y): (x, list(y))), sorted(out.groupByKey(2000).collect())) 100 partitions (or less) will be enough for 2G dataset. The input dataset has 57 zip files (2 GB) The same process with a smaller dataset completed successfully Any ideas to debug is welcome. Regards Eduardo - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: What are the best options for quickly filtering a DataFrame on a single column?
Thanks for the response, I was using IN as an example of the type of operation I need to do. Is there another way to do this that lines up more naturally with the way things are supposed to be done in SparkSQL? On Wed, Mar 25, 2015 at 2:29 PM, Michael Armbrust mich...@databricks.com wrote: The only way to do in using python currently is to use the string based filter API (where you pass us an expression as a string, and we parse it using our SQL parser). from pyspark.sql import Row from pyspark.sql.functions import * df = sc.parallelize([Row(name=test)]).toDF() df.filter(name in ('a', 'b')).collect() Out[1]: [] df.filter(name in ('test')).collect() Out[2]: [Row(name=u'test')] In general you want to avoid lambda functions whenever you can do the same thing a dataframe expression. This is because your lambda function is a black box that we cannot optimize (though you should certainly use them for the advanced stuff that expressions can't handle). I opened SPARK-6536 https://issues.apache.org/jira/browse/SPARK-6536 to provide a nicer interface for this. On Wed, Mar 25, 2015 at 7:41 AM, Stuart Layton stuart.lay...@gmail.com wrote: I have a SparkSQL dataframe with a a few billion rows that I need to quickly filter down to a few hundred thousand rows, using an operation like (syntax may not be correct) df = df[ df.filter(lambda x: x.key_col in approved_keys)] I was thinking about serializing the data using parquet and saving it to S3, however as I want to optimize for filtering speed I'm not sure this is the best option. -- Stuart Layton -- Stuart Layton
Re: java.lang.OutOfMemoryError: unable to create new native thread
This is a different kind of error. Thomas' OOM error was specific to the kernel refusing to create another thread/process for his application. Matthew On Wed, Mar 25, 2015 at 10:51 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: I have a YARN cluster where the max memory allowed is 16GB. I set 12G for my driver, however i see OutOFMemory error even for this program http://spark.apache.org/docs/1.3.0/sql-programming-guide.html#hive-tables . What do you suggest ? On Wed, Mar 25, 2015 at 8:23 AM, Thomas Gerber thomas.ger...@radius.com wrote: So, 1. I reduced my -XX:ThreadStackSize to 5m (instead of 10m - default is 1m), which is still OK for my need. 2. I reduced the executor memory to 44GB for a 60GB machine (instead of 49GB). This seems to have helped. Thanks to Matthew and Sean. Thomas On Tue, Mar 24, 2015 at 3:49 PM, Matt Silvey matt.sil...@videoamp.com wrote: My memory is hazy on this but aren't there hidden limitations to Linux-based threads? I ran into some issues a couple of years ago where, and here is the fuzzy part, the kernel wants to reserve virtual memory per thread equal to the stack size. When the total amount of reserved memory (not necessarily resident memory) exceeds the memory of the system it throws an OOM. I'm looking for material to back this up. Sorry for the initial vague response. Matthew On Tue, Mar 24, 2015 at 12:53 PM, Thomas Gerber thomas.ger...@radius.com wrote: Additional notes: I did not find anything wrong with the number of threads (ps -u USER -L | wc -l): around 780 on the master and 400 on executors. I am running on 100 r3.2xlarge. On Tue, Mar 24, 2015 at 12:38 PM, Thomas Gerber thomas.ger...@radius.com wrote: Hello, I am seeing various crashes in spark on large jobs which all share a similar exception: java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:714) I increased nproc (i.e. ulimit -u) 10 fold, but it doesn't help. Does anyone know how to avoid those kinds of errors? Noteworthy: I added -XX:ThreadStackSize=10m on both driver and executor extra java options, which might have amplified the problem. Thanks for you help, Thomas -- Deepak
Can a DataFrame be saved to s3 directly using Parquet?
I'm trying to save a dataframe to s3 as a parquet file but I'm getting Wrong FS errors df.saveAsParquetFile(parquetFile) 15/03/25 18:56:10 INFO storage.MemoryStore: ensureFreeSpace(46645) called with curMem=82744, maxMem=278302556 15/03/25 18:56:10 INFO storage.MemoryStore: Block broadcast_5 stored as values in memory (estimated size 45.6 KB, free 265.3 MB) 15/03/25 18:56:10 INFO storage.MemoryStore: ensureFreeSpace(7078) called with curMem=129389, maxMem=278302556 15/03/25 18:56:10 INFO storage.MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 6.9 KB, free 265.3 MB) 15/03/25 18:56:10 INFO storage.BlockManagerInfo: Added broadcast_5_piece0 in memory on ip-172-31-1-219.ec2.internal:58280 (size: 6.9 KB, free: 265.4 MB) 15/03/25 18:56:10 INFO storage.BlockManagerMaster: Updated info of block broadcast_5_piece0 15/03/25 18:56:10 INFO spark.SparkContext: Created broadcast 5 from textFile at JSONRelation.scala:98 Traceback (most recent call last): File stdin, line 1, in module File /root/spark/python/pyspark/sql/dataframe.py, line 121, in saveAsParquetFile self._jdf.saveAsParquetFile(path) File /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line 538, in __call__ File /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o22.saveAsParquetFile. : java.lang.IllegalArgumentException: Wrong FS: s3n://com.my.bucket/spark-testing/, expected: hdfs:// ec2-52-0-159-113.compute-1.amazonaws.com:9000 Is it possible to save a dataframe to s3 directly using parquet? -- Stuart Layton
Re: python : Out of memory: Kill process
With batchSize = 1, I think it will become even worse. I'd suggest to go with 1.3, have a taste for the new DataFrame API. On Wed, Mar 25, 2015 at 11:49 AM, Eduardo Cusa eduardo.c...@usmediaconsulting.com wrote: Hi Davies, I running 1.1.0. Now I'm following this thread that recommend use batchsize parameter = 1 http://apache-spark-user-list.1001560.n3.nabble.com/pySpark-memory-usage-td3022.html if this does not work I will install 1.2.1 or 1.3 Regards On Wed, Mar 25, 2015 at 3:39 PM, Davies Liu dav...@databricks.com wrote: What's the version of Spark you are running? There is a bug in SQL Python API [1], it's fixed in 1.2.1 and 1.3, [1] https://issues.apache.org/jira/browse/SPARK-6055 On Wed, Mar 25, 2015 at 10:33 AM, Eduardo Cusa eduardo.c...@usmediaconsulting.com wrote: Hi Guys, I running the following function with spark-submmit and de SO is killing my process : def getRdd(self,date,provider): path='s3n://'+AWS_BUCKET+'/'+date+'/*.log.gz' log2= self.sqlContext.jsonFile(path) log2.registerTempTable('log_test') log2.cache() You only visit the table once, cache does not help here. out=self.sqlContext.sql(SELECT user, tax from log_test where provider = '+provider+'and country '').map(lambda row: (row.user, row.tax)) print out1 return map((lambda (x,y): (x, list(y))), sorted(out.groupByKey(2000).collect())) 100 partitions (or less) will be enough for 2G dataset. The input dataset has 57 zip files (2 GB) The same process with a smaller dataset completed successfully Any ideas to debug is welcome. Regards Eduardo - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: column expression in left outer join for DataFrame
Thats a good question. In this particular example, it is really only internal implementation details that make it ambiguous. However, fixing this was a very large change so we have defered it to Spark 1.4 and instead print a warning now when you construct trivially equal expressions. I can try to explain a little bit about why solving this generally is (mostly) impossible. Consider the following: val df = sqlContext.load(...) val df1 = df val df2 = df df1.join(df2, df1(a) === df2(a)) Compared with SELECT * FROM df df1 JOIN df df2 WHERE df1.a = df2.a In the first example, the assigning of df to df1 and df2 is completely transparent to the catalyst optimizer as it is happening in Scala code. This means that df1(a) and df2(a) are completely indistinguishable to us (at least without crazy macro magic). In contrast, the aliasing is visible to the optimizer when are doing it in SQL instead of Scala and thus we can differentiate. In your case you are doing transformations, and we could assign new unique ids each time a transformation is done. However, we don't do this today, and its a pretty big change. There is a JIRA for this: SPARK-6231 https://issues.apache.org/jira/browse/SPARK-6231 On Wed, Mar 25, 2015 at 11:47 AM, S Krishna skrishna...@gmail.com wrote: Hi, Thanks for your response. I am not clear about why the query is ambiguous. val both = df_2.join(df_1, df_2(country)===df_1(country), left_outer) I thought df_2(country)===df_1(country) indicates that the country field in the 2 dataframes should match and df_2(country) is the equivalent of df_2.country in SQL, while df_1(country) is the equivalent of df_1.country in SQL. So I am not sure why it is ambiguous. In Spark 1.2.0 I have used the same logic using SparkSQL and Tables ( e.g. WHERE tab1.country = tab2.country) and had no problems getting the correct result. thanks On Wed, Mar 25, 2015 at 11:05 AM, Michael Armbrust mich...@databricks.com wrote: Unfortunately you are now hitting a bug (that is fixed in master and will be released in 1.3.1 hopefully next week). However, even with that your query is still ambiguous and you will need to use aliases: val df_1 = df.filter( df(event) === 0) . select(country, cnt).as(a) val df_2 = df.filter( df(event) === 3) . select(country, cnt).as(b) val both = df_2.join(df_1, $a.country === $b.country), left_outer) On Tue, Mar 24, 2015 at 11:57 PM, S Krishna skrishna...@gmail.com wrote: Hi, Thanks for your response. I modified my code as per your suggestion, but now I am getting a runtime error. Here's my code: val df_1 = df.filter( df(event) === 0) . select(country, cnt) val df_2 = df.filter( df(event) === 3) . select(country, cnt) df_1.show() //produces the following output : // countrycnt // tw 3000 // uk 2000 // us 1000 df_2.show() //produces the following output : // countrycnt // tw 25 // uk 200 // us 95 val both = df_2.join(df_1, df_2(country)===df_1(country), left_outer) I am getting the following error when executing the join statement: java.util.NoSuchElementException: next on empty iterator. This error seems to be originating at DataFrame.join (line 133 in DataFrame.scala). The show() results show that both dataframes do have columns named country and that they are non-empty. I also tried the simpler join ( i.e. df_2.join(df_1) ) and got the same error stated above. I would like to know what is wrong with the join statement above. thanks On Tue, Mar 24, 2015 at 6:08 PM, Michael Armbrust mich...@databricks.com wrote: You need to use `===`, so that you are constructing a column expression instead of evaluating the standard scala equality method. Calling methods to access columns (i.e. df.county is only supported in python). val join_df = df1.join( df2, df1(country) === df2(country), left_outer) On Tue, Mar 24, 2015 at 5:50 PM, SK skrishna...@gmail.com wrote: Hi, I am trying to port some code that was working in Spark 1.2.0 on the latest version, Spark 1.3.0. This code involves a left outer join between two SchemaRDDs which I am now trying to change to a left outer join between 2 DataFrames. I followed the example for left outer join of DataFrame at https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html Here's my code, where df1 and df2 are the 2 dataframes I am joining on the country field: val join_df = df1.join( df2, df1.country == df2.country, left_outer) But I got a compilation error that value country is not a member of sql.DataFrame I also tried the following: val join_df = df1.join( df2, df1(country) == df2(country), left_outer) I got a compilation error that it is a Boolean whereas a Column is required. So what is the correct
Re: newbie quesiton - spark with mesos
I think the problem is the use the loopback address: export SPARK_LOCAL_IP=127.0.0.1 In the stack trace from the slave, you see this: ... Reason: Connection refused: localhost/127.0.0.1:51849 akka.actor.ActorNotFound: Actor not found for: ActorSelection[Anchor(akka.tcp://sparkDriver@localhost:51849/), Path(/user/MapOutputTracker)] It's trying to connect to an Akka actor on itself, using the loopback address. Try changing SPARK_LOCAL_IP to the publicly routable IP address. dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Mon, Mar 23, 2015 at 7:37 PM, Anirudha Jadhav anirudh...@gmail.com wrote: My bad there, I was using the correct link for docs. The spark shell runs correctly, the framework is registered fine on mesos. is there some setting i am missing: this is my spark-env.sh export MESOS_NATIVE_LIBRARY=/usr/local/lib/libmesos.so export SPARK_EXECUTOR_URI=http://100.125.5.93/sparkx.tgz export SPARK_LOCAL_IP=127.0.0.1 here is what i see on the slave node. less 20150226-160708-78932-5050-8971-S0/frameworks/20150323-205508-78932-5050-29804-0012/executors/20150226-160708-78932-5050-8971-S0/runs/cceea834-c4d9-49d6-a579-8352f1889b56/stderr WARNING: Logging before InitGoogleLogging() is written to STDERR I0324 02:30:29.389225 27755 fetcher.cpp:76] Fetching URI ' http://100.125.5.93/sparkx.tgz' I0324 02:30:29.389361 27755 fetcher.cpp:126] Downloading ' http://100.125.5.93/sparkx.tgz' to '/tmp/mesos/slaves/20150226-160708-78932-5050-8971-S0/frameworks/20150323-205508-78932-5050-29804-0012/executors/20150226-160708-78932-5050-8971-S0/runs/cceea834-c4d9-49d6-a579-8352f1889b56/sparkx.tgz' I0324 02:30:35.353446 27755 fetcher.cpp:64] Extracted resource '/tmp/mesos/slaves/20150226-160708-78932-5050-8971-S0/frameworks/20150323-205508-78932-5050-29804-0012/executors/20150226-160708-78932-5050-8971-S0/runs/cceea834-c4d9-49d6-a579-8352f1889b56/sparkx.tgz' into '/tmp/mesos/slaves/20150226-160708-78932-5050-8971-S0/frameworks/20150323-205508-78932-5050-29804-0012/executors/20150226-160708-78932-5050-8971-S0/runs/cceea834-c4d9-49d6-a579-8352f1889b56' Spark assembly has been built with Hive, including Datanucleus jars on classpath Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 15/03/24 02:30:37 INFO MesosExecutorBackend: Registered signal handlers for [TERM, HUP, INT] I0324 02:30:37.071077 27863 exec.cpp:132] Version: 0.21.1 I0324 02:30:37.080971 27885 exec.cpp:206] Executor registered on slave 20150226-160708-78932-5050-8971-S0 15/03/24 02:30:37 INFO MesosExecutorBackend: Registered with Mesos as executor ID 20150226-160708-78932-5050-8971-S0 with 1 cpus 15/03/24 02:30:37 INFO SecurityManager: Changing view acls to: ubuntu 15/03/24 02:30:37 INFO SecurityManager: Changing modify acls to: ubuntu 15/03/24 02:30:37 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(ubuntu); users with modify permissions: Set(ubuntu) 15/03/24 02:30:37 INFO Slf4jLogger: Slf4jLogger started 15/03/24 02:30:37 INFO Remoting: Starting remoting 15/03/24 02:30:38 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkexecu...@mesos-si2.dny1.bcpc.bloomberg.com:50542] 15/03/24 02:30:38 INFO Utils: Successfully started service 'sparkExecutor' on port 50542. 15/03/24 02:30:38 INFO AkkaUtils: Connecting to MapOutputTracker: akka.tcp://sparkDriver@localhost:51849/user/MapOutputTracker 15/03/24 02:30:38 WARN Remoting: Tried to associate with unreachable remote address [akka.tcp://sparkDriver@localhost:51849]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused: localhost/127.0.0.1:51849 akka.actor.ActorNotFound: Actor not found for: ActorSelection[Anchor(akka.tcp://sparkDriver@localhost:51849/), Path(/user/MapOutputTracker)] at akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:65) at akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:63) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:67) at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:82) at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59) at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59) at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72) at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:58) at
Re: Date and decimal datatype not working
Recall that the input isn't actually read until to do something that forces evaluation, like call saveAsTextFile. You didn't show the whole stack trace here, but it probably occurred while parsing an input line where one of your long fields is actually an empty string. Because this is such a common problem, I usually define a parse method that converts input text to the desired schema. It catches parse exceptions like this and reports the bad line at least. If you can return a default long in this case, say 0, that makes it easier to return something. dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Wed, Mar 25, 2015 at 11:48 AM, BASAK, ANANDA ab9...@att.com wrote: Thanks. This library is only available with Spark 1.3. I am using version 1.2.1. Before I upgrade to 1.3, I want to try what can be done in 1.2.1. So I am using following: val MyDataset = sqlContext.sql(my select query”) MyDataset.map(t = t(0)+|+t(1)+|+t(2)+|+t(3)+|+t(4)+|+t(5)).saveAsTextFile(/my_destination_path) But it is giving following error: 15/03/24 17:05:51 ERROR Executor: Exception in task 1.0 in stage 13.0 (TID 106) java.lang.NumberFormatException: For input string: at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Long.parseLong(Long.java:453) at java.lang.Long.parseLong(Long.java:483) at scala.collection.immutable.StringLike$class.toLong(StringLike.scala:230) is there something wrong with the TSTAMP field which is Long datatype? Thanks Regards --- Ananda Basak Ph: 425-213-7092 *From:* Yin Huai [mailto:yh...@databricks.com] *Sent:* Monday, March 23, 2015 8:55 PM *To:* BASAK, ANANDA *Cc:* user@spark.apache.org *Subject:* Re: Date and decimal datatype not working To store to csv file, you can use Spark-CSV https://github.com/databricks/spark-csv library. On Mon, Mar 23, 2015 at 5:35 PM, BASAK, ANANDA ab9...@att.com wrote: Thanks. This worked well as per your suggestions. I had to run following: val TABLE_A = sc.textFile(/Myhome/SPARK/files/table_a_file.txt).map(_.split(|)).map(p = ROW_A(p(0).trim.toLong, p(1), p(2).trim.toInt, p(3), BigDecimal(p(4)), BigDecimal(p(5)), BigDecimal(p(6 Now I am stuck at another step. I have run a SQL query, where I am Selecting from all the fields with some where clause , TSTAMP filtered with date range and order by TSTAMP clause. That is running fine. Then I am trying to store the output in a CSV file. I am using saveAsTextFile(“filename”) function. But it is giving error. Can you please help me to write a proper syntax to store output in a CSV file? Thanks Regards --- Ananda Basak Ph: 425-213-7092 *From:* BASAK, ANANDA *Sent:* Tuesday, March 17, 2015 3:08 PM *To:* Yin Huai *Cc:* user@spark.apache.org *Subject:* RE: Date and decimal datatype not working Ok, thanks for the suggestions. Let me try and will confirm all. Regards Ananda *From:* Yin Huai [mailto:yh...@databricks.com] *Sent:* Tuesday, March 17, 2015 3:04 PM *To:* BASAK, ANANDA *Cc:* user@spark.apache.org *Subject:* Re: Date and decimal datatype not working p(0) is a String. So, you need to explicitly convert it to a Long. e.g. p(0).trim.toLong. You also need to do it for p(2). For those BigDecimals value, you need to create BigDecimal objects from your String values. On Tue, Mar 17, 2015 at 5:55 PM, BASAK, ANANDA ab9...@att.com wrote: Hi All, I am very new in Spark world. Just started some test coding from last week. I am using spark-1.2.1-bin-hadoop2.4 and scala coding. I am having issues while using Date and decimal data types. Following is my code that I am simply running on scala prompt. I am trying to define a table and point that to my flat file containing raw data (pipe delimited format). Once that is done, I will run some SQL queries and put the output data in to another flat file with pipe delimited format. *** val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.createSchemaRDD // Define row and table case class ROW_A( TSTAMP: Long, USIDAN: String, SECNT:Int, SECT: String, BLOCK_NUM:BigDecimal, BLOCK_DEN:BigDecimal, BLOCK_PCT:BigDecimal) val TABLE_A = sc.textFile(/Myhome/SPARK/files/table_a_file.txt).map(_.split(|)).map(p = ROW_A(p(0), p(1), p(2), p(3), p(4), p(5), p(6))) TABLE_A.registerTempTable(TABLE_A) *** The second last command is giving error, like following: console:17: error: type mismatch; found : String
Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems
Yes, that's the problem. The RDD class exists in both binary jar files, but the signatures probably don't match. The bottom line, as always for tools like this, is that you can't mix versions. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Wed, Mar 25, 2015 at 3:13 PM, roni roni.epi...@gmail.com wrote: My cluster is still on spark 1.2 and in SBT I am using 1.3. So probably it is compiling with 1.3 but running with 1.2 ? On Wed, Mar 25, 2015 at 12:34 PM, Dean Wampler deanwamp...@gmail.com wrote: Weird. Are you running using SBT console? It should have the spark-core jar on the classpath. Similarly, spark-shell or spark-submit should work, but be sure you're using the same version of Spark when running as when compiling. Also, you might need to add spark-sql to your SBT dependencies, but that shouldn't be this issue. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Wed, Mar 25, 2015 at 12:09 PM, roni roni.epi...@gmail.com wrote: Thanks Dean and Nick. So, I removed the ADAM and H2o from my SBT as I was not using them. I got the code to compile - only for fail while running with - SparkContext: Created broadcast 1 from textFile at kmerIntersetion.scala:21 Exception in thread main java.lang.NoClassDefFoundError: org/apache/spark/rdd/RDD$ at preDefKmerIntersection$.main(kmerIntersetion.scala:26) This line is where I do a JOIN operation. val hgPair = hgfasta.map(_.split (,)).map(a= (a(0), a(1).trim().toInt)) val filtered = hgPair.filter(kv = kv._2 == 1) val bedPair = bedFile.map(_.split (,)).map(a= (a(0), a(1).trim().toInt)) * val joinRDD = bedPair.join(filtered) * Any idea whats going on? I have data on the EC2 so I am avoiding creating the new cluster , but just upgrading and changing the code to use 1.3 and Spark SQL Thanks Roni On Wed, Mar 25, 2015 at 9:50 AM, Dean Wampler deanwamp...@gmail.com wrote: For the Spark SQL parts, 1.3 breaks backwards compatibility, because before 1.3, Spark SQL was considered experimental where API changes were allowed. So, H2O and ADA compatible with 1.2.X might not work with 1.3. dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Wed, Mar 25, 2015 at 9:39 AM, roni roni.epi...@gmail.com wrote: Even if H2o and ADA are dependent on 1.2.1 , it should be backword compatible, right? So using 1.3 should not break them. And the code is not using the classes from those libs. I tried sbt clean compile .. same errror Thanks _R On Wed, Mar 25, 2015 at 9:26 AM, Nick Pentreath nick.pentre...@gmail.com wrote: What version of Spark do the other dependencies rely on (Adam and H2O?) - that could be it Or try sbt clean compile — Sent from Mailbox https://www.dropbox.com/mailbox On Wed, Mar 25, 2015 at 5:58 PM, roni roni.epi...@gmail.com wrote: I have a EC2 cluster created using spark version 1.2.1. And I have a SBT project . Now I want to upgrade to spark 1.3 and use the new features. Below are issues . Sorry for the long post. Appreciate your help. Thanks -Roni Question - Do I have to create a new cluster using spark 1.3? Here is what I did - In my SBT file I changed to - libraryDependencies += org.apache.spark %% spark-core % 1.3.0 But then I started getting compilation error. along with Here are some of the libraries that were evicted: [warn] * org.apache.spark:spark-core_2.10:1.2.0 - 1.3.0 [warn] * org.apache.hadoop:hadoop-client:(2.5.0-cdh5.2.0, 2.2.0) - 2.6.0 [warn] Run 'evicted' to see detailed eviction warnings constructor cannot be instantiated to expected type; [error] found : (T1, T2) [error] required: org.apache.spark.sql.catalyst.expressions.Row [error] val ty = joinRDD.map{case(word, (file1Counts, file2Counts)) = KmerIntesect(word, file1Counts,xyz)} [error] ^ Here is my SBT and code -- SBT - version := 1.0 scalaVersion := 2.10.4 resolvers += Sonatype OSS Snapshots at https://oss.sonatype.org/content/repositories/snapshots;; resolvers += Maven Repo1 at https://repo1.maven.org/maven2;; resolvers += Maven Repo at https://s3.amazonaws.com/h2o-release/h2o-dev/master/1056/maven/repo/ ; /* Dependencies - %% appends Scala version to artifactId */ libraryDependencies += org.apache.hadoop % hadoop-client % 2.6.0 libraryDependencies += org.apache.spark %% spark-core %
Exception Failed to add a datanode. User may turn off this feature by setting dfs.client.block.write.replace-datanode-on-failure.policy in configuration
Hi,I am running spark in mesos and getting this error. Can anyone help me resolve this?Thanks15/03/25 21:05:00 ERROR scheduler.LiveListenerBus: Listener EventLoggingListener threw an exceptionjava.lang.reflect.InvocationTargetExceptionat sun.reflect.GeneratedMethodAccessor12.invoke(Unknown Source)at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.util.FileLogger$$anonfun$flush$2.apply(FileLogger.scala:203) at org.apache.spark.util.FileLogger$$anonfun$flush$2.apply(FileLogger.scala:203) at scala.Option.foreach(Option.scala:236) at org.apache.spark.util.FileLogger.flush(FileLogger.scala:203)at org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:90) at org.apache.spark.scheduler.EventLoggingListener.onUnpersistRDD(EventLoggingListener.scala:121) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$11.apply(SparkListenerBus.scala:66) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$11.apply(SparkListenerBus.scala:66) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:83) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:81) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.SparkListenerBus$class.foreachListener(SparkListenerBus.scala:81) at org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:66) at org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:32) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:56) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1460) at org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:46)Caused by: java.io.IOException: Failed to add a datanode. User may turn off this feature by setting dfs.client.block.write.replace-datanode-on-failure.policy in configuration, where the current policy is DEFAULT. (Nodes: current=[10.250.100.81:50010, 10.250.100.82:50010], original=[10.250.100.81:50010, 10.250.100.82:50010])at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:792) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:852) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:958) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:755) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:424)15/03/25 21:05:00 INFO scheduler.ReceivedBlockTracker: Deleting batches ArrayBuffer()15/03/25 21:05:00 INFO scheduler.ReceivedBlockTracker: Deleting batches ArrayBuffer(142731690 ms)15/03/25 21:05:00 INFO scheduler.JobGenerator: Checkpointing graph for time 142731750 ms15/03/25 21:05:00 INFO streaming.DStreamGraph: Updating checkpoint data for time 142731750 ms15/03/25 21:05:00 INFO streaming.DStreamGraph: Updated checkpoint data for time 142731750 ms -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Exception-Failed-to-add-a-datanode-User-may-turn-off-this-feature-by-setting-dfs-client-block-write-n-tp22231.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
trouble with jdbc df in python
if i run the following: db = sqlContext.load(jdbc, url=jdbc:postgresql://localhost/xx, dbtables=mstr.d_customer) i get the error: py4j.protocol.Py4JJavaError: An error occurred while calling o28.load. : java.io.FileNotFoundException: File file:/Users/elliottcordo/jdbc does not exist Seems to think i'm trying to load a file called jdbc? i'm setting the postgres driver in the SPARK_CLASSPATH as well (that doesn't seem to be the problem)..
Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems
My cluster is still on spark 1.2 and in SBT I am using 1.3. So probably it is compiling with 1.3 but running with 1.2 ? On Wed, Mar 25, 2015 at 12:34 PM, Dean Wampler deanwamp...@gmail.com wrote: Weird. Are you running using SBT console? It should have the spark-core jar on the classpath. Similarly, spark-shell or spark-submit should work, but be sure you're using the same version of Spark when running as when compiling. Also, you might need to add spark-sql to your SBT dependencies, but that shouldn't be this issue. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Wed, Mar 25, 2015 at 12:09 PM, roni roni.epi...@gmail.com wrote: Thanks Dean and Nick. So, I removed the ADAM and H2o from my SBT as I was not using them. I got the code to compile - only for fail while running with - SparkContext: Created broadcast 1 from textFile at kmerIntersetion.scala:21 Exception in thread main java.lang.NoClassDefFoundError: org/apache/spark/rdd/RDD$ at preDefKmerIntersection$.main(kmerIntersetion.scala:26) This line is where I do a JOIN operation. val hgPair = hgfasta.map(_.split (,)).map(a= (a(0), a(1).trim().toInt)) val filtered = hgPair.filter(kv = kv._2 == 1) val bedPair = bedFile.map(_.split (,)).map(a= (a(0), a(1).trim().toInt)) * val joinRDD = bedPair.join(filtered) * Any idea whats going on? I have data on the EC2 so I am avoiding creating the new cluster , but just upgrading and changing the code to use 1.3 and Spark SQL Thanks Roni On Wed, Mar 25, 2015 at 9:50 AM, Dean Wampler deanwamp...@gmail.com wrote: For the Spark SQL parts, 1.3 breaks backwards compatibility, because before 1.3, Spark SQL was considered experimental where API changes were allowed. So, H2O and ADA compatible with 1.2.X might not work with 1.3. dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Wed, Mar 25, 2015 at 9:39 AM, roni roni.epi...@gmail.com wrote: Even if H2o and ADA are dependent on 1.2.1 , it should be backword compatible, right? So using 1.3 should not break them. And the code is not using the classes from those libs. I tried sbt clean compile .. same errror Thanks _R On Wed, Mar 25, 2015 at 9:26 AM, Nick Pentreath nick.pentre...@gmail.com wrote: What version of Spark do the other dependencies rely on (Adam and H2O?) - that could be it Or try sbt clean compile — Sent from Mailbox https://www.dropbox.com/mailbox On Wed, Mar 25, 2015 at 5:58 PM, roni roni.epi...@gmail.com wrote: I have a EC2 cluster created using spark version 1.2.1. And I have a SBT project . Now I want to upgrade to spark 1.3 and use the new features. Below are issues . Sorry for the long post. Appreciate your help. Thanks -Roni Question - Do I have to create a new cluster using spark 1.3? Here is what I did - In my SBT file I changed to - libraryDependencies += org.apache.spark %% spark-core % 1.3.0 But then I started getting compilation error. along with Here are some of the libraries that were evicted: [warn] * org.apache.spark:spark-core_2.10:1.2.0 - 1.3.0 [warn] * org.apache.hadoop:hadoop-client:(2.5.0-cdh5.2.0, 2.2.0) - 2.6.0 [warn] Run 'evicted' to see detailed eviction warnings constructor cannot be instantiated to expected type; [error] found : (T1, T2) [error] required: org.apache.spark.sql.catalyst.expressions.Row [error] val ty = joinRDD.map{case(word, (file1Counts, file2Counts)) = KmerIntesect(word, file1Counts,xyz)} [error] ^ Here is my SBT and code -- SBT - version := 1.0 scalaVersion := 2.10.4 resolvers += Sonatype OSS Snapshots at https://oss.sonatype.org/content/repositories/snapshots;; resolvers += Maven Repo1 at https://repo1.maven.org/maven2;; resolvers += Maven Repo at https://s3.amazonaws.com/h2o-release/h2o-dev/master/1056/maven/repo/ ; /* Dependencies - %% appends Scala version to artifactId */ libraryDependencies += org.apache.hadoop % hadoop-client % 2.6.0 libraryDependencies += org.apache.spark %% spark-core % 1.3.0 libraryDependencies += org.bdgenomics.adam % adam-core % 0.16.0 libraryDependencies += ai.h2o % sparkling-water-core_2.10 % 0.2.10 CODE -- import org.apache.spark.{SparkConf, SparkContext} case class KmerIntesect(kmer: String, kCount: Int, fileName: String) object preDefKmerIntersection { def main(args: Array[String]) { val sparkConf = new SparkConf().setAppName(preDefKmer-intersect) val sc = new SparkContext(sparkConf) import
Re: Can LBFGS be used on streaming data?
Hi Arunkumar, I think L-BFGS will not work since L-BFGS algorithm assumes that the objective function will be always the same (i.e., the data is the same) for entire optimization process to construct the approximated Hessian matrix. In the streaming case, the data will be changing, so it will cause problem for the algorithm. Sincerely, DB Tsai --- Blog: https://www.dbtsai.com On Mon, Mar 16, 2015 at 3:19 PM, EcoMotto Inc. ecomot...@gmail.com wrote: Hello, I am new to spark streaming API. I wanted to ask if I can apply LBFGS (with LeastSquaresGradient) on streaming data? Currently I am using forecahRDD for parsing through DStream and I am generating a model based on each RDD. Am I doing anything logically wrong here? Thank you. Sample Code: val algorithm = new LBFGS(new LeastSquaresGradient(), new SimpleUpdater()) var initialWeights = Vectors.dense(Array.fill(numFeatures)(scala.util.Random.nextDouble())) var isFirst = true var model = new LinearRegressionModel(null,1.0) parsedData.foreachRDD{rdd = if(isFirst) { val weights = algorithm.optimize(rdd, initialWeights) val w = weights.toArray val intercept = w.head model = new LinearRegressionModel(Vectors.dense(w.drop(1)), intercept) isFirst = false }else{ var ab = ArrayBuffer[Double]() ab.insert(0, model.intercept) ab.appendAll( model.weights.toArray) print(Intercept = +model.intercept+ :: modelWeights = +model.weights) initialWeights = Vectors.dense(ab.toArray) print(Initial Weights: + initialWeights) val weights = algorithm.optimize(rdd, initialWeights) val w = weights.toArray val intercept = w.head model = new LinearRegressionModel(Vectors.dense(w.drop(1)), intercept) } Best Regards, Arunkumar - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: trouble with jdbc df in python
Try: db = sqlContext.load(source=jdbc, url=jdbc:postgresql://localhost/xx, dbtables=mstr.d_customer) On Wed, Mar 25, 2015 at 2:19 PM, elliott cordo elliottco...@gmail.com wrote: if i run the following: db = sqlContext.load(jdbc, url=jdbc:postgresql://localhost/xx, dbtables=mstr.d_customer) i get the error: py4j.protocol.Py4JJavaError: An error occurred while calling o28.load. : java.io.FileNotFoundException: File file:/Users/elliottcordo/jdbc does not exist Seems to think i'm trying to load a file called jdbc? i'm setting the postgres driver in the SPARK_CLASSPATH as well (that doesn't seem to be the problem)..
Re: How to specify the port for AM Actor ...
There is no configuration for it now. Best Regards, Shixiong Zhu 2015-03-26 7:13 GMT+08:00 Manoj Samel manojsamelt...@gmail.com: There may be firewall rules limiting the ports between host running spark and the hadoop cluster. In that case, not all ports are allowed. Can it be a range of ports that can be specified ? On Wed, Mar 25, 2015 at 4:06 PM, Shixiong Zhu zsxw...@gmail.com wrote: It's a random port to avoid port conflicts, since multiple AMs can run in the same machine. Why do you need a fixed port? Best Regards, Shixiong Zhu 2015-03-26 6:49 GMT+08:00 Manoj Samel manojsamelt...@gmail.com: Spark 1.3, Hadoop 2.5, Kerbeors When running spark-shell in yarn client mode, it shows following message with a random port every time (44071 in example below). Is there a way to specify that port to a specific port ? It does not seem to be part of ports specified in http://spark.apache.org/docs/latest/configuration.html spark.xxx.port ... Thanks, 15/03/25 22:27:10 INFO Client: Application report for application_1427316153428_0014 (state: ACCEPTED) 15/03/25 22:27:10 INFO YarnClientSchedulerBackend: ApplicationMaster registered as Actor[akka.tcp://sparkYarnAM@xyz :44071/user/YarnAM#-1989273896]
Re: trouble with jdbc df in python
Thanks for following up. I'll fix the docs. On Wed, Mar 25, 2015 at 4:04 PM, elliott cordo elliottco...@gmail.com wrote: Thanks!.. the below worked: db = sqlCtx.load(source=jdbc, url=jdbc:postgresql://localhost/x?user=xpassword=x,dbtable=mstr.d_customer) Note that https://spark.apache.org/docs/latest/sql-programming-guide.html#dataframe-operations needs to be updated: [image: Inline image 1] On Wed, Mar 25, 2015 at 6:12 PM, Michael Armbrust mich...@databricks.com wrote: Try: db = sqlContext.load(source=jdbc, url=jdbc:postgresql://localhost/xx, dbtables=mstr.d_customer) On Wed, Mar 25, 2015 at 2:19 PM, elliott cordo elliottco...@gmail.com wrote: if i run the following: db = sqlContext.load(jdbc, url=jdbc:postgresql://localhost/xx, dbtables=mstr.d_customer) i get the error: py4j.protocol.Py4JJavaError: An error occurred while calling o28.load. : java.io.FileNotFoundException: File file:/Users/elliottcordo/jdbc does not exist Seems to think i'm trying to load a file called jdbc? i'm setting the postgres driver in the SPARK_CLASSPATH as well (that doesn't seem to be the problem)..
Re: filter expression in API document for DataFrame
Yeah sorry, this is already fixed but we need to republish the docs. I'll add both of the following do work: people.filter(age 30) people.filter(people(age) 30) On Tue, Mar 24, 2015 at 7:11 PM, SK skrishna...@gmail.com wrote: The following statement appears in the Scala API example at https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame people.filter(age 30). I tried this example and it gave a compilation error. I think this needs to be changed to people.filter(people(age) 30) Also, it would be good to add some examples for the new equality operator for columns (e.g. (people(age) === 30) ). The programming guide does not have an example for this in the DataFrame Operations section and it was not very obvious that we need to be using a different equality operator for columns. thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/filter-expression-in-API-document-for-DataFrame-tp22213.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems
Is there any way that I can install the new one and remove previous version. I installed spark 1.3 on my EC2 master and set teh spark home to the new one. But when I start teh spark-shell I get - java.lang.UnsatisfiedLinkError: org.apache.hadoop.security.JniBasedUnixGroupsMapping.anchorNative()V at org.apache.hadoop.security.JniBasedUnixGroupsMapping.anchorNative(Native Method) Is There no way to upgrade without creating new cluster? Thanks Roni On Wed, Mar 25, 2015 at 1:18 PM, Dean Wampler deanwamp...@gmail.com wrote: Yes, that's the problem. The RDD class exists in both binary jar files, but the signatures probably don't match. The bottom line, as always for tools like this, is that you can't mix versions. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Wed, Mar 25, 2015 at 3:13 PM, roni roni.epi...@gmail.com wrote: My cluster is still on spark 1.2 and in SBT I am using 1.3. So probably it is compiling with 1.3 but running with 1.2 ? On Wed, Mar 25, 2015 at 12:34 PM, Dean Wampler deanwamp...@gmail.com wrote: Weird. Are you running using SBT console? It should have the spark-core jar on the classpath. Similarly, spark-shell or spark-submit should work, but be sure you're using the same version of Spark when running as when compiling. Also, you might need to add spark-sql to your SBT dependencies, but that shouldn't be this issue. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Wed, Mar 25, 2015 at 12:09 PM, roni roni.epi...@gmail.com wrote: Thanks Dean and Nick. So, I removed the ADAM and H2o from my SBT as I was not using them. I got the code to compile - only for fail while running with - SparkContext: Created broadcast 1 from textFile at kmerIntersetion.scala:21 Exception in thread main java.lang.NoClassDefFoundError: org/apache/spark/rdd/RDD$ at preDefKmerIntersection$.main(kmerIntersetion.scala:26) This line is where I do a JOIN operation. val hgPair = hgfasta.map(_.split (,)).map(a= (a(0), a(1).trim().toInt)) val filtered = hgPair.filter(kv = kv._2 == 1) val bedPair = bedFile.map(_.split (,)).map(a= (a(0), a(1).trim().toInt)) * val joinRDD = bedPair.join(filtered) * Any idea whats going on? I have data on the EC2 so I am avoiding creating the new cluster , but just upgrading and changing the code to use 1.3 and Spark SQL Thanks Roni On Wed, Mar 25, 2015 at 9:50 AM, Dean Wampler deanwamp...@gmail.com wrote: For the Spark SQL parts, 1.3 breaks backwards compatibility, because before 1.3, Spark SQL was considered experimental where API changes were allowed. So, H2O and ADA compatible with 1.2.X might not work with 1.3. dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Wed, Mar 25, 2015 at 9:39 AM, roni roni.epi...@gmail.com wrote: Even if H2o and ADA are dependent on 1.2.1 , it should be backword compatible, right? So using 1.3 should not break them. And the code is not using the classes from those libs. I tried sbt clean compile .. same errror Thanks _R On Wed, Mar 25, 2015 at 9:26 AM, Nick Pentreath nick.pentre...@gmail.com wrote: What version of Spark do the other dependencies rely on (Adam and H2O?) - that could be it Or try sbt clean compile — Sent from Mailbox https://www.dropbox.com/mailbox On Wed, Mar 25, 2015 at 5:58 PM, roni roni.epi...@gmail.com wrote: I have a EC2 cluster created using spark version 1.2.1. And I have a SBT project . Now I want to upgrade to spark 1.3 and use the new features. Below are issues . Sorry for the long post. Appreciate your help. Thanks -Roni Question - Do I have to create a new cluster using spark 1.3? Here is what I did - In my SBT file I changed to - libraryDependencies += org.apache.spark %% spark-core % 1.3.0 But then I started getting compilation error. along with Here are some of the libraries that were evicted: [warn] * org.apache.spark:spark-core_2.10:1.2.0 - 1.3.0 [warn] * org.apache.hadoop:hadoop-client:(2.5.0-cdh5.2.0, 2.2.0) - 2.6.0 [warn] Run 'evicted' to see detailed eviction warnings constructor cannot be instantiated to expected type; [error] found : (T1, T2) [error] required: org.apache.spark.sql.catalyst.expressions.Row [error] val ty = joinRDD.map{case(word, (file1Counts, file2Counts)) = KmerIntesect(word, file1Counts,xyz)} [error]
Re: Spark ML Pipeline inaccessible types
Thanks Peter, I ended up doing something similar. I however consider both the approaches you mentioned bad practices which is why I was looking for a solution directly supported by the current code. I can work with that now, but it does not seem to be the proper solution. Regards, Martin -- Původní zpráva -- Od: Peter Rudenko petro.rude...@gmail.com Komu: zapletal-mar...@email.cz, Sean Owen so...@cloudera.com Datum: 25. 3. 2015 13:28:38 Předmět: Re: Spark ML Pipeline inaccessible types Hi Martin, here’s 2 possibilities to overcome this: 1) Put your logic into org.apache.spark package in your project - then everything would be accessible. 2) Dirty trick: spanspanobject/span spanSparkVector/span spanspanextends/span/span spanHashingTF/span {/span spanspanval/span spanVectorUDT/span:/span spanDataType/span = outputDataType } then you can do like this: spanStructType/span(spanvectorTypeColumn/span, spanSparkVector/span.spanVectorUDT/span, spanfalse/span)) Thanks, Peter Rudenko On 2015-03-25 13:14, zapletal-mar...@email.cz (mailto:zapletal-mar...@email.cz) wrote: Sean, thanks for your response. I am familiar with NoSuchMethodException in general, but I think it is not the case this time. The code actually attempts to get parameter by name using val m = this.getClass.getMethodName (paramName). This may be a bug, but it is only a side effect caused by the real problem I am facing. My issue is that VectorUDT is not accessible by user code and therefore it is not possible to use custom ML pipeline with the existing Predictors (see the last two paragraphs in my first email). Best Regards, Martin -- Původní zpráva -- Od: Sean Owen so...@cloudera.com(mailto:so...@cloudera.com) Komu: zapletal-mar...@email.cz(mailto:zapletal-mar...@email.cz) Datum: 25. 3. 2015 11:05:54 Předmět: Re: Spark ML Pipeline inaccessible types NoSuchMethodError in general means that your runtime and compile-time environments are different. I think you need to first make sure you don't have mismatching versions of Spark. On Wed, Mar 25, 2015 at 11:00 AM, zapletal-mar...@email.cz (mailto:zapletal-mar...@email.cz) wrote: Hi, I have started implementing a machine learning pipeline using Spark 1.3.0 and the new pipelining API and DataFrames. I got to a point where I have my training data set prepared using a sequence of Transformers, but I am struggling to actually train a model and use it for predictions. I am getting a java.lang.NoSuchMethodException: org.apache.spark.ml.regression.LinearRegression.myFeaturesColumnName() exception thrown at checkInputColumn method in Params trait when using a Predictor (LinearRegression in my case, but that should not matter). This looks like a bug - the exception is thrown when executing getParam (colName) when the require(actualDataType.equals(datatype), ...) requirement is not met so the expected requirement failed exception is not thrown and is hidden by the unexpected NoSuchMethodException instead. I can raise a bug if this really is an issue and I am not using something incorrectly. The problem I am facing however is that the Predictor expects features to have VectorUDT type as defined in Predictor class (protected def featuresDataType: DataType = new VectorUDT). But since this type is private[spark] my Transformer can not prepare features with this type which then correctly results in the exception above when I use a different type. Is there a way to define a custom Pipeline that would be able to use the existing Predictors without having to bypass the access modifiers or reimplement something or is the pipelining API not yet expected to be used in this way? Thanks, Martin - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org (mailto:user-unsubscr...@spark.apache.org) For additional commands, e-mail: user-h...@spark.apache.org (mailto:user-h...@spark.apache.org)
Cross-compatibility of YARN shuffle service
Hi everyone, I am considering moving from Spark-Standalone to YARN. The context is that there are multiple Spark applications that are using different versions of Spark that all want to use the same YARN cluster. My question is: if I use a single Spark YARN shuffle service jar on the Node Manager, will the service work properly with all of the Spark applications, regardless of the specific versions of the applications? Or, is it it the case that, if I want to use the external shuffle service, I need to have all of my applications using the same version of Spark? Thanks, -Matt Cheah smime.p7s Description: S/MIME cryptographic signature