Re: column expression in left outer join for DataFrame

2015-03-25 Thread S Krishna
Hi,

Thanks for your response. I modified my code as per your suggestion, but
now I am getting a runtime error. Here's my code:

val df_1 = df.filter( df(event) === 0)
  . select(country, cnt)

val df_2 = df.filter( df(event) === 3)
  . select(country, cnt)

df_1.show()
//produces the following output :
// countrycnt
//   tw   3000
//   uk   2000
//   us   1000

df_2.show()
//produces the following output :
// countrycnt
//   tw   25
//   uk   200
//   us   95

val both = df_2.join(df_1, df_2(country)===df_1(country), left_outer)

I am getting the following error when executing the join statement:

java.util.NoSuchElementException: next on empty iterator.

This error seems to be originating at DataFrame.join (line 133 in
DataFrame.scala).

The show() results show that both dataframes do have columns named
country and that they are non-empty. I also tried the simpler join ( i.e.
df_2.join(df_1) ) and got the same error stated above.

I would like to know what is wrong with the join statement above.

thanks
























On Tue, Mar 24, 2015 at 6:08 PM, Michael Armbrust mich...@databricks.com
wrote:

 You need to use `===`, so that you are constructing a column expression
 instead of evaluating the standard scala equality method.  Calling methods
 to access columns (i.e. df.county is only supported in python).

 val join_df =  df1.join( df2, df1(country) === df2(country),
 left_outer)

 On Tue, Mar 24, 2015 at 5:50 PM, SK skrishna...@gmail.com wrote:

 Hi,

 I am trying to port some code that was working in Spark 1.2.0 on the
 latest
 version, Spark 1.3.0. This code involves a left outer join between two
 SchemaRDDs which I am now trying to change to a left outer join between 2
 DataFrames. I followed the example  for left outer join of DataFrame at

 https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html

 Here's my code, where df1 and df2 are the 2 dataframes I am joining on the
 country field:

  val join_df =  df1.join( df2,  df1.country == df2.country, left_outer)

 But I got a compilation error that value  country is not a member of
 sql.DataFrame

 I  also tried the following:
  val join_df =  df1.join( df2, df1(country) == df2(country),
 left_outer)

 I got a compilation error that it is a Boolean whereas a Column is
 required.

 So what is the correct Column expression I need to provide for joining
 the 2
 dataframes on a specific field ?

 thanks








 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/column-expression-in-left-outer-join-for-DataFrame-tp22209.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Re: Spark Application Hung

2015-03-25 Thread Akhil Das
In production, i'd suggest you having a High availability cluster with
minimum of 3 nodes (data nodes in your case).

Now lets examine your scenario:

- When you suddenly brings down one of the node which has 2 executors
running on it, what happens is that the node (DN2) will be having your jobs
shuffle data or computed data stored in it for the next stages (this is the
same effect as deleting your spark's local/work dir from DN1). The absence
of this node will lead to fetchFailures as you are seeing in the logs. But
eventually it will end up trying for sometime and i believe it will
recompute your whole pipeline on DN1



Thanks
Best Regards

On Wed, Mar 25, 2015 at 12:11 AM, Ashish Rawat ashish.ra...@guavus.com
wrote:

  Hi,

  We are observing a hung spark application when one of the yarn datanode
 (running multiple spark executors) go down.

  *Setup details*:

- Spark: 1.2.1
- Hadoop: 2.4.0
- Spark Application Mode: yarn-client
- 2 datanodes (DN1, DN2)
- 6 spark executors (initially 3 executors on both DN1 and DN2, after
rebooting DN2, changes to 4 executors on DN1 and 2 executors on DN2)

 *Scenario*:

  When one of the datanodes (DN2) is brought down, the application gets
 hung, with spark driver continuously showing the following warning:

  15/03/24 12:39:26 WARN TaskSetManager: Lost task 5.0 in stage 232.0 (TID
 37941, DN1): FetchFailed(null, shuffleId=155, mapId=-1, reduceId=5, message=
 org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output
 location for shuffle 155
 at
 org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:384)
 at
 org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:381)
 at
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
 at
 scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
 at
 scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
 at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
 at
 org.apache.spark.MapOutputTracker$.org$apache$spark$MapOutputTracker$$convertMapStatuses(MapOutputTracker.scala:380)
 at
 org.apache.spark.MapOutputTracker.getServerStatuses(MapOutputTracker.scala:176)
 at
 org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.fetch(BlockStoreShuffleFetcher.scala:42)
 at
 org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:40)
 at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
 at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
 at
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
 at
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
 at
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
 at
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:56)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown
 Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
 Source)
 at java.lang.Thread.run(Unknown Source)


  When DN2 is brought down, one executor gets launched on DN1. When DN2 is
 brought back up after 15mins, 2 executors get launched on it.
 All the executors (including the ones which got launched after DN2 comes
 back), keep showing the following errors:

  15/03/24 12:43:30 INFO spark.MapOutputTrackerWorker: Don't have map
 outputs for shuffle 155, fetching them
 

Spark Performance -Hive or Hbase?

2015-03-25 Thread Siddharth Ubale
HI ,

We have started RnD on Apache Spark to use its features such as Spark-SQL  
Spark Streaming. I have two Pain points , can anyone of you address them which 
are as follows:

1.   Does spark allow us the feature to fetch updated items after an RDD 
has been mapped and schema has been applied? Or every time while running the 
query we have to perform RDD Mapping and Apply schema? In this case I am using 
hbase tables to map the RDD.

2.   Spark-SQL provides better performance when used with Hive or Hbase?


Thanks,
Siddharth Ubale,
Synchronized Communications
#43, Velankani Tech Park, Block No. II,
3rd Floor, Electronic City Phase I,
Bangalore – 560 100
Tel : +91 80 3202 4060
Web: www.syncoms.comhttp://www.syncoms.com/
[LogoNEWmohLARGE]
London|Bangalore|Orlando

we innovate, plan, execute, and transform the business​



Re: spark worker on mesos slave | possible networking config issue

2015-03-25 Thread Akhil Das
It says:
ried to associate with unreachable remote address
[akka.tcp://sparkDriver@localhost:51849].
Address is now gated for 5000 ms, all messages to this address will be
delivered to dead letters. Reason: Connection refused: localhost/
127.0.0.1:51849

I'd suggest you changing this property:
export SPARK_LOCAL_IP=127.0.0.1

Point it to your network address like 192.168.1.10

Thanks
Best Regards

On Tue, Mar 24, 2015 at 11:18 PM, Anirudha Jadhav aniru...@nyu.edu wrote:

 is there some setting i am missing:
 this is my spark-env.sh

 export MESOS_NATIVE_LIBRARY=/usr/local/lib/libmesos.so
 export SPARK_EXECUTOR_URI=http://100.125.5.93/sparkx.tgz
 export SPARK_LOCAL_IP=127.0.0.1



 here is what i see on the slave node.
 
 less
 20150226-160708-78932-5050-8971-S0/frameworks/20150323-205508-78932-5050-29804-0012/executors/20150226-160708-78932-5050-8971-S0/runs/cceea834-c4d9-49d6-a579-8352f1889b56/stderr
 

 WARNING: Logging before InitGoogleLogging() is written to STDERR
 I0324 02:30:29.389225 27755 fetcher.cpp:76] Fetching URI '
 http://100.125.5.93/sparkx.tgz'
 I0324 02:30:29.389361 27755 fetcher.cpp:126] Downloading '
 http://100.125.5.93/sparkx.tgz' to
 '/tmp/mesos/slaves/20150226-160708-78932-5050-8971-S0/frameworks/20150323-205508-78932-5050-29804-0012/executors/20150226-160708-78932-5050-8971-S0/runs/cceea834-c4d9-49d6-a579-8352f1889b56/sparkx.tgz'
 I0324 02:30:35.353446 27755 fetcher.cpp:64] Extracted resource
 '/tmp/mesos/slaves/20150226-160708-78932-5050-8971-S0/frameworks/20150323-205508-78932-5050-29804-0012/executors/20150226-160708-78932-5050-8971-S0/runs/cceea834-c4d9-49d6-a579-8352f1889b56/sparkx.tgz'
 into
 '/tmp/mesos/slaves/20150226-160708-78932-5050-8971-S0/frameworks/20150323-205508-78932-5050-29804-0012/executors/20150226-160708-78932-5050-8971-S0/runs/cceea834-c4d9-49d6-a579-8352f1889b56'
 Spark assembly has been built with Hive, including Datanucleus jars on
 classpath
 Using Spark's default log4j profile:
 org/apache/spark/log4j-defaults.properties
 15/03/24 02:30:37 INFO MesosExecutorBackend: Registered signal handlers
 for [TERM, HUP, INT]
 I0324 02:30:37.071077 27863 exec.cpp:132] Version: 0.21.1
 I0324 02:30:37.080971 27885 exec.cpp:206] Executor registered on slave
 20150226-160708-78932-5050-8971-S0
 15/03/24 02:30:37 INFO MesosExecutorBackend: Registered with Mesos as
 executor ID 20150226-160708-78932-5050-8971-S0 with 1 cpus
 15/03/24 02:30:37 INFO SecurityManager: Changing view acls to: ubuntu
 15/03/24 02:30:37 INFO SecurityManager: Changing modify acls to: ubuntu
 15/03/24 02:30:37 INFO SecurityManager: SecurityManager: authentication
 disabled; ui acls disabled; users with view permissions: Set(ubuntu); users
 with modify permissions: Set(ubuntu)
 15/03/24 02:30:37 INFO Slf4jLogger: Slf4jLogger started
 15/03/24 02:30:37 INFO Remoting: Starting remoting
 15/03/24 02:30:38 INFO Remoting: Remoting started; listening on addresses
 :[akka.tcp://sparkExecutor@mesos-si2:50542]
 15/03/24 02:30:38 INFO Utils: Successfully started service 'sparkExecutor'
 on port 50542.
 15/03/24 02:30:38 INFO AkkaUtils: Connecting to MapOutputTracker:
 akka.tcp://sparkDriver@localhost:51849/user/MapOutputTracker
 15/03/24 02:30:38 WARN Remoting: Tried to associate with unreachable
 remote address [akka.tcp://sparkDriver@localhost:51849]. Address is now
 gated for 5000 ms, all messages to this address will be delivered to dead
 letters. Reason: Connection refused: localhost/127.0.0.1:51849
 akka.actor.ActorNotFound: Actor not found for:
 ActorSelection[Anchor(akka.tcp://sparkDriver@localhost:51849/),
 Path(/user/MapOutputTracker)]
 at
 akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:65)
 at
 akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:63)
 at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
 at
 akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:67)
 at
 akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:82)
 at
 akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
 at
 akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
 at
 scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
 at
 akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:58)
 at
 akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.unbatchedExecute(Future.scala:74)
 at
 akka.dispatch.BatchingExecutor$class.execute(BatchingExecutor.scala:110)
 at
 akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.execute(Future.scala:73)
 at
 scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
 at
 scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
 at 

Re: Weird exception in Spark job

2015-03-25 Thread Akhil Das
As it says, you are having a jar conflict:

java.lang.NoSuchMethodError:
org.jboss.netty.channel.socket.nio.NioWorkerPool.init(Ljava/util/concurrent/Executor;I)V

Verify your classpath and see the netty versions


Thanks
Best Regards

On Tue, Mar 24, 2015 at 11:07 PM, nitinkak001 nitinkak...@gmail.com wrote:

 Any Ideas on this?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Weird-exception-in-Spark-job-tp22195p22204.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Server IPC version 9 cannot communicate with client version 4

2015-03-25 Thread sandeep vura
Hi Sparkers,

I am trying to load data in spark with the following command

*sqlContext.sql(LOAD DATA LOCAL INPATH '/home/spark12/sandeep/sandeep.txt
  ' INTO TABLE src);*

*Getting exception below*


*Server IPC version 9 cannot communicate with client version 4*

NOte : i am using Hadoop 2.2 version and spark 1.2 and hive 0.13


Re: Optimal solution for getting the header from CSV with Spark

2015-03-25 Thread Felix C
The spark-csv package can handle header row, and the code is at the link below. 
It could also use the header to infer field names in the schema.

https://github.com/databricks/spark-csv/blob/master/src/main/scala/com/databricks/spark/csv/CsvRelation.scala

--- Original Message ---

From: Dean Wampler deanwamp...@gmail.com
Sent: March 24, 2015 9:19 AM
To: Sean Owen so...@cloudera.com
Cc: Spico Florin spicoflo...@gmail.com, user user@spark.apache.org
Subject: Re: Optimal solution for getting the header from CSV with Spark

Good point. There's no guarantee that you'll get the actual first
partition. One reason why I wouldn't allow a CSV header line in a real data
file, if I could avoid it.

Back to Spark, a safer approach is RDD.foreachPartition, which takes a
function expecting an iterator. You'll only need to grab the first element
(being careful that the partition isn't empty!) and then determine which of
those first lines has the header info.

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
Typesafe http://typesafe.com
@deanwampler http://twitter.com/deanwampler
http://polyglotprogramming.com

On Tue, Mar 24, 2015 at 11:12 AM, Sean Owen so...@cloudera.com wrote:

 I think this works in practice, but I don't know that the first block
 of the file is guaranteed to be in the first partition? certainly
 later down the pipeline that won't be true but presumably this is
 happening right after reading the file.

 I've always just written some filter that would only match the header,
 which assumes that this is possible to distinguish, but usually is.

 On Tue, Mar 24, 2015 at 2:41 PM, Dean Wampler deanwamp...@gmail.com
 wrote:
  Instead of data.zipWithIndex().filter(_._2==0), which will cause Spark to
  read the whole file, use data.take(1), which is simpler.
 
  From the Rdd.take documentation, it works by first scanning one
 partition,
  and using the results from that partition to estimate the number of
  additional partitions needed to satisfy the limit. In this case, it will
  trivially stop at the first.
 
 
  Dean Wampler, Ph.D.
  Author: Programming Scala, 2nd Edition (O'Reilly)
  Typesafe
  @deanwampler
  http://polyglotprogramming.com
 
  On Tue, Mar 24, 2015 at 7:12 AM, Spico Florin spicoflo...@gmail.com
 wrote:
 
  Hello!
 
  I would like to know what is the optimal solution for getting the header
  with from a CSV file with Spark? My aproach was:
 
  def getHeader(data: RDD[String]): String = {
  data.zipWithIndex().filter(_._2==0).map(x=x._1).take(1).mkString() }
 
  Thanks.
 
 



Re: FAILED SelectChannelConnector@0.0.0.0:4040 java.net.BindException: Address already in use

2015-03-25 Thread Akhil Das
It means you are already having 4 applications running on 4040, 4041, 4042,
4043. And that's why it was able to run on 4044.

You can do a *netstat -pnat | grep 404* *And see what all processes are
running.

Thanks
Best Regards

On Wed, Mar 25, 2015 at 1:13 AM, , Roy rp...@njit.edu wrote:

 I get following message for each time I run spark job


1. 15/03/24 15:35:56 WARN AbstractLifeCycle: FAILED
SelectChannelConnector@0.0.0.0:4040: java.net.BindException: Address
already in use


 full trace is here

 http://pastebin.com/xSvRN01f

 how do I fix this ?

 I am on CDH 5.3.1

 thanks
 roy





Re: Optimal solution for getting the header from CSV with Spark

2015-03-25 Thread Spico Florin
Hello!
  Thank for your responses. I was afraid that due to partitioning I will
loose the logic that the first element is the header. I observe that
rdd.first calls behind the rdd.take(1) method.
Best regards,
  Florin

On Tue, Mar 24, 2015 at 4:41 PM, Dean Wampler deanwamp...@gmail.com wrote:

 Instead of data.zipWithIndex().filter(_._2==0), which will cause Spark to
 read the whole file, use data.take(1), which is simpler.

 From the Rdd.take documentation, it works by first scanning one partition,
 and using the results from that partition to estimate the number of
 additional partitions needed to satisfy the limit. In this case, it will
 trivially stop at the first.


 Dean Wampler, Ph.D.
 Author: Programming Scala, 2nd Edition
 http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
 Typesafe http://typesafe.com
 @deanwampler http://twitter.com/deanwampler
 http://polyglotprogramming.com

 On Tue, Mar 24, 2015 at 7:12 AM, Spico Florin spicoflo...@gmail.com
 wrote:

 Hello!

 I would like to know what is the optimal solution for getting the header
 with from a CSV file with Spark? My aproach was:

 def getHeader(data: RDD[String]): String = {
 data.zipWithIndex().filter(_._2==0).map(x=x._1).take(1).mkString() }

 Thanks.





Explanation streaming-cep-engine with example

2015-03-25 Thread Dhimant
Hi,
Can someone explain how spark streaming cep engine works ?
How to use it with sample example?

http://spark-packages.org/package/Stratio/streaming-cep-engine



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Explanation-streaming-cep-engine-with-example-tp22218.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Server IPC version 9 cannot communicate with client version 4

2015-03-25 Thread Saisai Shao
Looks like you have to build Spark with related Hadoop version, otherwise
you will meet exception as mentioned. you could follow this doc:
http://spark.apache.org/docs/latest/building-spark.html

2015-03-25 15:22 GMT+08:00 sandeep vura sandeepv...@gmail.com:

 Hi Sparkers,

 I am trying to load data in spark with the following command

 *sqlContext.sql(LOAD DATA LOCAL INPATH '/home/spark12/sandeep/sandeep.txt
   ' INTO TABLE src);*

 *Getting exception below*


 *Server IPC version 9 cannot communicate with client version 4*

 NOte : i am using Hadoop 2.2 version and spark 1.2 and hive 0.13







Spark Maven Test error

2015-03-25 Thread zzcclp
I use command to run Unit test, as follow:
./make-distribution.sh --tgz --skip-java-test -Pscala-2.10 -Phadoop-2.3
-Phive -Phive-thriftserver -Pyarn -Dyarn.version=2.3.0-cdh5.1.2
-Dhadoop.version=2.3.0-cdh5.1.2

mvn -Pscala-2.10 -Phadoop-2.3 -Phive -Phive-thriftserver -Pyarn
-Dyarn.version=2.3.0-cdh5.1.2 -Dhadoop.version=2.3.0-cdh5.1.2 -pl
core,launcher,network/common,network/shuffle,sql/core,sql/catalyst,sql/hive
-DwildcardSuites=org.apache.spark.sql.JoinSuite test‍

Error occur:
---

 T E S T S

---

Running org.apache.spark.network.ProtocolSuite

log4j:WARN No appenders could be found for logger
(io.netty.util.internal.logging.InternalLoggerFactory).

log4j:WARN Please initialize the log4j system properly.

log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for
more info.

Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.251 sec -
in org.apache.spark.network.ProtocolSuite

Running org.apache.spark.network.RpcIntegrationSuite

Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.54 sec -
in org.apache.spark.network.RpcIntegrationSuite

Running org.apache.spark.network.ChunkFetchIntegrationSuite

Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.34 sec -
in org.apache.spark.network.ChunkFetchIntegrationSuite

Running org.apache.spark.network.sasl.SparkSaslSuite

Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.103 sec -
in org.apache.spark.network.sasl.SparkSaslSuite

Running org.apache.spark.network.TransportClientFactorySuite

Tests run: 5, Failures: 0, Errors: 5, Skipped: 0, Time elapsed: 11.42 sec
 FAILURE! - in org.apache.spark.network.TransportClientFactorySuite

reuseClientsUpToConfigVariable(org.apache.spark.network.TransportClientFactorySuite)
 
Time elapsed: 2.332 sec   ERROR!

java.lang.IllegalStateException: failed to create a child event loop

at sun.nio.ch.IOUtil.makePipe(Native Method)

at
sun.nio.ch.EPollSelectorImpl.init(EPollSelectorImpl.java:65)

at
sun.nio.ch.EPollSelectorProvider.openSelector(EPollSelectorProvider.java:36)

at
io.netty.channel.nio.NioEventLoop.openSelector(NioEventLoop.java:126)

at
io.netty.channel.nio.NioEventLoop.init(NioEventLoop.java:120)

at
io.netty.channel.nio.NioEventLoopGroup.newChild(NioEventLoopGroup.java:87)

at
io.netty.util.concurrent.MultithreadEventExecutorGroup.init(MultithreadEventExecutorGroup.java:64)

at
io.netty.channel.MultithreadEventLoopGroup.init(MultithreadEventLoopGroup.java:49)

at
io.netty.channel.nio.NioEventLoopGroup.init(NioEventLoopGroup.java:61)

at
io.netty.channel.nio.NioEventLoopGroup.init(NioEventLoopGroup.java:52)

at
org.apache.spark.network.util.NettyUtils.createEventLoop(NettyUtils.java:56)

at
org.apache.spark.network.client.TransportClientFactory.init(TransportClientFactory.java:104)

at
org.apache.spark.network.TransportContext.createClientFactory(TransportContext.java:76)

at
org.apache.spark.network.TransportContext.createClientFactory(TransportContext.java:80)

at
org.apache.spark.network.TransportClientFactorySuite.testClientReuse(TransportClientFactorySuite.java:86)

at
org.apache.spark.network.TransportClientFactorySuite.reuseClientsUpToConfigVariable(TransportClientFactorySuite.java:131)

 

reuseClientsUpToConfigVariableConcurrent(org.apache.spark.network.TransportClientFactorySuite)
 
Time elapsed: 2.279 sec   ERROR!

java.lang.IllegalStateException: failed to create a child event loop

at sun.nio.ch.IOUtil.makePipe(Native Method)

at
sun.nio.ch.EPollSelectorImpl.init(EPollSelectorImpl.java:65)

at
sun.nio.ch.EPollSelectorProvider.openSelector(EPollSelectorProvider.java:36)

at
io.netty.channel.nio.NioEventLoop.openSelector(NioEventLoop.java:126)

at
io.netty.channel.nio.NioEventLoop.init(NioEventLoop.java:120)

at
io.netty.channel.nio.NioEventLoopGroup.newChild(NioEventLoopGroup.java:87)

at
io.netty.util.concurrent.MultithreadEventExecutorGroup.init(MultithreadEventExecutorGroup.java:64)

at
io.netty.channel.MultithreadEventLoopGroup.init(MultithreadEventLoopGroup.java:49)

at
io.netty.channel.nio.NioEventLoopGroup.init(NioEventLoopGroup.java:61)

at
io.netty.channel.nio.NioEventLoopGroup.init(NioEventLoopGroup.java:52)

at
org.apache.spark.network.util.NettyUtils.createEventLoop(NettyUtils.java:56)

at
org.apache.spark.network.client.TransportClientFactory.init(TransportClientFactory.java:104)

at
org.apache.spark.network.TransportContext.createClientFactory(TransportContext.java:76)

at
org.apache.spark.network.TransportContext.createClientFactory(TransportContext.java:80)


Exception in thread main java.lang.VerifyError: class org.apache.hadoop.yarn.proto.YarnProtos$PriorityProto overrides final method getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet;

2015-03-25 Thread Canoe
I compile spark-1.3.0 on Hadoop 2.3.0-cdh5.1.0 with protoc 2.5.0. But when I
try to run the examples, it throws: Exception in thread main
java.lang.VerifyError: class
org.apache.hadoop.yarn.proto.YarnProtos$PriorityProto overrides final method
getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet;at
java.lang.ClassLoader.defineClass1(Native Method)   at
java.lang.ClassLoader.defineClass(ClassLoader.java:800) at
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at
java.net.URLClassLoader.defineClass(URLClassLoader.java:449)at
java.net.URLClassLoader.access$100(URLClassLoader.java:71)  at
java.net.URLClassLoader$1.run(URLClassLoader.java:361)  at
java.net.URLClassLoader$1.run(URLClassLoader.java:355)  at
java.security.AccessController.doPrivileged(Native Method)  at
java.net.URLClassLoader.findClass(URLClassLoader.java:354)  at
java.lang.ClassLoader.loadClass(ClassLoader.java:425)   at
sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)   at
java.lang.ClassLoader.loadClass(ClassLoader.java:358)   at
java.lang.Class.getDeclaredConstructors0(Native Method) at
java.lang.Class.privateGetDeclaredConstructors(Class.java:2532) at
java.lang.Class.getConstructor0(Class.java:2842)at
java.lang.Class.getConstructor(Class.java:1718) at
org.apache.hadoop.yarn.factories.impl.pb.RecordFactoryPBImpl.newRecordInstance(RecordFactoryPBImpl.java:62)
at org.apache.hadoop.yarn.util.Records.newRecord(Records.java:36)   at
org.apache.hadoop.yarn.api.records.Priority.newInstance(Priority.java:39)   
at
org.apache.hadoop.yarn.api.records.Priority.(Priority.java:34)  at
org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$.(YarnSparkHadoopUtil.scala:101)
at
org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$.(YarnSparkHadoopUtil.scala)
at org.apache.spark.deploy.yarn.ClientArguments.(ClientArguments.scala:38)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:646)  at
org.apache.spark.deploy.yarn.Client.main(Client.scala)  at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606) at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)   at
org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)at
org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)Hey, guys.  It
may be caused by the proto version unmatched.  Can anyone tell me what
should I do to avoid this problem? I did not find any information about
this. Thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Exception-in-thread-main-java-lang-VerifyError-class-org-apache-hadoop-yarn-proto-YarnProtos-Priorit-tp22217.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Server IPC version 9 cannot communicate with client version 4

2015-03-25 Thread Akhil Das
Is your spark compiled against hadoop 2.2? If not download
https://spark.apache.org/downloads.html the Spark 1.2 binary with Hadoop
2.2

Thanks
Best Regards

On Wed, Mar 25, 2015 at 12:52 PM, sandeep vura sandeepv...@gmail.com
wrote:

 Hi Sparkers,

 I am trying to load data in spark with the following command

 *sqlContext.sql(LOAD DATA LOCAL INPATH '/home/spark12/sandeep/sandeep.txt
   ' INTO TABLE src);*

 *Getting exception below*


 *Server IPC version 9 cannot communicate with client version 4*

 NOte : i am using Hadoop 2.2 version and spark 1.2 and hive 0.13







Serialization Problem in Spark Program

2015-03-25 Thread donhoff_h
Hi, experts

I wrote a very simple spark program to test the KryoSerialization function. The 
codes are as following:

object TestKryoSerialization {
  def main(args: Array[String]) {
val conf = new SparkConf()
conf.set(spark.serializer, org.apache.spark.serializer.KryoSerializer)
conf.set(spark.kryo.registrationRequired,true)  //I use this statement 
to force checking registration.
conf.registerKryoClasses(Array(classOf[MyObject]))

val sc = new SparkContext(conf)
val rdd = 
sc.textFile(hdfs://dhao.hrb:8020/user/spark/tstfiles/charset/A_utf8.txt)
val objs = rdd.map(new MyObject(_,1)).collect()
for (x - objs ) {
  x.printMyObject
}
}

The class MyObject is also a very simple Class, which is only used to test the 
serialization function:
class MyObject  {
  var myStr : String = 
  var myInt : Int = 0
  def this(inStr : String, inInt : Int) {
this()
this.myStr = inStr
this.myInt = inInt
  }
  def printMyObject {
println(MyString is : +myStr+\tMyInt is : +myInt)
  }
}

But when I ran the application, it reported the following error:
java.lang.IllegalArgumentException: Class is not registered: 
dhao.test.Serialization.MyObject[]
Note: To register this class use: 
kryo.register(dhao.test.Serialization.MyObject[].class);
at com.esotericsoftware.kryo.Kryo.getRegistration(Kryo.java:442)
at 
com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:79)
at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:472)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:565)
at 
org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:161)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

I don't understand what cause this problem. I have used the 
conf.registerKryoClasses to register my class. Could anyone help me ? Thanks

By the way, the spark version is 1.3.0.

OutOfMemoryError when using DataFrame created by Spark SQL

2015-03-25 Thread SLiZn Liu
Hi,

I am using *Spark SQL* to query on my *Hive cluster*, following Spark SQL
and DataFrame Guide
https://spark.apache.org/docs/latest/sql-programming-guide.html step by
step. However, my HiveQL via sqlContext.sql() fails and
java.lang.OutOfMemoryError was raised. The expected result of such query is
considered to be small (by adding limit 1000 clause). My code is shown
below:

scala import sqlContext.implicits._
scala val df = sqlContext.sql(select * from some_table where
logdate=2015-03-24 limit 1000)

and the error msg:

[ERROR] [03/25/2015 16:08:22.379] [sparkDriver-scheduler-27]
[ActorSystem(sparkDriver)] Uncaught fatal error from thread
[sparkDriver-scheduler-27] shutting down ActorSystem [sparkDriver]
java.lang.OutOfMemoryError: GC overhead limit exceeded

the master heap memory is set by -Xms512m -Xmx512m, while workers set
by -Xms4096M
-Xmx4096M, which I presume sufficient for this trivial query.

Additionally, after restarted the spark-shell and re-run the limit 5 query
, the df object is returned and can be printed by df.show(), but other APIs
fails on OutOfMemoryError, namely, df.count(),
df.select(some_field).show() and so forth.

I understand that the RDD can be collected to master hence further
transmutations can be applied, as DataFrame has “richer optimizations under
the hood” and the convention from an R/julia user, I really hope this error
is able to be tackled, and DataFrame is robust enough to depend.

Thanks in advance!

REGARDS,
Todd
​


OutOfMemoryError when using DataFrame created by Spark SQL

2015-03-25 Thread Todd Leo
Hi,

I am using *Spark SQL* to query on my *Hive cluster*, following Spark SQL
and DataFrame Guide
https://spark.apache.org/docs/latest/sql-programming-guide.html step by
step. However, my HiveQL via sqlContext.sql() fails and
java.lang.OutOfMemoryError was raised. The expected result of such query is
considered to be small (by adding limit 1000 clause). My code is shown
below:

scala import sqlContext.implicits._
scala val df = sqlContext.sql(select * from some_table where
logdate=2015-03-24 limit 1000)

and the error msg:

[ERROR] [03/25/2015 16:08:22.379] [sparkDriver-scheduler-27]
[ActorSystem(sparkDriver)] Uncaught fatal error from thread
[sparkDriver-scheduler-27] shutting down ActorSystem [sparkDriver]
java.lang.OutOfMemoryError: GC overhead limit exceeded

the master heap memory is set by -Xms512m -Xmx512m, while workers set
by -Xms4096M
-Xmx4096M, which I presume sufficient for this trivial query.

Additionally, after restarted the spark-shell and re-run the limit 5 query
, the df object is returned and can be printed by df.show(), but other APIs
fails on OutOfMemoryError, namely, df.count(),
df.select(some_field).show() and so forth.

I understand that the RDD can be collected to master hence further
transmutations can be applied, as DataFrame has “richer optimizations under
the hood” and the convention from an R/julia user, I really hope this error
is able to be tackled, and DataFrame is robust enough to depend.

Thanks in advance!

REGARDS,
Todd
​




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/OutOfMemoryError-when-using-DataFrame-created-by-Spark-SQL-tp22219.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Server IPC version 9 cannot communicate with client version 4

2015-03-25 Thread sandeep vura
Where do i export MAVEN_OPTS in spark-env.sh or hadoop-env.sh

I am running the below command in spark/yarn directory where pom.xml file
is available

mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=VERSION -DskipTests clean package

Please correct me if i am wrong.




On Wed, Mar 25, 2015 at 12:55 PM, Saisai Shao sai.sai.s...@gmail.com
wrote:

 Looks like you have to build Spark with related Hadoop version, otherwise
 you will meet exception as mentioned. you could follow this doc:
 http://spark.apache.org/docs/latest/building-spark.html

 2015-03-25 15:22 GMT+08:00 sandeep vura sandeepv...@gmail.com:

 Hi Sparkers,

 I am trying to load data in spark with the following command

 *sqlContext.sql(LOAD DATA LOCAL INPATH
 '/home/spark12/sandeep/sandeep.txt   ' INTO TABLE src);*

 *Getting exception below*


 *Server IPC version 9 cannot communicate with client version 4*

 NOte : i am using Hadoop 2.2 version and spark 1.2 and hive 0.13








Re: OutOfMemoryError when using DataFrame created by Spark SQL

2015-03-25 Thread Ted Yu
Can you try giving Spark driver more heap ?

Cheers



 On Mar 25, 2015, at 2:14 AM, Todd Leo sliznmail...@gmail.com wrote:
 
 Hi,
 
 I am using Spark SQL to query on my Hive cluster, following Spark SQL and 
 DataFrame Guide step by step. However, my HiveQL via sqlContext.sql() fails 
 and java.lang.OutOfMemoryError was raised. The expected result of such query 
 is considered to be small (by adding limit 1000 clause). My code is shown 
 below:
 
 scala import sqlContext.implicits._  

 scala val df = sqlContext.sql(select * from some_table where 
 logdate=2015-03-24 limit 1000)
 and the error msg:
 
 [ERROR] [03/25/2015 16:08:22.379] [sparkDriver-scheduler-27] 
 [ActorSystem(sparkDriver)] Uncaught fatal error from thread 
 [sparkDriver-scheduler-27] shutting down ActorSystem [sparkDriver]
 java.lang.OutOfMemoryError: GC overhead limit exceeded
 the master heap memory is set by -Xms512m -Xmx512m, while workers set by 
 -Xms4096M -Xmx4096M, which I presume sufficient for this trivial query.
 
 Additionally, after restarted the spark-shell and re-run the limit 5 query , 
 the df object is returned and can be printed by df.show(), but other APIs 
 fails on OutOfMemoryError, namely, df.count(), df.select(some_field).show() 
 and so forth.
 
 I understand that the RDD can be collected to master hence further 
 transmutations can be applied, as DataFrame has “richer optimizations under 
 the hood” and the convention from an R/julia user, I really hope this error 
 is able to be tackled, and DataFrame is robust enough to depend.
 
 Thanks in advance!
 
 REGARDS,
 Todd
 
 
 View this message in context: OutOfMemoryError when using DataFrame created 
 by Spark SQL
 Sent from the Apache Spark User List mailing list archive at Nabble.com.


issue while submitting Spark Job as --master yarn-cluster

2015-03-25 Thread sachin Singh
Hi ,
when I am submitting spark job in cluster mode getting error as under in
hadoop-yarn  log,
someone has any idea,please suggest,

2015-03-25 23:35:22,467 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl:
application_1427124496008_0028 State change from FINAL_SAVING to FAILED
2015-03-25 23:35:22,467 WARN
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=hdfs
OPERATION=Application Finished - Failed TARGET=RMAppManager RESULT=FAILURE
DESCRIPTION=App failed with state: FAILED   PERMISSIONS=Application
application_1427124496008_0028 failed 2 times due to AM Container for
appattempt_1427124496008_0028_02 exited with  exitCode: 13 due to:
Exception from container-launch.
Container id: container_1427124496008_0028_02_01
Exit code: 13
Stack trace: ExitCodeException exitCode=13: 
at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:702)
at
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:197)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)


Container exited with a non-zero exit code 13
.Failing this attempt.. Failing the application.
APPID=application_1427124496008_0028



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/issue-while-submitting-Spark-Job-as-master-yarn-cluster-tp0.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Server IPC version 9 cannot communicate with client version 4

2015-03-25 Thread Akhil Das
Just run :

mvn -Pyarn -Phadoop-2.4 -D*hadoop.version=2.2* -DskipTests clean package


​

Thanks
Best Regards

On Wed, Mar 25, 2015 at 3:08 PM, sandeep vura sandeepv...@gmail.com wrote:

 Where do i export MAVEN_OPTS in spark-env.sh or hadoop-env.sh

 I am running the below command in spark/yarn directory where pom.xml file
 is available

 mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=VERSION -DskipTests clean package

 Please correct me if i am wrong.




 On Wed, Mar 25, 2015 at 12:55 PM, Saisai Shao sai.sai.s...@gmail.com
 wrote:

 Looks like you have to build Spark with related Hadoop version, otherwise
 you will meet exception as mentioned. you could follow this doc:
 http://spark.apache.org/docs/latest/building-spark.html

 2015-03-25 15:22 GMT+08:00 sandeep vura sandeepv...@gmail.com:

 Hi Sparkers,

 I am trying to load data in spark with the following command

 *sqlContext.sql(LOAD DATA LOCAL INPATH
 '/home/spark12/sandeep/sandeep.txt   ' INTO TABLE src);*

 *Getting exception below*


 *Server IPC version 9 cannot communicate with client version 4*

 NOte : i am using Hadoop 2.2 version and spark 1.2 and hive 0.13









Spark-sql query got exception.Help

2015-03-25 Thread 李铖
It is ok when I do query data from a small hdfs file.
But if the hdfs file is 152m,I got this exception.
I try this code
.'sc.setSystemProperty(spark.kryoserializer.buffer.mb,'256')'.error
still.

```
com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0,
required: 39135
at com.esotericsoftware.kryo.io.Output.require(Output.java:138)
at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:220)
at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:206)
at
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:29)
at


```


How do you write Dataframes to elasticsearch

2015-03-25 Thread yamanoj
It seems that elasticsearch-spark_2.10 currently not supporting spart 1.3.
Could you tell me if there is an alternative way to save Dataframes to
elasticsearch? 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-do-you-write-Dataframes-to-elasticsearch-tp3.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Server IPC version 9 cannot communicate with client version 4

2015-03-25 Thread Akhil Das
Oh, in that case you should mention 2.4, If you don't want to compile
spark, then you can download the precompiled version from Downloads page
https://spark.apache.org/downloads.html.
http://d3kbcqa49mib13.cloudfront.net/spark-1.2.0-bin-hadoop2.4.tgz

Thanks
Best Regards

On Wed, Mar 25, 2015 at 5:40 PM, sandeep vura sandeepv...@gmail.com wrote:

 *I am using hadoop 2.4 should i mention -Dhadoop.version=2.2*

 *$ hadoop version*
 *Hadoop 2.4.1*
 *Subversion http://svn.apache.org/repos/asf/hadoop/common
 http://svn.apache.org/repos/asf/hadoop/common -r 1604318*
 *Compiled by jenkins on 2014-06-21T05:43Z*
 *Compiled with protoc 2.5.0*
 *From source with checksum bb7ac0a3c73dc131f4844b873c74b630*
 *This command was run using
 /home/hadoop24/hadoop-2.4.1/share/hadoop/common/hadoop-common-2.4.1.jar*




 On Wed, Mar 25, 2015 at 5:38 PM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 -D*hadoop.version=2.2*


 Thanks
 Best Regards

 On Wed, Mar 25, 2015 at 5:34 PM, sandeep vura sandeepv...@gmail.com
 wrote:

 Build failed with following errors.

 I have executed the below following command.

 * mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=VERSION -DskipTests clean
 package*


 [INFO]
 
 [INFO] BUILD FAILURE
 [INFO]
 
 [INFO] Total time: 2:11:59.461s
 [INFO] Finished at: Wed Mar 25 17:22:29 IST 2015
 [INFO] Final Memory: 30M/440M
 [INFO]
 
 [ERROR] Failed to execute goal on project spark-core_2.10: Could not
 resolve dep
endencies for project
 org.apache.spark:spark-core_2.10:jar:1.2.1: Could not find

 artifact org.apache.hadoop:hadoop-client:jar:VERSION in central (
 https://repo1.
maven.org/maven2) - [Help 1]
 [ERROR]
 [ERROR] To see the full stack trace of the errors, re-run Maven with the
 -e swit
ch.
 [ERROR] Re-run Maven using the -X switch to enable full debug logging.
 [ERROR]
 [ERROR] For more information about the errors and possible solutions,
 please rea
d the following articles:
 [ERROR] [Help 1]
 http://cwiki.apache.org/confluence/display/MAVEN/DependencyReso

lutionException
 [ERROR]
 [ERROR] After correcting the problems, you can resume the build with the
 command
 [ERROR]   mvn goals -rf :spark-core_2.10


 On Wed, Mar 25, 2015 at 3:38 PM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 Just run :

 mvn -Pyarn -Phadoop-2.4 -D*hadoop.version=2.2* -DskipTests clean package


 ​

 Thanks
 Best Regards

 On Wed, Mar 25, 2015 at 3:08 PM, sandeep vura sandeepv...@gmail.com
 wrote:

 Where do i export MAVEN_OPTS in spark-env.sh or hadoop-env.sh

 I am running the below command in spark/yarn directory where pom.xml
 file is available

 mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=VERSION -DskipTests clean package

 Please correct me if i am wrong.




 On Wed, Mar 25, 2015 at 12:55 PM, Saisai Shao sai.sai.s...@gmail.com
 wrote:

 Looks like you have to build Spark with related Hadoop version,
 otherwise you will meet exception as mentioned. you could follow this 
 doc:
 http://spark.apache.org/docs/latest/building-spark.html

 2015-03-25 15:22 GMT+08:00 sandeep vura sandeepv...@gmail.com:

 Hi Sparkers,

 I am trying to load data in spark with the following command

 *sqlContext.sql(LOAD DATA LOCAL INPATH
 '/home/spark12/sandeep/sandeep.txt   ' INTO TABLE src);*

 *Getting exception below*


 *Server IPC version 9 cannot communicate with client version 4*

 NOte : i am using Hadoop 2.2 version and spark 1.2 and hive 0.13













Re: Server IPC version 9 cannot communicate with client version 4

2015-03-25 Thread Sean Owen
Of course, VERSION is supposed to be replaced by a real Hadoop version!

On Wed, Mar 25, 2015 at 12:04 PM, sandeep vura sandeepv...@gmail.com wrote:
 Build failed with following errors.

 I have executed the below following command.

  mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=VERSION -DskipTests clean package


 [INFO]
 
 [INFO] BUILD FAILURE
 [INFO]
 
 [INFO] Total time: 2:11:59.461s
 [INFO] Finished at: Wed Mar 25 17:22:29 IST 2015
 [INFO] Final Memory: 30M/440M
 [INFO]
 
 [ERROR] Failed to execute goal on project spark-core_2.10: Could not resolve
 dep
 endencies for project org.apache.spark:spark-core_2.10:jar:1.2.1: Could not
 find
 artifact org.apache.hadoop:hadoop-client:jar:VERSION in central
 (https://repo1.
 maven.org/maven2) - [Help 1]
 [ERROR]
 [ERROR] To see the full stack trace of the errors, re-run Maven with the -e
 swit
 ch.
 [ERROR] Re-run Maven using the -X switch to enable full debug logging.
 [ERROR]
 [ERROR] For more information about the errors and possible solutions, please
 rea
 d the following articles:
 [ERROR] [Help 1]
 http://cwiki.apache.org/confluence/display/MAVEN/DependencyReso
 lutionException
 [ERROR]
 [ERROR] After correcting the problems, you can resume the build with the
 command
 [ERROR]   mvn goals -rf :spark-core_2.10


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark worker on mesos slave | possible networking config issue

2015-03-25 Thread Anirudha Jadhav
is there a way to have this dynamically pick the local IP.

static assignment does not work cos the workers  are dynamically allocated
on mesos

On Wed, Mar 25, 2015 at 3:04 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:

 It says:
 ried to associate with unreachable remote address 
 [akka.tcp://sparkDriver@localhost:51849].
 Address is now gated for 5000 ms, all messages to this address will be
 delivered to dead letters. Reason: Connection refused: localhost/
 127.0.0.1:51849

 I'd suggest you changing this property:
 export SPARK_LOCAL_IP=127.0.0.1

 Point it to your network address like 192.168.1.10

 Thanks
 Best Regards

 On Tue, Mar 24, 2015 at 11:18 PM, Anirudha Jadhav aniru...@nyu.edu
 wrote:

 is there some setting i am missing:
 this is my spark-env.sh

 export MESOS_NATIVE_LIBRARY=/usr/local/lib/libmesos.so
 export SPARK_EXECUTOR_URI=http://100.125.5.93/sparkx.tgz
 export SPARK_LOCAL_IP=127.0.0.1



 here is what i see on the slave node.
 
 less
 20150226-160708-78932-5050-8971-S0/frameworks/20150323-205508-78932-5050-29804-0012/executors/20150226-160708-78932-5050-8971-S0/runs/cceea834-c4d9-49d6-a579-8352f1889b56/stderr
 

 WARNING: Logging before InitGoogleLogging() is written to STDERR
 I0324 02:30:29.389225 27755 fetcher.cpp:76] Fetching URI '
 http://100.125.5.93/sparkx.tgz'
 I0324 02:30:29.389361 27755 fetcher.cpp:126] Downloading '
 http://100.125.5.93/sparkx.tgz' to
 '/tmp/mesos/slaves/20150226-160708-78932-5050-8971-S0/frameworks/20150323-205508-78932-5050-29804-0012/executors/20150226-160708-78932-5050-8971-S0/runs/cceea834-c4d9-49d6-a579-8352f1889b56/sparkx.tgz'
 I0324 02:30:35.353446 27755 fetcher.cpp:64] Extracted resource
 '/tmp/mesos/slaves/20150226-160708-78932-5050-8971-S0/frameworks/20150323-205508-78932-5050-29804-0012/executors/20150226-160708-78932-5050-8971-S0/runs/cceea834-c4d9-49d6-a579-8352f1889b56/sparkx.tgz'
 into
 '/tmp/mesos/slaves/20150226-160708-78932-5050-8971-S0/frameworks/20150323-205508-78932-5050-29804-0012/executors/20150226-160708-78932-5050-8971-S0/runs/cceea834-c4d9-49d6-a579-8352f1889b56'
 Spark assembly has been built with Hive, including Datanucleus jars on
 classpath
 Using Spark's default log4j profile:
 org/apache/spark/log4j-defaults.properties
 15/03/24 02:30:37 INFO MesosExecutorBackend: Registered signal handlers
 for [TERM, HUP, INT]
 I0324 02:30:37.071077 27863 exec.cpp:132] Version: 0.21.1
 I0324 02:30:37.080971 27885 exec.cpp:206] Executor registered on slave
 20150226-160708-78932-5050-8971-S0
 15/03/24 02:30:37 INFO MesosExecutorBackend: Registered with Mesos as
 executor ID 20150226-160708-78932-5050-8971-S0 with 1 cpus
 15/03/24 02:30:37 INFO SecurityManager: Changing view acls to: ubuntu
 15/03/24 02:30:37 INFO SecurityManager: Changing modify acls to: ubuntu
 15/03/24 02:30:37 INFO SecurityManager: SecurityManager: authentication
 disabled; ui acls disabled; users with view permissions: Set(ubuntu); users
 with modify permissions: Set(ubuntu)
 15/03/24 02:30:37 INFO Slf4jLogger: Slf4jLogger started
 15/03/24 02:30:37 INFO Remoting: Starting remoting
 15/03/24 02:30:38 INFO Remoting: Remoting started; listening on addresses
 :[akka.tcp://sparkExecutor@mesos-si2:50542]
 15/03/24 02:30:38 INFO Utils: Successfully started service
 'sparkExecutor' on port 50542.
 15/03/24 02:30:38 INFO AkkaUtils: Connecting to MapOutputTracker:
 akka.tcp://sparkDriver@localhost:51849/user/MapOutputTracker
 15/03/24 02:30:38 WARN Remoting: Tried to associate with unreachable
 remote address [akka.tcp://sparkDriver@localhost:51849]. Address is now
 gated for 5000 ms, all messages to this address will be delivered to dead
 letters. Reason: Connection refused: localhost/127.0.0.1:51849
 akka.actor.ActorNotFound: Actor not found for:
 ActorSelection[Anchor(akka.tcp://sparkDriver@localhost:51849/),
 Path(/user/MapOutputTracker)]
 at
 akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:65)
 at
 akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:63)
 at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
 at
 akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:67)
 at
 akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:82)
 at
 akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
 at
 akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
 at
 scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
 at
 akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:58)
 at
 akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.unbatchedExecute(Future.scala:74)
 at
 akka.dispatch.BatchingExecutor$class.execute(BatchingExecutor.scala:110)
 at
 

Re: spark worker on mesos slave | possible networking config issue

2015-03-25 Thread Akhil Das
Remove SPARK_LOCAL_IP then?

Thanks
Best Regards

On Wed, Mar 25, 2015 at 6:45 PM, Anirudha Jadhav aniru...@nyu.edu wrote:

 is there a way to have this dynamically pick the local IP.

 static assignment does not work cos the workers  are dynamically allocated
 on mesos

 On Wed, Mar 25, 2015 at 3:04 AM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 It says:
 ried to associate with unreachable remote address 
 [akka.tcp://sparkDriver@localhost:51849].
 Address is now gated for 5000 ms, all messages to this address will be
 delivered to dead letters. Reason: Connection refused: localhost/
 127.0.0.1:51849

 I'd suggest you changing this property:
 export SPARK_LOCAL_IP=127.0.0.1

 Point it to your network address like 192.168.1.10

 Thanks
 Best Regards

 On Tue, Mar 24, 2015 at 11:18 PM, Anirudha Jadhav aniru...@nyu.edu
 wrote:

 is there some setting i am missing:
 this is my spark-env.sh

 export MESOS_NATIVE_LIBRARY=/usr/local/lib/libmesos.so
 export SPARK_EXECUTOR_URI=http://100.125.5.93/sparkx.tgz
 export SPARK_LOCAL_IP=127.0.0.1



 here is what i see on the slave node.
 
 less
 20150226-160708-78932-5050-8971-S0/frameworks/20150323-205508-78932-5050-29804-0012/executors/20150226-160708-78932-5050-8971-S0/runs/cceea834-c4d9-49d6-a579-8352f1889b56/stderr
 

 WARNING: Logging before InitGoogleLogging() is written to STDERR
 I0324 02:30:29.389225 27755 fetcher.cpp:76] Fetching URI '
 http://100.125.5.93/sparkx.tgz'
 I0324 02:30:29.389361 27755 fetcher.cpp:126] Downloading '
 http://100.125.5.93/sparkx.tgz' to
 '/tmp/mesos/slaves/20150226-160708-78932-5050-8971-S0/frameworks/20150323-205508-78932-5050-29804-0012/executors/20150226-160708-78932-5050-8971-S0/runs/cceea834-c4d9-49d6-a579-8352f1889b56/sparkx.tgz'
 I0324 02:30:35.353446 27755 fetcher.cpp:64] Extracted resource
 '/tmp/mesos/slaves/20150226-160708-78932-5050-8971-S0/frameworks/20150323-205508-78932-5050-29804-0012/executors/20150226-160708-78932-5050-8971-S0/runs/cceea834-c4d9-49d6-a579-8352f1889b56/sparkx.tgz'
 into
 '/tmp/mesos/slaves/20150226-160708-78932-5050-8971-S0/frameworks/20150323-205508-78932-5050-29804-0012/executors/20150226-160708-78932-5050-8971-S0/runs/cceea834-c4d9-49d6-a579-8352f1889b56'
 Spark assembly has been built with Hive, including Datanucleus jars on
 classpath
 Using Spark's default log4j profile:
 org/apache/spark/log4j-defaults.properties
 15/03/24 02:30:37 INFO MesosExecutorBackend: Registered signal handlers
 for [TERM, HUP, INT]
 I0324 02:30:37.071077 27863 exec.cpp:132] Version: 0.21.1
 I0324 02:30:37.080971 27885 exec.cpp:206] Executor registered on slave
 20150226-160708-78932-5050-8971-S0
 15/03/24 02:30:37 INFO MesosExecutorBackend: Registered with Mesos as
 executor ID 20150226-160708-78932-5050-8971-S0 with 1 cpus
 15/03/24 02:30:37 INFO SecurityManager: Changing view acls to: ubuntu
 15/03/24 02:30:37 INFO SecurityManager: Changing modify acls to: ubuntu
 15/03/24 02:30:37 INFO SecurityManager: SecurityManager: authentication
 disabled; ui acls disabled; users with view permissions: Set(ubuntu); users
 with modify permissions: Set(ubuntu)
 15/03/24 02:30:37 INFO Slf4jLogger: Slf4jLogger started
 15/03/24 02:30:37 INFO Remoting: Starting remoting
 15/03/24 02:30:38 INFO Remoting: Remoting started; listening on
 addresses :[akka.tcp://sparkExecutor@mesos-si2:50542]
 15/03/24 02:30:38 INFO Utils: Successfully started service
 'sparkExecutor' on port 50542.
 15/03/24 02:30:38 INFO AkkaUtils: Connecting to MapOutputTracker:
 akka.tcp://sparkDriver@localhost:51849/user/MapOutputTracker
 15/03/24 02:30:38 WARN Remoting: Tried to associate with unreachable
 remote address [akka.tcp://sparkDriver@localhost:51849]. Address is now
 gated for 5000 ms, all messages to this address will be delivered to dead
 letters. Reason: Connection refused: localhost/127.0.0.1:51849
 akka.actor.ActorNotFound: Actor not found for:
 ActorSelection[Anchor(akka.tcp://sparkDriver@localhost:51849/),
 Path(/user/MapOutputTracker)]
 at
 akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:65)
 at
 akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:63)
 at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
 at
 akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:67)
 at
 akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:82)
 at
 akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
 at
 akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
 at
 scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
 at
 akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:58)
 at
 akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.unbatchedExecute(Future.scala:74)
 at
 

JavaKinesisWordCountASLYARN Example not working on EMR

2015-03-25 Thread ankur.jain
Hi,
I am trying to run a Spark on YARN program provided by Spark in the examples
directory using Amazon Kinesis on EMR cluster : 
I am using Spark 1.3.0 and EMR AMI: 3.5.0 

I've setup the Credentials 
export AWS_ACCESS_KEY_ID=XX
export AWS_SECRET_KEY=XXX

*A) This is the Kinesis Word Count Producer which ran Successfully : *
run-example org.apache.spark.examples.streaming.KinesisWordCountProducerASL
mySparkStream https://kinesis.us-east-1.amazonaws.com 1 5

*B) This one is the Normal Consumer using Spark Streaming which also ran
Successfully: *
run-example org.apache.spark.examples.streaming.JavaKinesisWordCountASL
mySparkStream https://kinesis.us-east-1.amazonaws.com

*C) And this is the YARN based program which is NOT WORKING: *
run-example org.apache.spark.examples.streaming.JavaKinesisWordCountASLYARN
mySparkStream https://kinesis.us-east-1.amazonaws.com\
Spark assembly has been built with Hive, including Datanucleus jars on
classpath
15/03/25 11:52:45 INFO spark.SparkContext: Running Spark version 1.3.0
15/03/25 11:52:45 WARN spark.SparkConf: 
SPARK_CLASSPATH was detected (set to
'/home/hadoop/spark/conf:/home/hadoop/conf:/home/hadoop/spark/classpath/emr/:/home/hadoop/spark/classpath/emrfs/:/home/hadoop/share/hadoop/common/lib/*:/home/hadoop/share/hadoop/common/lib/hadoop-lzo.jar').
This is deprecated in Spark 1.0+.
Please instead use:
•   ./spark-submit with --driver-class-path to augment the driver classpath
•   spark.executor.extraClassPath to augment the executor classpath
15/03/25 11:52:45 WARN spark.SparkConf: Setting
'spark.executor.extraClassPath' to
'/home/hadoop/spark/conf:/home/hadoop/conf:/home/hadoop/spark/classpath/emr/:/home/hadoop/spark/classpath/emrfs/:/home/hadoop/share/hadoop/common/lib/:/home/hadoop/share/hadoop/common/lib/hadoop-lzo.jar'
as a work-around.
15/03/25 11:52:45 WARN spark.SparkConf: Setting
'spark.driver.extraClassPath' to
'/home/hadoop/spark/conf:/home/hadoop/conf:/home/hadoop/spark/classpath/emr/:/home/hadoop/spark/classpath/emrfs/:/home/hadoop/share/hadoop/common/lib/:/home/hadoop/share/hadoop/common/lib/hadoop-lzo.jar'
as a work-around.
15/03/25 11:52:46 INFO spark.SecurityManager: Changing view acls to: hadoop
15/03/25 11:52:46 INFO spark.SecurityManager: Changing modify acls to:
hadoop
15/03/25 11:52:46 INFO spark.SecurityManager: SecurityManager:
authentication disabled; ui acls disabled; users with view permissions:
Set(hadoop); users with modify permissions: Set(hadoop)
15/03/25 11:52:47 INFO slf4j.Slf4jLogger: Slf4jLogger started
15/03/25 11:52:48 INFO Remoting: Starting remoting
15/03/25 11:52:48 INFO Remoting: Remoting started; listening on addresses
:[akka.tcp://sparkDriver@ip-10-80-175-92.ec2.internal:59504]
15/03/25 11:52:48 INFO util.Utils: Successfully started service
'sparkDriver' on port 59504.
15/03/25 11:52:48 INFO spark.SparkEnv: Registering MapOutputTracker
15/03/25 11:52:48 INFO spark.SparkEnv: Registering BlockManagerMaster
15/03/25 11:52:48 INFO storage.DiskBlockManager: Created local directory at
/mnt/spark/spark-120befbc-6dae-4751-b41f-dbf7b3d97616/blockmgr-d339d180-36f5-465f-bda3-cecccb23b1d3
15/03/25 11:52:48 INFO storage.MemoryStore: MemoryStore started with
capacity 265.4 MB
15/03/25 11:52:48 INFO spark.HttpFileServer: HTTP File server directory is
/mnt/spark/spark-85e88478-3dad-4fcf-a43a-efd15166bef3/httpd-6115870a-0d90-44df-aa7c-a6bd1a47e107
15/03/25 11:52:48 INFO spark.HttpServer: Starting HTTP Server
15/03/25 11:52:49 INFO server.Server: jetty-8.y.z-SNAPSHOT
15/03/25 11:52:49 INFO server.AbstractConnector: Started
SocketConnector@0.0.0.0:44879
15/03/25 11:52:49 INFO util.Utils: Successfully started service 'HTTP file
server' on port 44879.
15/03/25 11:52:49 INFO spark.SparkEnv: Registering OutputCommitCoordinator
15/03/25 11:52:49 INFO server.Server: jetty-8.y.z-SNAPSHOT
15/03/25 11:52:49 INFO server.AbstractConnector: Started
SelectChannelConnector@0.0.0.0:4040
15/03/25 11:52:49 INFO util.Utils: Successfully started service 'SparkUI' on
port 4040.
15/03/25 11:52:49 INFO ui.SparkUI: Started SparkUI at
http://ip-10-80-175-92.ec2.internal:4040
15/03/25 11:52:50 INFO spark.SparkContext: Added JAR
file:/home/hadoop/spark/lib/spark-examples-1.3.0-hadoop2.4.0.jar at
http://10.80.175.92:44879/jars/spark-examples-1.3.0-hadoop2.4.0.jar with
timestamp 1427284370358
15/03/25 11:52:50 INFO cluster.YarnClusterScheduler: Created
YarnClusterScheduler
15/03/25 11:52:51 ERROR cluster.YarnClusterSchedulerBackend: Application ID
is not set.
15/03/25 11:52:51 INFO netty.NettyBlockTransferService: Server created on
49982
15/03/25 11:52:51 INFO storage.BlockManagerMaster: Trying to register
BlockManager
15/03/25 11:52:51 INFO storage.BlockManagerMasterActor: Registering block
manager ip-10-80-175-92.ec2.internal:49982 with 265.4 MB RAM,
BlockManagerId(, ip-10-80-175-92.ec2.internal, 49982)
15/03/25 11:52:51 INFO storage.BlockManagerMaster: Registered BlockManager
*Exception in thread main java.lang.NullPointerException*
*at

Re: NetwrokWordCount + Spark standalone

2015-03-25 Thread Akhil Das
You can open the Master UI running on 8080 port of your ubuntu machine and
after submitting the job, you can see how many cores are being used etc
from the UI.

Thanks
Best Regards

On Wed, Mar 25, 2015 at 6:50 PM, James King jakwebin...@gmail.com wrote:

 Thanks Akhil,

 Yes indeed this is why it works when using local[2] but I'm unclear of why
 it doesn't work when using standalone daemons?

 Is there way to check what cores are being seen when running against
 standalone daemons?

 I'm running the master and worker on same ubuntu host. The Driver program
 is running from a windows machine.

 On ubuntu host command cat /proc/cpuinfo | grep processor | wc -l
 is giving 2

 On Windows machine it is:
 NumberOfCores=2
 NumberOfLogicalProcessors=4


 On Wed, Mar 25, 2015 at 2:06 PM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 Spark Streaming requires you to have minimum of 2 cores, 1 for receiving
 your data and the other for processing. So when you say local[2] it
 basically initialize 2 threads on your local machine, 1 for receiving data
 from network and the other for your word count processing.

 Thanks
 Best Regards

 On Wed, Mar 25, 2015 at 6:31 PM, James King jakwebin...@gmail.com
 wrote:

 I'm trying to run the Java NetwrokWordCount example against a simple
 spark standalone runtime of one  master and one worker.

 But it doesn't seem to work, the text entered on the Netcat data server
 is not being picked up and printed to Eclispe console output.

 However if I use conf.setMaster(local[2]) it works, the correct text
 gets picked up and printed to Eclipse console.

 Any ideas why, any pointers?






Re: Server IPC version 9 cannot communicate with client version 4

2015-03-25 Thread sandeep vura
Build failed with following errors.

I have executed the below following command.

* mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=VERSION -DskipTests clean
package*


[INFO]

[INFO] BUILD FAILURE
[INFO]

[INFO] Total time: 2:11:59.461s
[INFO] Finished at: Wed Mar 25 17:22:29 IST 2015
[INFO] Final Memory: 30M/440M
[INFO]

[ERROR] Failed to execute goal on project spark-core_2.10: Could not
resolve dep
   endencies for project
org.apache.spark:spark-core_2.10:jar:1.2.1: Could not find

artifact org.apache.hadoop:hadoop-client:jar:VERSION in central (
https://repo1.
   maven.org/maven2) - [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e
swit
 ch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions,
please rea
   d the following articles:
[ERROR] [Help 1]
http://cwiki.apache.org/confluence/display/MAVEN/DependencyReso

 lutionException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the
command
[ERROR]   mvn goals -rf :spark-core_2.10


On Wed, Mar 25, 2015 at 3:38 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:

 Just run :

 mvn -Pyarn -Phadoop-2.4 -D*hadoop.version=2.2* -DskipTests clean package


 ​

 Thanks
 Best Regards

 On Wed, Mar 25, 2015 at 3:08 PM, sandeep vura sandeepv...@gmail.com
 wrote:

 Where do i export MAVEN_OPTS in spark-env.sh or hadoop-env.sh

 I am running the below command in spark/yarn directory where pom.xml file
 is available

 mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=VERSION -DskipTests clean package

 Please correct me if i am wrong.




 On Wed, Mar 25, 2015 at 12:55 PM, Saisai Shao sai.sai.s...@gmail.com
 wrote:

 Looks like you have to build Spark with related Hadoop version,
 otherwise you will meet exception as mentioned. you could follow this doc:
 http://spark.apache.org/docs/latest/building-spark.html

 2015-03-25 15:22 GMT+08:00 sandeep vura sandeepv...@gmail.com:

 Hi Sparkers,

 I am trying to load data in spark with the following command

 *sqlContext.sql(LOAD DATA LOCAL INPATH
 '/home/spark12/sandeep/sandeep.txt   ' INTO TABLE src);*

 *Getting exception below*


 *Server IPC version 9 cannot communicate with client version 4*

 NOte : i am using Hadoop 2.2 version and spark 1.2 and hive 0.13










Re: JavaKinesisWordCountASLYARN Example not working on EMR

2015-03-25 Thread Arush Kharbanda
Did you built for kineses using profile *-Pkinesis-asl*

On Wed, Mar 25, 2015 at 7:18 PM, ankur.jain ankur.j...@yash.com wrote:

 Hi,
 I am trying to run a Spark on YARN program provided by Spark in the
 examples
 directory using Amazon Kinesis on EMR cluster :
 I am using Spark 1.3.0 and EMR AMI: 3.5.0

 I've setup the Credentials
 export AWS_ACCESS_KEY_ID=XX
 export AWS_SECRET_KEY=XXX

 *A) This is the Kinesis Word Count Producer which ran Successfully : *
 run-example org.apache.spark.examples.streaming.KinesisWordCountProducerASL
 mySparkStream https://kinesis.us-east-1.amazonaws.com 1 5

 *B) This one is the Normal Consumer using Spark Streaming which also ran
 Successfully: *
 run-example org.apache.spark.examples.streaming.JavaKinesisWordCountASL
 mySparkStream https://kinesis.us-east-1.amazonaws.com

 *C) And this is the YARN based program which is NOT WORKING: *
 run-example org.apache.spark.examples.streaming.JavaKinesisWordCountASLYARN
 mySparkStream https://kinesis.us-east-1.amazonaws.com\
 Spark assembly has been built with Hive, including Datanucleus jars on
 classpath
 15/03/25 11:52:45 INFO spark.SparkContext: Running Spark version 1.3.0
 15/03/25 11:52:45 WARN spark.SparkConf:
 SPARK_CLASSPATH was detected (set to

 '/home/hadoop/spark/conf:/home/hadoop/conf:/home/hadoop/spark/classpath/emr/:/home/hadoop/spark/classpath/emrfs/:/home/hadoop/share/hadoop/common/lib/*:/home/hadoop/share/hadoop/common/lib/hadoop-lzo.jar').
 This is deprecated in Spark 1.0+.
 Please instead use:
 •   ./spark-submit with --driver-class-path to augment the driver
 classpath
 •   spark.executor.extraClassPath to augment the executor classpath
 15/03/25 11:52:45 WARN spark.SparkConf: Setting
 'spark.executor.extraClassPath' to

 '/home/hadoop/spark/conf:/home/hadoop/conf:/home/hadoop/spark/classpath/emr/:/home/hadoop/spark/classpath/emrfs/:/home/hadoop/share/hadoop/common/lib/:/home/hadoop/share/hadoop/common/lib/hadoop-lzo.jar'
 as a work-around.
 15/03/25 11:52:45 WARN spark.SparkConf: Setting
 'spark.driver.extraClassPath' to

 '/home/hadoop/spark/conf:/home/hadoop/conf:/home/hadoop/spark/classpath/emr/:/home/hadoop/spark/classpath/emrfs/:/home/hadoop/share/hadoop/common/lib/:/home/hadoop/share/hadoop/common/lib/hadoop-lzo.jar'
 as a work-around.
 15/03/25 11:52:46 INFO spark.SecurityManager: Changing view acls to: hadoop
 15/03/25 11:52:46 INFO spark.SecurityManager: Changing modify acls to:
 hadoop
 15/03/25 11:52:46 INFO spark.SecurityManager: SecurityManager:
 authentication disabled; ui acls disabled; users with view permissions:
 Set(hadoop); users with modify permissions: Set(hadoop)
 15/03/25 11:52:47 INFO slf4j.Slf4jLogger: Slf4jLogger started
 15/03/25 11:52:48 INFO Remoting: Starting remoting
 15/03/25 11:52:48 INFO Remoting: Remoting started; listening on addresses
 :[akka.tcp://sparkDriver@ip-10-80-175-92.ec2.internal:59504]
 15/03/25 11:52:48 INFO util.Utils: Successfully started service
 'sparkDriver' on port 59504.
 15/03/25 11:52:48 INFO spark.SparkEnv: Registering MapOutputTracker
 15/03/25 11:52:48 INFO spark.SparkEnv: Registering BlockManagerMaster
 15/03/25 11:52:48 INFO storage.DiskBlockManager: Created local directory at

 /mnt/spark/spark-120befbc-6dae-4751-b41f-dbf7b3d97616/blockmgr-d339d180-36f5-465f-bda3-cecccb23b1d3
 15/03/25 11:52:48 INFO storage.MemoryStore: MemoryStore started with
 capacity 265.4 MB
 15/03/25 11:52:48 INFO spark.HttpFileServer: HTTP File server directory is

 /mnt/spark/spark-85e88478-3dad-4fcf-a43a-efd15166bef3/httpd-6115870a-0d90-44df-aa7c-a6bd1a47e107
 15/03/25 11:52:48 INFO spark.HttpServer: Starting HTTP Server
 15/03/25 11:52:49 INFO server.Server: jetty-8.y.z-SNAPSHOT
 15/03/25 11:52:49 INFO server.AbstractConnector: Started
 SocketConnector@0.0.0.0:44879
 15/03/25 11:52:49 INFO util.Utils: Successfully started service 'HTTP file
 server' on port 44879.
 15/03/25 11:52:49 INFO spark.SparkEnv: Registering OutputCommitCoordinator
 15/03/25 11:52:49 INFO server.Server: jetty-8.y.z-SNAPSHOT
 15/03/25 11:52:49 INFO server.AbstractConnector: Started
 SelectChannelConnector@0.0.0.0:4040
 15/03/25 11:52:49 INFO util.Utils: Successfully started service 'SparkUI'
 on
 port 4040.
 15/03/25 11:52:49 INFO ui.SparkUI: Started SparkUI at
 http://ip-10-80-175-92.ec2.internal:4040
 15/03/25 11:52:50 INFO spark.SparkContext: Added JAR
 file:/home/hadoop/spark/lib/spark-examples-1.3.0-hadoop2.4.0.jar at
 http://10.80.175.92:44879/jars/spark-examples-1.3.0-hadoop2.4.0.jar with
 timestamp 1427284370358
 15/03/25 11:52:50 INFO cluster.YarnClusterScheduler: Created
 YarnClusterScheduler
 15/03/25 11:52:51 ERROR cluster.YarnClusterSchedulerBackend: Application ID
 is not set.
 15/03/25 11:52:51 INFO netty.NettyBlockTransferService: Server created on
 49982
 15/03/25 11:52:51 INFO storage.BlockManagerMaster: Trying to register
 BlockManager
 15/03/25 11:52:51 INFO storage.BlockManagerMasterActor: Registering block
 manager ip-10-80-175-92.ec2.internal:49982 with 

Re: FAILED SelectChannelConnector@0.0.0.0:4040 java.net.BindException: Address already in use

2015-03-25 Thread , Roy
Yes I do have other application already running.

Thanks for your explanation.



On Wed, Mar 25, 2015 at 2:49 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:

 It means you are already having 4 applications running on 4040, 4041,
 4042, 4043. And that's why it was able to run on 4044.

 You can do a *netstat -pnat | grep 404* *And see what all processes are
 running.

 Thanks
 Best Regards

 On Wed, Mar 25, 2015 at 1:13 AM, , Roy rp...@njit.edu wrote:

 I get following message for each time I run spark job


1. 15/03/24 15:35:56 WARN AbstractLifeCycle: FAILED
SelectChannelConnector@0.0.0.0:4040: java.net.BindException: Address
already in use


 full trace is here

 http://pastebin.com/xSvRN01f

 how do I fix this ?

 I am on CDH 5.3.1

 thanks
 roy






Re: Server IPC version 9 cannot communicate with client version 4

2015-03-25 Thread Akhil Das
-D*hadoop.version=2.2*


Thanks
Best Regards

On Wed, Mar 25, 2015 at 5:34 PM, sandeep vura sandeepv...@gmail.com wrote:

 Build failed with following errors.

 I have executed the below following command.

 * mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=VERSION -DskipTests clean
 package*


 [INFO]
 
 [INFO] BUILD FAILURE
 [INFO]
 
 [INFO] Total time: 2:11:59.461s
 [INFO] Finished at: Wed Mar 25 17:22:29 IST 2015
 [INFO] Final Memory: 30M/440M
 [INFO]
 
 [ERROR] Failed to execute goal on project spark-core_2.10: Could not
 resolve dep
endencies for project
 org.apache.spark:spark-core_2.10:jar:1.2.1: Could not find

 artifact org.apache.hadoop:hadoop-client:jar:VERSION in central (
 https://repo1.
  maven.org/maven2) - [Help 1]
 [ERROR]
 [ERROR] To see the full stack trace of the errors, re-run Maven with the
 -e swit
ch.
 [ERROR] Re-run Maven using the -X switch to enable full debug logging.
 [ERROR]
 [ERROR] For more information about the errors and possible solutions,
 please rea
d the following articles:
 [ERROR] [Help 1]
 http://cwiki.apache.org/confluence/display/MAVEN/DependencyReso

  lutionException
 [ERROR]
 [ERROR] After correcting the problems, you can resume the build with the
 command
 [ERROR]   mvn goals -rf :spark-core_2.10


 On Wed, Mar 25, 2015 at 3:38 PM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 Just run :

 mvn -Pyarn -Phadoop-2.4 -D*hadoop.version=2.2* -DskipTests clean package


 ​

 Thanks
 Best Regards

 On Wed, Mar 25, 2015 at 3:08 PM, sandeep vura sandeepv...@gmail.com
 wrote:

 Where do i export MAVEN_OPTS in spark-env.sh or hadoop-env.sh

 I am running the below command in spark/yarn directory where pom.xml
 file is available

 mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=VERSION -DskipTests clean package

 Please correct me if i am wrong.




 On Wed, Mar 25, 2015 at 12:55 PM, Saisai Shao sai.sai.s...@gmail.com
 wrote:

 Looks like you have to build Spark with related Hadoop version,
 otherwise you will meet exception as mentioned. you could follow this doc:
 http://spark.apache.org/docs/latest/building-spark.html

 2015-03-25 15:22 GMT+08:00 sandeep vura sandeepv...@gmail.com:

 Hi Sparkers,

 I am trying to load data in spark with the following command

 *sqlContext.sql(LOAD DATA LOCAL INPATH
 '/home/spark12/sandeep/sandeep.txt   ' INTO TABLE src);*

 *Getting exception below*


 *Server IPC version 9 cannot communicate with client version 4*

 NOte : i am using Hadoop 2.2 version and spark 1.2 and hive 0.13











Re: 1.3 Hadoop File System problem

2015-03-25 Thread Jim Carroll
Thanks Patrick and Michael for your responses.

For anyone else that runs across this problem prior to 1.3.1 being released,
I've been pointed to this Jira ticket that's scheduled for 1.3.1:

https://issues.apache.org/jira/browse/SPARK-6351

Thanks again.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/1-3-Hadoop-File-System-problem-tp22207p5.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: EC2 Having script run at startup

2015-03-25 Thread rahulkumar-aws
You can use AWS user-data feature.
try this, if it help for you.
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/user-data.html
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/user-data.html  



-
Software Developer
SigmoidAnalytics, Bangalore

--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/EC2-Having-script-run-at-startup-tp22197p4.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



NetwrokWordCount + Spark standalone

2015-03-25 Thread James King
I'm trying to run the Java NetwrokWordCount example against a simple spark
standalone runtime of one  master and one worker.

But it doesn't seem to work, the text entered on the Netcat data server is
not being picked up and printed to Eclispe console output.

However if I use conf.setMaster(local[2]) it works, the correct text gets
picked up and printed to Eclipse console.

Any ideas why, any pointers?


Re: Spark ML Pipeline inaccessible types

2015-03-25 Thread Peter Rudenko

Hi Martin, here’s 2 possibilities to overcome this:

1) Put your logic into org.apache.spark package in your project - then 
everything would be accessible.

2) Dirty trick:

|object SparkVector extends HashingTF { val VectorUDT: DataType = 
outputDataType } |


then you can do like this:

|StructType(vectorTypeColumn, SparkVector.VectorUDT, false)) |

Thanks,
Peter Rudenko

On 2015-03-25 13:14, zapletal-mar...@email.cz wrote:


Sean,

thanks for your response. I am familiar with /NoSuchMethodException/ 
in general, but I think it is not the case this time. The code 
actually attempts to get parameter by name using /val m = 
this.getClass.getMethodName(paramName)./


This may be a bug, but it is only a side effect caused by the real 
problem I am facing. My issue is that VectorUDT is not accessible by 
user code and therefore it is not possible to use custom ML pipeline 
with the existing Predictors (see the last two paragraphs in my first 
email).


Best Regards,
Martin

-- Původní zpráva --
Od: Sean Owen so...@cloudera.com
Komu: zapletal-mar...@email.cz
Datum: 25. 3. 2015 11:05:54
Předmět: Re: Spark ML Pipeline inaccessible types


NoSuchMethodError in general means that your runtime and compile-time
environments are different. I think you need to first make sure you
don't have mismatching versions of Spark.

On Wed, Mar 25, 2015 at 11:00 AM, zapletal-mar...@email.cz wrote:
 Hi,

 I have started implementing a machine learning pipeline using
Spark 1.3.0
 and the new pipelining API and DataFrames. I got to a point
where I have my
 training data set prepared using a sequence of Transformers, but
I am
 struggling to actually train a model and use it for predictions.

 I am getting a java.lang.NoSuchMethodException:

org.apache.spark.ml.regression.LinearRegression.myFeaturesColumnName()
 exception thrown at checkInputColumn method in Params trait when
using a
 Predictor (LinearRegression in my case, but that should not
matter). This
 looks like a bug - the exception is thrown when executing
getParam(colName)
 when the require(actualDataType.equals(datatype), ...)
requirement is not
 met so the expected requirement failed exception is not thrown
and is hidden
 by the unexpected NoSuchMethodException instead. I can raise a
bug if this
 really is an issue and I am not using something incorrectly.

 The problem I am facing however is that the Predictor expects
features to
 have VectorUDT type as defined in Predictor class (protected def
 featuresDataType: DataType = new VectorUDT). But since this type is
 private[spark] my Transformer can not prepare features with this
type which
 then correctly results in the exception above when I use a
different type.

 Is there a way to define a custom Pipeline that would be able to
use the
 existing Predictors without having to bypass the access modifiers or
 reimplement something or is the pipelining API not yet expected
to be used
 in this way?

 Thanks,
 Martin



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org


​


Re: issue while submitting Spark Job as --master yarn-cluster

2015-03-25 Thread Sandy Ryza
Hi Sachin,

It appears that the application master is failing.  To figure out what's
wrong you need to get the logs for the application master.

-Sandy

On Wed, Mar 25, 2015 at 7:05 AM, Sachin Singh sachin.sha...@gmail.com
wrote:

 OS I am using Linux,
 when I will run simply as master yarn, its running fine,

 Regards
 Sachin

 On Wed, Mar 25, 2015 at 4:25 PM, Xi Shen davidshe...@gmail.com wrote:

 What is your environment? I remember I had similar error when running
 spark-shell --master yarn-client in Windows environment.


 On Wed, Mar 25, 2015 at 9:07 PM sachin Singh sachin.sha...@gmail.com
 wrote:

 Hi ,
 when I am submitting spark job in cluster mode getting error as under in
 hadoop-yarn  log,
 someone has any idea,please suggest,

 2015-03-25 23:35:22,467 INFO
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl:
 application_1427124496008_0028 State change from FINAL_SAVING to FAILED
 2015-03-25 23:35:22,467 WARN
 org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=hdfs
 OPERATION=Application Finished - Failed TARGET=RMAppManager
  RESULT=FAILURE
 DESCRIPTION=App failed with state: FAILED   PERMISSIONS=Application
 application_1427124496008_0028 failed 2 times due to AM Container for
 appattempt_1427124496008_0028_02 exited with  exitCode: 13 due to:
 Exception from container-launch.
 Container id: container_1427124496008_0028_02_01
 Exit code: 13
 Stack trace: ExitCodeException exitCode=13:
 at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
 at org.apache.hadoop.util.Shell.run(Shell.java:455)
 at
 org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(
 Shell.java:702)
 at
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.
 launchContainer(DefaultContainerExecutor.java:197)
 at
 org.apache.hadoop.yarn.server.nodemanager.containermanager.
 launcher.ContainerLaunch.call(ContainerLaunch.java:299)
 at
 org.apache.hadoop.yarn.server.nodemanager.containermanager.
 launcher.ContainerLaunch.call(ContainerLaunch.java:81)
 at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(
 ThreadPoolExecutor.java:1145)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(
 ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)


 Container exited with a non-zero exit code 13
 .Failing this attempt.. Failing the application.
 APPID=application_1427124496008_0028



 --
 View this message in context: http://apache-spark-user-list.
 1001560.n3.nabble.com/issue-while-submitting-Spark-Job-as-
 master-yarn-cluster-tp0.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Re: Spark-sql query got exception.Help

2015-03-25 Thread Cheng Lian
Oh, just noticed that you were calling |sc.setSystemProperty|. Actually 
you need to set this property in SparkConf or in spark-defaults.conf. 
And there are two configurations related to Kryo buffer size,


 * spark.kryoserializer.buffer.mb, which is the initial size, and
 * spark.kryoserializer.buffer.max.mb, which is the max buffer size.

Make sure the 2nd one is larger (it seems that Kryo doesn’t check for it).

Cheng

On 3/25/15 7:31 PM, 李铖 wrote:


Here is the full track

15/03/25 17:48:34 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 
1, cloud1): com.esotericsoftware.kryo.KryoException: Buffer overflow. 
Available: 0, required: 39135

at com.esotericsoftware.kryo.io.Output.require(Output.java:138)
at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:220)
at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:206)
at 
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:29)
at 
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:18)

at com.esotericsoftware.kryo.Kryo.writeObjectOrNull(Kryo.java:549)
at 
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:312)
at 
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:293)

at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
at 
org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:165)

at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:206)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)

2015-03-25 19:05 GMT+08:00 Cheng Lian lian.cs@gmail.com 
mailto:lian.cs@gmail.com:


Could you please provide the full stack trace?


On 3/25/15 6:26 PM, 李铖 wrote:

It is ok when I do query data from a small hdfs file.
But if the hdfs file is 152m,I got this exception.
I try this code
.'sc.setSystemProperty(spark.kryoserializer.buffer.mb,'256')'.error
still.

```
com.esotericsoftware.kryo.KryoException: Buffer overflow.
Available: 0, required: 39135
at com.esotericsoftware.kryo.io.Output.require(Output.java:138)
at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:220)
at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:206)
at

com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:29)
at


```




​


foreachRDD execution

2015-03-25 Thread Luis Ángel Vicente Sánchez
I have a simple and probably dumb question about foreachRDD.

We are using spark streaming + cassandra to compute concurrent users every
5min. Our batch size is 10secs and our block interval is 2.5secs.

At the end of the world we are using foreachRDD to join the data in the RDD
with existing data in Cassandra, update the counters and then save it back
to Cassandra.

To the best of my understanding, in this scenario, spark streaming produces
one RDD every 10secs and foreachRDD executes them sequentially, that is,
foreachRDD would never run in parallel.

Am I right?

Regards,

Luis


Re: NetwrokWordCount + Spark standalone

2015-03-25 Thread Akhil Das
Spark Streaming requires you to have minimum of 2 cores, 1 for receiving
your data and the other for processing. So when you say local[2] it
basically initialize 2 threads on your local machine, 1 for receiving data
from network and the other for your word count processing.

Thanks
Best Regards

On Wed, Mar 25, 2015 at 6:31 PM, James King jakwebin...@gmail.com wrote:

 I'm trying to run the Java NetwrokWordCount example against a simple spark
 standalone runtime of one  master and one worker.

 But it doesn't seem to work, the text entered on the Netcat data server is
 not being picked up and printed to Eclispe console output.

 However if I use conf.setMaster(local[2]) it works, the correct text
 gets picked up and printed to Eclipse console.

 Any ideas why, any pointers?



Re: Server IPC version 9 cannot communicate with client version 4

2015-03-25 Thread sandeep vura
*I am using hadoop 2.4 should i mention -Dhadoop.version=2.2*

*$ hadoop version*
*Hadoop 2.4.1*
*Subversion http://svn.apache.org/repos/asf/hadoop/common
http://svn.apache.org/repos/asf/hadoop/common -r 1604318*
*Compiled by jenkins on 2014-06-21T05:43Z*
*Compiled with protoc 2.5.0*
*From source with checksum bb7ac0a3c73dc131f4844b873c74b630*
*This command was run using
/home/hadoop24/hadoop-2.4.1/share/hadoop/common/hadoop-common-2.4.1.jar*




On Wed, Mar 25, 2015 at 5:38 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:

 -D*hadoop.version=2.2*


 Thanks
 Best Regards

 On Wed, Mar 25, 2015 at 5:34 PM, sandeep vura sandeepv...@gmail.com
 wrote:

 Build failed with following errors.

 I have executed the below following command.

 * mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=VERSION -DskipTests clean
 package*


 [INFO]
 
 [INFO] BUILD FAILURE
 [INFO]
 
 [INFO] Total time: 2:11:59.461s
 [INFO] Finished at: Wed Mar 25 17:22:29 IST 2015
 [INFO] Final Memory: 30M/440M
 [INFO]
 
 [ERROR] Failed to execute goal on project spark-core_2.10: Could not
 resolve dep
endencies for project
 org.apache.spark:spark-core_2.10:jar:1.2.1: Could not find

 artifact org.apache.hadoop:hadoop-client:jar:VERSION in central (
 https://repo1.
  maven.org/maven2) - [Help 1]
 [ERROR]
 [ERROR] To see the full stack trace of the errors, re-run Maven with the
 -e swit
ch.
 [ERROR] Re-run Maven using the -X switch to enable full debug logging.
 [ERROR]
 [ERROR] For more information about the errors and possible solutions,
 please rea
d the following articles:
 [ERROR] [Help 1]
 http://cwiki.apache.org/confluence/display/MAVEN/DependencyReso

lutionException
 [ERROR]
 [ERROR] After correcting the problems, you can resume the build with the
 command
 [ERROR]   mvn goals -rf :spark-core_2.10


 On Wed, Mar 25, 2015 at 3:38 PM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 Just run :

 mvn -Pyarn -Phadoop-2.4 -D*hadoop.version=2.2* -DskipTests clean package


 ​

 Thanks
 Best Regards

 On Wed, Mar 25, 2015 at 3:08 PM, sandeep vura sandeepv...@gmail.com
 wrote:

 Where do i export MAVEN_OPTS in spark-env.sh or hadoop-env.sh

 I am running the below command in spark/yarn directory where pom.xml
 file is available

 mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=VERSION -DskipTests clean package

 Please correct me if i am wrong.




 On Wed, Mar 25, 2015 at 12:55 PM, Saisai Shao sai.sai.s...@gmail.com
 wrote:

 Looks like you have to build Spark with related Hadoop version,
 otherwise you will meet exception as mentioned. you could follow this doc:
 http://spark.apache.org/docs/latest/building-spark.html

 2015-03-25 15:22 GMT+08:00 sandeep vura sandeepv...@gmail.com:

 Hi Sparkers,

 I am trying to load data in spark with the following command

 *sqlContext.sql(LOAD DATA LOCAL INPATH
 '/home/spark12/sandeep/sandeep.txt   ' INTO TABLE src);*

 *Getting exception below*


 *Server IPC version 9 cannot communicate with client version 4*

 NOte : i am using Hadoop 2.2 version and spark 1.2 and hive 0.13












Re: NetwrokWordCount + Spark standalone

2015-03-25 Thread James King
Thanks Akhil,

Yes indeed this is why it works when using local[2] but I'm unclear of why
it doesn't work when using standalone daemons?

Is there way to check what cores are being seen when running against
standalone daemons?

I'm running the master and worker on same ubuntu host. The Driver program
is running from a windows machine.

On ubuntu host command cat /proc/cpuinfo | grep processor | wc -l
is giving 2

On Windows machine it is:
NumberOfCores=2
NumberOfLogicalProcessors=4


On Wed, Mar 25, 2015 at 2:06 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:

 Spark Streaming requires you to have minimum of 2 cores, 1 for receiving
 your data and the other for processing. So when you say local[2] it
 basically initialize 2 threads on your local machine, 1 for receiving data
 from network and the other for your word count processing.

 Thanks
 Best Regards

 On Wed, Mar 25, 2015 at 6:31 PM, James King jakwebin...@gmail.com wrote:

 I'm trying to run the Java NetwrokWordCount example against a simple
 spark standalone runtime of one  master and one worker.

 But it doesn't seem to work, the text entered on the Netcat data server
 is not being picked up and printed to Eclispe console output.

 However if I use conf.setMaster(local[2]) it works, the correct text
 gets picked up and printed to Eclipse console.

 Any ideas why, any pointers?





Re: Spark as a service

2015-03-25 Thread Irfan Ahmad
You're welcome. How did it go?


*Irfan Ahmad*
CTO | Co-Founder | *CloudPhysics* http://www.cloudphysics.com
Best of VMworld Finalist
Best Cloud Management Award
NetworkWorld 10 Startups to Watch
EMA Most Notable Vendor

On Wed, Mar 25, 2015 at 7:53 AM, Ashish Mukherjee 
ashish.mukher...@gmail.com wrote:

 Thank you

 On Tue, Mar 24, 2015 at 8:40 PM, Irfan Ahmad ir...@cloudphysics.com
 wrote:

 Also look at the spark-kernel and spark job server projects.

 Irfan
 On Mar 24, 2015 5:03 AM, Todd Nist tsind...@gmail.com wrote:

 Perhaps this project, https://github.com/calrissian/spark-jetty-server,
 could help with your requirements.

 On Tue, Mar 24, 2015 at 7:12 AM, Jeffrey Jedele 
 jeffrey.jed...@gmail.com wrote:

 I don't think there's are general approach to that - the usecases are
 just to different. If you really need it, you probably will have to
 implement yourself in the driver of your application.

 PS: Make sure to use the reply to all button so that the mailing list
 is included in your reply. Otherwise only I will get your mail.

 Regards,
 Jeff

 2015-03-24 12:01 GMT+01:00 Ashish Mukherjee ashish.mukher...@gmail.com
 :

 Hi Jeffrey,

 Thanks. Yes, this resolves the SQL problem. My bad - I was looking for
 something which would work for Spark Streaming and other Spark jobs too,
 not just SQL.

 Regards,
 Ashish

 On Tue, Mar 24, 2015 at 4:07 PM, Jeffrey Jedele 
 jeffrey.jed...@gmail.com wrote:

 Hi Ashish,
 this might be what you're looking for:


 https://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbcodbc-server

 Regards,
 Jeff

 2015-03-24 11:28 GMT+01:00 Ashish Mukherjee 
 ashish.mukher...@gmail.com:

 Hello,

 As of now, if I have to execute a Spark job, I need to create a jar
 and deploy it.  If I need to run a dynamically formed SQL from a Web
 application, is there any way of using SparkSQL in this manner? Perhaps,
 through a Web Service or something similar.

 Regards,
 Ashish









Write Parquet File with spark-streaming with Spark 1.3

2015-03-25 Thread richiesgr
Hi

I've succeed to write kafka stream to parquet file in Spark 1.2 but I can't
make it with spark 1.3
As in streaming I can't use saveAsParquetFile() because I can't add data to
an existing parquet File

I know that it's possible to stream data directly into parquet 
could you help me by providing a little sample what API I need to use ??

Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Write-Parquet-File-with-spark-streaming-with-Spark-1-3-tp8.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How do you write Dataframes to elasticsearch

2015-03-25 Thread Nick Pentreath
Spark 1.3 is not supported by elasticsearch-hadoop yet but will be very soon: 
https://github.com/elastic/elasticsearch-hadoop/issues/400





However in the meantime you could use df.toRDD.saveToEs - though you may have 
to manipulate the Row object perhaps to extract fields, not sure if it will 
serialize directly to ES JSON...



—
Sent from Mailbox

On Wed, Mar 25, 2015 at 2:07 PM, yamanoj manoj.per...@gmail.com wrote:

 It seems that elasticsearch-spark_2.10 currently not supporting spart 1.3.
 Could you tell me if there is an alternative way to save Dataframes to
 elasticsearch? 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/How-do-you-write-Dataframes-to-elasticsearch-tp3.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Spark Streaming - Minimizing batch interval

2015-03-25 Thread RodrigoB
I've been given a feature requirement that means processing events on a
latency lower than 0.25ms. 

Meaning I would have to make sure that Spark streaming gets new events from
the messaging layer within that period of time. Would anyone have achieve
such numbers using a Spark cluster? Or would this be even possible, even
assuming we don't use the write ahead logs...

tnks in advance!

Rod



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Minimizing-batch-interval-tp7.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems

2015-03-25 Thread roni
I have a EC2 cluster created using spark version 1.2.1.
And I have a SBT project .
Now I want to upgrade to spark 1.3 and use the new features.
Below are issues .
Sorry for the long post.
Appreciate your help.
Thanks
-Roni

Question - Do I have to create a new cluster using spark 1.3?

Here is what I did -

In my SBT file I  changed to -
libraryDependencies += org.apache.spark %% spark-core % 1.3.0

But then I started getting compilation error. along with
Here are some of the libraries that were evicted:
[warn]  * org.apache.spark:spark-core_2.10:1.2.0 - 1.3.0
[warn]  * org.apache.hadoop:hadoop-client:(2.5.0-cdh5.2.0, 2.2.0) - 2.6.0
[warn] Run 'evicted' to see detailed eviction warnings

 constructor cannot be instantiated to expected type;
[error]  found   : (T1, T2)
[error]  required: org.apache.spark.sql.catalyst.expressions.Row
[error] val ty = joinRDD.map{case(word,
(file1Counts, file2Counts)) = KmerIntesect(word, file1Counts,xyz)}
[error]  ^

Here is my SBT and code --
SBT -

version := 1.0

scalaVersion := 2.10.4

resolvers += Sonatype OSS Snapshots at 
https://oss.sonatype.org/content/repositories/snapshots;;
resolvers += Maven Repo1 at https://repo1.maven.org/maven2;;
resolvers += Maven Repo at 
https://s3.amazonaws.com/h2o-release/h2o-dev/master/1056/maven/repo/;;

/* Dependencies - %% appends Scala version to artifactId */
libraryDependencies += org.apache.hadoop % hadoop-client % 2.6.0
libraryDependencies += org.apache.spark %% spark-core % 1.3.0
libraryDependencies += org.bdgenomics.adam % adam-core % 0.16.0
libraryDependencies += ai.h2o % sparkling-water-core_2.10 % 0.2.10


CODE --
import org.apache.spark.{SparkConf, SparkContext}
case class KmerIntesect(kmer: String, kCount: Int, fileName: String)

object preDefKmerIntersection {
  def main(args: Array[String]) {

 val sparkConf = new SparkConf().setAppName(preDefKmer-intersect)
 val sc = new SparkContext(sparkConf)
import sqlContext.createSchemaRDD
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val bedFile = sc.textFile(s3n://a/b/c,40)
 val hgfasta = sc.textFile(hdfs://a/b/c,40)
 val hgPair = hgfasta.map(_.split (,)).map(a= (a(0),
a(1).trim().toInt))
 val filtered = hgPair.filter(kv = kv._2 == 1)
 val bedPair = bedFile.map(_.split (,)).map(a= (a(0),
a(1).trim().toInt))
 val joinRDD = bedPair.join(filtered)
val ty = joinRDD.map{case(word, (file1Counts, file2Counts))
= KmerIntesect(word, file1Counts,xyz)}
ty.registerTempTable(KmerIntesect)
ty.saveAsParquetFile(hdfs://x/y/z/kmerIntersect.parquet)
  }
}


Re: Spark Streaming - Minimizing batch interval

2015-03-25 Thread Sean Owen
I don't think it's feasible to set a batch interval of 0.25ms. Even at
tens of ms the overhead of the framework is a large factor. Do you
mean 0.25s = 250ms?

Related thoughts, and I don't know if they apply to your case:

If you mean, can you just read off the source that quickly? yes.

Sometimes when people say I need very low latency streaming because I
need to answer quickly they are really trying to design a synchronous
API, and I don't think asynchronous streaming is the right
architecture.

Sometimes people really mean I need to process 400 items per ms on
average, which is different and entirely possible.



On Wed, Mar 25, 2015 at 2:53 PM, RodrigoB rodrigo.boav...@aspect.com wrote:
 I've been given a feature requirement that means processing events on a
 latency lower than 0.25ms.

 Meaning I would have to make sure that Spark streaming gets new events from
 the messaging layer within that period of time. Would anyone have achieve
 such numbers using a Spark cluster? Or would this be even possible, even
 assuming we don't use the write ahead logs...

 tnks in advance!

 Rod



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Minimizing-batch-interval-tp7.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



What are the best options for quickly filtering a DataFrame on a single column?

2015-03-25 Thread Stuart Layton
I have a SparkSQL dataframe with a a few billion rows that I need to
quickly filter down to a few hundred thousand rows, using an operation like
(syntax may not be correct)

df = df[ df.filter(lambda x: x.key_col in approved_keys)]

I was thinking about serializing the data using parquet and saving it to
S3, however as I want to optimize for filtering speed I'm not sure this is
the best option.

-- 
Stuart Layton


Recovered state for updateStateByKey and incremental streams processing

2015-03-25 Thread Ravi Reddy
I want to use the restore from checkpoint to continue from last accumulated
word counts and process new streams of data. This recovery process will keep
accurate state of accumulated counters state (calculated by
updateStateByKey) after failure/recovery or temp shutdown/upgrade to new
code.

However, the recomendation seem to indicate you have to delete the
checkpoint data if you upgrade to new code. How would this work if I change
the word count accumulation logic and still want to continue to work from
last remembered state?. An example would be that the word counters could be
weighted in a logic that is used for streams that are coming from a
point-in-time later.

This is an example but there are quite a few scenarios where one needs to
continue from previous state as rememberd by updateStateByKey and apply
new logic. 

 code snippets below 

As in example RecoverableNetworkWordCount we should build
setupInputStreamAndProcessWordCounts in the context of retrieved
checkpoint only. If setupInputStreamAndProcessWordCounts called outside
the createContext, you will get error [Receiver-0-1427260249292] is not
unique!.

  def createContext(checkPointDir: String, host: String, port : Int) = {
// If you do not see this printed, that means the StreamingContext has
been loaded
// from the new checkpoint
println(Creating new context)
val sparkConf = new SparkConf().setAppName(StatefulNetworkWordCount)

// Create the context with a 1 second batch size
val ssc = new StreamingContext(sparkConf, Seconds(1))
ssc.checkpoint(checkPointDir)

setupInputStreamAndProcessWordCounts(ssc, host, port)

ssc
  }

and invoke in main as

  def main(args: Array[String]) {
val checkPointDir = ./saved_state

if (args.length  2) {
  System.err.println(Usage: SavedNetworkWordCount hostname port)
  System.exit(1)
}

val ssc = StreamingContext.getOrCreate(checkPointDir, () = {
createContext(checkPointDir,args(0), args(1).toInt)
  })

//setupInputStreamAndProcessWordCounts(ssc, args(0), args(1).toInt)

ssc.start()
ssc.awaitTermination()
  }
 }

  def setupInputStreamAndProcessWordCounts(ssc: StreamingContext, hostname:
String, port: Int) {
// InputDStream has to be created inside createContext, else you get an
error
val lines = ssc.socketTextStream(hostname, port)
val words = lines.flatMap(_.split( ))
val wordDstream = words.map(x = (x, 1))

val updateFunc = (values: Seq[Int], state: Option[Int]) = {
  val currentCount = values.sum
  val previousCount = state.getOrElse(0)
  Some(currentCount + previousCount)
}

// Update and print the cumulative count using updateStateByKey
val countsDstream = wordDstream.updateStateByKey[Int](updateFunc)
countsDstream.print() // print or save to external system
  }
  



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Recovered-state-for-updateStateByKey-and-incremental-streams-processing-tp9.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



python : Out of memory: Kill process

2015-03-25 Thread Eduardo Cusa
Hi Guys, I running the following function with spark-submmit and de SO is
killing my process :


  def getRdd(self,date,provider):
path='s3n://'+AWS_BUCKET+'/'+date+'/*.log.gz'
log2= self.sqlContext.jsonFile(path)
log2.registerTempTable('log_test')
log2.cache()
out=self.sqlContext.sql(SELECT user, tax from log_test where provider
= '+provider+'and country  '').map(lambda row: (row.user, row.tax))
print out1
return  map((lambda (x,y): (x, list(y))),
sorted(out.groupByKey(2000).collect()))



The input dataset has 57 zip files (2 GB)

The same process with a smaller dataset completed successfully

Any ideas to debug is welcome.

Regards
Eduardo


Re: OOM for HiveFromSpark example

2015-03-25 Thread ๏̯͡๏
I am facing same issue, posted a new thread. Please respond.

On Wed, Jan 14, 2015 at 4:38 AM, Zhan Zhang zzh...@hortonworks.com wrote:

 Hi Folks,

 I am trying to run hive context in yarn-cluster mode, but met some error.
 Does anybody know what cause the issue.

 I use following cmd to build the distribution:

  ./make-distribution.sh -Phive -Phive-thriftserver  -Pyarn  -Phadoop-2.4

 15/01/13 17:59:42 INFO cluster.YarnClusterScheduler:
 YarnClusterScheduler.postStartHook done
 15/01/13 17:59:42 INFO storage.BlockManagerMasterActor: Registering block
 manager cn122-10.l42scl.hortonworks.com:56157 with 1589.8 MB RAM,
 BlockManagerId(2, cn122-10.l42scl.hortonworks.com, 56157)
 15/01/13 17:59:43 INFO parse.ParseDriver: Parsing command: CREATE TABLE IF
 NOT EXISTS src (key INT, value STRING)
 15/01/13 17:59:43 INFO parse.ParseDriver: Parse Completed
 15/01/13 17:59:44 INFO metastore.HiveMetaStore: 0: Opening raw store with
 implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
 15/01/13 17:59:44 INFO metastore.ObjectStore: ObjectStore, initialize
 called
 15/01/13 17:59:44 INFO DataNucleus.Persistence: Property
 datanucleus.cache.level2 unknown - will be ignored
 15/01/13 17:59:44 INFO DataNucleus.Persistence: Property
 hive.metastore.integral.jdo.pushdown unknown - will be ignored
 15/01/13 17:59:44 WARN DataNucleus.Connection: BoneCP specified but not
 present in CLASSPATH (or one of dependencies)
 15/01/13 17:59:44 WARN DataNucleus.Connection: BoneCP specified but not
 present in CLASSPATH (or one of dependencies)
 15/01/13 17:59:52 INFO metastore.ObjectStore: Setting MetaStore object pin
 classes with
 hive.metastore.cache.pinobjtypes=Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order
 15/01/13 17:59:52 INFO metastore.MetaStoreDirectSql: MySQL check failed,
 assuming we are not on mysql: Lexical error at line 1, column 5.
 Encountered: @ (64), after : .
 15/01/13 17:59:53 INFO DataNucleus.Datastore: The class
 org.apache.hadoop.hive.metastore.model.MFieldSchema is tagged as
 embedded-only so does not have its own datastore table.
 15/01/13 17:59:53 INFO DataNucleus.Datastore: The class
 org.apache.hadoop.hive.metastore.model.MOrder is tagged as
 embedded-only so does not have its own datastore table.
 15/01/13 17:59:59 INFO DataNucleus.Datastore: The class
 org.apache.hadoop.hive.metastore.model.MFieldSchema is tagged as
 embedded-only so does not have its own datastore table.
 15/01/13 17:59:59 INFO DataNucleus.Datastore: The class
 org.apache.hadoop.hive.metastore.model.MOrder is tagged as
 embedded-only so does not have its own datastore table.
 15/01/13 18:00:00 INFO metastore.ObjectStore: Initialized ObjectStore
 15/01/13 18:00:00 WARN metastore.ObjectStore: Version information not
 found in metastore. hive.metastore.schema.verification is not enabled so
 recording the schema version 0.13.1aa
 15/01/13 18:00:01 INFO metastore.HiveMetaStore: Added admin role in
 metastore
 15/01/13 18:00:01 INFO metastore.HiveMetaStore: Added public role in
 metastore
 15/01/13 18:00:01 INFO metastore.HiveMetaStore: No user is added in admin
 role, since config is empty
 15/01/13 18:00:01 INFO session.SessionState: No Tez session required at
 this point. hive.execution.engine=mr.
 15/01/13 18:00:02 INFO log.PerfLogger: PERFLOG method=Driver.run
 from=org.apache.hadoop.hive.ql.Driver
 15/01/13 18:00:02 INFO log.PerfLogger: PERFLOG method=TimeToSubmit
 from=org.apache.hadoop.hive.ql.Driver
 15/01/13 18:00:02 INFO ql.Driver: Concurrency mode is disabled, not
 creating a lock manager
 15/01/13 18:00:02 INFO log.PerfLogger: PERFLOG method=compile
 from=org.apache.hadoop.hive.ql.Driver
 15/01/13 18:00:03 INFO log.PerfLogger: PERFLOG method=parse
 from=org.apache.hadoop.hive.ql.Driver
 15/01/13 18:00:03 INFO parse.ParseDriver: Parsing command: CREATE TABLE IF
 NOT EXISTS src (key INT, value STRING)
 15/01/13 18:00:03 INFO parse.ParseDriver: Parse Completed
 15/01/13 18:00:03 INFO log.PerfLogger: /PERFLOG method=parse
 start=1421190003030 end=1421190003031 duration=1
 from=org.apache.hadoop.hive.ql.Driver
 15/01/13 18:00:03 INFO log.PerfLogger: PERFLOG method=semanticAnalyze
 from=org.apache.hadoop.hive.ql.Driver
 15/01/13 18:00:03 INFO parse.SemanticAnalyzer: Starting Semantic Analysis
 15/01/13 18:00:03 INFO parse.SemanticAnalyzer: Creating table src
 position=27
 15/01/13 18:00:03 INFO metastore.HiveMetaStore: 0: get_table : db=default
 tbl=src
 15/01/13 18:00:03 INFO HiveMetaStore.audit: ugi=zzhang
 ip=unknown-ip-addr  cmd=get_table : db=default tbl=src
 15/01/13 18:00:03 INFO metastore.HiveMetaStore: 0: get_database: default
 15/01/13 18:00:03 INFO HiveMetaStore.audit: ugi=zzhang
 ip=unknown-ip-addr  cmd=get_database: default
 15/01/13 18:00:03 INFO ql.Driver: Semantic Analysis Completed
 15/01/13 18:00:03 INFO log.PerfLogger: /PERFLOG method=semanticAnalyze
 start=1421190003031 end=1421190003406 duration=375
 from=org.apache.hadoop.hive.ql.Driver
 15/01/13 18:00:03 INFO 

Re: OOM for HiveFromSpark example

2015-03-25 Thread Zhan Zhang
I solve this by  increase the PermGen memory size in driver.

-XX:MaxPermSize=512m

Thanks.

Zhan Zhang

On Mar 25, 2015, at 10:54 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) 
deepuj...@gmail.commailto:deepuj...@gmail.com wrote:

I am facing same issue, posted a new thread. Please respond.

On Wed, Jan 14, 2015 at 4:38 AM, Zhan Zhang 
zzh...@hortonworks.commailto:zzh...@hortonworks.com wrote:
Hi Folks,

I am trying to run hive context in yarn-cluster mode, but met some error. Does 
anybody know what cause the issue.

I use following cmd to build the distribution:

 ./make-distribution.sh -Phive -Phive-thriftserver  -Pyarn  -Phadoop-2.4

15/01/13 17:59:42 INFO cluster.YarnClusterScheduler: 
YarnClusterScheduler.postStartHook done
15/01/13 17:59:42 INFO storage.BlockManagerMasterActor: Registering block 
manager 
cn122-10.l42scl.hortonworks.com:56157http://cn122-10.l42scl.hortonworks.com:56157/
 with 1589.8 MB RAM, BlockManagerId(2, 
cn122-10.l42scl.hortonworks.comhttp://cn122-10.l42scl.hortonworks.com/, 56157)
15/01/13 17:59:43 INFO parse.ParseDriver: Parsing command: CREATE TABLE IF NOT 
EXISTS src (key INT, value STRING)
15/01/13 17:59:43 INFO parse.ParseDriver: Parse Completed
15/01/13 17:59:44 INFO metastore.HiveMetaStore: 0: Opening raw store with 
implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
15/01/13 17:59:44 INFO metastore.ObjectStore: ObjectStore, initialize called
15/01/13 17:59:44 INFO DataNucleus.Persistence: Property 
datanucleus.cache.level2 unknown - will be ignored
15/01/13 17:59:44 INFO DataNucleus.Persistence: Property 
hive.metastore.integral.jdo.pushdown unknown - will be ignored
15/01/13 17:59:44 WARN DataNucleus.Connection: BoneCP specified but not present 
in CLASSPATH (or one of dependencies)
15/01/13 17:59:44 WARN DataNucleus.Connection: BoneCP specified but not present 
in CLASSPATH (or one of dependencies)
15/01/13 17:59:52 INFO metastore.ObjectStore: Setting MetaStore object pin 
classes with 
hive.metastore.cache.pinobjtypes=Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order
15/01/13 17:59:52 INFO metastore.MetaStoreDirectSql: MySQL check failed, 
assuming we are not on mysql: Lexical error at line 1, column 5.  Encountered: 
@ (64), after : .
15/01/13 17:59:53 INFO DataNucleus.Datastore: The class 
org.apache.hadoop.hive.metastore.model.MFieldSchema is tagged as 
embedded-only so does not have its own datastore table.
15/01/13 17:59:53 INFO DataNucleus.Datastore: The class 
org.apache.hadoop.hive.metastore.model.MOrder is tagged as embedded-only so 
does not have its own datastore table.
15/01/13 17:59:59 INFO DataNucleus.Datastore: The class 
org.apache.hadoop.hive.metastore.model.MFieldSchema is tagged as 
embedded-only so does not have its own datastore table.
15/01/13 17:59:59 INFO DataNucleus.Datastore: The class 
org.apache.hadoop.hive.metastore.model.MOrder is tagged as embedded-only so 
does not have its own datastore table.
15/01/13 18:00:00 INFO metastore.ObjectStore: Initialized ObjectStore
15/01/13 18:00:00 WARN metastore.ObjectStore: Version information not found in 
metastore. hive.metastore.schema.verification is not enabled so recording the 
schema version 0.13.1aa
15/01/13 18:00:01 INFO metastore.HiveMetaStore: Added admin role in metastore
15/01/13 18:00:01 INFO metastore.HiveMetaStore: Added public role in metastore
15/01/13 18:00:01 INFO metastore.HiveMetaStore: No user is added in admin role, 
since config is empty
15/01/13 18:00:01 INFO session.SessionState: No Tez session required at this 
point. hive.execution.engine=mr.
15/01/13 18:00:02 INFO log.PerfLogger: PERFLOG method=Driver.run 
from=org.apache.hadoop.hive.ql.Driver
15/01/13 18:00:02 INFO log.PerfLogger: PERFLOG method=TimeToSubmit 
from=org.apache.hadoop.hive.ql.Driver
15/01/13 18:00:02 INFO ql.Driver: Concurrency mode is disabled, not creating a 
lock manager
15/01/13 18:00:02 INFO log.PerfLogger: PERFLOG method=compile 
from=org.apache.hadoop.hive.ql.Driver
15/01/13 18:00:03 INFO log.PerfLogger: PERFLOG method=parse 
from=org.apache.hadoop.hive.ql.Driver
15/01/13 18:00:03 INFO parse.ParseDriver: Parsing command: CREATE TABLE IF NOT 
EXISTS src (key INT, value STRING)
15/01/13 18:00:03 INFO parse.ParseDriver: Parse Completed
15/01/13 18:00:03 INFO log.PerfLogger: /PERFLOG method=parse 
start=1421190003030 end=1421190003031 duration=1 
from=org.apache.hadoop.hive.ql.Driver
15/01/13 18:00:03 INFO log.PerfLogger: PERFLOG method=semanticAnalyze 
from=org.apache.hadoop.hive.ql.Driver
15/01/13 18:00:03 INFO parse.SemanticAnalyzer: Starting Semantic Analysis
15/01/13 18:00:03 INFO parse.SemanticAnalyzer: Creating table src position=27
15/01/13 18:00:03 INFO metastore.HiveMetaStore: 0: get_table : db=default 
tbl=src
15/01/13 18:00:03 INFO HiveMetaStore.audit: ugi=zzhang  ip=unknown-ip-addr  
cmd=get_table : db=default tbl=src
15/01/13 18:00:03 INFO metastore.HiveMetaStore: 0: get_database: default
15/01/13 18:00:03 INFO HiveMetaStore.audit: ugi=zzhang  ip=unknown-ip-addr  

Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems

2015-03-25 Thread Dean Wampler
For the Spark SQL parts, 1.3 breaks backwards compatibility, because before
1.3, Spark SQL was considered experimental where API changes were allowed.

So, H2O and ADA compatible with 1.2.X might not work with 1.3.

dean

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
Typesafe http://typesafe.com
@deanwampler http://twitter.com/deanwampler
http://polyglotprogramming.com

On Wed, Mar 25, 2015 at 9:39 AM, roni roni.epi...@gmail.com wrote:

 Even if H2o and ADA are dependent on 1.2.1 , it should be backword
 compatible, right?
 So using 1.3 should not break them.
 And the code is not using the classes from those libs.
 I tried sbt clean compile .. same errror
 Thanks
 _R

 On Wed, Mar 25, 2015 at 9:26 AM, Nick Pentreath nick.pentre...@gmail.com
 wrote:

 What version of Spark do the other dependencies rely on (Adam and H2O?) -
 that could be it

 Or try sbt clean compile

 —
 Sent from Mailbox https://www.dropbox.com/mailbox


 On Wed, Mar 25, 2015 at 5:58 PM, roni roni.epi...@gmail.com wrote:

 I have a EC2 cluster created using spark version 1.2.1.
 And I have a SBT project .
 Now I want to upgrade to spark 1.3 and use the new features.
 Below are issues .
 Sorry for the long post.
 Appreciate your help.
 Thanks
 -Roni

 Question - Do I have to create a new cluster using spark 1.3?

 Here is what I did -

 In my SBT file I  changed to -
 libraryDependencies += org.apache.spark %% spark-core % 1.3.0

 But then I started getting compilation error. along with
 Here are some of the libraries that were evicted:
 [warn]  * org.apache.spark:spark-core_2.10:1.2.0 - 1.3.0
 [warn]  * org.apache.hadoop:hadoop-client:(2.5.0-cdh5.2.0, 2.2.0) -
 2.6.0
 [warn] Run 'evicted' to see detailed eviction warnings

  constructor cannot be instantiated to expected type;
 [error]  found   : (T1, T2)
 [error]  required: org.apache.spark.sql.catalyst.expressions.Row
 [error] val ty = joinRDD.map{case(word,
 (file1Counts, file2Counts)) = KmerIntesect(word, file1Counts,xyz)}
 [error]  ^

 Here is my SBT and code --
 SBT -

 version := 1.0

 scalaVersion := 2.10.4

 resolvers += Sonatype OSS Snapshots at 
 https://oss.sonatype.org/content/repositories/snapshots;;
 resolvers += Maven Repo1 at https://repo1.maven.org/maven2;;
 resolvers += Maven Repo at 
 https://s3.amazonaws.com/h2o-release/h2o-dev/master/1056/maven/repo/;;

 /* Dependencies - %% appends Scala version to artifactId */
 libraryDependencies += org.apache.hadoop % hadoop-client % 2.6.0
 libraryDependencies += org.apache.spark %% spark-core % 1.3.0
 libraryDependencies += org.bdgenomics.adam % adam-core % 0.16.0
 libraryDependencies += ai.h2o % sparkling-water-core_2.10 % 0.2.10


 CODE --
 import org.apache.spark.{SparkConf, SparkContext}
 case class KmerIntesect(kmer: String, kCount: Int, fileName: String)

 object preDefKmerIntersection {
   def main(args: Array[String]) {

  val sparkConf = new SparkConf().setAppName(preDefKmer-intersect)
  val sc = new SparkContext(sparkConf)
 import sqlContext.createSchemaRDD
 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 val bedFile = sc.textFile(s3n://a/b/c,40)
  val hgfasta = sc.textFile(hdfs://a/b/c,40)
  val hgPair = hgfasta.map(_.split (,)).map(a= (a(0),
 a(1).trim().toInt))
  val filtered = hgPair.filter(kv = kv._2 == 1)
  val bedPair = bedFile.map(_.split (,)).map(a= (a(0),
 a(1).trim().toInt))
  val joinRDD = bedPair.join(filtered)
 val ty = joinRDD.map{case(word, (file1Counts,
 file2Counts)) = KmerIntesect(word, file1Counts,xyz)}
 ty.registerTempTable(KmerIntesect)

 ty.saveAsParquetFile(hdfs://x/y/z/kmerIntersect.parquet)
   }
 }






Unable to Hive program from Spark Programming Guide (OutOfMemoryError)

2015-03-25 Thread ๏̯͡๏
http://spark.apache.org/docs/1.3.0/sql-programming-guide.html#hive-tables



I modified the Hive query but run into same error. (
http://spark.apache.org/docs/1.3.0/sql-programming-guide.html#hive-tables)


val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.sql(CREATE TABLE IF NOT EXISTS src_spark (key INT, value
STRING))
sqlContext.sql(LOAD DATA LOCAL INPATH
'examples/src/main/resources/kv1.txt' INTO TABLE src)

// Queries are expressed in HiveQL
sqlContext.sql(FROM src SELECT key, value).collect().foreach(println)



Command

./bin/spark-submit -v --master yarn-cluster --jars
/home/dvasthimal/spark1.3/spark-avro_2.10-1.0.0.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar
--num-executors 3 --driver-memory 8g --executor-memory 2g --executor-cores
1 --queue hdmi-express --class com.ebay.ep.poc.spark.reporting.SparkApp
spark_reporting-1.0-SNAPSHOT.jar startDate=2015-02-16 endDate=2015-02-16
input=/user/dvasthimal/epdatasets/successdetail1/part-r-0.avro
subcommand=successevents2 output=/user/dvasthimal/epdatasets/successdetail2


Input

-sh-4.1$ ls -l examples/src/main/resources/kv1.txt
-rw-r--r-- 1 dvasthimal gid-dvasthimal 5812 Mar  5 17:31
examples/src/main/resources/kv1.txt
-sh-4.1$ head examples/src/main/resources/kv1.txt
238val_238
86val_86
311val_311
27val_27
165val_165
409val_409
255val_255
278val_278
98val_98
484val_484
-sh-4.1$

Log

/apache/hadoop/bin/yarn logs -applicationId application_1426715280024_82757

…
…
…


15/03/25 07:52:44 INFO metastore.HiveMetaStore: No user is added in admin
role, since config is empty
15/03/25 07:52:44 INFO session.SessionState: No Tez session required at
this point. hive.execution.engine=mr.
15/03/25 07:52:47 INFO parse.ParseDriver: Parsing command: CREATE TABLE IF
NOT EXISTS src_spark (key INT, value STRING)
15/03/25 07:52:47 INFO parse.ParseDriver: Parse Completed
15/03/25 07:52:48 INFO log.PerfLogger: PERFLOG method=Driver.run
from=org.apache.hadoop.hive.ql.Driver
15/03/25 07:52:48 INFO log.PerfLogger: PERFLOG method=TimeToSubmit
from=org.apache.hadoop.hive.ql.Driver
15/03/25 07:52:48 INFO ql.Driver: Concurrency mode is disabled, not
creating a lock manager
15/03/25 07:52:48 INFO log.PerfLogger: PERFLOG method=compile
from=org.apache.hadoop.hive.ql.Driver
15/03/25 07:52:48 INFO log.PerfLogger: PERFLOG method=parse
from=org.apache.hadoop.hive.ql.Driver
15/03/25 07:52:48 INFO parse.ParseDriver: Parsing command: CREATE TABLE IF
NOT EXISTS src_spark (key INT, value STRING)
15/03/25 07:52:48 INFO parse.ParseDriver: Parse Completed
15/03/25 07:52:48 INFO log.PerfLogger: /PERFLOG method=parse
start=1427295168392 end=1427295168393 duration=1
from=org.apache.hadoop.hive.ql.Driver
15/03/25 07:52:48 INFO log.PerfLogger: PERFLOG method=semanticAnalyze
from=org.apache.hadoop.hive.ql.Driver
15/03/25 07:52:48 INFO parse.SemanticAnalyzer: Starting Semantic Analysis
15/03/25 07:52:48 INFO parse.SemanticAnalyzer: Creating table src_spark
position=27
15/03/25 07:52:48 INFO metastore.HiveMetaStore: 0: get_table : db=default
tbl=src_spark
15/03/25 07:52:48 INFO HiveMetaStore.audit: ugi=dvasthimal
ip=unknown-ip-addr cmd=get_table : db=default tbl=src_spark
15/03/25 07:52:48 INFO metastore.HiveMetaStore: 0: get_database: default
15/03/25 07:52:48 INFO HiveMetaStore.audit: ugi=dvasthimal
ip=unknown-ip-addr cmd=get_database: default
15/03/25 07:52:48 INFO ql.Driver: Semantic Analysis Completed
15/03/25 07:52:48 INFO log.PerfLogger: /PERFLOG method=semanticAnalyze
start=1427295168393 end=1427295168595 duration=202
from=org.apache.hadoop.hive.ql.Driver
15/03/25 07:52:48 INFO ql.Driver: Returning Hive schema:
Schema(fieldSchemas:null, properties:null)
15/03/25 07:52:48 INFO log.PerfLogger: /PERFLOG method=compile
start=1427295168352 end=1427295168607 duration=255
from=org.apache.hadoop.hive.ql.Driver
15/03/25 07:52:48 INFO log.PerfLogger: PERFLOG method=Driver.execute
from=org.apache.hadoop.hive.ql.Driver
15/03/25 07:52:48 INFO ql.Driver: Starting command: CREATE TABLE IF NOT
EXISTS src_spark (key INT, value STRING)
15/03/25 07:52:48 INFO log.PerfLogger: /PERFLOG method=TimeToSubmit
start=1427295168349 end=1427295168625 duration=276
from=org.apache.hadoop.hive.ql.Driver
15/03/25 07:52:48 INFO log.PerfLogger: PERFLOG method=runTasks
from=org.apache.hadoop.hive.ql.Driver
15/03/25 07:52:48 INFO log.PerfLogger: PERFLOG method=task.DDL.Stage-0
from=org.apache.hadoop.hive.ql.Driver
15/03/25 07:52:51 INFO exec.DDLTask: Default to LazySimpleSerDe for table
src_spark
15/03/25 07:52:52 INFO metastore.HiveMetaStore: 0: create_table:
Table(tableName:src_spark, dbName:default, owner:dvasthimal,
createTime:1427295171, lastAccessTime:0, retention:0,
sd:StorageDescriptor(cols:[FieldSchema(name:key, type:int, comment:null),
FieldSchema(name:value, 

Re: OutOfMemory : Java heap space error

2015-03-25 Thread ๏̯͡๏
I am facing same issue, posted a new thread. Please respond.

On Wed, Jul 9, 2014 at 1:56 AM, Rahul Bhojwani rahulbhojwani2...@gmail.com
wrote:

 Hi,

 My code was running properly but then it suddenly gave this error. Can you
 just put some light on it.

 ###
 0 KB, free: 38.7 MB)
 14/07/09 01:46:12 INFO BlockManagerMaster: Updated info of block rdd_2212_4
 14/07/09 01:46:13 INFO PythonRDD: Times: total = 1486, boot = 698, init =
 626, finish = 162
 Exception in thread stdin writer for python 14/07/09 01:46:14 INFO
 MemoryStore: ensureFreeSpace(61480) called with cur
 Mem=270794224, maxMem=311387750
 java.lang.OutOfMemoryError: Java heap space
 at java.io.BufferedOutputStream.init(Unknown Source)
 at
 org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:62)
 14/07/09 01:46:15 INFO MemoryStore: Block rdd_2212_0 stored as values to
 memory (estimated size 60.0 KB, free 38.7 MB)
 Exception in thread stdin writer for python java.lang.OutOfMemoryError:
 Java heap space
 at java.io.BufferedOutputStream.init(Unknown Source)
 at
 org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:62)
 14/07/09 01:46:18 INFO BlockManagerMasterActor$BlockManagerInfo: Added
 rdd_2212_0 in memory on shawn-PC:51451 (size: 60.
 0 KB, free: 38.7 MB)
 PySpark worker failed with exception:
 Traceback (most recent call last):
   File F:\spark-0.9.1\spark-0.9.1\bin\..\/python/pyspark/worker.py, line
 50, in main
 split_index = read_int(infile)
   File F:\spark-0.9.1\spark-0.9.1\python\pyspark\serializers.py, line
 328, in read_int
 raise EOFError
 EOFError

 14/07/09 01:46:25 INFO BlockManagerMaster: Updated info of block rdd_2212_0
 Exception in thread stdin writer for python java.lang.OutOfMemoryError:
 Java heap space
 PySpark worker failed with exception:
 Traceback (most recent call last):
   File F:\spark-0.9.1\spark-0.9.1\bin\..\/python/pyspark/worker.py, line
 50, in main
 split_index = read_int(infile)
   File F:\spark-0.9.1\spark-0.9.1\python\pyspark\serializers.py, line
 328, in read_int
 raise EOFError
 EOFError

 Exception in thread Executor task launch worker-3
 java.lang.OutOfMemoryError: Java heap space

 Exception: java.lang.OutOfMemoryError thrown from the
 UncaughtExceptionHandler in thread spark-akka.actor.default-dispa
 tcher-15
 Exception in thread Executor task launch worker-1 Exception in thread
 Executor task launch worker-2 java.lang.OutOfM
 emoryError: Java heap space
 java.lang.OutOfMemoryError: Java heap space
 Exception in thread Executor task launch worker-0 Exception in thread
 Executor task launch worker-5 java.lang.OutOfM
 emoryError: Java heap space
 java.lang.OutOfMemoryError: Java heap space
 14/07/09 01:46:52 WARN BlockManagerMaster: Error sending message to
 BlockManagerMaster in 1 attempts
 akka.pattern.AskTimeoutException:
 Recipient[Actor[akka://spark/user/BlockManagerMaster#920823400]] had
 already been term
 inated.
 at
 akka.pattern.AskableActorRef$.ask$extension(AskSupport.scala:134)
 at
 org.apache.spark.storage.BlockManagerMaster.askDriverWithReply(BlockManagerMaster.scala:161)
 at
 org.apache.spark.storage.BlockManagerMaster.sendHeartBeat(BlockManagerMaster.scala:52)
 at org.apache.spark.storage.BlockManager.org
 $apache$spark$storage$BlockManager$$heartBeat(BlockManager.scala:97)

 at
 org.apache.spark.storage.BlockManager$$anonfun$initialize$1.apply$mcV$sp(BlockManager.scala:135)
 at akka.actor.Scheduler$$anon$9.run(Scheduler.scala:80)
 at
 akka.actor.LightArrayRevolverScheduler$$anon$3$$anon$2.run(Scheduler.scala:241)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown
 Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
 Source)
 at java.lang.Thread.run(Unknown Source)
 14/07/09 01:46:56 WARN BlockManagerMaster: Error sending message to
 BlockManagerMaster in 2 attempts
 akka.pattern.AskTimeoutException:
 Recipient[Actor[akka://spark/user/BlockManagerMaster#920823400]] had
 already been term
 inated.
 at
 akka.pattern.AskableActorRef$.ask$extension(AskSupport.scala:134)
 at
 org.apache.spark.storage.BlockManagerMaster.askDriverWithReply(BlockManagerMaster.scala:161)
 at
 org.apache.spark.storage.BlockManagerMaster.sendHeartBeat(BlockManagerMaster.scala:52)
 at org.apache.spark.storage.BlockManager.org
 $apache$spark$storage$BlockManager$$heartBeat(BlockManager.scala:97)

 at
 org.apache.spark.storage.BlockManager$$anonfun$initialize$1.apply$mcV$sp(BlockManager.scala:135)
 at akka.actor.Scheduler$$anon$9.run(Scheduler.scala:80)
 at
 akka.actor.LightArrayRevolverScheduler$$anon$3$$anon$2.run(Scheduler.scala:241)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown
 Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
 Source)
 at java.lang.Thread.run(Unknown Source)
 14/07/09 

Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems

2015-03-25 Thread Nick Pentreath
What version of Spark do the other dependencies rely on (Adam and H2O?) - that 
could be it




Or try sbt clean compile 



—
Sent from Mailbox

On Wed, Mar 25, 2015 at 5:58 PM, roni roni.epi...@gmail.com wrote:

 I have a EC2 cluster created using spark version 1.2.1.
 And I have a SBT project .
 Now I want to upgrade to spark 1.3 and use the new features.
 Below are issues .
 Sorry for the long post.
 Appreciate your help.
 Thanks
 -Roni
 Question - Do I have to create a new cluster using spark 1.3?
 Here is what I did -
 In my SBT file I  changed to -
 libraryDependencies += org.apache.spark %% spark-core % 1.3.0
 But then I started getting compilation error. along with
 Here are some of the libraries that were evicted:
 [warn]  * org.apache.spark:spark-core_2.10:1.2.0 - 1.3.0
 [warn]  * org.apache.hadoop:hadoop-client:(2.5.0-cdh5.2.0, 2.2.0) - 2.6.0
 [warn] Run 'evicted' to see detailed eviction warnings
  constructor cannot be instantiated to expected type;
 [error]  found   : (T1, T2)
 [error]  required: org.apache.spark.sql.catalyst.expressions.Row
 [error] val ty = joinRDD.map{case(word,
 (file1Counts, file2Counts)) = KmerIntesect(word, file1Counts,xyz)}
 [error]  ^
 Here is my SBT and code --
 SBT -
 version := 1.0
 scalaVersion := 2.10.4
 resolvers += Sonatype OSS Snapshots at 
 https://oss.sonatype.org/content/repositories/snapshots;;
 resolvers += Maven Repo1 at https://repo1.maven.org/maven2;;
 resolvers += Maven Repo at 
 https://s3.amazonaws.com/h2o-release/h2o-dev/master/1056/maven/repo/;;
 /* Dependencies - %% appends Scala version to artifactId */
 libraryDependencies += org.apache.hadoop % hadoop-client % 2.6.0
 libraryDependencies += org.apache.spark %% spark-core % 1.3.0
 libraryDependencies += org.bdgenomics.adam % adam-core % 0.16.0
 libraryDependencies += ai.h2o % sparkling-water-core_2.10 % 0.2.10
 CODE --
 import org.apache.spark.{SparkConf, SparkContext}
 case class KmerIntesect(kmer: String, kCount: Int, fileName: String)
 object preDefKmerIntersection {
   def main(args: Array[String]) {
  val sparkConf = new SparkConf().setAppName(preDefKmer-intersect)
  val sc = new SparkContext(sparkConf)
 import sqlContext.createSchemaRDD
 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 val bedFile = sc.textFile(s3n://a/b/c,40)
  val hgfasta = sc.textFile(hdfs://a/b/c,40)
  val hgPair = hgfasta.map(_.split (,)).map(a= (a(0),
 a(1).trim().toInt))
  val filtered = hgPair.filter(kv = kv._2 == 1)
  val bedPair = bedFile.map(_.split (,)).map(a= (a(0),
 a(1).trim().toInt))
  val joinRDD = bedPair.join(filtered)
 val ty = joinRDD.map{case(word, (file1Counts, file2Counts))
 = KmerIntesect(word, file1Counts,xyz)}
 ty.registerTempTable(KmerIntesect)
 ty.saveAsParquetFile(hdfs://x/y/z/kmerIntersect.parquet)
   }
 }

Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems

2015-03-25 Thread roni
Even if H2o and ADA are dependent on 1.2.1 , it should be backword
compatible, right?
So using 1.3 should not break them.
And the code is not using the classes from those libs.
I tried sbt clean compile .. same errror
Thanks
_R

On Wed, Mar 25, 2015 at 9:26 AM, Nick Pentreath nick.pentre...@gmail.com
wrote:

 What version of Spark do the other dependencies rely on (Adam and H2O?) -
 that could be it

 Or try sbt clean compile

 —
 Sent from Mailbox https://www.dropbox.com/mailbox


 On Wed, Mar 25, 2015 at 5:58 PM, roni roni.epi...@gmail.com wrote:

 I have a EC2 cluster created using spark version 1.2.1.
 And I have a SBT project .
 Now I want to upgrade to spark 1.3 and use the new features.
 Below are issues .
 Sorry for the long post.
 Appreciate your help.
 Thanks
 -Roni

 Question - Do I have to create a new cluster using spark 1.3?

 Here is what I did -

 In my SBT file I  changed to -
 libraryDependencies += org.apache.spark %% spark-core % 1.3.0

 But then I started getting compilation error. along with
 Here are some of the libraries that were evicted:
 [warn]  * org.apache.spark:spark-core_2.10:1.2.0 - 1.3.0
 [warn]  * org.apache.hadoop:hadoop-client:(2.5.0-cdh5.2.0, 2.2.0) - 2.6.0
 [warn] Run 'evicted' to see detailed eviction warnings

  constructor cannot be instantiated to expected type;
 [error]  found   : (T1, T2)
 [error]  required: org.apache.spark.sql.catalyst.expressions.Row
 [error] val ty = joinRDD.map{case(word,
 (file1Counts, file2Counts)) = KmerIntesect(word, file1Counts,xyz)}
 [error]  ^

 Here is my SBT and code --
 SBT -

 version := 1.0

 scalaVersion := 2.10.4

 resolvers += Sonatype OSS Snapshots at 
 https://oss.sonatype.org/content/repositories/snapshots;;
 resolvers += Maven Repo1 at https://repo1.maven.org/maven2;;
 resolvers += Maven Repo at 
 https://s3.amazonaws.com/h2o-release/h2o-dev/master/1056/maven/repo/;;

 /* Dependencies - %% appends Scala version to artifactId */
 libraryDependencies += org.apache.hadoop % hadoop-client % 2.6.0
 libraryDependencies += org.apache.spark %% spark-core % 1.3.0
 libraryDependencies += org.bdgenomics.adam % adam-core % 0.16.0
 libraryDependencies += ai.h2o % sparkling-water-core_2.10 % 0.2.10


 CODE --
 import org.apache.spark.{SparkConf, SparkContext}
 case class KmerIntesect(kmer: String, kCount: Int, fileName: String)

 object preDefKmerIntersection {
   def main(args: Array[String]) {

  val sparkConf = new SparkConf().setAppName(preDefKmer-intersect)
  val sc = new SparkContext(sparkConf)
 import sqlContext.createSchemaRDD
 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 val bedFile = sc.textFile(s3n://a/b/c,40)
  val hgfasta = sc.textFile(hdfs://a/b/c,40)
  val hgPair = hgfasta.map(_.split (,)).map(a= (a(0),
 a(1).trim().toInt))
  val filtered = hgPair.filter(kv = kv._2 == 1)
  val bedPair = bedFile.map(_.split (,)).map(a= (a(0),
 a(1).trim().toInt))
  val joinRDD = bedPair.join(filtered)
 val ty = joinRDD.map{case(word, (file1Counts,
 file2Counts)) = KmerIntesect(word, file1Counts,xyz)}
 ty.registerTempTable(KmerIntesect)
 ty.saveAsParquetFile(hdfs://x/y/z/kmerIntersect.parquet)
   }
 }





RE: Date and decimal datatype not working

2015-03-25 Thread BASAK, ANANDA
Thanks. This library is only available with Spark 1.3. I am using version 
1.2.1. Before I upgrade to 1.3, I want to try what can be done in 1.2.1.

So I am using following:
val MyDataset = sqlContext.sql(my select query”)

MyDataset.map(t = 
t(0)+|+t(1)+|+t(2)+|+t(3)+|+t(4)+|+t(5)).saveAsTextFile(/my_destination_path)

But it is giving following error:
15/03/24 17:05:51 ERROR Executor: Exception in task 1.0 in stage 13.0 (TID 106)
java.lang.NumberFormatException: For input string: 
at 
java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Long.parseLong(Long.java:453)
at java.lang.Long.parseLong(Long.java:483)
at 
scala.collection.immutable.StringLike$class.toLong(StringLike.scala:230)

is there something wrong with the TSTAMP field which is Long datatype?

Thanks  Regards
---
Ananda Basak
Ph: 425-213-7092

From: Yin Huai [mailto:yh...@databricks.com]
Sent: Monday, March 23, 2015 8:55 PM
To: BASAK, ANANDA
Cc: user@spark.apache.org
Subject: Re: Date and decimal datatype not working

To store to csv file, you can use 
Spark-CSVhttps://github.com/databricks/spark-csv library.

On Mon, Mar 23, 2015 at 5:35 PM, BASAK, ANANDA 
ab9...@att.commailto:ab9...@att.com wrote:
Thanks. This worked well as per your suggestions. I had to run following:
val TABLE_A = 
sc.textFile(/Myhome/SPARK/files/table_a_file.txt).map(_.split(|)).map(p = 
ROW_A(p(0).trim.toLong, p(1), p(2).trim.toInt, p(3), BigDecimal(p(4)), 
BigDecimal(p(5)), BigDecimal(p(6

Now I am stuck at another step. I have run a SQL query, where I am Selecting 
from all the fields with some where clause , TSTAMP filtered with date range 
and order by TSTAMP clause. That is running fine.

Then I am trying to store the output in a CSV file. I am using 
saveAsTextFile(“filename”) function. But it is giving error. Can you please 
help me to write a proper syntax to store output in a CSV file?


Thanks  Regards
---
Ananda Basak
Ph: 425-213-7092tel:425-213-7092

From: BASAK, ANANDA
Sent: Tuesday, March 17, 2015 3:08 PM
To: Yin Huai
Cc: user@spark.apache.orgmailto:user@spark.apache.org
Subject: RE: Date and decimal datatype not working

Ok, thanks for the suggestions. Let me try and will confirm all.

Regards
Ananda

From: Yin Huai [mailto:yh...@databricks.commailto:yh...@databricks.com]
Sent: Tuesday, March 17, 2015 3:04 PM
To: BASAK, ANANDA
Cc: user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: Date and decimal datatype not working

p(0) is a String. So, you need to explicitly convert it to a Long. e.g. 
p(0).trim.toLong. You also need to do it for p(2). For those BigDecimals value, 
you need to create BigDecimal objects from your String values.

On Tue, Mar 17, 2015 at 5:55 PM, BASAK, ANANDA 
ab9...@att.commailto:ab9...@att.com wrote:
Hi All,
I am very new in Spark world. Just started some test coding from last week. I 
am using spark-1.2.1-bin-hadoop2.4 and scala coding.
I am having issues while using Date and decimal data types. Following is my 
code that I am simply running on scala prompt. I am trying to define a table 
and point that to my flat file containing raw data (pipe delimited format). 
Once that is done, I will run some SQL queries and put the output data in to 
another flat file with pipe delimited format.

***
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.createSchemaRDD


// Define row and table
case class ROW_A(
  TSTAMP:   Long,
  USIDAN: String,
  SECNT:Int,
  SECT:   String,
  BLOCK_NUM:BigDecimal,
  BLOCK_DEN:BigDecimal,
  BLOCK_PCT:BigDecimal)

val TABLE_A = 
sc.textFile(/Myhome/SPARK/files/table_a_file.txt).map(_.split(|)).map(p = 
ROW_A(p(0), p(1), p(2), p(3), p(4), p(5), p(6)))

TABLE_A.registerTempTable(TABLE_A)

***

The second last command is giving error, like following:
console:17: error: type mismatch;
found   : String
required: Long

Looks like the content from my flat file are considered as String always and 
not as Date or decimal. How can I make Spark to take them as Date or decimal 
types?

Regards
Ananda




Re: Total size of serialized results is bigger than spark.driver.maxResultSize

2015-03-25 Thread Denny Lee
As you noted, you can change the spark.driver.maxResultSize value in your
Spark Configurations (https://spark.apache.org/docs/1.2.0/configuration.html).
Please reference the Spark Properties section noting that you can modify
these properties via the spark-defaults.conf or via SparkConf().

HTH!



On Wed, Mar 25, 2015 at 8:01 AM Wang, Ningjun (LNG-NPV) 
ningjun.w...@lexisnexis.com wrote:

  Hi



 I ran a spark job and got the following error. Can anybody tell me how to
 work around this problem? For example how can I increase
 spark.driver.maxResultSize? Thanks.

  org.apache.spark.SparkException: Job aborted due to stage failure: Total
 size of serialized results

 of 128 tasks (1029.1 MB) is bigger than spark.driver.maxResultSize (1024.0
 MB)

 at org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$failJobA

 ndIndependentStages(DAGScheduler.scala:1214)

 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:12

 03)

 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:12

 02)

 at
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

 at
 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)

 at
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1202)

 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler

 .scala:696)

 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler

 .scala:696)

 at scala.Option.foreach(Option.scala:236)

 at
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:696)

 at
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(D

 AGScheduler.scala:1420)

 at akka.actor.Actor$class.aroundReceive(Actor.scala:465)

 at
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala

 :1375)

 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)

at akka.actor.ActorCell.invoke(ActorCell.scala:487)

 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)

 at akka.dispatch.Mailbox.run(Mailbox.scala:220)

 at
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala

 :393)

 at
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

 at
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

 at
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

 at
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

 15/03/25 10:48:38 WARN TaskSetManager: Lost task 128.0 in stage 199.0 (TID
 6324, INT1-CAS01.pcc.lexi

 snexis.com): TaskKilled (killed intentionally)



 Ningjun





Re: java.lang.OutOfMemoryError: unable to create new native thread

2015-03-25 Thread ๏̯͡๏
I have a YARN cluster where the max memory allowed is 16GB. I set 12G for
my driver, however i see OutOFMemory error even for this program
http://spark.apache.org/docs/1.3.0/sql-programming-guide.html#hive-tables .
What do you suggest ?

On Wed, Mar 25, 2015 at 8:23 AM, Thomas Gerber thomas.ger...@radius.com
wrote:

 So,

 1. I reduced my  -XX:ThreadStackSize to 5m (instead of 10m - default is
 1m), which is still OK for my need.
 2. I reduced the executor memory to 44GB for a 60GB machine (instead of
 49GB).

 This seems to have helped. Thanks to Matthew and Sean.

 Thomas

 On Tue, Mar 24, 2015 at 3:49 PM, Matt Silvey matt.sil...@videoamp.com
 wrote:

 My memory is hazy on this but aren't there hidden limitations to
 Linux-based threads?  I ran into some issues a couple of years ago where,
 and here is the fuzzy part, the kernel wants to reserve virtual memory per
 thread equal to the stack size.  When the total amount of reserved memory
 (not necessarily resident memory) exceeds the memory of the system it
 throws an OOM.  I'm looking for material to back this up.  Sorry for the
 initial vague response.

 Matthew

 On Tue, Mar 24, 2015 at 12:53 PM, Thomas Gerber thomas.ger...@radius.com
  wrote:

 Additional notes:
 I did not find anything wrong with the number of threads (ps -u USER -L
 | wc -l): around 780 on the master and 400 on executors. I am running on
 100 r3.2xlarge.

 On Tue, Mar 24, 2015 at 12:38 PM, Thomas Gerber 
 thomas.ger...@radius.com wrote:

 Hello,

 I am seeing various crashes in spark on large jobs which all share a
 similar exception:

 java.lang.OutOfMemoryError: unable to create new native thread
 at java.lang.Thread.start0(Native Method)
 at java.lang.Thread.start(Thread.java:714)

 I increased nproc (i.e. ulimit -u) 10 fold, but it doesn't help.

 Does anyone know how to avoid those kinds of errors?

 Noteworthy: I added -XX:ThreadStackSize=10m on both driver and executor
 extra java options, which might have amplified the problem.

 Thanks for you help,
 Thomas







-- 
Deepak


Re: column expression in left outer join for DataFrame

2015-03-25 Thread Michael Armbrust
Unfortunately you are now hitting a bug (that is fixed in master and will
be released in 1.3.1 hopefully next week).  However, even with that your
query is still ambiguous and you will need to use aliases:

val df_1 = df.filter( df(event) === 0)
  . select(country, cnt).as(a)
val df_2 = df.filter( df(event) === 3)
  . select(country, cnt).as(b)
val both = df_2.join(df_1, $a.country === $b.country), left_outer)



On Tue, Mar 24, 2015 at 11:57 PM, S Krishna skrishna...@gmail.com wrote:

 Hi,

 Thanks for your response. I modified my code as per your suggestion, but
 now I am getting a runtime error. Here's my code:

 val df_1 = df.filter( df(event) === 0)
   . select(country, cnt)

 val df_2 = df.filter( df(event) === 3)
   . select(country, cnt)

 df_1.show()
 //produces the following output :
 // countrycnt
 //   tw   3000
 //   uk   2000
 //   us   1000

 df_2.show()
 //produces the following output :
 // countrycnt
 //   tw   25
 //   uk   200
 //   us   95

 val both = df_2.join(df_1, df_2(country)===df_1(country), left_outer)

 I am getting the following error when executing the join statement:

 java.util.NoSuchElementException: next on empty iterator.

 This error seems to be originating at DataFrame.join (line 133 in
 DataFrame.scala).

 The show() results show that both dataframes do have columns named
 country and that they are non-empty. I also tried the simpler join ( i.e.
 df_2.join(df_1) ) and got the same error stated above.

 I would like to know what is wrong with the join statement above.

 thanks
























 On Tue, Mar 24, 2015 at 6:08 PM, Michael Armbrust mich...@databricks.com
 wrote:

 You need to use `===`, so that you are constructing a column expression
 instead of evaluating the standard scala equality method.  Calling methods
 to access columns (i.e. df.county is only supported in python).

 val join_df =  df1.join( df2, df1(country) === df2(country),
 left_outer)

 On Tue, Mar 24, 2015 at 5:50 PM, SK skrishna...@gmail.com wrote:

 Hi,

 I am trying to port some code that was working in Spark 1.2.0 on the
 latest
 version, Spark 1.3.0. This code involves a left outer join between two
 SchemaRDDs which I am now trying to change to a left outer join between 2
 DataFrames. I followed the example  for left outer join of DataFrame at

 https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html

 Here's my code, where df1 and df2 are the 2 dataframes I am joining on
 the
 country field:

  val join_df =  df1.join( df2,  df1.country == df2.country, left_outer)

 But I got a compilation error that value  country is not a member of
 sql.DataFrame

 I  also tried the following:
  val join_df =  df1.join( df2, df1(country) == df2(country),
 left_outer)

 I got a compilation error that it is a Boolean whereas a Column is
 required.

 So what is the correct Column expression I need to provide for joining
 the 2
 dataframes on a specific field ?

 thanks








 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/column-expression-in-left-outer-join-for-DataFrame-tp22209.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org






Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems

2015-03-25 Thread Nick Pentreath
Ah I see now you are trying to use a spark 1.2 cluster - you will need to be 
running spark 1.3 on your EC2 cluster in order to run programs built against 
spark 1.3.




You will need to terminate and restart your cluster with spark 1.3 



—
Sent from Mailbox

On Wed, Mar 25, 2015 at 6:39 PM, roni roni.epi...@gmail.com wrote:

 Even if H2o and ADA are dependent on 1.2.1 , it should be backword
 compatible, right?
 So using 1.3 should not break them.
 And the code is not using the classes from those libs.
 I tried sbt clean compile .. same errror
 Thanks
 _R
 On Wed, Mar 25, 2015 at 9:26 AM, Nick Pentreath nick.pentre...@gmail.com
 wrote:
 What version of Spark do the other dependencies rely on (Adam and H2O?) -
 that could be it

 Or try sbt clean compile

 —
 Sent from Mailbox https://www.dropbox.com/mailbox


 On Wed, Mar 25, 2015 at 5:58 PM, roni roni.epi...@gmail.com wrote:

 I have a EC2 cluster created using spark version 1.2.1.
 And I have a SBT project .
 Now I want to upgrade to spark 1.3 and use the new features.
 Below are issues .
 Sorry for the long post.
 Appreciate your help.
 Thanks
 -Roni

 Question - Do I have to create a new cluster using spark 1.3?

 Here is what I did -

 In my SBT file I  changed to -
 libraryDependencies += org.apache.spark %% spark-core % 1.3.0

 But then I started getting compilation error. along with
 Here are some of the libraries that were evicted:
 [warn]  * org.apache.spark:spark-core_2.10:1.2.0 - 1.3.0
 [warn]  * org.apache.hadoop:hadoop-client:(2.5.0-cdh5.2.0, 2.2.0) - 2.6.0
 [warn] Run 'evicted' to see detailed eviction warnings

  constructor cannot be instantiated to expected type;
 [error]  found   : (T1, T2)
 [error]  required: org.apache.spark.sql.catalyst.expressions.Row
 [error] val ty = joinRDD.map{case(word,
 (file1Counts, file2Counts)) = KmerIntesect(word, file1Counts,xyz)}
 [error]  ^

 Here is my SBT and code --
 SBT -

 version := 1.0

 scalaVersion := 2.10.4

 resolvers += Sonatype OSS Snapshots at 
 https://oss.sonatype.org/content/repositories/snapshots;;
 resolvers += Maven Repo1 at https://repo1.maven.org/maven2;;
 resolvers += Maven Repo at 
 https://s3.amazonaws.com/h2o-release/h2o-dev/master/1056/maven/repo/;;

 /* Dependencies - %% appends Scala version to artifactId */
 libraryDependencies += org.apache.hadoop % hadoop-client % 2.6.0
 libraryDependencies += org.apache.spark %% spark-core % 1.3.0
 libraryDependencies += org.bdgenomics.adam % adam-core % 0.16.0
 libraryDependencies += ai.h2o % sparkling-water-core_2.10 % 0.2.10


 CODE --
 import org.apache.spark.{SparkConf, SparkContext}
 case class KmerIntesect(kmer: String, kCount: Int, fileName: String)

 object preDefKmerIntersection {
   def main(args: Array[String]) {

  val sparkConf = new SparkConf().setAppName(preDefKmer-intersect)
  val sc = new SparkContext(sparkConf)
 import sqlContext.createSchemaRDD
 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 val bedFile = sc.textFile(s3n://a/b/c,40)
  val hgfasta = sc.textFile(hdfs://a/b/c,40)
  val hgPair = hgfasta.map(_.split (,)).map(a= (a(0),
 a(1).trim().toInt))
  val filtered = hgPair.filter(kv = kv._2 == 1)
  val bedPair = bedFile.map(_.split (,)).map(a= (a(0),
 a(1).trim().toInt))
  val joinRDD = bedPair.join(filtered)
 val ty = joinRDD.map{case(word, (file1Counts,
 file2Counts)) = KmerIntesect(word, file1Counts,xyz)}
 ty.registerTempTable(KmerIntesect)
 ty.saveAsParquetFile(hdfs://x/y/z/kmerIntersect.parquet)
   }
 }




[no subject]

2015-03-25 Thread Himanish Kushary
Hi,

I have a RDD of pairs of strings like below :

(A,B)
(B,C)
(C,D)
(A,D)
(E,F)
(B,F)

I need to transform/filter this into a RDD of pairs that does not repeat a
string once it has been used once. So something like ,

(A,B)
(C,D)
(E,F)

(B,C) is out because B has already ben used in (A,B), (A,D) is out because
A (and D) has been used etc.

I was thinking of a option of using a shared variable to keep track of what
has already been used but that may only work for a single partition and
would not scale for larger dataset.

Is there any other efficient way to accomplish this ?

-- 
Thanks  Regards
Himanish


Re: OutOfMemoryError when using DataFrame created by Spark SQL

2015-03-25 Thread Michael Armbrust
You should also try increasing the perm gen size: -XX:MaxPermSize=512m

On Wed, Mar 25, 2015 at 2:37 AM, Ted Yu yuzhih...@gmail.com wrote:

 Can you try giving Spark driver more heap ?

 Cheers



 On Mar 25, 2015, at 2:14 AM, Todd Leo sliznmail...@gmail.com wrote:

 Hi,

 I am using *Spark SQL* to query on my *Hive cluster*, following Spark SQL
 and DataFrame Guide
 https://spark.apache.org/docs/latest/sql-programming-guide.html step by
 step. However, my HiveQL via sqlContext.sql() fails and
 java.lang.OutOfMemoryError was raised. The expected result of such query is
 considered to be small (by adding limit 1000 clause). My code is shown
 below:

 scala import sqlContext.implicits._
 scala val df = sqlContext.sql(select * from some_table where 
 logdate=2015-03-24 limit 1000)

 and the error msg:

 [ERROR] [03/25/2015 16:08:22.379] [sparkDriver-scheduler-27] 
 [ActorSystem(sparkDriver)] Uncaught fatal error from thread 
 [sparkDriver-scheduler-27] shutting down ActorSystem [sparkDriver]
 java.lang.OutOfMemoryError: GC overhead limit exceeded

 the master heap memory is set by -Xms512m -Xmx512m, while workers set by 
 -Xms4096M
 -Xmx4096M, which I presume sufficient for this trivial query.

 Additionally, after restarted the spark-shell and re-run the limit 5 query
 , the df object is returned and can be printed by df.show(), but other
 APIs fails on OutOfMemoryError, namely, df.count(),
 df.select(some_field).show() and so forth.

 I understand that the RDD can be collected to master hence further
 transmutations can be applied, as DataFrame has “richer optimizations under
 the hood” and the convention from an R/julia user, I really hope this error
 is able to be tackled, and DataFrame is robust enough to depend.

 Thanks in advance!

 REGARDS,
 Todd
 ​

 --
 View this message in context: OutOfMemoryError when using DataFrame
 created by Spark SQL
 http://apache-spark-user-list.1001560.n3.nabble.com/OutOfMemoryError-when-using-DataFrame-created-by-Spark-SQL-tp22219.html
 Sent from the Apache Spark User List mailing list archive
 http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com.




Re:

2015-03-25 Thread Nathan Kronenfeld
What would it do with the following dataset?

(A, B)
(A, C)
(B, D)

On Wed, Mar 25, 2015 at 1:02 PM, Himanish Kushary himan...@gmail.com
wrote:

 Hi,

 I have a RDD of pairs of strings like below :

 (A,B)
 (B,C)
 (C,D)
 (A,D)
 (E,F)
 (B,F)

 I need to transform/filter this into a RDD of pairs that does not repeat a
 string once it has been used once. So something like ,

 (A,B)
 (C,D)
 (E,F)

 (B,C) is out because B has already ben used in (A,B), (A,D) is out because
 A (and D) has been used etc.

 I was thinking of a option of using a shared variable to keep track of
 what has already been used but that may only work for a single partition
 and would not scale for larger dataset.

 Is there any other efficient way to accomplish this ?

 --
 Thanks  Regards
 Himanish



Re: What are the best options for quickly filtering a DataFrame on a single column?

2015-03-25 Thread Michael Armbrust
The only way to do in using python currently is to use the string based
filter API (where you pass us an expression as a string, and we parse it
using our SQL parser).

from pyspark.sql import Row
from pyspark.sql.functions import *

df = sc.parallelize([Row(name=test)]).toDF()
df.filter(name in ('a', 'b')).collect()
Out[1]: []

df.filter(name in ('test')).collect()
Out[2]: [Row(name=u'test')]

In general you want to avoid lambda functions whenever you can do the same
thing a dataframe expression.  This is because your lambda function is a
black box that we cannot optimize (though you should certainly use them for
the advanced stuff that expressions can't handle).

I opened SPARK-6536 https://issues.apache.org/jira/browse/SPARK-6536 to
provide a nicer interface for this.


On Wed, Mar 25, 2015 at 7:41 AM, Stuart Layton stuart.lay...@gmail.com
wrote:

 I have a SparkSQL dataframe with a a few billion rows that I need to
 quickly filter down to a few hundred thousand rows, using an operation like
 (syntax may not be correct)

 df = df[ df.filter(lambda x: x.key_col in approved_keys)]

 I was thinking about serializing the data using parquet and saving it to
 S3, however as I want to optimize for filtering speed I'm not sure this is
 the best option.

 --
 Stuart Layton



Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems

2015-03-25 Thread roni
Thanks Dean and Nick.
So, I removed the ADAM and H2o from my SBT as I was not using them.
I got the code to compile  - only for fail while running with -
SparkContext: Created broadcast 1 from textFile at kmerIntersetion.scala:21
Exception in thread main java.lang.NoClassDefFoundError:
org/apache/spark/rdd/RDD$
at preDefKmerIntersection$.main(kmerIntersetion.scala:26)

This line is where I do a JOIN operation.
val hgPair = hgfasta.map(_.split (,)).map(a= (a(0), a(1).trim().toInt))
 val filtered = hgPair.filter(kv = kv._2 == 1)
 val bedPair = bedFile.map(_.split (,)).map(a= (a(0),
a(1).trim().toInt))
* val joinRDD = bedPair.join(filtered)   *
Any idea whats going on?
I have data on the EC2 so I am avoiding creating the new cluster , but just
upgrading and changing the code to use 1.3 and Spark SQL
Thanks
Roni



On Wed, Mar 25, 2015 at 9:50 AM, Dean Wampler deanwamp...@gmail.com wrote:

 For the Spark SQL parts, 1.3 breaks backwards compatibility, because
 before 1.3, Spark SQL was considered experimental where API changes were
 allowed.

 So, H2O and ADA compatible with 1.2.X might not work with 1.3.

 dean

 Dean Wampler, Ph.D.
 Author: Programming Scala, 2nd Edition
 http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
 Typesafe http://typesafe.com
 @deanwampler http://twitter.com/deanwampler
 http://polyglotprogramming.com

 On Wed, Mar 25, 2015 at 9:39 AM, roni roni.epi...@gmail.com wrote:

 Even if H2o and ADA are dependent on 1.2.1 , it should be backword
 compatible, right?
 So using 1.3 should not break them.
 And the code is not using the classes from those libs.
 I tried sbt clean compile .. same errror
 Thanks
 _R

 On Wed, Mar 25, 2015 at 9:26 AM, Nick Pentreath nick.pentre...@gmail.com
  wrote:

 What version of Spark do the other dependencies rely on (Adam and H2O?)
 - that could be it

 Or try sbt clean compile

 —
 Sent from Mailbox https://www.dropbox.com/mailbox


 On Wed, Mar 25, 2015 at 5:58 PM, roni roni.epi...@gmail.com wrote:

 I have a EC2 cluster created using spark version 1.2.1.
 And I have a SBT project .
 Now I want to upgrade to spark 1.3 and use the new features.
 Below are issues .
 Sorry for the long post.
 Appreciate your help.
 Thanks
 -Roni

 Question - Do I have to create a new cluster using spark 1.3?

 Here is what I did -

 In my SBT file I  changed to -
 libraryDependencies += org.apache.spark %% spark-core % 1.3.0

 But then I started getting compilation error. along with
 Here are some of the libraries that were evicted:
 [warn]  * org.apache.spark:spark-core_2.10:1.2.0 - 1.3.0
 [warn]  * org.apache.hadoop:hadoop-client:(2.5.0-cdh5.2.0, 2.2.0) -
 2.6.0
 [warn] Run 'evicted' to see detailed eviction warnings

  constructor cannot be instantiated to expected type;
 [error]  found   : (T1, T2)
 [error]  required: org.apache.spark.sql.catalyst.expressions.Row
 [error] val ty = joinRDD.map{case(word,
 (file1Counts, file2Counts)) = KmerIntesect(word, file1Counts,xyz)}
 [error]  ^

 Here is my SBT and code --
 SBT -

 version := 1.0

 scalaVersion := 2.10.4

 resolvers += Sonatype OSS Snapshots at 
 https://oss.sonatype.org/content/repositories/snapshots;;
 resolvers += Maven Repo1 at https://repo1.maven.org/maven2;;
 resolvers += Maven Repo at 
 https://s3.amazonaws.com/h2o-release/h2o-dev/master/1056/maven/repo/;;

 /* Dependencies - %% appends Scala version to artifactId */
 libraryDependencies += org.apache.hadoop % hadoop-client % 2.6.0
 libraryDependencies += org.apache.spark %% spark-core % 1.3.0
 libraryDependencies += org.bdgenomics.adam % adam-core % 0.16.0
 libraryDependencies += ai.h2o % sparkling-water-core_2.10 % 0.2.10


 CODE --
 import org.apache.spark.{SparkConf, SparkContext}
 case class KmerIntesect(kmer: String, kCount: Int, fileName: String)

 object preDefKmerIntersection {
   def main(args: Array[String]) {

  val sparkConf = new SparkConf().setAppName(preDefKmer-intersect)
  val sc = new SparkContext(sparkConf)
 import sqlContext.createSchemaRDD
 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 val bedFile = sc.textFile(s3n://a/b/c,40)
  val hgfasta = sc.textFile(hdfs://a/b/c,40)
  val hgPair = hgfasta.map(_.split (,)).map(a= (a(0),
 a(1).trim().toInt))
  val filtered = hgPair.filter(kv = kv._2 == 1)
  val bedPair = bedFile.map(_.split (,)).map(a=
 (a(0), a(1).trim().toInt))
  val joinRDD = bedPair.join(filtered)
 val ty = joinRDD.map{case(word, (file1Counts,
 file2Counts)) = KmerIntesect(word, file1Counts,xyz)}
 ty.registerTempTable(KmerIntesect)

 ty.saveAsParquetFile(hdfs://x/y/z/kmerIntersect.parquet)
   }
 }







Re:

2015-03-25 Thread Himanish Kushary
It will only give (A,B). I am generating the pair from combinations of the
the strings A,B,C and D, so the pairs (ignoring order) would be

(A,B),(A,C),(A,D),(B,C),(B,D),(C,D)

On successful filtering using the original condition it will transform to
(A,B) and (C,D)

On Wed, Mar 25, 2015 at 3:00 PM, Nathan Kronenfeld 
nkronenfeld@uncharted.software wrote:

 What would it do with the following dataset?

 (A, B)
 (A, C)
 (B, D)


 On Wed, Mar 25, 2015 at 1:02 PM, Himanish Kushary himan...@gmail.com
 wrote:

 Hi,

 I have a RDD of pairs of strings like below :

 (A,B)
 (B,C)
 (C,D)
 (A,D)
 (E,F)
 (B,F)

 I need to transform/filter this into a RDD of pairs that does not repeat
 a string once it has been used once. So something like ,

 (A,B)
 (C,D)
 (E,F)

 (B,C) is out because B has already ben used in (A,B), (A,D) is out
 because A (and D) has been used etc.

 I was thinking of a option of using a shared variable to keep track of
 what has already been used but that may only work for a single partition
 and would not scale for larger dataset.

 Is there any other efficient way to accomplish this ?

 --
 Thanks  Regards
 Himanish





-- 
Thanks  Regards
Himanish


Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems

2015-03-25 Thread Dean Wampler
Weird. Are you running using SBT console? It should have the spark-core jar
on the classpath. Similarly, spark-shell or spark-submit should work, but
be sure you're using the same version of Spark when running as when
compiling. Also, you might need to add spark-sql to your SBT dependencies,
but that shouldn't be this issue.

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
Typesafe http://typesafe.com
@deanwampler http://twitter.com/deanwampler
http://polyglotprogramming.com

On Wed, Mar 25, 2015 at 12:09 PM, roni roni.epi...@gmail.com wrote:

 Thanks Dean and Nick.
 So, I removed the ADAM and H2o from my SBT as I was not using them.
 I got the code to compile  - only for fail while running with -
 SparkContext: Created broadcast 1 from textFile at kmerIntersetion.scala:21
 Exception in thread main java.lang.NoClassDefFoundError:
 org/apache/spark/rdd/RDD$
 at preDefKmerIntersection$.main(kmerIntersetion.scala:26)

 This line is where I do a JOIN operation.
 val hgPair = hgfasta.map(_.split (,)).map(a= (a(0), a(1).trim().toInt))
  val filtered = hgPair.filter(kv = kv._2 == 1)
  val bedPair = bedFile.map(_.split (,)).map(a= (a(0),
 a(1).trim().toInt))
 * val joinRDD = bedPair.join(filtered)   *
 Any idea whats going on?
 I have data on the EC2 so I am avoiding creating the new cluster , but
 just upgrading and changing the code to use 1.3 and Spark SQL
 Thanks
 Roni



 On Wed, Mar 25, 2015 at 9:50 AM, Dean Wampler deanwamp...@gmail.com
 wrote:

 For the Spark SQL parts, 1.3 breaks backwards compatibility, because
 before 1.3, Spark SQL was considered experimental where API changes were
 allowed.

 So, H2O and ADA compatible with 1.2.X might not work with 1.3.

 dean

 Dean Wampler, Ph.D.
 Author: Programming Scala, 2nd Edition
 http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
 Typesafe http://typesafe.com
 @deanwampler http://twitter.com/deanwampler
 http://polyglotprogramming.com

 On Wed, Mar 25, 2015 at 9:39 AM, roni roni.epi...@gmail.com wrote:

 Even if H2o and ADA are dependent on 1.2.1 , it should be backword
 compatible, right?
 So using 1.3 should not break them.
 And the code is not using the classes from those libs.
 I tried sbt clean compile .. same errror
 Thanks
 _R

 On Wed, Mar 25, 2015 at 9:26 AM, Nick Pentreath 
 nick.pentre...@gmail.com wrote:

 What version of Spark do the other dependencies rely on (Adam and H2O?)
 - that could be it

 Or try sbt clean compile

 —
 Sent from Mailbox https://www.dropbox.com/mailbox


 On Wed, Mar 25, 2015 at 5:58 PM, roni roni.epi...@gmail.com wrote:

 I have a EC2 cluster created using spark version 1.2.1.
 And I have a SBT project .
 Now I want to upgrade to spark 1.3 and use the new features.
 Below are issues .
 Sorry for the long post.
 Appreciate your help.
 Thanks
 -Roni

 Question - Do I have to create a new cluster using spark 1.3?

 Here is what I did -

 In my SBT file I  changed to -
 libraryDependencies += org.apache.spark %% spark-core % 1.3.0

 But then I started getting compilation error. along with
 Here are some of the libraries that were evicted:
 [warn]  * org.apache.spark:spark-core_2.10:1.2.0 - 1.3.0
 [warn]  * org.apache.hadoop:hadoop-client:(2.5.0-cdh5.2.0, 2.2.0) -
 2.6.0
 [warn] Run 'evicted' to see detailed eviction warnings

  constructor cannot be instantiated to expected type;
 [error]  found   : (T1, T2)
 [error]  required: org.apache.spark.sql.catalyst.expressions.Row
 [error] val ty =
 joinRDD.map{case(word, (file1Counts, file2Counts)) = KmerIntesect(word,
 file1Counts,xyz)}
 [error]  ^

 Here is my SBT and code --
 SBT -

 version := 1.0

 scalaVersion := 2.10.4

 resolvers += Sonatype OSS Snapshots at 
 https://oss.sonatype.org/content/repositories/snapshots;;
 resolvers += Maven Repo1 at https://repo1.maven.org/maven2;;
 resolvers += Maven Repo at 
 https://s3.amazonaws.com/h2o-release/h2o-dev/master/1056/maven/repo/;;

 /* Dependencies - %% appends Scala version to artifactId */
 libraryDependencies += org.apache.hadoop % hadoop-client % 2.6.0
 libraryDependencies += org.apache.spark %% spark-core % 1.3.0
 libraryDependencies += org.bdgenomics.adam % adam-core % 0.16.0
 libraryDependencies += ai.h2o % sparkling-water-core_2.10 %
 0.2.10


 CODE --
 import org.apache.spark.{SparkConf, SparkContext}
 case class KmerIntesect(kmer: String, kCount: Int, fileName: String)

 object preDefKmerIntersection {
   def main(args: Array[String]) {

  val sparkConf = new SparkConf().setAppName(preDefKmer-intersect)
  val sc = new SparkContext(sparkConf)
 import sqlContext.createSchemaRDD
 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 val bedFile = sc.textFile(s3n://a/b/c,40)
  val hgfasta = sc.textFile(hdfs://a/b/c,40)
  val hgPair = 

Re: Can LBFGS be used on streaming data?

2015-03-25 Thread EcoMotto Inc.
Hello Jeremy,

Sorry for the delayed reply!

First issue was resolved, I believe it was just production and consumption
rate problem.

Regarding the second question, I am streaming the data from the file and
there are about 38k records. I am sending the streams in the same sequence
as I am reading from the file, but I am getting different weights each time
may be something to do with how the DStreams are being processed.

Can you suggest me some solution for this case? my requirement is that my
program must generate the same weights for both static and streaming data?
Thank you for your help!

Best Regards,
Arunkumar


On Thu, Mar 19, 2015 at 9:25 PM, Jeremy Freeman freeman.jer...@gmail.com
wrote:

 Regarding the first question, can you say more about how you are loading
 your data? And what is the size of the data set? And is that the only error
 you see, and do you only see it in the streaming version?

 For the second question, there are a couple reasons the weights might
 slightly differ, it depends on exactly how you set up the comparison. When
 you split it into 5, were those the same 5 chunks of data you used for the
 streaming case? And were they presented to the optimizer in the same order?
 Difference in either could produce small differences in the resulting
 weights, but that doesn’t mean it’s doing anything wrong.

 -
 jeremyfreeman.net
 @thefreemanlab

 On Mar 17, 2015, at 6:19 PM, EcoMotto Inc. ecomot...@gmail.com wrote:

 Hello Jeremy,

 Thank you for your reply.

 When I am running this code on the local machine on a streaming data, it
 keeps giving me this error:
 *WARN TaskSetManager: Lost task 2.0 in stage 211.0 (TID 4138, localhost):
 java.io.FileNotFoundException:
 /tmp/spark-local-20150316165742-9ac0/27/shuffle_102_2_0.data (No such file
 or directory) *

 And when I execute the same code on a static data after randomly splitting
 it into 5 sets, it gives me a little bit different weights (difference is
 in decimals). I am still trying to analyse why would this be happening.
 Any inputs, on why would this be happening?

 Best Regards,
 Arunkumar


 On Tue, Mar 17, 2015 at 11:32 AM, Jeremy Freeman freeman.jer...@gmail.com
  wrote:

 Hi Arunkumar,

 That looks like it should work. Logically, it’s similar to the
 implementation used by StreamingLinearRegression and
 StreamingLogisticRegression, see this class:


 https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/regression/StreamingLinearAlgorithm.scala

 which exposes the kind of operation your describing (for any linear
 method).

 The nice thing about the gradient-based methods is that they can use
 existing MLLib optimization routines in this fairly direct way. Other
 methods (such as KMeans) require a bit more reengineering.

 — Jeremy

 -
 jeremyfreeman.net
 @thefreemanlab

 On Mar 16, 2015, at 6:19 PM, EcoMotto Inc. ecomot...@gmail.com wrote:

 Hello,

 I am new to spark streaming API.

 I wanted to ask if I can apply LBFGS (with LeastSquaresGradient) on
 streaming data? Currently I am using forecahRDD for parsing through DStream
 and I am generating a model based on each RDD. Am I doing anything
 logically wrong here?
 Thank you.

 Sample Code:

 val algorithm = new LBFGS(new LeastSquaresGradient(), new SimpleUpdater())
 var initialWeights = 
 Vectors.dense(Array.fill(numFeatures)(scala.util.Random.nextDouble()))
 var isFirst = true
 var model = new LinearRegressionModel(null,1.0)

 parsedData.foreachRDD{rdd =
   if(isFirst) {
 val weights = algorithm.optimize(rdd, initialWeights)
 val w = weights.toArray
 val intercept = w.head
 model = new LinearRegressionModel(Vectors.dense(w.drop(1)), intercept)
 isFirst = false
   }else{
 var ab = ArrayBuffer[Double]()
 ab.insert(0, model.intercept)
 ab.appendAll( model.weights.toArray)
 print(Intercept = +model.intercept+ :: modelWeights = +model.weights)
 initialWeights = Vectors.dense(ab.toArray)
 print(Initial Weights: + initialWeights)
 val weights = algorithm.optimize(rdd, initialWeights)
 val w = weights.toArray
 val intercept = w.head
 model = new LinearRegressionModel(Vectors.dense(w.drop(1)), intercept)
   }



 Best Regards,
 Arunkumar







Re: Can a DataFrame be saved to s3 directly using Parquet?

2015-03-25 Thread Michael Armbrust
Until then you can try

sql(SET spark.sql.parquet.useDataSourceApi=false)

On Wed, Mar 25, 2015 at 12:15 PM, Michael Armbrust mich...@databricks.com
wrote:

 This will be fixed in Spark 1.3.1:
 https://issues.apache.org/jira/browse/SPARK-6351

 and is fixed in master/branch-1.3 if you want to compile from source

 On Wed, Mar 25, 2015 at 11:59 AM, Stuart Layton stuart.lay...@gmail.com
 wrote:

 I'm trying to save a dataframe to s3 as a parquet file but I'm getting
 Wrong FS errors

  df.saveAsParquetFile(parquetFile)
 15/03/25 18:56:10 INFO storage.MemoryStore: ensureFreeSpace(46645) called
 with curMem=82744, maxMem=278302556
 15/03/25 18:56:10 INFO storage.MemoryStore: Block broadcast_5 stored as
 values in memory (estimated size 45.6 KB, free 265.3 MB)
 15/03/25 18:56:10 INFO storage.MemoryStore: ensureFreeSpace(7078) called
 with curMem=129389, maxMem=278302556
 15/03/25 18:56:10 INFO storage.MemoryStore: Block broadcast_5_piece0
 stored as bytes in memory (estimated size 6.9 KB, free 265.3 MB)
 15/03/25 18:56:10 INFO storage.BlockManagerInfo: Added broadcast_5_piece0
 in memory on ip-172-31-1-219.ec2.internal:58280 (size: 6.9 KB, free: 265.4
 MB)
 15/03/25 18:56:10 INFO storage.BlockManagerMaster: Updated info of block
 broadcast_5_piece0
 15/03/25 18:56:10 INFO spark.SparkContext: Created broadcast 5 from
 textFile at JSONRelation.scala:98
 Traceback (most recent call last):
   File stdin, line 1, in module
   File /root/spark/python/pyspark/sql/dataframe.py, line 121, in
 saveAsParquetFile
 self._jdf.saveAsParquetFile(path)
   File
 /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line
 538, in __call__
   File /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py,
 line 300, in get_return_value
 py4j.protocol.Py4JJavaError: An error occurred while calling
 o22.saveAsParquetFile.
 : java.lang.IllegalArgumentException: Wrong FS:
 s3n://com.my.bucket/spark-testing/, expected: hdfs://
 ec2-52-0-159-113.compute-1.amazonaws.com:9000


 Is it possible to save a dataframe to s3 directly using parquet?

 --
 Stuart Layton





Re: Spark shell never leaves ACCEPTED state in YARN CDH5

2015-03-25 Thread Marcelo Vanzin
The probably means there are not enough free resources in your cluster
to run the AM for the Spark job. Check your RM's web ui to see the
resources you have available.

On Wed, Mar 25, 2015 at 12:08 PM, Khandeshi, Ami
ami.khande...@fmr.com.invalid wrote:
 I am seeing the same behavior.  I have enough resources…..  How do I resolve
 it?



 Thanks,



 Ami



-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: What are the best options for quickly filtering a DataFrame on a single column?

2015-03-25 Thread Michael Armbrust
My example is a totally reasonable way to do it, it just requires
constructing strings

In many cases you can also do it with column objects

df[df.name == test].collect()

Out[15]: [Row(name=u'test')]


You should check out:
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column
and
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions

On Wed, Mar 25, 2015 at 11:39 AM, Stuart Layton stuart.lay...@gmail.com
wrote:

 Thanks for the response, I was using IN as an example of the type of
 operation I need to do. Is there another way to do this that lines up more
 naturally with the way things are supposed to be done in SparkSQL?

 On Wed, Mar 25, 2015 at 2:29 PM, Michael Armbrust mich...@databricks.com
 wrote:

 The only way to do in using python currently is to use the string based
 filter API (where you pass us an expression as a string, and we parse it
 using our SQL parser).

 from pyspark.sql import Row
 from pyspark.sql.functions import *

 df = sc.parallelize([Row(name=test)]).toDF()
 df.filter(name in ('a', 'b')).collect()
 Out[1]: []

 df.filter(name in ('test')).collect()
 Out[2]: [Row(name=u'test')]

 In general you want to avoid lambda functions whenever you can do the
 same thing a dataframe expression.  This is because your lambda function is
 a black box that we cannot optimize (though you should certainly use them
 for the advanced stuff that expressions can't handle).

 I opened SPARK-6536 https://issues.apache.org/jira/browse/SPARK-6536 to
 provide a nicer interface for this.


 On Wed, Mar 25, 2015 at 7:41 AM, Stuart Layton stuart.lay...@gmail.com
 wrote:

 I have a SparkSQL dataframe with a a few billion rows that I need to
 quickly filter down to a few hundred thousand rows, using an operation like
 (syntax may not be correct)

 df = df[ df.filter(lambda x: x.key_col in approved_keys)]

 I was thinking about serializing the data using parquet and saving it to
 S3, however as I want to optimize for filtering speed I'm not sure this is
 the best option.

 --
 Stuart Layton





 --
 Stuart Layton



Re: column expression in left outer join for DataFrame

2015-03-25 Thread S Krishna
Hi,

Thanks for your response.  I am not clear about why the query is ambiguous.

val both = df_2.join(df_1, df_2(country)===df_1(country), left_outer)

I thought df_2(country)===df_1(country) indicates that the country
field in the 2 dataframes should match and df_2(country) is the
equivalent of df_2.country in SQL, while  df_1(country) is the equivalent
of df_1.country in SQL. So I am not sure why it is ambiguous. In Spark
1.2.0 I have used the same logic using SparkSQL  and Tables ( e.g.  WHERE
tab1.country = tab2.country)  and had no problems getting the correct
result.

thanks





On Wed, Mar 25, 2015 at 11:05 AM, Michael Armbrust mich...@databricks.com
wrote:

 Unfortunately you are now hitting a bug (that is fixed in master and will
 be released in 1.3.1 hopefully next week).  However, even with that your
 query is still ambiguous and you will need to use aliases:

 val df_1 = df.filter( df(event) === 0)
   . select(country, cnt).as(a)
 val df_2 = df.filter( df(event) === 3)
   . select(country, cnt).as(b)
 val both = df_2.join(df_1, $a.country === $b.country), left_outer)



 On Tue, Mar 24, 2015 at 11:57 PM, S Krishna skrishna...@gmail.com wrote:

 Hi,

 Thanks for your response. I modified my code as per your suggestion, but
 now I am getting a runtime error. Here's my code:

 val df_1 = df.filter( df(event) === 0)
   . select(country, cnt)

 val df_2 = df.filter( df(event) === 3)
   . select(country, cnt)

 df_1.show()
 //produces the following output :
 // countrycnt
 //   tw   3000
 //   uk   2000
 //   us   1000

 df_2.show()
 //produces the following output :
 // countrycnt
 //   tw   25
 //   uk   200
 //   us   95

 val both = df_2.join(df_1, df_2(country)===df_1(country),
 left_outer)

 I am getting the following error when executing the join statement:

 java.util.NoSuchElementException: next on empty iterator.

 This error seems to be originating at DataFrame.join (line 133 in
 DataFrame.scala).

 The show() results show that both dataframes do have columns named
 country and that they are non-empty. I also tried the simpler join ( i.e.
 df_2.join(df_1) ) and got the same error stated above.

 I would like to know what is wrong with the join statement above.

 thanks
























 On Tue, Mar 24, 2015 at 6:08 PM, Michael Armbrust mich...@databricks.com
  wrote:

 You need to use `===`, so that you are constructing a column expression
 instead of evaluating the standard scala equality method.  Calling methods
 to access columns (i.e. df.county is only supported in python).

 val join_df =  df1.join( df2, df1(country) === df2(country),
 left_outer)

 On Tue, Mar 24, 2015 at 5:50 PM, SK skrishna...@gmail.com wrote:

 Hi,

 I am trying to port some code that was working in Spark 1.2.0 on the
 latest
 version, Spark 1.3.0. This code involves a left outer join between two
 SchemaRDDs which I am now trying to change to a left outer join between
 2
 DataFrames. I followed the example  for left outer join of DataFrame at

 https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html

 Here's my code, where df1 and df2 are the 2 dataframes I am joining on
 the
 country field:

  val join_df =  df1.join( df2,  df1.country == df2.country,
 left_outer)

 But I got a compilation error that value  country is not a member of
 sql.DataFrame

 I  also tried the following:
  val join_df =  df1.join( df2, df1(country) == df2(country),
 left_outer)

 I got a compilation error that it is a Boolean whereas a Column is
 required.

 So what is the correct Column expression I need to provide for joining
 the 2
 dataframes on a specific field ?

 thanks








 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/column-expression-in-left-outer-join-for-DataFrame-tp22209.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org







Re: python : Out of memory: Kill process

2015-03-25 Thread Eduardo Cusa
Hi Davies, I running 1.1.0.

Now I'm following this thread that recommend use batchsize parameter = 1


http://apache-spark-user-list.1001560.n3.nabble.com/pySpark-memory-usage-td3022.html

if this does not work I will install  1.2.1 or  1.3

Regards






On Wed, Mar 25, 2015 at 3:39 PM, Davies Liu dav...@databricks.com wrote:

 What's the version of Spark you are running?

 There is a bug in SQL Python API [1], it's fixed in 1.2.1 and 1.3,

 [1] https://issues.apache.org/jira/browse/SPARK-6055

 On Wed, Mar 25, 2015 at 10:33 AM, Eduardo Cusa
 eduardo.c...@usmediaconsulting.com wrote:
  Hi Guys, I running the following function with spark-submmit and de SO is
  killing my process :
 
 
def getRdd(self,date,provider):
  path='s3n://'+AWS_BUCKET+'/'+date+'/*.log.gz'
  log2= self.sqlContext.jsonFile(path)
  log2.registerTempTable('log_test')
  log2.cache()

 You only visit the table once, cache does not help here.

  out=self.sqlContext.sql(SELECT user, tax from log_test where
 provider =
  '+provider+'and country  '').map(lambda row: (row.user, row.tax))
  print out1
  return  map((lambda (x,y): (x, list(y))),
  sorted(out.groupByKey(2000).collect()))

 100 partitions (or less) will be enough for 2G dataset.

 
 
  The input dataset has 57 zip files (2 GB)
 
  The same process with a smaller dataset completed successfully
 
  Any ideas to debug is welcome.
 
  Regards
  Eduardo
 
 



Spark shell never leaves ACCEPTED state in YARN CDH5

2015-03-25 Thread Khandeshi, Ami
I am seeing the same behavior.  I have enough resources.  How do I resolve 
it?

Thanks,

Ami


Re: Can a DataFrame be saved to s3 directly using Parquet?

2015-03-25 Thread Michael Armbrust
This will be fixed in Spark 1.3.1:
https://issues.apache.org/jira/browse/SPARK-6351

and is fixed in master/branch-1.3 if you want to compile from source

On Wed, Mar 25, 2015 at 11:59 AM, Stuart Layton stuart.lay...@gmail.com
wrote:

 I'm trying to save a dataframe to s3 as a parquet file but I'm getting
 Wrong FS errors

  df.saveAsParquetFile(parquetFile)
 15/03/25 18:56:10 INFO storage.MemoryStore: ensureFreeSpace(46645) called
 with curMem=82744, maxMem=278302556
 15/03/25 18:56:10 INFO storage.MemoryStore: Block broadcast_5 stored as
 values in memory (estimated size 45.6 KB, free 265.3 MB)
 15/03/25 18:56:10 INFO storage.MemoryStore: ensureFreeSpace(7078) called
 with curMem=129389, maxMem=278302556
 15/03/25 18:56:10 INFO storage.MemoryStore: Block broadcast_5_piece0
 stored as bytes in memory (estimated size 6.9 KB, free 265.3 MB)
 15/03/25 18:56:10 INFO storage.BlockManagerInfo: Added broadcast_5_piece0
 in memory on ip-172-31-1-219.ec2.internal:58280 (size: 6.9 KB, free: 265.4
 MB)
 15/03/25 18:56:10 INFO storage.BlockManagerMaster: Updated info of block
 broadcast_5_piece0
 15/03/25 18:56:10 INFO spark.SparkContext: Created broadcast 5 from
 textFile at JSONRelation.scala:98
 Traceback (most recent call last):
   File stdin, line 1, in module
   File /root/spark/python/pyspark/sql/dataframe.py, line 121, in
 saveAsParquetFile
 self._jdf.saveAsParquetFile(path)
   File /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py,
 line 538, in __call__
   File /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py,
 line 300, in get_return_value
 py4j.protocol.Py4JJavaError: An error occurred while calling
 o22.saveAsParquetFile.
 : java.lang.IllegalArgumentException: Wrong FS:
 s3n://com.my.bucket/spark-testing/, expected: hdfs://
 ec2-52-0-159-113.compute-1.amazonaws.com:9000


 Is it possible to save a dataframe to s3 directly using parquet?

 --
 Stuart Layton



writing DStream RDDs to the same file

2015-03-25 Thread Adrian Mocanu
Hi
Is there a way to write all RDDs in a DStream to the same file?
I tried this and got an empty file. I think it's bc the file is not closed i.e. 
ESMinibatchFunctions.writer.close() executes before the stream is created.

Here's my code
  myStream.foreachRDD(rdd = {
rdd.foreach(x = {
  ESMinibatchFunctions.writer.append(rdd.collect()(0).toString()+ the 
data )})
//localRdd = localRdd.union(rdd)
//localArray = localArray ++ rdd.collect()
  } )

ESMinibatchFunctions.writer.close()

object ESMinibatchFunctions {
  val writer = new PrintWriter(c:/delme/exxx.txt)
}


Re: python : Out of memory: Kill process

2015-03-25 Thread Davies Liu
What's the version of Spark you are running?

There is a bug in SQL Python API [1], it's fixed in 1.2.1 and 1.3,

[1] https://issues.apache.org/jira/browse/SPARK-6055

On Wed, Mar 25, 2015 at 10:33 AM, Eduardo Cusa
eduardo.c...@usmediaconsulting.com wrote:
 Hi Guys, I running the following function with spark-submmit and de SO is
 killing my process :


   def getRdd(self,date,provider):
 path='s3n://'+AWS_BUCKET+'/'+date+'/*.log.gz'
 log2= self.sqlContext.jsonFile(path)
 log2.registerTempTable('log_test')
 log2.cache()

You only visit the table once, cache does not help here.

 out=self.sqlContext.sql(SELECT user, tax from log_test where provider =
 '+provider+'and country  '').map(lambda row: (row.user, row.tax))
 print out1
 return  map((lambda (x,y): (x, list(y))),
 sorted(out.groupByKey(2000).collect()))

100 partitions (or less) will be enough for 2G dataset.



 The input dataset has 57 zip files (2 GB)

 The same process with a smaller dataset completed successfully

 Any ideas to debug is welcome.

 Regards
 Eduardo



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: What are the best options for quickly filtering a DataFrame on a single column?

2015-03-25 Thread Stuart Layton
Thanks for the response, I was using IN as an example of the type of
operation I need to do. Is there another way to do this that lines up more
naturally with the way things are supposed to be done in SparkSQL?

On Wed, Mar 25, 2015 at 2:29 PM, Michael Armbrust mich...@databricks.com
wrote:

 The only way to do in using python currently is to use the string based
 filter API (where you pass us an expression as a string, and we parse it
 using our SQL parser).

 from pyspark.sql import Row
 from pyspark.sql.functions import *

 df = sc.parallelize([Row(name=test)]).toDF()
 df.filter(name in ('a', 'b')).collect()
 Out[1]: []

 df.filter(name in ('test')).collect()
 Out[2]: [Row(name=u'test')]

 In general you want to avoid lambda functions whenever you can do the same
 thing a dataframe expression.  This is because your lambda function is a
 black box that we cannot optimize (though you should certainly use them for
 the advanced stuff that expressions can't handle).

 I opened SPARK-6536 https://issues.apache.org/jira/browse/SPARK-6536 to
 provide a nicer interface for this.


 On Wed, Mar 25, 2015 at 7:41 AM, Stuart Layton stuart.lay...@gmail.com
 wrote:

 I have a SparkSQL dataframe with a a few billion rows that I need to
 quickly filter down to a few hundred thousand rows, using an operation like
 (syntax may not be correct)

 df = df[ df.filter(lambda x: x.key_col in approved_keys)]

 I was thinking about serializing the data using parquet and saving it to
 S3, however as I want to optimize for filtering speed I'm not sure this is
 the best option.

 --
 Stuart Layton





-- 
Stuart Layton


Re: java.lang.OutOfMemoryError: unable to create new native thread

2015-03-25 Thread Matt Silvey
This is a different kind of error.  Thomas' OOM error was specific to the
kernel refusing to create another thread/process for his application.

Matthew

On Wed, Mar 25, 2015 at 10:51 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:

 I have a YARN cluster where the max memory allowed is 16GB. I set 12G for
 my driver, however i see OutOFMemory error even for this program
 http://spark.apache.org/docs/1.3.0/sql-programming-guide.html#hive-tables
 . What do you suggest ?

 On Wed, Mar 25, 2015 at 8:23 AM, Thomas Gerber thomas.ger...@radius.com
 wrote:

 So,

 1. I reduced my  -XX:ThreadStackSize to 5m (instead of 10m - default is
 1m), which is still OK for my need.
 2. I reduced the executor memory to 44GB for a 60GB machine (instead of
 49GB).

 This seems to have helped. Thanks to Matthew and Sean.

 Thomas

 On Tue, Mar 24, 2015 at 3:49 PM, Matt Silvey matt.sil...@videoamp.com
 wrote:

 My memory is hazy on this but aren't there hidden limitations to
 Linux-based threads?  I ran into some issues a couple of years ago where,
 and here is the fuzzy part, the kernel wants to reserve virtual memory per
 thread equal to the stack size.  When the total amount of reserved memory
 (not necessarily resident memory) exceeds the memory of the system it
 throws an OOM.  I'm looking for material to back this up.  Sorry for the
 initial vague response.

 Matthew

 On Tue, Mar 24, 2015 at 12:53 PM, Thomas Gerber 
 thomas.ger...@radius.com wrote:

 Additional notes:
 I did not find anything wrong with the number of threads (ps -u USER -L
 | wc -l): around 780 on the master and 400 on executors. I am running on
 100 r3.2xlarge.

 On Tue, Mar 24, 2015 at 12:38 PM, Thomas Gerber 
 thomas.ger...@radius.com wrote:

 Hello,

 I am seeing various crashes in spark on large jobs which all share a
 similar exception:

 java.lang.OutOfMemoryError: unable to create new native thread
 at java.lang.Thread.start0(Native Method)
 at java.lang.Thread.start(Thread.java:714)

 I increased nproc (i.e. ulimit -u) 10 fold, but it doesn't help.

 Does anyone know how to avoid those kinds of errors?

 Noteworthy: I added -XX:ThreadStackSize=10m on both driver and
 executor extra java options, which might have amplified the problem.

 Thanks for you help,
 Thomas







 --
 Deepak




Can a DataFrame be saved to s3 directly using Parquet?

2015-03-25 Thread Stuart Layton
I'm trying to save a dataframe to s3 as a parquet file but I'm getting
Wrong FS errors

 df.saveAsParquetFile(parquetFile)
15/03/25 18:56:10 INFO storage.MemoryStore: ensureFreeSpace(46645) called
with curMem=82744, maxMem=278302556
15/03/25 18:56:10 INFO storage.MemoryStore: Block broadcast_5 stored as
values in memory (estimated size 45.6 KB, free 265.3 MB)
15/03/25 18:56:10 INFO storage.MemoryStore: ensureFreeSpace(7078) called
with curMem=129389, maxMem=278302556
15/03/25 18:56:10 INFO storage.MemoryStore: Block broadcast_5_piece0 stored
as bytes in memory (estimated size 6.9 KB, free 265.3 MB)
15/03/25 18:56:10 INFO storage.BlockManagerInfo: Added broadcast_5_piece0
in memory on ip-172-31-1-219.ec2.internal:58280 (size: 6.9 KB, free: 265.4
MB)
15/03/25 18:56:10 INFO storage.BlockManagerMaster: Updated info of block
broadcast_5_piece0
15/03/25 18:56:10 INFO spark.SparkContext: Created broadcast 5 from
textFile at JSONRelation.scala:98
Traceback (most recent call last):
  File stdin, line 1, in module
  File /root/spark/python/pyspark/sql/dataframe.py, line 121, in
saveAsParquetFile
self._jdf.saveAsParquetFile(path)
  File /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py,
line 538, in __call__
  File /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line
300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling
o22.saveAsParquetFile.
: java.lang.IllegalArgumentException: Wrong FS:
s3n://com.my.bucket/spark-testing/, expected: hdfs://
ec2-52-0-159-113.compute-1.amazonaws.com:9000


Is it possible to save a dataframe to s3 directly using parquet?

-- 
Stuart Layton


Re: python : Out of memory: Kill process

2015-03-25 Thread Davies Liu
With batchSize = 1, I think it will become even worse.

I'd suggest to go with 1.3, have a taste for the new DataFrame API.

On Wed, Mar 25, 2015 at 11:49 AM, Eduardo Cusa
eduardo.c...@usmediaconsulting.com wrote:
 Hi Davies, I running 1.1.0.

 Now I'm following this thread that recommend use batchsize parameter = 1


 http://apache-spark-user-list.1001560.n3.nabble.com/pySpark-memory-usage-td3022.html

 if this does not work I will install  1.2.1 or  1.3

 Regards






 On Wed, Mar 25, 2015 at 3:39 PM, Davies Liu dav...@databricks.com wrote:

 What's the version of Spark you are running?

 There is a bug in SQL Python API [1], it's fixed in 1.2.1 and 1.3,

 [1] https://issues.apache.org/jira/browse/SPARK-6055

 On Wed, Mar 25, 2015 at 10:33 AM, Eduardo Cusa
 eduardo.c...@usmediaconsulting.com wrote:
  Hi Guys, I running the following function with spark-submmit and de SO
  is
  killing my process :
 
 
def getRdd(self,date,provider):
  path='s3n://'+AWS_BUCKET+'/'+date+'/*.log.gz'
  log2= self.sqlContext.jsonFile(path)
  log2.registerTempTable('log_test')
  log2.cache()

 You only visit the table once, cache does not help here.

  out=self.sqlContext.sql(SELECT user, tax from log_test where
  provider =
  '+provider+'and country  '').map(lambda row: (row.user, row.tax))
  print out1
  return  map((lambda (x,y): (x, list(y))),
  sorted(out.groupByKey(2000).collect()))

 100 partitions (or less) will be enough for 2G dataset.

 
 
  The input dataset has 57 zip files (2 GB)
 
  The same process with a smaller dataset completed successfully
 
  Any ideas to debug is welcome.
 
  Regards
  Eduardo
 
 



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: column expression in left outer join for DataFrame

2015-03-25 Thread Michael Armbrust
Thats a good question.  In this particular example, it is really only
internal implementation details that make it ambiguous.  However, fixing
this was a very large change so we have defered it to Spark 1.4 and instead
print a warning now when you construct trivially equal expressions.  I can
try to explain a little bit about why solving this generally is (mostly)
impossible.  Consider the following:

val df = sqlContext.load(...)

val df1 = df
val df2 = df

df1.join(df2, df1(a) === df2(a))

Compared with

SELECT * FROM df df1 JOIN df df2 WHERE df1.a = df2.a

In the first example, the assigning of df to df1 and df2 is completely
transparent to the catalyst optimizer as it is happening in Scala code.
This means that df1(a) and df2(a) are completely indistinguishable to
us (at least without crazy macro magic).  In contrast, the aliasing is
visible to the optimizer when are doing it in SQL instead of Scala and thus
we can differentiate.

In your case you are doing transformations, and we could assign new unique
ids each time a transformation is done.  However, we don't do this today,
and its a pretty big change.  There is a JIRA for this: SPARK-6231
https://issues.apache.org/jira/browse/SPARK-6231

On Wed, Mar 25, 2015 at 11:47 AM, S Krishna skrishna...@gmail.com wrote:

 Hi,

 Thanks for your response.  I am not clear about why the query is ambiguous.

 val both = df_2.join(df_1, df_2(country)===df_1(country),
 left_outer)

 I thought df_2(country)===df_1(country) indicates that the country
 field in the 2 dataframes should match and df_2(country) is the
 equivalent of df_2.country in SQL, while  df_1(country) is the
 equivalent of df_1.country in SQL. So I am not sure why it is ambiguous. In
 Spark 1.2.0 I have used the same logic using SparkSQL  and Tables ( e.g.
  WHERE tab1.country = tab2.country)  and had no problems getting the
 correct result.

 thanks





 On Wed, Mar 25, 2015 at 11:05 AM, Michael Armbrust mich...@databricks.com
  wrote:

 Unfortunately you are now hitting a bug (that is fixed in master and will
 be released in 1.3.1 hopefully next week).  However, even with that your
 query is still ambiguous and you will need to use aliases:

 val df_1 = df.filter( df(event) === 0)
   . select(country, cnt).as(a)
 val df_2 = df.filter( df(event) === 3)
   . select(country, cnt).as(b)
 val both = df_2.join(df_1, $a.country === $b.country), left_outer)



 On Tue, Mar 24, 2015 at 11:57 PM, S Krishna skrishna...@gmail.com
 wrote:

 Hi,

 Thanks for your response. I modified my code as per your suggestion, but
 now I am getting a runtime error. Here's my code:

 val df_1 = df.filter( df(event) === 0)
   . select(country, cnt)

 val df_2 = df.filter( df(event) === 3)
   . select(country, cnt)

 df_1.show()
 //produces the following output :
 // countrycnt
 //   tw   3000
 //   uk   2000
 //   us   1000

 df_2.show()
 //produces the following output :
 // countrycnt
 //   tw   25
 //   uk   200
 //   us   95

 val both = df_2.join(df_1, df_2(country)===df_1(country),
 left_outer)

 I am getting the following error when executing the join statement:

 java.util.NoSuchElementException: next on empty iterator.

 This error seems to be originating at DataFrame.join (line 133 in
 DataFrame.scala).

 The show() results show that both dataframes do have columns named
 country and that they are non-empty. I also tried the simpler join ( i.e.
 df_2.join(df_1) ) and got the same error stated above.

 I would like to know what is wrong with the join statement above.

 thanks
























 On Tue, Mar 24, 2015 at 6:08 PM, Michael Armbrust 
 mich...@databricks.com wrote:

 You need to use `===`, so that you are constructing a column expression
 instead of evaluating the standard scala equality method.  Calling methods
 to access columns (i.e. df.county is only supported in python).

 val join_df =  df1.join( df2, df1(country) === df2(country),
 left_outer)

 On Tue, Mar 24, 2015 at 5:50 PM, SK skrishna...@gmail.com wrote:

 Hi,

 I am trying to port some code that was working in Spark 1.2.0 on the
 latest
 version, Spark 1.3.0. This code involves a left outer join between two
 SchemaRDDs which I am now trying to change to a left outer join
 between 2
 DataFrames. I followed the example  for left outer join of DataFrame at

 https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html

 Here's my code, where df1 and df2 are the 2 dataframes I am joining on
 the
 country field:

  val join_df =  df1.join( df2,  df1.country == df2.country,
 left_outer)

 But I got a compilation error that value  country is not a member of
 sql.DataFrame

 I  also tried the following:
  val join_df =  df1.join( df2, df1(country) == df2(country),
 left_outer)

 I got a compilation error that it is a Boolean whereas a Column is
 required.

 So what is the correct 

Re: newbie quesiton - spark with mesos

2015-03-25 Thread Dean Wampler
I think the problem is the use the loopback address:

export SPARK_LOCAL_IP=127.0.0.1

In the stack trace from the slave, you see this:

...  Reason: Connection refused: localhost/127.0.0.1:51849
akka.actor.ActorNotFound: Actor not found for:
ActorSelection[Anchor(akka.tcp://sparkDriver@localhost:51849/),
Path(/user/MapOutputTracker)]

It's trying to connect to an Akka actor on itself, using the loopback
address.

Try changing SPARK_LOCAL_IP to the publicly routable IP address.

dean


Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
Typesafe http://typesafe.com
@deanwampler http://twitter.com/deanwampler
http://polyglotprogramming.com

On Mon, Mar 23, 2015 at 7:37 PM, Anirudha Jadhav anirudh...@gmail.com
wrote:

 My bad there, I was using the correct link for docs. The spark shell runs
 correctly, the framework is registered fine on mesos.

 is there some setting i am missing:
 this is my spark-env.sh

 export MESOS_NATIVE_LIBRARY=/usr/local/lib/libmesos.so
 export SPARK_EXECUTOR_URI=http://100.125.5.93/sparkx.tgz
 export SPARK_LOCAL_IP=127.0.0.1



 here is what i see on the slave node.
 
 less
 20150226-160708-78932-5050-8971-S0/frameworks/20150323-205508-78932-5050-29804-0012/executors/20150226-160708-78932-5050-8971-S0/runs/cceea834-c4d9-49d6-a579-8352f1889b56/stderr
 

 WARNING: Logging before InitGoogleLogging() is written to STDERR
 I0324 02:30:29.389225 27755 fetcher.cpp:76] Fetching URI '
 http://100.125.5.93/sparkx.tgz'
 I0324 02:30:29.389361 27755 fetcher.cpp:126] Downloading '
 http://100.125.5.93/sparkx.tgz' to
 '/tmp/mesos/slaves/20150226-160708-78932-5050-8971-S0/frameworks/20150323-205508-78932-5050-29804-0012/executors/20150226-160708-78932-5050-8971-S0/runs/cceea834-c4d9-49d6-a579-8352f1889b56/sparkx.tgz'
 I0324 02:30:35.353446 27755 fetcher.cpp:64] Extracted resource
 '/tmp/mesos/slaves/20150226-160708-78932-5050-8971-S0/frameworks/20150323-205508-78932-5050-29804-0012/executors/20150226-160708-78932-5050-8971-S0/runs/cceea834-c4d9-49d6-a579-8352f1889b56/sparkx.tgz'
 into
 '/tmp/mesos/slaves/20150226-160708-78932-5050-8971-S0/frameworks/20150323-205508-78932-5050-29804-0012/executors/20150226-160708-78932-5050-8971-S0/runs/cceea834-c4d9-49d6-a579-8352f1889b56'
 Spark assembly has been built with Hive, including Datanucleus jars on
 classpath
 Using Spark's default log4j profile:
 org/apache/spark/log4j-defaults.properties
 15/03/24 02:30:37 INFO MesosExecutorBackend: Registered signal handlers
 for [TERM, HUP, INT]
 I0324 02:30:37.071077 27863 exec.cpp:132] Version: 0.21.1
 I0324 02:30:37.080971 27885 exec.cpp:206] Executor registered on slave
 20150226-160708-78932-5050-8971-S0
 15/03/24 02:30:37 INFO MesosExecutorBackend: Registered with Mesos as
 executor ID 20150226-160708-78932-5050-8971-S0 with 1 cpus
 15/03/24 02:30:37 INFO SecurityManager: Changing view acls to: ubuntu
 15/03/24 02:30:37 INFO SecurityManager: Changing modify acls to: ubuntu
 15/03/24 02:30:37 INFO SecurityManager: SecurityManager: authentication
 disabled; ui acls disabled; users with view permissions: Set(ubuntu); users
 with modify permissions: Set(ubuntu)
 15/03/24 02:30:37 INFO Slf4jLogger: Slf4jLogger started
 15/03/24 02:30:37 INFO Remoting: Starting remoting
 15/03/24 02:30:38 INFO Remoting: Remoting started; listening on addresses
 :[akka.tcp://sparkexecu...@mesos-si2.dny1.bcpc.bloomberg.com:50542]
 15/03/24 02:30:38 INFO Utils: Successfully started service 'sparkExecutor'
 on port 50542.
 15/03/24 02:30:38 INFO AkkaUtils: Connecting to MapOutputTracker:
 akka.tcp://sparkDriver@localhost:51849/user/MapOutputTracker
 15/03/24 02:30:38 WARN Remoting: Tried to associate with unreachable
 remote address [akka.tcp://sparkDriver@localhost:51849]. Address is now
 gated for 5000 ms, all messages to this address will be delivered to dead
 letters. Reason: Connection refused: localhost/127.0.0.1:51849
 akka.actor.ActorNotFound: Actor not found for:
 ActorSelection[Anchor(akka.tcp://sparkDriver@localhost:51849/),
 Path(/user/MapOutputTracker)]
 at
 akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:65)
 at
 akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:63)
 at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
 at
 akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:67)
 at
 akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:82)
 at
 akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
 at
 akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
 at
 scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
 at
 akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:58)
 at
 

Re: Date and decimal datatype not working

2015-03-25 Thread Dean Wampler
Recall that the input isn't actually read until to do something that forces
evaluation, like call saveAsTextFile. You didn't show the whole stack trace
here, but it probably occurred while parsing an input line where one of
your long fields is actually an empty string.

Because this is such a common problem, I usually define a parse method
that converts input text to the desired schema. It catches parse exceptions
like this and reports the bad line at least. If you can return a default
long in this case, say 0, that makes it easier to return something.

dean



Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
Typesafe http://typesafe.com
@deanwampler http://twitter.com/deanwampler
http://polyglotprogramming.com

On Wed, Mar 25, 2015 at 11:48 AM, BASAK, ANANDA ab9...@att.com wrote:

  Thanks. This library is only available with Spark 1.3. I am using
 version 1.2.1. Before I upgrade to 1.3, I want to try what can be done in
 1.2.1.



 So I am using following:

 val MyDataset = sqlContext.sql(my select query”)



 MyDataset.map(t =
 t(0)+|+t(1)+|+t(2)+|+t(3)+|+t(4)+|+t(5)).saveAsTextFile(/my_destination_path)



 But it is giving following error:

 15/03/24 17:05:51 ERROR Executor: Exception in task 1.0 in stage 13.0 (TID
 106)

 java.lang.NumberFormatException: For input string: 

 at
 java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)

 at java.lang.Long.parseLong(Long.java:453)

 at java.lang.Long.parseLong(Long.java:483)

 at
 scala.collection.immutable.StringLike$class.toLong(StringLike.scala:230)



 is there something wrong with the TSTAMP field which is Long datatype?



 Thanks  Regards

 ---

 Ananda Basak

 Ph: 425-213-7092



 *From:* Yin Huai [mailto:yh...@databricks.com]
 *Sent:* Monday, March 23, 2015 8:55 PM

 *To:* BASAK, ANANDA
 *Cc:* user@spark.apache.org
 *Subject:* Re: Date and decimal datatype not working



 To store to csv file, you can use Spark-CSV
 https://github.com/databricks/spark-csv library.



 On Mon, Mar 23, 2015 at 5:35 PM, BASAK, ANANDA ab9...@att.com wrote:

  Thanks. This worked well as per your suggestions. I had to run following:

 val TABLE_A =
 sc.textFile(/Myhome/SPARK/files/table_a_file.txt).map(_.split(|)).map(p
 = ROW_A(p(0).trim.toLong, p(1), p(2).trim.toInt, p(3), BigDecimal(p(4)),
 BigDecimal(p(5)), BigDecimal(p(6



 Now I am stuck at another step. I have run a SQL query, where I am
 Selecting from all the fields with some where clause , TSTAMP filtered with
 date range and order by TSTAMP clause. That is running fine.



 Then I am trying to store the output in a CSV file. I am using
 saveAsTextFile(“filename”) function. But it is giving error. Can you please
 help me to write a proper syntax to store output in a CSV file?





 Thanks  Regards

 ---

 Ananda Basak

 Ph: 425-213-7092



 *From:* BASAK, ANANDA
 *Sent:* Tuesday, March 17, 2015 3:08 PM
 *To:* Yin Huai
 *Cc:* user@spark.apache.org
 *Subject:* RE: Date and decimal datatype not working



 Ok, thanks for the suggestions. Let me try and will confirm all.



 Regards

 Ananda



 *From:* Yin Huai [mailto:yh...@databricks.com]
 *Sent:* Tuesday, March 17, 2015 3:04 PM
 *To:* BASAK, ANANDA
 *Cc:* user@spark.apache.org
 *Subject:* Re: Date and decimal datatype not working



 p(0) is a String. So, you need to explicitly convert it to a Long. e.g.
 p(0).trim.toLong. You also need to do it for p(2). For those BigDecimals
 value, you need to create BigDecimal objects from your String values.



 On Tue, Mar 17, 2015 at 5:55 PM, BASAK, ANANDA ab9...@att.com wrote:

   Hi All,

 I am very new in Spark world. Just started some test coding from last
 week. I am using spark-1.2.1-bin-hadoop2.4 and scala coding.

 I am having issues while using Date and decimal data types. Following is
 my code that I am simply running on scala prompt. I am trying to define a
 table and point that to my flat file containing raw data (pipe delimited
 format). Once that is done, I will run some SQL queries and put the output
 data in to another flat file with pipe delimited format.



 ***

 val sqlContext = new org.apache.spark.sql.SQLContext(sc)

 import sqlContext.createSchemaRDD





 // Define row and table

 case class ROW_A(

   TSTAMP:   Long,

   USIDAN: String,

   SECNT:Int,

   SECT:   String,

   BLOCK_NUM:BigDecimal,

   BLOCK_DEN:BigDecimal,

   BLOCK_PCT:BigDecimal)



 val TABLE_A =
 sc.textFile(/Myhome/SPARK/files/table_a_file.txt).map(_.split(|)).map(p
 = ROW_A(p(0), p(1), p(2), p(3), p(4), p(5), p(6)))



 TABLE_A.registerTempTable(TABLE_A)



 ***



 The second last command is giving error, like following:

 console:17: error: type mismatch;

 found   : String

 

Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems

2015-03-25 Thread Dean Wampler
Yes, that's the problem. The RDD class exists in both binary jar files, but
the signatures probably don't match. The bottom line, as always for tools
like this, is that you can't mix versions.

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
Typesafe http://typesafe.com
@deanwampler http://twitter.com/deanwampler
http://polyglotprogramming.com

On Wed, Mar 25, 2015 at 3:13 PM, roni roni.epi...@gmail.com wrote:

 My cluster is still on spark 1.2 and in SBT I am using 1.3.
 So probably it is compiling with 1.3 but running with 1.2 ?

 On Wed, Mar 25, 2015 at 12:34 PM, Dean Wampler deanwamp...@gmail.com
 wrote:

 Weird. Are you running using SBT console? It should have the spark-core
 jar on the classpath. Similarly, spark-shell or spark-submit should work,
 but be sure you're using the same version of Spark when running as when
 compiling. Also, you might need to add spark-sql to your SBT dependencies,
 but that shouldn't be this issue.

 Dean Wampler, Ph.D.
 Author: Programming Scala, 2nd Edition
 http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
 Typesafe http://typesafe.com
 @deanwampler http://twitter.com/deanwampler
 http://polyglotprogramming.com

 On Wed, Mar 25, 2015 at 12:09 PM, roni roni.epi...@gmail.com wrote:

 Thanks Dean and Nick.
 So, I removed the ADAM and H2o from my SBT as I was not using them.
 I got the code to compile  - only for fail while running with -
 SparkContext: Created broadcast 1 from textFile at
 kmerIntersetion.scala:21
 Exception in thread main java.lang.NoClassDefFoundError:
 org/apache/spark/rdd/RDD$
 at preDefKmerIntersection$.main(kmerIntersetion.scala:26)

 This line is where I do a JOIN operation.
 val hgPair = hgfasta.map(_.split (,)).map(a= (a(0),
 a(1).trim().toInt))
  val filtered = hgPair.filter(kv = kv._2 == 1)
  val bedPair = bedFile.map(_.split (,)).map(a= (a(0),
 a(1).trim().toInt))
 * val joinRDD = bedPair.join(filtered)   *
 Any idea whats going on?
 I have data on the EC2 so I am avoiding creating the new cluster , but
 just upgrading and changing the code to use 1.3 and Spark SQL
 Thanks
 Roni



 On Wed, Mar 25, 2015 at 9:50 AM, Dean Wampler deanwamp...@gmail.com
 wrote:

 For the Spark SQL parts, 1.3 breaks backwards compatibility, because
 before 1.3, Spark SQL was considered experimental where API changes were
 allowed.

 So, H2O and ADA compatible with 1.2.X might not work with 1.3.

 dean

 Dean Wampler, Ph.D.
 Author: Programming Scala, 2nd Edition
 http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
 Typesafe http://typesafe.com
 @deanwampler http://twitter.com/deanwampler
 http://polyglotprogramming.com

 On Wed, Mar 25, 2015 at 9:39 AM, roni roni.epi...@gmail.com wrote:

 Even if H2o and ADA are dependent on 1.2.1 , it should be backword
 compatible, right?
 So using 1.3 should not break them.
 And the code is not using the classes from those libs.
 I tried sbt clean compile .. same errror
 Thanks
 _R

 On Wed, Mar 25, 2015 at 9:26 AM, Nick Pentreath 
 nick.pentre...@gmail.com wrote:

 What version of Spark do the other dependencies rely on (Adam and
 H2O?) - that could be it

 Or try sbt clean compile

 —
 Sent from Mailbox https://www.dropbox.com/mailbox


 On Wed, Mar 25, 2015 at 5:58 PM, roni roni.epi...@gmail.com wrote:

 I have a EC2 cluster created using spark version 1.2.1.
 And I have a SBT project .
 Now I want to upgrade to spark 1.3 and use the new features.
 Below are issues .
 Sorry for the long post.
 Appreciate your help.
 Thanks
 -Roni

 Question - Do I have to create a new cluster using spark 1.3?

 Here is what I did -

 In my SBT file I  changed to -
 libraryDependencies += org.apache.spark %% spark-core % 1.3.0

 But then I started getting compilation error. along with
 Here are some of the libraries that were evicted:
 [warn]  * org.apache.spark:spark-core_2.10:1.2.0 - 1.3.0
 [warn]  * org.apache.hadoop:hadoop-client:(2.5.0-cdh5.2.0, 2.2.0) -
 2.6.0
 [warn] Run 'evicted' to see detailed eviction warnings

  constructor cannot be instantiated to expected type;
 [error]  found   : (T1, T2)
 [error]  required: org.apache.spark.sql.catalyst.expressions.Row
 [error] val ty =
 joinRDD.map{case(word, (file1Counts, file2Counts)) = KmerIntesect(word,
 file1Counts,xyz)}
 [error]  ^

 Here is my SBT and code --
 SBT -

 version := 1.0

 scalaVersion := 2.10.4

 resolvers += Sonatype OSS Snapshots at 
 https://oss.sonatype.org/content/repositories/snapshots;;
 resolvers += Maven Repo1 at https://repo1.maven.org/maven2;;
 resolvers += Maven Repo at 
 https://s3.amazonaws.com/h2o-release/h2o-dev/master/1056/maven/repo/
 ;

 /* Dependencies - %% appends Scala version to artifactId */
 libraryDependencies += org.apache.hadoop % hadoop-client %
 2.6.0
 libraryDependencies += org.apache.spark %% spark-core % 

Exception Failed to add a datanode. User may turn off this feature by setting dfs.client.block.write.replace-datanode-on-failure.policy in configuration

2015-03-25 Thread varvind
Hi,I am running spark in mesos and getting this error. Can anyone help me
resolve this?Thanks15/03/25 21:05:00 ERROR scheduler.LiveListenerBus:
Listener EventLoggingListener threw an
exceptionjava.lang.reflect.InvocationTargetExceptionat
sun.reflect.GeneratedMethodAccessor12.invoke(Unknown Source)at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606) at
org.apache.spark.util.FileLogger$$anonfun$flush$2.apply(FileLogger.scala:203)
at
org.apache.spark.util.FileLogger$$anonfun$flush$2.apply(FileLogger.scala:203)
at scala.Option.foreach(Option.scala:236)   at
org.apache.spark.util.FileLogger.flush(FileLogger.scala:203)at
org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:90)
at
org.apache.spark.scheduler.EventLoggingListener.onUnpersistRDD(EventLoggingListener.scala:121)
at
org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$11.apply(SparkListenerBus.scala:66)
at
org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$11.apply(SparkListenerBus.scala:66)
at
org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:83)
at
org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:81)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)   at
org.apache.spark.scheduler.SparkListenerBus$class.foreachListener(SparkListenerBus.scala:81)
at
org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:66)
at
org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:32)
at
org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56)
at
org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56)
at scala.Option.foreach(Option.scala:236)   at
org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:56)
at
org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47)
at
org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1460) at
org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:46)Caused
by: java.io.IOException: Failed to add a datanode.  User may turn off this
feature by setting dfs.client.block.write.replace-datanode-on-failure.policy
in configuration, where the current policy is DEFAULT.  (Nodes:
current=[10.250.100.81:50010, 10.250.100.82:50010],
original=[10.250.100.81:50010, 10.250.100.82:50010])at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:792)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:852)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:958)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:755)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:424)15/03/25
21:05:00 INFO scheduler.ReceivedBlockTracker: Deleting batches
ArrayBuffer()15/03/25 21:05:00 INFO scheduler.ReceivedBlockTracker: Deleting
batches ArrayBuffer(142731690 ms)15/03/25 21:05:00 INFO
scheduler.JobGenerator: Checkpointing graph for time 142731750
ms15/03/25 21:05:00 INFO streaming.DStreamGraph: Updating checkpoint data
for time 142731750 ms15/03/25 21:05:00 INFO streaming.DStreamGraph:
Updated checkpoint data for time 142731750 ms



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Exception-Failed-to-add-a-datanode-User-may-turn-off-this-feature-by-setting-dfs-client-block-write-n-tp22231.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

trouble with jdbc df in python

2015-03-25 Thread elliott cordo
if i run the following:

db = sqlContext.load(jdbc, url=jdbc:postgresql://localhost/xx,
dbtables=mstr.d_customer)

i get the error:

py4j.protocol.Py4JJavaError: An error occurred while calling o28.load.

: java.io.FileNotFoundException: File file:/Users/elliottcordo/jdbc does
not exist

Seems to think i'm trying to load a file called jdbc?  i'm setting the
postgres driver in the SPARK_CLASSPATH as well (that doesn't seem to be the
problem)..


Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems

2015-03-25 Thread roni
My cluster is still on spark 1.2 and in SBT I am using 1.3.
So probably it is compiling with 1.3 but running with 1.2 ?

On Wed, Mar 25, 2015 at 12:34 PM, Dean Wampler deanwamp...@gmail.com
wrote:

 Weird. Are you running using SBT console? It should have the spark-core
 jar on the classpath. Similarly, spark-shell or spark-submit should work,
 but be sure you're using the same version of Spark when running as when
 compiling. Also, you might need to add spark-sql to your SBT dependencies,
 but that shouldn't be this issue.

 Dean Wampler, Ph.D.
 Author: Programming Scala, 2nd Edition
 http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
 Typesafe http://typesafe.com
 @deanwampler http://twitter.com/deanwampler
 http://polyglotprogramming.com

 On Wed, Mar 25, 2015 at 12:09 PM, roni roni.epi...@gmail.com wrote:

 Thanks Dean and Nick.
 So, I removed the ADAM and H2o from my SBT as I was not using them.
 I got the code to compile  - only for fail while running with -
 SparkContext: Created broadcast 1 from textFile at
 kmerIntersetion.scala:21
 Exception in thread main java.lang.NoClassDefFoundError:
 org/apache/spark/rdd/RDD$
 at preDefKmerIntersection$.main(kmerIntersetion.scala:26)

 This line is where I do a JOIN operation.
 val hgPair = hgfasta.map(_.split (,)).map(a= (a(0), a(1).trim().toInt))
  val filtered = hgPair.filter(kv = kv._2 == 1)
  val bedPair = bedFile.map(_.split (,)).map(a= (a(0),
 a(1).trim().toInt))
 * val joinRDD = bedPair.join(filtered)   *
 Any idea whats going on?
 I have data on the EC2 so I am avoiding creating the new cluster , but
 just upgrading and changing the code to use 1.3 and Spark SQL
 Thanks
 Roni



 On Wed, Mar 25, 2015 at 9:50 AM, Dean Wampler deanwamp...@gmail.com
 wrote:

 For the Spark SQL parts, 1.3 breaks backwards compatibility, because
 before 1.3, Spark SQL was considered experimental where API changes were
 allowed.

 So, H2O and ADA compatible with 1.2.X might not work with 1.3.

 dean

 Dean Wampler, Ph.D.
 Author: Programming Scala, 2nd Edition
 http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
 Typesafe http://typesafe.com
 @deanwampler http://twitter.com/deanwampler
 http://polyglotprogramming.com

 On Wed, Mar 25, 2015 at 9:39 AM, roni roni.epi...@gmail.com wrote:

 Even if H2o and ADA are dependent on 1.2.1 , it should be backword
 compatible, right?
 So using 1.3 should not break them.
 And the code is not using the classes from those libs.
 I tried sbt clean compile .. same errror
 Thanks
 _R

 On Wed, Mar 25, 2015 at 9:26 AM, Nick Pentreath 
 nick.pentre...@gmail.com wrote:

 What version of Spark do the other dependencies rely on (Adam and
 H2O?) - that could be it

 Or try sbt clean compile

 —
 Sent from Mailbox https://www.dropbox.com/mailbox


 On Wed, Mar 25, 2015 at 5:58 PM, roni roni.epi...@gmail.com wrote:

 I have a EC2 cluster created using spark version 1.2.1.
 And I have a SBT project .
 Now I want to upgrade to spark 1.3 and use the new features.
 Below are issues .
 Sorry for the long post.
 Appreciate your help.
 Thanks
 -Roni

 Question - Do I have to create a new cluster using spark 1.3?

 Here is what I did -

 In my SBT file I  changed to -
 libraryDependencies += org.apache.spark %% spark-core % 1.3.0

 But then I started getting compilation error. along with
 Here are some of the libraries that were evicted:
 [warn]  * org.apache.spark:spark-core_2.10:1.2.0 - 1.3.0
 [warn]  * org.apache.hadoop:hadoop-client:(2.5.0-cdh5.2.0, 2.2.0) -
 2.6.0
 [warn] Run 'evicted' to see detailed eviction warnings

  constructor cannot be instantiated to expected type;
 [error]  found   : (T1, T2)
 [error]  required: org.apache.spark.sql.catalyst.expressions.Row
 [error] val ty =
 joinRDD.map{case(word, (file1Counts, file2Counts)) = KmerIntesect(word,
 file1Counts,xyz)}
 [error]  ^

 Here is my SBT and code --
 SBT -

 version := 1.0

 scalaVersion := 2.10.4

 resolvers += Sonatype OSS Snapshots at 
 https://oss.sonatype.org/content/repositories/snapshots;;
 resolvers += Maven Repo1 at https://repo1.maven.org/maven2;;
 resolvers += Maven Repo at 
 https://s3.amazonaws.com/h2o-release/h2o-dev/master/1056/maven/repo/
 ;

 /* Dependencies - %% appends Scala version to artifactId */
 libraryDependencies += org.apache.hadoop % hadoop-client % 2.6.0
 libraryDependencies += org.apache.spark %% spark-core % 1.3.0
 libraryDependencies += org.bdgenomics.adam % adam-core % 0.16.0
 libraryDependencies += ai.h2o % sparkling-water-core_2.10 %
 0.2.10


 CODE --
 import org.apache.spark.{SparkConf, SparkContext}
 case class KmerIntesect(kmer: String, kCount: Int, fileName: String)

 object preDefKmerIntersection {
   def main(args: Array[String]) {

  val sparkConf = new SparkConf().setAppName(preDefKmer-intersect)
  val sc = new SparkContext(sparkConf)
 import 

Re: Can LBFGS be used on streaming data?

2015-03-25 Thread DB Tsai
Hi Arunkumar,

I think L-BFGS will not work since L-BFGS algorithm assumes that the
objective function will be always the same (i.e., the data is the
same) for entire optimization process to construct the approximated
Hessian matrix. In the streaming case, the data will be changing, so
it will cause problem for the algorithm.

Sincerely,

DB Tsai
---
Blog: https://www.dbtsai.com


On Mon, Mar 16, 2015 at 3:19 PM, EcoMotto Inc. ecomot...@gmail.com wrote:
 Hello,

 I am new to spark streaming API.

 I wanted to ask if I can apply LBFGS (with LeastSquaresGradient) on
 streaming data? Currently I am using forecahRDD for parsing through DStream
 and I am generating a model based on each RDD. Am I doing anything logically
 wrong here?
 Thank you.

 Sample Code:

 val algorithm = new LBFGS(new LeastSquaresGradient(), new SimpleUpdater())
 var initialWeights =
 Vectors.dense(Array.fill(numFeatures)(scala.util.Random.nextDouble()))
 var isFirst = true
 var model = new LinearRegressionModel(null,1.0)

 parsedData.foreachRDD{rdd =
   if(isFirst) {
 val weights = algorithm.optimize(rdd, initialWeights)
 val w = weights.toArray
 val intercept = w.head
 model = new LinearRegressionModel(Vectors.dense(w.drop(1)), intercept)
 isFirst = false
   }else{
 var ab = ArrayBuffer[Double]()
 ab.insert(0, model.intercept)
 ab.appendAll( model.weights.toArray)
 print(Intercept = +model.intercept+ :: modelWeights =
 +model.weights)
 initialWeights = Vectors.dense(ab.toArray)
 print(Initial Weights: + initialWeights)
 val weights = algorithm.optimize(rdd, initialWeights)
 val w = weights.toArray
 val intercept = w.head
 model = new LinearRegressionModel(Vectors.dense(w.drop(1)), intercept)
   }



 Best Regards,
 Arunkumar

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: trouble with jdbc df in python

2015-03-25 Thread Michael Armbrust
Try:

db = sqlContext.load(source=jdbc, url=jdbc:postgresql://localhost/xx,
dbtables=mstr.d_customer)


On Wed, Mar 25, 2015 at 2:19 PM, elliott cordo elliottco...@gmail.com
wrote:

 if i run the following:

 db = sqlContext.load(jdbc, url=jdbc:postgresql://localhost/xx,
 dbtables=mstr.d_customer)

 i get the error:

 py4j.protocol.Py4JJavaError: An error occurred while calling o28.load.

 : java.io.FileNotFoundException: File file:/Users/elliottcordo/jdbc does
 not exist

 Seems to think i'm trying to load a file called jdbc?  i'm setting the
 postgres driver in the SPARK_CLASSPATH as well (that doesn't seem to be the
 problem)..



Re: How to specify the port for AM Actor ...

2015-03-25 Thread Shixiong Zhu
There is no configuration for it now.

Best Regards,
Shixiong Zhu

2015-03-26 7:13 GMT+08:00 Manoj Samel manojsamelt...@gmail.com:

 There may be firewall rules limiting the ports between host running spark
 and the hadoop cluster. In that case, not all ports are allowed.

 Can it be a range of ports that can be specified ?

 On Wed, Mar 25, 2015 at 4:06 PM, Shixiong Zhu zsxw...@gmail.com wrote:

 It's a random port to avoid port conflicts, since multiple AMs can run in
 the same machine. Why do you need a fixed port?

 Best Regards,
 Shixiong Zhu

 2015-03-26 6:49 GMT+08:00 Manoj Samel manojsamelt...@gmail.com:

 Spark 1.3, Hadoop 2.5, Kerbeors

 When running spark-shell in yarn client mode, it shows following message
 with a random port every time (44071 in example below). Is there a way to
 specify that port to a specific port ? It does not seem to be part of ports
 specified in http://spark.apache.org/docs/latest/configuration.html
 spark.xxx.port ...

 Thanks,

 15/03/25 22:27:10 INFO Client: Application report for
 application_1427316153428_0014 (state: ACCEPTED)
 15/03/25 22:27:10 INFO YarnClientSchedulerBackend: ApplicationMaster
 registered as Actor[akka.tcp://sparkYarnAM@xyz
 :44071/user/YarnAM#-1989273896]






Re: trouble with jdbc df in python

2015-03-25 Thread Michael Armbrust
Thanks for following up.  I'll fix the docs.

On Wed, Mar 25, 2015 at 4:04 PM, elliott cordo elliottco...@gmail.com
wrote:

 Thanks!.. the below worked:

 db = sqlCtx.load(source=jdbc,
 url=jdbc:postgresql://localhost/x?user=xpassword=x,dbtable=mstr.d_customer)

 Note that
 https://spark.apache.org/docs/latest/sql-programming-guide.html#dataframe-operations
 needs to be updated:

 [image: Inline image 1]

 On Wed, Mar 25, 2015 at 6:12 PM, Michael Armbrust mich...@databricks.com
 wrote:

 Try:

 db = sqlContext.load(source=jdbc, url=jdbc:postgresql://localhost/xx,
 dbtables=mstr.d_customer)


 On Wed, Mar 25, 2015 at 2:19 PM, elliott cordo elliottco...@gmail.com
 wrote:

 if i run the following:

 db = sqlContext.load(jdbc, url=jdbc:postgresql://localhost/xx,
 dbtables=mstr.d_customer)

 i get the error:

 py4j.protocol.Py4JJavaError: An error occurred while calling o28.load.

 : java.io.FileNotFoundException: File file:/Users/elliottcordo/jdbc does
 not exist

 Seems to think i'm trying to load a file called jdbc?  i'm setting the
 postgres driver in the SPARK_CLASSPATH as well (that doesn't seem to be the
 problem)..






Re: filter expression in API document for DataFrame

2015-03-25 Thread Michael Armbrust
Yeah sorry, this is already fixed but we need to republish the docs.  I'll
add both of the following do work:

people.filter(age  30)
people.filter(people(age)  30)



On Tue, Mar 24, 2015 at 7:11 PM, SK skrishna...@gmail.com wrote:



 The following statement appears in the Scala API example at

 https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame

 people.filter(age  30).

 I tried this example and it gave a compilation error. I think this needs to
 be changed to people.filter(people(age)  30)

 Also, it would be good to add some examples for the new equality operator
 for columns (e.g. (people(age) === 30) ). The programming guide does not
 have an example for this in the DataFrame Operations section and it was not
 very obvious that we need to be using a different equality operator for
 columns.


 thanks



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/filter-expression-in-API-document-for-DataFrame-tp22213.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems

2015-03-25 Thread roni
Is there any way that I can install the new one and remove previous version.
I installed spark 1.3 on my EC2 master and set teh spark home to the new
one.
But when I start teh spark-shell I get -
 java.lang.UnsatisfiedLinkError:
org.apache.hadoop.security.JniBasedUnixGroupsMapping.anchorNative()V
at
org.apache.hadoop.security.JniBasedUnixGroupsMapping.anchorNative(Native
Method)

Is There no way to upgrade without creating new cluster?
Thanks
Roni



On Wed, Mar 25, 2015 at 1:18 PM, Dean Wampler deanwamp...@gmail.com wrote:

 Yes, that's the problem. The RDD class exists in both binary jar files,
 but the signatures probably don't match. The bottom line, as always for
 tools like this, is that you can't mix versions.

 Dean Wampler, Ph.D.
 Author: Programming Scala, 2nd Edition
 http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
 Typesafe http://typesafe.com
 @deanwampler http://twitter.com/deanwampler
 http://polyglotprogramming.com

 On Wed, Mar 25, 2015 at 3:13 PM, roni roni.epi...@gmail.com wrote:

 My cluster is still on spark 1.2 and in SBT I am using 1.3.
 So probably it is compiling with 1.3 but running with 1.2 ?

 On Wed, Mar 25, 2015 at 12:34 PM, Dean Wampler deanwamp...@gmail.com
 wrote:

 Weird. Are you running using SBT console? It should have the spark-core
 jar on the classpath. Similarly, spark-shell or spark-submit should work,
 but be sure you're using the same version of Spark when running as when
 compiling. Also, you might need to add spark-sql to your SBT dependencies,
 but that shouldn't be this issue.

 Dean Wampler, Ph.D.
 Author: Programming Scala, 2nd Edition
 http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
 Typesafe http://typesafe.com
 @deanwampler http://twitter.com/deanwampler
 http://polyglotprogramming.com

 On Wed, Mar 25, 2015 at 12:09 PM, roni roni.epi...@gmail.com wrote:

 Thanks Dean and Nick.
 So, I removed the ADAM and H2o from my SBT as I was not using them.
 I got the code to compile  - only for fail while running with -
 SparkContext: Created broadcast 1 from textFile at
 kmerIntersetion.scala:21
 Exception in thread main java.lang.NoClassDefFoundError:
 org/apache/spark/rdd/RDD$
 at preDefKmerIntersection$.main(kmerIntersetion.scala:26)

 This line is where I do a JOIN operation.
 val hgPair = hgfasta.map(_.split (,)).map(a= (a(0),
 a(1).trim().toInt))
  val filtered = hgPair.filter(kv = kv._2 == 1)
  val bedPair = bedFile.map(_.split (,)).map(a=
 (a(0), a(1).trim().toInt))
 * val joinRDD = bedPair.join(filtered)   *
 Any idea whats going on?
 I have data on the EC2 so I am avoiding creating the new cluster , but
 just upgrading and changing the code to use 1.3 and Spark SQL
 Thanks
 Roni



 On Wed, Mar 25, 2015 at 9:50 AM, Dean Wampler deanwamp...@gmail.com
 wrote:

 For the Spark SQL parts, 1.3 breaks backwards compatibility, because
 before 1.3, Spark SQL was considered experimental where API changes were
 allowed.

 So, H2O and ADA compatible with 1.2.X might not work with 1.3.

 dean

 Dean Wampler, Ph.D.
 Author: Programming Scala, 2nd Edition
 http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
 Typesafe http://typesafe.com
 @deanwampler http://twitter.com/deanwampler
 http://polyglotprogramming.com

 On Wed, Mar 25, 2015 at 9:39 AM, roni roni.epi...@gmail.com wrote:

 Even if H2o and ADA are dependent on 1.2.1 , it should be backword
 compatible, right?
 So using 1.3 should not break them.
 And the code is not using the classes from those libs.
 I tried sbt clean compile .. same errror
 Thanks
 _R

 On Wed, Mar 25, 2015 at 9:26 AM, Nick Pentreath 
 nick.pentre...@gmail.com wrote:

 What version of Spark do the other dependencies rely on (Adam and
 H2O?) - that could be it

 Or try sbt clean compile

 —
 Sent from Mailbox https://www.dropbox.com/mailbox


 On Wed, Mar 25, 2015 at 5:58 PM, roni roni.epi...@gmail.com wrote:

 I have a EC2 cluster created using spark version 1.2.1.
 And I have a SBT project .
 Now I want to upgrade to spark 1.3 and use the new features.
 Below are issues .
 Sorry for the long post.
 Appreciate your help.
 Thanks
 -Roni

 Question - Do I have to create a new cluster using spark 1.3?

 Here is what I did -

 In my SBT file I  changed to -
 libraryDependencies += org.apache.spark %% spark-core % 1.3.0

 But then I started getting compilation error. along with
 Here are some of the libraries that were evicted:
 [warn]  * org.apache.spark:spark-core_2.10:1.2.0 - 1.3.0
 [warn]  * org.apache.hadoop:hadoop-client:(2.5.0-cdh5.2.0, 2.2.0)
 - 2.6.0
 [warn] Run 'evicted' to see detailed eviction warnings

  constructor cannot be instantiated to expected type;
 [error]  found   : (T1, T2)
 [error]  required: org.apache.spark.sql.catalyst.expressions.Row
 [error] val ty =
 joinRDD.map{case(word, (file1Counts, file2Counts)) = 
 KmerIntesect(word,
 file1Counts,xyz)}
 [error]   

Re: Spark ML Pipeline inaccessible types

2015-03-25 Thread zapletal-martin
Thanks Peter,



I ended up doing something similar. I however consider both the approaches 
you mentioned bad practices which is why I was looking for a solution 
directly supported by the current code.




I can work with that now, but it does not seem to be the proper solution.




Regards,

Martin





-- Původní zpráva --
Od: Peter Rudenko petro.rude...@gmail.com
Komu: zapletal-mar...@email.cz, Sean Owen so...@cloudera.com
Datum: 25. 3. 2015 13:28:38
Předmět: Re: Spark ML Pipeline inaccessible types




 Hi Martin, here’s 2 possibilities to overcome this:

 1) Put your logic into org.apache.spark package in your project - then 
 everything would be accessible.
 2) Dirty trick:

 spanspanobject/span spanSparkVector/span 
spanspanextends/span/span spanHashingTF/span {/span
  spanspanval/span spanVectorUDT/span:/span spanDataType/span = 
outputDataType
}


 then you can do like this:

 spanStructType/span(spanvectorTypeColumn/span, 
spanSparkVector/span.spanVectorUDT/span, spanfalse/span))


 Thanks,
 Peter Rudenko

 On 2015-03-25 13:14, zapletal-mar...@email.cz
 (mailto:zapletal-mar...@email.cz) wrote:

 



Sean, 



thanks for your response. I am familiar with NoSuchMethodException in 
general, but I think it is not the case this time. The code actually 
attempts to get parameter by name using val m = this.getClass.getMethodName
(paramName).




This may be a bug, but it is only a side effect caused by the real problem I
am facing. My issue is that VectorUDT is not accessible by user code and 
therefore it is not possible to use custom ML pipeline with the existing 
Predictors (see the last two paragraphs in my first email).




Best Regards,

Martin



-- Původní zpráva --
Od: Sean Owen so...@cloudera.com(mailto:so...@cloudera.com)
Komu: zapletal-mar...@email.cz(mailto:zapletal-mar...@email.cz)
Datum: 25. 3. 2015 11:05:54
Předmět: Re: Spark ML Pipeline inaccessible types

NoSuchMethodError in general means that your runtime and compile-time
environments are different. I think you need to first make sure you
don't have mismatching versions of Spark.

On Wed, Mar 25, 2015 at 11:00 AM, zapletal-mar...@email.cz
(mailto:zapletal-mar...@email.cz) wrote:
 Hi,

 I have started implementing a machine learning pipeline using Spark 1.3.0
 and the new pipelining API and DataFrames. I got to a point where I have 
my
 training data set prepared using a sequence of Transformers, but I am
 struggling to actually train a model and use it for predictions.

 I am getting a java.lang.NoSuchMethodException:
 org.apache.spark.ml.regression.LinearRegression.myFeaturesColumnName()
 exception thrown at checkInputColumn method in Params trait when using a
 Predictor (LinearRegression in my case, but that should not matter). This
 looks like a bug - the exception is thrown when executing getParam
(colName)
 when the require(actualDataType.equals(datatype), ...) requirement is not
 met so the expected requirement failed exception is not thrown and is 
hidden
 by the unexpected NoSuchMethodException instead. I can raise a bug if this
 really is an issue and I am not using something incorrectly.

 The problem I am facing however is that the Predictor expects features to
 have VectorUDT type as defined in Predictor class (protected def
 featuresDataType: DataType = new VectorUDT). But since this type is
 private[spark] my Transformer can not prepare features with this type 
which
 then correctly results in the exception above when I use a different type.

 Is there a way to define a custom Pipeline that would be able to use the
 existing Predictors without having to bypass the access modifiers or
 reimplement something or is the pipelining API not yet expected to be used
 in this way?

 Thanks,
 Martin



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
(mailto:user-unsubscr...@spark.apache.org)
For additional commands, e-mail: user-h...@spark.apache.org
(mailto:user-h...@spark.apache.org) 
 



 

​





Cross-compatibility of YARN shuffle service

2015-03-25 Thread Matt Cheah
Hi everyone,

I am considering moving from Spark-Standalone to YARN. The context is that
there are multiple Spark applications that are using different versions of
Spark that all want to use the same YARN cluster.

My question is: if I use a single Spark YARN shuffle service jar on the Node
Manager, will the service work properly with all of the Spark applications,
regardless of the specific versions of the applications? Or, is it it the
case that, if I want to use the external shuffle service, I need to have all
of my applications using the same version of Spark?

Thanks,

-Matt Cheah




smime.p7s
Description: S/MIME cryptographic signature


  1   2   >