Re: Invalid signature file digest for Manifest main attributes with spark job built using maven

2014-09-16 Thread Kevin Peng
Sean,

Thanks.  That worked.

Kevin

On Mon, Sep 15, 2014 at 3:37 PM, Sean Owen so...@cloudera.com wrote:

 This is more of a Java / Maven issue than Spark per se. I would use
 the shade plugin to remove signature files in your final META-INF/
 dir. As Spark does, in its configuration:

 filters
   filter
 artifact*:*/artifact
 excludes
   excludeorg/datanucleus/**/exclude
   excludeMETA-INF/*.SF/exclude
   excludeMETA-INF/*.DSA/exclude
   excludeMETA-INF/*.RSA/exclude
 /excludes
   /filter
 /filters

 On Mon, Sep 15, 2014 at 11:33 PM, kpeng1 kpe...@gmail.com wrote:
  Hi All,
 
  I am trying to submit a spark job that I have built in maven using the
  following command:
  /usr/bin/spark-submit --deploy-mode client --class com.spark.TheMain
  --master local[1] /home/cloudera/myjar.jar 100
 
  But I seem to be getting the following error:
  Exception in thread main java.lang.SecurityException: Invalid signature
  file digest for Manifest main attributes
  at
 
 sun.security.util.SignatureFileVerifier.processImpl(SignatureFileVerifier.java:286)
  at
 
 sun.security.util.SignatureFileVerifier.process(SignatureFileVerifier.java:239)
  at java.util.jar.JarVerifier.processEntry(JarVerifier.java:307)
  at java.util.jar.JarVerifier.update(JarVerifier.java:218)
  at java.util.jar.JarFile.initializeVerifier(JarFile.java:345)
  at java.util.jar.JarFile.getInputStream(JarFile.java:412)
  at
 sun.misc.URLClassPath$JarLoader$2.getInputStream(URLClassPath.java:775)
  at sun.misc.Resource.cachedInputStream(Resource.java:77)
  at sun.misc.Resource.getByteBuffer(Resource.java:160)
  at java.net.URLClassLoader.defineClass(URLClassLoader.java:436)
  at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
  at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
  at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
  at java.security.AccessController.doPrivileged(Native Method)
  at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
  at java.lang.Class.forName0(Native Method)
  at java.lang.Class.forName(Class.java:270)
  at
 org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:289)
  at
 org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
  at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 
 
  Here is the pom file I am using to build the jar:
  project xmlns=http://maven.apache.org/POM/4.0.0;
  xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance;
  xsi:schemaLocation=http://maven.apache.org/POM/4.0.0
  http://maven.apache.org/maven-v4_0_0.xsd;
modelVersion4.0.0/modelVersion
groupIdcom.spark/groupId
artifactIdmyjar/artifactId
version0.0.1-SNAPSHOT/version
name${project.artifactId}/name
descriptionMy wonderfull scala app/description
inceptionYear2010/inceptionYear
licenses
  license
nameMy License/name
urlhttp:///url
distributionrepo/distribution
  /license
/licenses
 
properties
  cdh.versioncdh5.1.0/cdh.version
  maven.compiler.source1.6/maven.compiler.source
  maven.compiler.target1.6/maven.compiler.target
  encodingUTF-8/encoding
  scala.tools.version2.10/scala.tools.version
  scala.version2.10.4/scala.version
/properties
 
repositories
  repository
idscala-tools.org/id
nameScala-tools Maven2 Repository/name
urlhttps://oss.sonatype.org/content/repositories/snapshots/
 /url
  /repository
  repository
idmaven-hadoop/id
nameHadoop Releases/name
 
  urlhttps://repository.cloudera.com/content/repositories/releases/
 /url
  /repository
  repository
idcloudera-repos/id
nameCloudera Repos/name
urlhttps://repository.cloudera.com/artifactory/cloudera-repos/
 /url
  /repository
/repositories
pluginRepositories
  pluginRepository
idscala-tools.org/id
nameScala-tools Maven2 Repository/name
urlhttps://oss.sonatype.org/content/repositories/snapshots/
 /url
  /pluginRepository
/pluginRepositories
 
dependencies
  dependency
groupIdorg.scala-lang/groupId
artifactIdscala-library/artifactId
version${scala.version}/version
  /dependency
  dependency
groupIdorg.apache.spark/groupId
artifactIdspark-core_2.10/artifactId
version1.0.0-${cdh.version}/version
  /dependency
  dependency
groupIdorg.apache.spark/groupId
artifactIdspark-tools_2.10/artifactId
version1.0.0-${cdh.version}/version
  /dependency
  dependency
groupIdorg.apache.spark/groupId
artifactIdspark-streaming-flume_2.10/artifactId

Re: spark and mesos issue

2014-09-16 Thread Gurvinder Singh
It might not be related only to memory issue. Memory issue is also
there as you mentioned. I have seen that one too. The fine mode issue
is mainly spark considering that it got two different block manager
for same ID, whereas if I search for the ID in the mesos slave, it
exist only on the one slave not on multiple of them. Theis might be
due to the size of ID, as spark out the error as

14/09/16 08:04:29 ERROR BlockManagerMasterActor: Got two different
block manager registrations on 20140822-112818-711206558-5050-25951-0

where as in the mesos slave I see logs as

I0915 20:55:18.293903 31434 containerizer.cpp:392] Starting container
'3aab2237-d32f-470d-a206-7bada454ad3f' for executor
'20140822-112818-711206558-5050-25951-0' of framework
'20140822-112818-711206558-5050-25951-0053'

I0915 20:53:28.039218 31437 containerizer.cpp:392] Starting container
'fe4b344f-16c9-484a-9c2f-92bd92b43f6d' for executor
'20140822-112818-711206558-5050-25951-0' of framework
'20140822-112818-711206558-5050-25951-0050'


you the last 3 digits of ID are missing in spark where as they are
different in mesos slaves.

- Gurvinder
On 09/15/2014 11:13 PM, Brenden Matthews wrote:
 I started hitting a similar problem, and it seems to be related to 
 memory overhead and tasks getting OOM killed.  I filed a ticket
 here:
 
 https://issues.apache.org/jira/browse/SPARK-3535
 
 On Wed, Jul 16, 2014 at 5:27 AM, Ray Rodriguez
 rayrod2...@gmail.com mailto:rayrod2...@gmail.com wrote:
 
 I'll set some time aside today to gather and post some logs and 
 details about this issue from our end.
 
 
 On Wed, Jul 16, 2014 at 2:05 AM, Vinod Kone vinodk...@gmail.com 
 mailto:vinodk...@gmail.com wrote:
 
 
 
 
 On Tue, Jul 15, 2014 at 11:02 PM, Vinod Kone vi...@twitter.com 
 mailto:vi...@twitter.com wrote:
 
 
 On Fri, Jul 4, 2014 at 2:05 AM, Gurvinder Singh 
 gurvinder.si...@uninett.no mailto:gurvinder.si...@uninett.no
 wrote:
 
 ERROR storage.BlockManagerMasterActor: Got two different block
 manager registrations on 201407031041-1227224054-5050-24004-0
 
 Googling about it seems that mesos is starting slaves at the same
 time and giving them the same id. So may bug in mesos ?
 
 
 Has this issue been resolved? We need more information to triage
 this. Maybe some logs that show the lifecycle of the duplicate
 instances?
 
 
 @vinodkone
 
 
 
 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: SparkSQL 1.1 hang when DROP or LOAD

2014-09-16 Thread linkpatrickliu
Seems like the thriftServer cannot connect to Zookeeper, so it cannot get
lock.

This is how it the log looks when I run SparkSQL:
load data inpath kv1.txt into table src;
log:
14/09/16 14:40:47 INFO Driver: PERFLOG method=acquireReadWriteLocks
14/09/16 14:40:47 INFO ClientCnxn: Opening socket connection to server
SVR4044HW2285.hadoop.lpt.qa.nt.ctripcorp.com/10.2.4.191:2181. Will not
attempt to authenticate using SASL (unknown error)
14/09/16 14:40:47 INFO ClientCnxn: Socket connection established to
SVR4044HW2285.hadoop.lpt.qa.nt.ctripcorp.com/10.2.4.191:2181, initiating
session
14/09/16 14:40:47 INFO ClientCnxn: Session establishment complete on server
SVR4044HW2285.hadoop.lpt.qa.nt.ctripcorp.com/10.2.4.191:2181, sessionid =
0x347c1b1f78d495e, negotiated timeout = 18
14/09/16 14:40:47 INFO Driver: /PERFLOG method=acquireReadWriteLocks
start=1410849647447 end=1410849647457 duration=10

You can see, between the PERFLOG of acquireReadWriteLocks, the ClientCnxn
will try to connect to Zookeeper. After the connection has been successfully
established, the acquireReadWriteLocks phrase can be finished.

But, when I run the ThriftServer, and run the same SQL.
Here is the log:
14/09/16 14:40:09 INFO Driver: PERFLOG method=acquireReadWriteLocks

It will wait here. 
So I doubt, the reason why Drop or Load failed in thriftServer mode, is
because of the thriftServer cannot connect to Zookeeper.





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-1-1-hang-when-DROP-or-LOAD-tp14222p14336.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



SparkContext creation slow down unit tests

2014-09-16 Thread 诺铁
hi,

I am trying to write some unit test, following spark programming guide
http://spark.apache.org/docs/latest/programming-guide.html#unit-testing.
but I observed unit test runs very slow(the code is just a SparkPi),  so I
turn log level to trace and look through the log output.  and found
creation of SparkContext seems to take most time.

there are some action take a lot of time:
1, seems starting jetty
14:04:55.038 [ScalaTest-run-running-MySparkPiSpec] DEBUG
o.e.jetty.servlet.ServletHandler -
servletNameMap={org.apache.spark.ui.JettyUtils$$anon$1-672f11c2=org.apache.spark.ui.JettyUtils$$anon$1-672f11c2}
14:05:25.121 [ScalaTest-run-running-MySparkPiSpec] DEBUG
o.e.jetty.util.component.Container - Container
org.eclipse.jetty.server.Server@e3cee7b +
SelectChannelConnector@0.0.0.0:4040 as connector

2, I don't know what's this
14:05:25.202 [ScalaTest-run-running-MySparkPiSpec] DEBUG
org.apache.hadoop.security.Groups - Group mapping
impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping;
cacheTimeout=30
14:05:54.594 [spark-akka.actor.default-dispatcher-4] TRACE
o.a.s.s.BlockManagerMasterActor - Checking for hosts with no recent heart
beats in BlockManagerMaster.


are there any way to make this faster for unit test?
I also notice in spark's unit test, that there exists a SharedSparkContext
https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/SharedSparkContext.scala,
is this the suggested way to work around this problem? if so, I would
suggest to document this in programming guide.


RE: SparkSQL 1.1 hang when DROP or LOAD

2014-09-16 Thread linkpatrickliu
Hi, Hao Cheng.

I have done other tests. And the result shows the thriftServer can connect
to Zookeeper.

However, I found some more interesting things. And I think I have found a
bug!

Test procedure:
Test1:
(0) Use beeline to connect to thriftServer.
(1) Switch database use dw_op1; (OK)
The logs show that the thriftServer connected with Zookeeper and acquired
locks.

(2) Drop table drop table src; (Blocked)
The logs show that the thriftServer is acquireReadWriteLocks.

Doubt:
The reason why I cannot drop table src is because the first SQL use dw_op1
have left locks in Zookeeper  unsuccessfully released.
So when the second SQL is acquiring locks in Zookeeper, it will block.

Test2:
Restart thriftServer.
Instead of switching to another database, I just drop the table in the
default database;
(0) Restart thriftServer  use beeline to connect to thriftServer.
(1) Drop table drop table src; (OK)
Amazing! Succeed!
(2) Drop again!  drop table src2; (Blocked)
Same error: the thriftServer is blocked in the acquireReadWriteLocks
phrase.

As you can see. 
Only the first SQL requiring locks can succeed.
So I think the reason is that the thriftServer cannot release locks
correctly in Zookeeper.









--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-1-1-hang-when-DROP-or-LOAD-tp14222p14339.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark 1.1 / cdh4 stuck using old hadoop client?

2014-09-16 Thread Paul Wais
Thanks Christian!  I tried compiling from source but am still getting the
same hadoop client version error when reading from HDFS.  Will have to poke
deeper... perhaps I've got some classpath issues.  FWIW I compiled using:

$ MAVEN_OPTS=-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m
mvn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests clean package

and hadoop 2.3 / cdh5 from
http://archive.cloudera.com/cdh5/cdh/5/hadoop-2.3.0-cdh5.0.0.tar.gz





On Mon, Sep 15, 2014 at 6:49 PM, Christian Chua cc8...@icloud.com wrote:

 Hi Paul.

 I would recommend building your own 1.1.0 distribution.

 ./make-distribution.sh --name hadoop-personal-build-2.4 --tgz -Pyarn
 -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests



 I downloaded the Pre-build for Hadoop 2.4 binary, and it had this
 strange behavior where

 spark-submit --master yarn-cluster ...

 will work, but

 spark-submit --master yarn-client ...

 will fail.


 But on the personal build obtained from the command above, both will then
 work.


 -Christian




 On Sep 15, 2014, at 6:28 PM, Paul Wais pw...@yelp.com wrote:

 Dear List,

 I'm having trouble getting Spark 1.1 to use the Hadoop 2 API for
 reading SequenceFiles.  In particular, I'm seeing:

 Exception in thread main org.apache.hadoop.ipc.RemoteException:
 Server IPC version 7 cannot communicate with client version 4
at org.apache.hadoop.ipc.Client.call(Client.java:1070)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
at com.sun.proxy.$Proxy7.getProtocolVersion(Unknown Source)
...

 When invoking JavaSparkContext#newAPIHadoopFile().  (With args
 validSequenceFileURI, SequenceFileInputFormat.class, Text.class,
 BytesWritable.class, new Job().getConfiguration() -- Pretty close to
 the unit test here:

 https://github.com/apache/spark/blob/f0f1ba09b195f23f0c89af6fa040c9e01dfa8951/core/src/test/java/org/apache/spark/JavaAPISuite.java#L916
 )


 This error indicates to me that Spark is using an old hadoop client to
 do reads.  Oddly I'm able to do /writes/ ok, i.e. I'm able to write
 via JavaPairRdd#saveAsNewAPIHadoopFile() to my hdfs cluster.


 Do I need to explicitly build spark for modern hadoop??  I previously
 had an hdfs cluster running hadoop 2.3.0 and I was getting a similar
 error (server is using version 9, client is using version 4).


 I'm using Spark 1.1 cdh4 as well as hadoop cdh4 from the links posted
 on spark's site:
 * http://d3kbcqa49mib13.cloudfront.net/spark-1.1.0-bin-cdh4.tgz
 * http://d3kbcqa49mib13.cloudfront.net/hadoop-2.0.0-cdh4.2.0.tar.gz


 What distro of hadoop is used at Data Bricks?  Are there distros of
 Spark 1.1 and hadoop that should work together out-of-the-box?
 (Previously I had Spark 1.0.0 and Hadoop 2.3 working fine..)

 Thanks for any help anybody can give me here!
 -Paul

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Re: SchemaRDD saveToCassandra

2014-09-16 Thread lmk
Hi Michael,
Please correct me if I am wrong. The error seems to originate from spark
only. Please have a look at the stack trace of the error which is as
follows:

[error] (run-main-0) java.lang.NoSuchMethodException: Cannot resolve any
suitable constructor for class org.apache.spark.sql.catalyst.expressions.Row
java.lang.NoSuchMethodException: Cannot resolve any suitable constructor for
class org.apache.spark.sql.catalyst.expressions.Row
at
com.datastax.spark.connector.rdd.reader.AnyObjectFactory$$anonfun$resolveConstructor$2.apply(AnyObjectFactory.scala:134)
at
com.datastax.spark.connector.rdd.reader.AnyObjectFactory$$anonfun$resolveConstructor$2.apply(AnyObjectFactory.scala:134)
at scala.util.Try.getOrElse(Try.scala:77)
at
com.datastax.spark.connector.rdd.reader.AnyObjectFactory$.resolveConstructor(AnyObjectFactory.scala:133)
at
com.datastax.spark.connector.mapper.ReflectionColumnMapper.columnMap(ReflectionColumnMapper.scala:46)
at
com.datastax.spark.connector.writer.DefaultRowWriter.init(DefaultRowWriter.scala:20)
at
com.datastax.spark.connector.writer.DefaultRowWriter$$anon$2.rowWriter(DefaultRowWriter.scala:109)
at
com.datastax.spark.connector.writer.DefaultRowWriter$$anon$2.rowWriter(DefaultRowWriter.scala:107)
at
com.datastax.spark.connector.writer.TableWriter$.apply(TableWriter.scala:171)
at
com.datastax.spark.connector.RDDFunctions.tableWriter(RDDFunctions.scala:76)
at
com.datastax.spark.connector.RDDFunctions.saveToCassandra(RDDFunctions.scala:56)
at ScalaSparkCassandra$.main(ScalaSparkCassandra.scala:66)
at ScalaSparkCassandra.main(ScalaSparkCassandra.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)

Please clarify.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-saveToCassandra-tp13951p14340.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SparkContext creation slow down unit tests

2014-09-16 Thread 诺铁
I connect my sample project to a hosted CI service, it only takes 3 seconds
to run there...while the same tests takes 2minutes on my macbook pro.  so
maybe this is a mac os specific problem?

On Tue, Sep 16, 2014 at 3:06 PM, 诺铁 noty...@gmail.com wrote:

 hi,

 I am trying to write some unit test, following spark programming guide
 http://spark.apache.org/docs/latest/programming-guide.html#unit-testing
 .
 but I observed unit test runs very slow(the code is just a SparkPi),  so I
 turn log level to trace and look through the log output.  and found
 creation of SparkContext seems to take most time.

 there are some action take a lot of time:
 1, seems starting jetty
 14:04:55.038 [ScalaTest-run-running-MySparkPiSpec] DEBUG
 o.e.jetty.servlet.ServletHandler -
 servletNameMap={org.apache.spark.ui.JettyUtils$$anon$1-672f11c2=org.apache.spark.ui.JettyUtils$$anon$1-672f11c2}
 14:05:25.121 [ScalaTest-run-running-MySparkPiSpec] DEBUG
 o.e.jetty.util.component.Container - Container
 org.eclipse.jetty.server.Server@e3cee7b +
 SelectChannelConnector@0.0.0.0:4040 as connector

 2, I don't know what's this
 14:05:25.202 [ScalaTest-run-running-MySparkPiSpec] DEBUG
 org.apache.hadoop.security.Groups - Group mapping
 impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping;
 cacheTimeout=30
 14:05:54.594 [spark-akka.actor.default-dispatcher-4] TRACE
 o.a.s.s.BlockManagerMasterActor - Checking for hosts with no recent heart
 beats in BlockManagerMaster.


 are there any way to make this faster for unit test?
 I also notice in spark's unit test, that there exists a SharedSparkContext
 https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/SharedSparkContext.scala,
 is this the suggested way to work around this problem? if so, I would
 suggest to document this in programming guide.



Re: Broadcast error

2014-09-16 Thread Chengi Liu
Cool.. While let me try that.. any other suggestion(s) on things I can try?

On Mon, Sep 15, 2014 at 9:59 AM, Davies Liu dav...@databricks.com wrote:

 I think the 1.1 will be really helpful for you, it's all compatitble
 with 1.0, so it's
 not hard to upgrade to 1.1.

 On Mon, Sep 15, 2014 at 2:35 AM, Chengi Liu chengi.liu...@gmail.com
 wrote:
  So.. same result with parallelize (matrix,1000)
  with broadcast.. seems like I got jvm core dump :-/
  4/09/15 02:31:22 INFO BlockManagerInfo: Registering block manager
 host:47978
  with 19.2 GB RAM
  14/09/15 02:31:22 INFO BlockManagerInfo: Registering block manager
  host:43360 with 19.2 GB RAM
  Unhandled exception
  Unhandled exception
  Type=Segmentation error vmState=0x
  J9Generic_Signal_Number=0004 Signal_Number=000b
 Error_Value=
  Signal_Code=0001
  Handler1=2BF53760 Handler2=2C3069D0
  InaccessibleAddress=
  RDI=2AB9505F2698 RSI=2AABAE2C54D8 RAX=2AB7CE6009A0
  RBX=2AB7CE6009C0
  RCX=FFC7FFE0 RDX=2AB8509726A8 R8=7FE41FF0
  R9=2000
  R10=2DA318A0 R11=2AB850959520 R12=2AB5EF97DD88
  R13=2AB5EF97BD88
  R14=2C0CE940 R15=2AB5EF97BD88
  RIP= GS= FS= RSP=007367A0
  EFlags=00210282 CS=0033 RBP=00BCDB00 ERR=0014
  TRAPNO=000E OLDMASK= CR2=
  xmm0 4141414141414141 (f: 1094795648.00, d: 2.261635e+06)
  xmm1 4141414141414141 (f: 1094795648.00, d: 2.261635e+06)
  xmm2 4141414141414141 (f: 1094795648.00, d: 2.261635e+06)
  xmm3 4141414141414141 (f: 1094795648.00, d: 2.261635e+06)
  xmm4 4141414141414141 (f: 1094795648.00, d: 2.261635e+06)
  xmm5 4141414141414141 (f: 1094795648.00, d: 2.261635e+06)
  xmm6 4141414141414141 (f: 1094795648.00, d: 2.261635e+06)
  xmm7 f180c714f8e2a139 (f: 4175601920.00, d: -5.462583e+238)
  xmm8 428e8000 (f: 1116635136.00, d: 5.516911e-315)
  xmm9  (f: 0.00, d: 0.00e+00)
  xmm10  (f: 0.00, d: 0.00e+00)
  xmm11  (f: 0.00, d: 0.00e+00)
  xmm12  (f: 0.00, d: 0.00e+00)
  xmm13  (f: 0.00, d: 0.00e+00)
  xmm14  (f: 0.00, d: 0.00e+00)
  xmm15  (f: 0.00, d: 0.00e+00)
  Target=2_60_20140106_181350 (Linux 3.0.93-0.8.2_1.0502.8048-cray_ari_c)
  CPU=amd64 (48 logical CPUs) (0xfc0c5b000 RAM)
  --- Stack Backtrace ---
  (0x2C2FA122 [libj9prt26.so+0x13122])
  (0x2C30779F [libj9prt26.so+0x2079f])
  (0x2C2F9E6B [libj9prt26.so+0x12e6b])
  (0x2C2F9F67 [libj9prt26.so+0x12f67])
  (0x2C30779F [libj9prt26.so+0x2079f])
  (0x2C2F9A8B [libj9prt26.so+0x12a8b])
  (0x2BF52C9D [libj9vm26.so+0x1ac9d])
  (0x2C30779F [libj9prt26.so+0x2079f])
  (0x2BF52F56 [libj9vm26.so+0x1af56])
  (0x2BF96CA0 [libj9vm26.so+0x5eca0])
  ---
  JVMDUMP039I
  JVMDUMP032I
 
 
  Note, this still is with the framesize I modified in the last email
 thread?
 
  On Mon, Sep 15, 2014 at 2:12 AM, Akhil Das ak...@sigmoidanalytics.com
  wrote:
 
  Try:
 
  rdd = sc.broadcast(matrix)
 
  Or
 
  rdd = sc.parallelize(matrix,100) // Just increase the number of slices,
  give it a try.
 
 
 
  Thanks
  Best Regards
 
  On Mon, Sep 15, 2014 at 2:18 PM, Chengi Liu chengi.liu...@gmail.com
  wrote:
 
  Hi Akhil,
So with your config (specifically with set(spark.akka.frameSize ,
  1000)) , I see the error:
  org.apache.spark.SparkException: Job aborted due to stage failure:
  Serialized task 0:0 was 401970046 bytes which exceeds
 spark.akka.frameSize
  (10485760 bytes). Consider using broadcast variables for large values.
  at
  org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1049)
  at org.apache.spark
 
  So, I changed
  set(spark.akka.frameSize , 1000) to set(spark.akka.frameSize
 ,
  10)
  but now I get the same error?
 
  y4j.protocol.Py4JJavaError: An error occurred while calling
  o28.trainKMeansModel.
  : org.apache.spark.SparkException: Job aborted due to stage failure:
 All
  masters are unresponsive! Giving up.
  at
  org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1049)
  at
 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033)
  at
 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031)
  at
 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
  at
 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031)
  at
 
 

Re: Spark 1.1 / cdh4 stuck using old hadoop client?

2014-09-16 Thread Christian Chua
Is 1.0.8 working for you ?

You indicated your last known good version is 1.0.0

Maybe we can track down where it broke. 



 On Sep 16, 2014, at 12:25 AM, Paul Wais pw...@yelp.com wrote:
 
 Thanks Christian!  I tried compiling from source but am still getting the 
 same hadoop client version error when reading from HDFS.  Will have to poke 
 deeper... perhaps I've got some classpath issues.  FWIW I compiled using:
 
 $ MAVEN_OPTS=-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m mvn 
 -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests clean package 
 
 and hadoop 2.3 / cdh5 from 
 http://archive.cloudera.com/cdh5/cdh/5/hadoop-2.3.0-cdh5.0.0.tar.gz
 
 
 
 
 
 On Mon, Sep 15, 2014 at 6:49 PM, Christian Chua cc8...@icloud.com wrote:
 Hi Paul.
 
 I would recommend building your own 1.1.0 distribution.
 
  ./make-distribution.sh --name hadoop-personal-build-2.4 --tgz -Pyarn 
 -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests
 
 
 
 I downloaded the Pre-build for Hadoop 2.4 binary, and it had this strange 
 behavior where
 
  spark-submit --master yarn-cluster ...
 
 will work, but
 
  spark-submit --master yarn-client ...
 
 will fail.
 
 
 But on the personal build obtained from the command above, both will then 
 work.
 
 
 -Christian
 
 
 
 
 On Sep 15, 2014, at 6:28 PM, Paul Wais pw...@yelp.com wrote:
 
 Dear List,
 
 I'm having trouble getting Spark 1.1 to use the Hadoop 2 API for
 reading SequenceFiles.  In particular, I'm seeing:
 
 Exception in thread main org.apache.hadoop.ipc.RemoteException:
 Server IPC version 7 cannot communicate with client version 4
at org.apache.hadoop.ipc.Client.call(Client.java:1070)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
at com.sun.proxy.$Proxy7.getProtocolVersion(Unknown Source)
...
 
 When invoking JavaSparkContext#newAPIHadoopFile().  (With args
 validSequenceFileURI, SequenceFileInputFormat.class, Text.class,
 BytesWritable.class, new Job().getConfiguration() -- Pretty close to
 the unit test here:
 https://github.com/apache/spark/blob/f0f1ba09b195f23f0c89af6fa040c9e01dfa8951/core/src/test/java/org/apache/spark/JavaAPISuite.java#L916
 )
 
 
 This error indicates to me that Spark is using an old hadoop client to
 do reads.  Oddly I'm able to do /writes/ ok, i.e. I'm able to write
 via JavaPairRdd#saveAsNewAPIHadoopFile() to my hdfs cluster.
 
 
 Do I need to explicitly build spark for modern hadoop??  I previously
 had an hdfs cluster running hadoop 2.3.0 and I was getting a similar
 error (server is using version 9, client is using version 4).
 
 
 I'm using Spark 1.1 cdh4 as well as hadoop cdh4 from the links posted
 on spark's site:
 * http://d3kbcqa49mib13.cloudfront.net/spark-1.1.0-bin-cdh4.tgz
 * http://d3kbcqa49mib13.cloudfront.net/hadoop-2.0.0-cdh4.2.0.tar.gz
 
 
 What distro of hadoop is used at Data Bricks?  Are there distros of
 Spark 1.1 and hadoop that should work together out-of-the-box?
 (Previously I had Spark 1.0.0 and Hadoop 2.3 working fine..)
 
 Thanks for any help anybody can give me here!
 -Paul
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 


Re: Spark SQL Thrift JDBC server deployment for production

2014-09-16 Thread vasiliy
it works, thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Thrift-JDBC-server-deployment-for-production-tp13947p14345.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SparkContext creation slow down unit tests

2014-09-16 Thread 诺铁
sorry for disturb, please ignore this mail
in the end, I find it slow because lack of memory in my machine..

sorry again.

On Tue, Sep 16, 2014 at 3:26 PM, 诺铁 noty...@gmail.com wrote:

 I connect my sample project to a hosted CI service, it only takes 3
 seconds to run there...while the same tests takes 2minutes on my macbook
 pro.  so maybe this is a mac os specific problem?

 On Tue, Sep 16, 2014 at 3:06 PM, 诺铁 noty...@gmail.com wrote:

 hi,

 I am trying to write some unit test, following spark programming guide
 http://spark.apache.org/docs/latest/programming-guide.html#unit-testing
 .
 but I observed unit test runs very slow(the code is just a SparkPi),  so
 I turn log level to trace and look through the log output.  and found
 creation of SparkContext seems to take most time.

 there are some action take a lot of time:
 1, seems starting jetty
 14:04:55.038 [ScalaTest-run-running-MySparkPiSpec] DEBUG
 o.e.jetty.servlet.ServletHandler -
 servletNameMap={org.apache.spark.ui.JettyUtils$$anon$1-672f11c2=org.apache.spark.ui.JettyUtils$$anon$1-672f11c2}
 14:05:25.121 [ScalaTest-run-running-MySparkPiSpec] DEBUG
 o.e.jetty.util.component.Container - Container
 org.eclipse.jetty.server.Server@e3cee7b +
 SelectChannelConnector@0.0.0.0:4040 as connector

 2, I don't know what's this
 14:05:25.202 [ScalaTest-run-running-MySparkPiSpec] DEBUG
 org.apache.hadoop.security.Groups - Group mapping
 impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping;
 cacheTimeout=30
 14:05:54.594 [spark-akka.actor.default-dispatcher-4] TRACE
 o.a.s.s.BlockManagerMasterActor - Checking for hosts with no recent heart
 beats in BlockManagerMaster.


 are there any way to make this faster for unit test?
 I also notice in spark's unit test, that there exists a
 SharedSparkContext
 https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/SharedSparkContext.scala,
 is this the suggested way to work around this problem? if so, I would
 suggest to document this in programming guide.





Re: Serving data

2014-09-16 Thread Marius Soutier
Writing to Parquet and querying the result via SparkSQL works great (except for 
some strange SQL parser errors). However the problem remains, how do I get that 
data back to a dashboard. So I guess I’ll have to use a database after all.


You can batch up data  store into parquet partitions as well.  query it using 
another SparkSQL  shell, JDBC driver in SparkSQL is part 1.1 i believe. 


RE: SparkSQL 1.1 hang when DROP or LOAD

2014-09-16 Thread Cheng, Hao
Thank you for pasting the steps, I will look at this, hopefully come out with a 
solution soon.

-Original Message-
From: linkpatrickliu [mailto:linkpatrick...@live.com] 
Sent: Tuesday, September 16, 2014 3:17 PM
To: u...@spark.incubator.apache.org
Subject: RE: SparkSQL 1.1 hang when DROP or LOAD

Hi, Hao Cheng.

I have done other tests. And the result shows the thriftServer can connect to 
Zookeeper.

However, I found some more interesting things. And I think I have found a bug!

Test procedure:
Test1:
(0) Use beeline to connect to thriftServer.
(1) Switch database use dw_op1; (OK)
The logs show that the thriftServer connected with Zookeeper and acquired locks.

(2) Drop table drop table src; (Blocked) The logs show that the thriftServer 
is acquireReadWriteLocks.

Doubt:
The reason why I cannot drop table src is because the first SQL use dw_op1
have left locks in Zookeeper  unsuccessfully released.
So when the second SQL is acquiring locks in Zookeeper, it will block.

Test2:
Restart thriftServer.
Instead of switching to another database, I just drop the table in the default 
database;
(0) Restart thriftServer  use beeline to connect to thriftServer.
(1) Drop table drop table src; (OK)
Amazing! Succeed!
(2) Drop again!  drop table src2; (Blocked) Same error: the thriftServer is 
blocked in the acquireReadWriteLocks
phrase.

As you can see. 
Only the first SQL requiring locks can succeed.
So I think the reason is that the thriftServer cannot release locks correctly 
in Zookeeper.









--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-1-1-hang-when-DROP-or-LOAD-tp14222p14339.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: vertex active/inactive feature in Pregel API ?

2014-09-16 Thread Yifan LI
Dear Ankur,

Thanks! :)

- from [1], and my understanding, the existing inactive feature in graphx 
pregel api is “if there is no in-edges, from active vertex, to this vertex, 
then we will say this one is inactive”, right?

For instance, there is a graph in which every vertex has at least one in-edges, 
then we run static Pagerank on it for 10 iterations. During this calculation, 
is there any vertex would be set inactive?


- for more “explicit active vertex tracking”, e.g. vote to halt, how to achieve 
it in existing api?
(I am not sure I got the point of [2], that “vote” function has already been 
introduced in graphx pregel api? )


Best,
Yifan LI

On 15 Sep 2014, at 23:07, Ankur Dave ankurd...@gmail.com wrote:

 At 2014-09-15 16:25:04 +0200, Yifan LI iamyifa...@gmail.com wrote:
 I am wondering if the vertex active/inactive(corresponding the change of its 
 value between two supersteps) feature is introduced in Pregel API of GraphX?
 
 Vertex activeness in Pregel is controlled by messages: if a vertex did not 
 receive a message in the previous iteration, its vertex program will not run 
 in the current iteration. Also, inactive vertices will not be able to send 
 messages because by default the sendMsg function will only be run on edges 
 where at least one of the adjacent vertices received a message. You can 
 change this behavior -- see the documentation for the activeDirection 
 parameter to Pregel.apply [1].
 
 There is also an open pull request to make active vertex tracking more 
 explicit by allowing vertices to vote to halt directly [2].
 
 Ankur
 
 [1] 
 http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.graphx.Pregel$
 [2] https://github.com/apache/spark/pull/1217



Re: vertex active/inactive feature in Pregel API ?

2014-09-16 Thread Ankur Dave
At 2014-09-16 10:55:37 +0200, Yifan LI iamyifa...@gmail.com wrote:
 - from [1], and my understanding, the existing inactive feature in graphx 
 pregel api is “if there is no in-edges, from active vertex, to this vertex, 
 then we will say this one is inactive”, right?

Well, that's true when messages are only sent forward along edges (from the 
source to the destination) and the activeDirection is EdgeDirection.Out. If 
both of these conditions are true, then a vertex without in-edges cannot 
receive a message, and therefore its vertex program will never run and a 
message will never be sent along its out-edges. PageRank is an application that 
satisfies both the conditions.

 For instance, there is a graph in which every vertex has at least one 
 in-edges, then we run static Pagerank on it for 10 iterations. During this 
 calculation, is there any vertex would be set inactive?

No: since every vertex always sends a message in static PageRank, if every 
vertex has an in-edge, it will always receive a message and will remain active.

In fact, this is why I recently rewrote static PageRank not to use Pregel [3]. 
Assuming that most vertices do have in-edges, it's unnecessary to track active 
vertices, which can provide a big savings.

 - for more “explicit active vertex tracking”, e.g. vote to halt, how to 
 achieve it in existing api?
 (I am not sure I got the point of [2], that “vote” function has already been 
 introduced in graphx pregel api? )

The current Pregel API effectively makes every vertex vote to halt in every 
superstep. Therefore only vertices that receive messages get awoken in the next 
superstep.

Instead, [2] proposes to make every vertex run by default in every superstep 
unless it votes to halt *and* receives no messages. This allows a vertex to 
have more control over whether or not it will run, rather than leaving that 
entirely up to its neighbors.

Ankur

 [1] 
 http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.graphx.Pregel$
 [2] https://github.com/apache/spark/pull/1217
[3] https://github.com/apache/spark/pull/2308

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: vertex active/inactive feature in Pregel API ?

2014-09-16 Thread Ankur Dave
At 2014-09-16 12:23:10 +0200, Yifan LI iamyifa...@gmail.com wrote:
 but I am wondering if there is a message(none?) sent to the target vertex(the 
 rank change is less than tolerance) in below dynamic page rank implementation,

  def sendMessage(edge: EdgeTriplet[(Double, Double), Double]) = {
   if (edge.srcAttr._2  tol) {
 Iterator((edge.dstId, edge.srcAttr._2 * edge.attr))
   } else {
 Iterator.empty
   }
 }

 so, in this case, there is a message, even is none, is still sent? or not?

No, in that case no message is sent, and if all in-edges of a particular vertex 
return Iterator.empty, then the vertex will become inactive in the next 
iteration.

Ankur

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark 1.1 / cdh4 stuck using old hadoop client?

2014-09-16 Thread Sean Owen
From the caller / application perspective, you don't care what version
of Hadoop Spark is running on on the cluster. The Spark API you
compile against is the same. When you spark-submit the app, at
runtime, Spark is using the Hadoop libraries from the cluster, which
are the right version.

So when you build your app, you mark Spark as a 'provided' dependency.
Therefore in general, no, you do not build Spark for yourself if you
are a Spark app creator.

(Of course, your app would care if it were also using Hadoop libraries
directly. In that case, you will want to depend on hadoop-client, and
the right version for your cluster, but still mark it as provided.)

The version Spark is built against only matters when you are deploying
Spark's artifacts on the cluster to set it up.

Your error suggests there is still a version mismatch. Either you
deployed a build that was not compatible, or, maybe you are packaging
a version of Spark with your app which is incompatible and
interfering.

For example, the artifacts you get via Maven depend on Hadoop 1.0.4. I
suspect that's what you're doing -- packaging Spark(+Hadoop1.0.4) with
your app, when it shouldn't be packaged.

Spark works out of the box with just about any modern combo of HDFS and YARN.

On Tue, Sep 16, 2014 at 2:28 AM, Paul Wais pw...@yelp.com wrote:
 Dear List,

 I'm having trouble getting Spark 1.1 to use the Hadoop 2 API for
 reading SequenceFiles.  In particular, I'm seeing:

 Exception in thread main org.apache.hadoop.ipc.RemoteException:
 Server IPC version 7 cannot communicate with client version 4
 at org.apache.hadoop.ipc.Client.call(Client.java:1070)
 at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
 at com.sun.proxy.$Proxy7.getProtocolVersion(Unknown Source)
 ...

 When invoking JavaSparkContext#newAPIHadoopFile().  (With args
 validSequenceFileURI, SequenceFileInputFormat.class, Text.class,
 BytesWritable.class, new Job().getConfiguration() -- Pretty close to
 the unit test here:
 https://github.com/apache/spark/blob/f0f1ba09b195f23f0c89af6fa040c9e01dfa8951/core/src/test/java/org/apache/spark/JavaAPISuite.java#L916
 )


 This error indicates to me that Spark is using an old hadoop client to
 do reads.  Oddly I'm able to do /writes/ ok, i.e. I'm able to write
 via JavaPairRdd#saveAsNewAPIHadoopFile() to my hdfs cluster.


 Do I need to explicitly build spark for modern hadoop??  I previously
 had an hdfs cluster running hadoop 2.3.0 and I was getting a similar
 error (server is using version 9, client is using version 4).


 I'm using Spark 1.1 cdh4 as well as hadoop cdh4 from the links posted
 on spark's site:
  * http://d3kbcqa49mib13.cloudfront.net/spark-1.1.0-bin-cdh4.tgz
  * http://d3kbcqa49mib13.cloudfront.net/hadoop-2.0.0-cdh4.2.0.tar.gz


 What distro of hadoop is used at Data Bricks?  Are there distros of
 Spark 1.1 and hadoop that should work together out-of-the-box?
 (Previously I had Spark 1.0.0 and Hadoop 2.3 working fine..)

 Thanks for any help anybody can give me here!
 -Paul

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to set executor num on spark on yarn

2014-09-16 Thread Sean Owen
How many cores do your machines have? --executor-cores should be the
number of cores each executor uses. Fewer cores means more executors
in general. From your data, it sounds like, for example, there are 7
nodes with 4+ cores available to YARN, and 2 more nodes with 2-3 cores
available. Hence when you ask for fewer cores per executor, more can
fit.

You may need to increase the number of cores you are letting YARN
manage if this doesn't match your expectation. For example, I'm
guessing your machines have more than 4 cores in reality.

On Tue, Sep 16, 2014 at 6:08 AM, hequn cheng chenghe...@gmail.com wrote:
 hi~I want to set the executor number to 16, but it is very strange that
 executor cores may affect executor num on spark on yarn, i don't know why
 and how to set executor number.
 =
 ./bin/spark-submit --class com.hequn.spark.SparkJoins \
 --master yarn-cluster \
 --num-executors 16 \
 --driver-memory 2g \
 --executor-memory 10g \
 --executor-cores 4 \
 /home/sparkjoins-1.0-SNAPSHOT.jar

 The UI shows there are 7 executors
 =
 ./bin/spark-submit --class com.hequn.spark.SparkJoins \
 --master yarn-cluster \
 --num-executors 16 \
 --driver-memory 2g \
 --executor-memory 10g \
 --executor-cores 2 \
 /home/sparkjoins-1.0-SNAPSHOT.jar

 The UI shows there are 9 executors
 =
 ./bin/spark-submit --class com.hequn.spark.SparkJoins \
 --master yarn-cluster \
 --num-executors 16 \
 --driver-memory 2g \
 --executor-memory 10g \
 --executor-cores 1 \
 /home/sparkjoins-1.0-SNAPSHOT.jar

 The UI shows there are 9 executors
 ==
 The cluster contains 16 nodes. Each node 64G RAM.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark Streaming: CoarseGrainedExecutorBackend: Slave registration failed: Duplicate executor ID

2014-09-16 Thread Luis Ángel Vicente Sánchez
I have a standalone spark cluster and from within the same scala
application I'm creating 2 different spark context to run two different
spark streaming jobs as SparkConf is different for each of them.

I'm getting this error that... I don't really understand:

14/09/16 11:51:35 ERROR OneForOneStrategy: spark.httpBroadcast.uri
 java.util.NoSuchElementException: spark.httpBroadcast.uri
   at org.apache.spark.SparkConf$$anonfun$get$1.apply(SparkConf.scala:149)
   at org.apache.spark.SparkConf$$anonfun$get$1.apply(SparkConf.scala:149)
   at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
   at scala.collection.AbstractMap.getOrElse(Map.scala:58)
   at org.apache.spark.SparkConf.get(SparkConf.scala:149)
   at 
 org.apache.spark.broadcast.HttpBroadcast$.initialize(HttpBroadcast.scala:130)
   at 
 org.apache.spark.broadcast.HttpBroadcastFactory.initialize(HttpBroadcastFactory.scala:31)
   at 
 org.apache.spark.broadcast.BroadcastManager.initialize(BroadcastManager.scala:48)
   at 
 org.apache.spark.broadcast.BroadcastManager.init(BroadcastManager.scala:35)
   at org.apache.spark.SparkEnv$.create(SparkEnv.scala:218)
   at org.apache.spark.executor.Executor.init(Executor.scala:85)
   at 
 org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:59)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
   at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 14/09/16 11:51:35 ERROR CoarseGrainedExecutorBackend: Slave registration 
 failed: Duplicate executor ID: 10


 Both apps has different names so I don't get how the executor ID is not
unique :-S

This happens on startup and most of the time only one job dies, but
sometimes both of them die.

Regards,

Luis


Re: Spark Streaming: CoarseGrainedExecutorBackend: Slave registration failed: Duplicate executor ID

2014-09-16 Thread Luis Ángel Vicente Sánchez
It seems that, as I have a single scala application, the scheduler is the
same and there is a collision between executors of both spark context. Is
there a way to change how the executor ID is generated (maybe an uuid
instead of a sequential number..?)

2014-09-16 13:07 GMT+01:00 Luis Ángel Vicente Sánchez 
langel.gro...@gmail.com:

 I have a standalone spark cluster and from within the same scala
 application I'm creating 2 different spark context to run two different
 spark streaming jobs as SparkConf is different for each of them.

 I'm getting this error that... I don't really understand:

 14/09/16 11:51:35 ERROR OneForOneStrategy: spark.httpBroadcast.uri
 java.util.NoSuchElementException: spark.httpBroadcast.uri
  at org.apache.spark.SparkConf$$anonfun$get$1.apply(SparkConf.scala:149)
  at org.apache.spark.SparkConf$$anonfun$get$1.apply(SparkConf.scala:149)
  at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
  at scala.collection.AbstractMap.getOrElse(Map.scala:58)
  at org.apache.spark.SparkConf.get(SparkConf.scala:149)
  at 
 org.apache.spark.broadcast.HttpBroadcast$.initialize(HttpBroadcast.scala:130)
  at 
 org.apache.spark.broadcast.HttpBroadcastFactory.initialize(HttpBroadcastFactory.scala:31)
  at 
 org.apache.spark.broadcast.BroadcastManager.initialize(BroadcastManager.scala:48)
  at 
 org.apache.spark.broadcast.BroadcastManager.init(BroadcastManager.scala:35)
  at org.apache.spark.SparkEnv$.create(SparkEnv.scala:218)
  at org.apache.spark.executor.Executor.init(Executor.scala:85)
  at 
 org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:59)
  at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
  at akka.actor.ActorCell.invoke(ActorCell.scala:456)
  at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
  at akka.dispatch.Mailbox.run(Mailbox.scala:219)
  at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
  at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
  at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
  at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
  at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 14/09/16 11:51:35 ERROR CoarseGrainedExecutorBackend: Slave registration 
 failed: Duplicate executor ID: 10


  Both apps has different names so I don't get how the executor ID is not
 unique :-S

 This happens on startup and most of the time only one job dies, but
 sometimes both of them die.

 Regards,

 Luis



Re: PySpark on Yarn - how group by data properly

2014-09-16 Thread Oleg Ruchovets
I am expand my data set and executing pyspark on yarn:
   I payed attention that only 2 processes processed the data:

14210 yarn  20   0 2463m 2.0g 9708 R 100.0  4.3   8:22.63 python2.7
32467 yarn  20   0 2519m 2.1g 9720 R 99.3  4.4   7:16.97   python2.7


*Question:*
   *how to configure pyspark to have more processes for  process the data?*


Here is my command :

  [hdfs@UCS-MASTER cad]$
/usr/lib/spark-1.0.1.2.1.3.0-563-bin-2.4.0.2.1.3.0-563/bin/spark-submit
--master yarn  --num-executors 12  --driver-memory 4g --executor-memory 2g
--py-files tad.zip --executor-cores 4   /usr/lib/cad/PrepareDataSetYarn.py
/input/tad/data.csv /output/cad_model_500_1

I tried to play with num-executors and executor-cores but it is still 2
python processes doing the job. I have 5 machine cluster with 32 GB ram.

console output:


14/09/16 20:07:34 INFO spark.SecurityManager: Changing view acls to: hdfs
14/09/16 20:07:34 INFO spark.SecurityManager: SecurityManager:
authentication disabled; ui acls disabled; users with view permissions:
Set(hdfs)
14/09/16 20:07:34 INFO slf4j.Slf4jLogger: Slf4jLogger started
14/09/16 20:07:35 INFO Remoting: Starting remoting
14/09/16 20:07:35 INFO Remoting: Remoting started; listening on addresses
:[akka.tcp://spark@UCS-MASTER.sms1.local:39379]
14/09/16 20:07:35 INFO Remoting: Remoting now listens on addresses:
[akka.tcp://spark@UCS-MASTER.sms1.local:39379]
14/09/16 20:07:35 INFO spark.SparkEnv: Registering MapOutputTracker
14/09/16 20:07:35 INFO spark.SparkEnv: Registering BlockManagerMaster
14/09/16 20:07:35 INFO storage.DiskBlockManager: Created local directory at
/tmp/spark-local-20140916200735-53f6
14/09/16 20:07:35 INFO storage.MemoryStore: MemoryStore started with
capacity 2.3 GB.
14/09/16 20:07:35 INFO network.ConnectionManager: Bound socket to port
37255 with id = ConnectionManagerId(UCS-MASTER.sms1.local,37255)
14/09/16 20:07:35 INFO storage.BlockManagerMaster: Trying to register
BlockManager
14/09/16 20:07:35 INFO storage.BlockManagerInfo: Registering block manager
UCS-MASTER.sms1.local:37255 with 2.3 GB RAM
14/09/16 20:07:35 INFO storage.BlockManagerMaster: Registered BlockManager
14/09/16 20:07:35 INFO spark.HttpServer: Starting HTTP Server
14/09/16 20:07:35 INFO server.Server: jetty-8.y.z-SNAPSHOT
14/09/16 20:07:35 INFO server.AbstractConnector: Started
SocketConnector@0.0.0.0:55286
14/09/16 20:07:35 INFO broadcast.HttpBroadcast: Broadcast server started at
http://10.193.218.2:55286
14/09/16 20:07:35 INFO spark.HttpFileServer: HTTP File server directory is
/tmp/spark-ca8193be-9148-4e7e-a0cc-4b6e7cb72172
14/09/16 20:07:35 INFO spark.HttpServer: Starting HTTP Server
14/09/16 20:07:35 INFO server.Server: jetty-8.y.z-SNAPSHOT
14/09/16 20:07:35 INFO server.AbstractConnector: Started
SocketConnector@0.0.0.0:38065
14/09/16 20:07:35 INFO server.Server: jetty-8.y.z-SNAPSHOT
14/09/16 20:07:35 INFO server.AbstractConnector: Started
SelectChannelConnector@0.0.0.0:4040
14/09/16 20:07:35 INFO ui.SparkUI: Started SparkUI at
http://UCS-MASTER.sms1.local:4040
14/09/16 20:07:36 WARN util.NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
--args is deprecated. Use --arg instead.
14/09/16 20:07:36 INFO impl.TimelineClientImpl: Timeline service address:
http://UCS-NODE1.sms1.local:8188/ws/v1/timeline/
14/09/16 20:07:37 INFO client.RMProxy: Connecting to ResourceManager at
UCS-NODE1.sms1.local/10.193.218.3:8050
14/09/16 20:07:37 INFO yarn.Client: Got Cluster metric info from
ApplicationsManager (ASM), number of NodeManagers: 5
14/09/16 20:07:37 INFO yarn.Client: Queue info ... queueName: default,
queueCurrentCapacity: 0.0, queueMaxCapacity: 1.0,
  queueApplicationCount = 0, queueChildQueueCount = 0
14/09/16 20:07:37 INFO yarn.Client: Max mem capabililty of a single
resource in this cluster 53248
14/09/16 20:07:37 INFO yarn.Client: Preparing Local resources
14/09/16 20:07:37 WARN hdfs.BlockReaderLocal: The short-circuit local reads
feature cannot be used because libhadoop cannot be loaded.
14/09/16 20:07:37 INFO yarn.Client: Uploading
file:/usr/lib/spark-1.0.1.2.1.3.0-563-bin-2.4.0.2.1.3.0-563/lib/spark-assembly-1.0.1.2.1.3.0-563-hadoop2.4.0.2.1.3.0-563.jar
to
hdfs://UCS-MASTER.sms1.local:8020/user/hdfs/.sparkStaging/application_1409564765875_0046/spark-assembly-1.0.1.2.1.3.0-563-hadoop2.4.0.2.1.3.0-563.jar
14/09/16 20:07:39 INFO yarn.Client: Uploading
file:/usr/lib/cad/PrepareDataSetYarn.py to
hdfs://UCS-MASTER.sms1.local:8020/user/hdfs/.sparkStaging/application_1409564765875_0046/PrepareDataSetYarn.py
14/09/16 20:07:39 INFO yarn.Client: Uploading file:/usr/lib/cad/tad.zip to
hdfs://UCS-MASTER.sms1.local:8020/user/hdfs/.sparkStaging/application_1409564765875_0046/tad.zip
14/09/16 20:07:39 INFO yarn.Client: Setting up the launch environment
14/09/16 20:07:39 INFO yarn.Client: Setting up container launch context
14/09/16 20:07:39 INFO yarn.Client: Command for starting the Spark
ApplicationMaster: 

Re: Spark Streaming: CoarseGrainedExecutorBackend: Slave registration failed: Duplicate executor ID

2014-09-16 Thread Luis Ángel Vicente Sánchez
When I said scheduler I meant executor backend.

2014-09-16 13:26 GMT+01:00 Luis Ángel Vicente Sánchez 
langel.gro...@gmail.com:

 It seems that, as I have a single scala application, the scheduler is the
 same and there is a collision between executors of both spark context. Is
 there a way to change how the executor ID is generated (maybe an uuid
 instead of a sequential number..?)

 2014-09-16 13:07 GMT+01:00 Luis Ángel Vicente Sánchez 
 langel.gro...@gmail.com:

 I have a standalone spark cluster and from within the same scala
 application I'm creating 2 different spark context to run two different
 spark streaming jobs as SparkConf is different for each of them.

 I'm getting this error that... I don't really understand:

 14/09/16 11:51:35 ERROR OneForOneStrategy: spark.httpBroadcast.uri
 java.util.NoSuchElementException: spark.httpBroadcast.uri
 at org.apache.spark.SparkConf$$anonfun$get$1.apply(SparkConf.scala:149)
 at org.apache.spark.SparkConf$$anonfun$get$1.apply(SparkConf.scala:149)
 at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
 at scala.collection.AbstractMap.getOrElse(Map.scala:58)
 at org.apache.spark.SparkConf.get(SparkConf.scala:149)
 at 
 org.apache.spark.broadcast.HttpBroadcast$.initialize(HttpBroadcast.scala:130)
 at 
 org.apache.spark.broadcast.HttpBroadcastFactory.initialize(HttpBroadcastFactory.scala:31)
 at 
 org.apache.spark.broadcast.BroadcastManager.initialize(BroadcastManager.scala:48)
 at 
 org.apache.spark.broadcast.BroadcastManager.init(BroadcastManager.scala:35)
 at org.apache.spark.SparkEnv$.create(SparkEnv.scala:218)
 at org.apache.spark.executor.Executor.init(Executor.scala:85)
 at 
 org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:59)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 at akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 14/09/16 11:51:35 ERROR CoarseGrainedExecutorBackend: Slave registration 
 failed: Duplicate executor ID: 10


  Both apps has different names so I don't get how the executor ID is not
 unique :-S

 This happens on startup and most of the time only one job dies, but
 sometimes both of them die.

 Regards,

 Luis





Re: Spark Streaming: CoarseGrainedExecutorBackend: Slave registration failed: Duplicate executor ID

2014-09-16 Thread Luis Ángel Vicente Sánchez
I dug a bit more and the executor ID is a number so it's seems there is not
possible workaround.

Looking at the code of the CoarseGrainedSchedulerBackend.scala:

https://github.com/apache/spark/blob/6324eb7b5b0ae005cb2e913e36b1508bd6f1b9b8/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L86

It seems there is only one DriverActor and, as the RegisterExecutor message
only contains the executorID, there is a collision.

I wonder if what I'm doing is wrong... basically, from the same scala
application I...

1. create one actor per job.
2. send a message to each actor with configuration details to create a
SparkContext/StreamingContext.
3. send a message to each actor to start the job and streaming context.

2014-09-16 13:29 GMT+01:00 Luis Ángel Vicente Sánchez 
langel.gro...@gmail.com:

 When I said scheduler I meant executor backend.

 2014-09-16 13:26 GMT+01:00 Luis Ángel Vicente Sánchez 
 langel.gro...@gmail.com:

 It seems that, as I have a single scala application, the scheduler is the
 same and there is a collision between executors of both spark context. Is
 there a way to change how the executor ID is generated (maybe an uuid
 instead of a sequential number..?)

 2014-09-16 13:07 GMT+01:00 Luis Ángel Vicente Sánchez 
 langel.gro...@gmail.com:

 I have a standalone spark cluster and from within the same scala
 application I'm creating 2 different spark context to run two different
 spark streaming jobs as SparkConf is different for each of them.

 I'm getting this error that... I don't really understand:

 14/09/16 11:51:35 ERROR OneForOneStrategy: spark.httpBroadcast.uri
 java.util.NoSuchElementException: spark.httpBroadcast.uri
at org.apache.spark.SparkConf$$anonfun$get$1.apply(SparkConf.scala:149)
at org.apache.spark.SparkConf$$anonfun$get$1.apply(SparkConf.scala:149)
at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
at scala.collection.AbstractMap.getOrElse(Map.scala:58)
at org.apache.spark.SparkConf.get(SparkConf.scala:149)
at 
 org.apache.spark.broadcast.HttpBroadcast$.initialize(HttpBroadcast.scala:130)
at 
 org.apache.spark.broadcast.HttpBroadcastFactory.initialize(HttpBroadcastFactory.scala:31)
at 
 org.apache.spark.broadcast.BroadcastManager.initialize(BroadcastManager.scala:48)
at 
 org.apache.spark.broadcast.BroadcastManager.init(BroadcastManager.scala:35)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:218)
at org.apache.spark.executor.Executor.init(Executor.scala:85)
at 
 org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:59)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 14/09/16 11:51:35 ERROR CoarseGrainedExecutorBackend: Slave registration 
 failed: Duplicate executor ID: 10


  Both apps has different names so I don't get how the executor ID is not
 unique :-S

 This happens on startup and most of the time only one job dies, but
 sometimes both of them die.

 Regards,

 Luis






Reduce Tuple2Integer, Integer to Tuple2Integer,ListInteger

2014-09-16 Thread Tom
From my map function I create Tuple2Integer, Integer pairs. Now I want to
reduce them, and get something like Tuple2Integer, Listlt;Integer. 

The only way I found to do this was by treating all variables as String, and
in the reduceByKey do
/return a._2 + , + b._2/ //in which both are numeric values saved in a
String
After which I do a Arrays.asList(string.split(,)) in mapValues. This
leaves me with String, Listlt;Integer. So now I am looking for either
- A function with which I can transform String, Listlt;Integer to
Integer, Listlt;Integer
or
- A way to reduce Tuple2Integer, Integer into a Tuple2Integer,
Listlt;Integer in the reduceByKey function so that I can use Integers all
the way

Of course option two would have preferences.

Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Reduce-Tuple2-Integer-Integer-to-Tuple2-Integer-List-Integer-tp14361.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Reduce Tuple2Integer, Integer to Tuple2Integer,ListInteger

2014-09-16 Thread Sean Owen
If you mean you have (key,value) pairs, and want pairs with key, and
all values for that key, then you're looking for groupByKey

On Tue, Sep 16, 2014 at 2:42 PM, Tom thubregt...@gmail.com wrote:
 From my map function I create Tuple2Integer, Integer pairs. Now I want to
 reduce them, and get something like Tuple2Integer, Listlt;Integer.

 The only way I found to do this was by treating all variables as String, and
 in the reduceByKey do
 /return a._2 + , + b._2/ //in which both are numeric values saved in a
 String
 After which I do a Arrays.asList(string.split(,)) in mapValues. This
 leaves me with String, Listlt;Integer. So now I am looking for either
 - A function with which I can transform String, Listlt;Integer to
 Integer, Listlt;Integer
 or
 - A way to reduce Tuple2Integer, Integer into a Tuple2Integer,
 Listlt;Integer in the reduceByKey function so that I can use Integers all
 the way

 Of course option two would have preferences.

 Thanks!



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Reduce-Tuple2-Integer-Integer-to-Tuple2-Integer-List-Integer-tp14361.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Serving data

2014-09-16 Thread Yana Kadiyska
If your dashboard is doing ajax/pull requests against say a REST API you
can always create a Spark context in your rest service and use SparkSQL to
query over the parquet files. The parquet files are already on disk so it
seems silly to write both to parquet and to a DB...unless I'm missing
something in your setup.

On Tue, Sep 16, 2014 at 4:18 AM, Marius Soutier mps@gmail.com wrote:

 Writing to Parquet and querying the result via SparkSQL works great
 (except for some strange SQL parser errors). However the problem remains,
 how do I get that data back to a dashboard. So I guess I’ll have to use a
 database after all.


 You can batch up data  store into parquet partitions as well.  query
 it using another SparkSQL  shell, JDBC driver in SparkSQL is part 1.1 i
 believe.




java.util.NoSuchElementException: key not found

2014-09-16 Thread Brad Miller
Hi All,

I suspect I am experiencing a bug. I've noticed that while running
larger jobs, they occasionally die with the exception
java.util.NoSuchElementException: key not found xyz, where xyz
denotes the ID of some particular task.  I've excerpted the log from
one job that died in this way below and attached the full log for
reference.

I suspect that my bug is the same as SPARK-2002 (linked below).  Is
there any reason to suspect otherwise?  Is there any known workaround
other than not coalescing?
https://issues.apache.org/jira/browse/SPARK-2002
http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201406.mbox/%3CCAMwrk0=d1dww5fdbtpkefwokyozltosbbjqamsqqjowlzng...@mail.gmail.com%3E

Note that I have been coalescing SchemaRDDs using srdd =
SchemaRDD(srdd._jschema_rdd.coalesce(partitions, False, None),
sqlCtx), the workaround described in this thread.
http://mail-archives.apache.org/mod_mbox/spark-user/201409.mbox/%3ccanr-kkciei17m43-yz5z-pj00zwpw3ka_u7zhve2y7ejw1v...@mail.gmail.com%3E

...
14/09/15 21:43:14 INFO scheduler.TaskSetManager: Starting task 78.0 in
stage 551.0 (TID 78738, bennett.research.intel-research.net,
PROCESS_LOCAL, 1056 bytes)
...
14/09/15 21:43:15 INFO storage.BlockManagerInfo: Added
taskresult_78738 in memory on
bennett.research.intel-research.net:38074 (size: 13.0 MB, free: 1560.8
MB)
...
14/09/15 21:43:15 ERROR scheduler.TaskResultGetter: Exception while
getting task result
java.util.NoSuchElementException: key not found: 78738
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:58)
at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
at 
org.apache.spark.scheduler.TaskSetManager.handleTaskGettingResult(TaskSetManager.scala:500)
at 
org.apache.spark.scheduler.TaskSchedulerImpl.handleTaskGettingResult(TaskSchedulerImpl.scala:348)
at 
org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:52)
at 
org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47)
at 
org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311)
at 
org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:46)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:701)


I am running the pre-compiled 1.1.0 binaries.

best,
-Brad

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



org.apache.spark.SparkException: java.io.FileNotFoundException: does not exist)

2014-09-16 Thread Hui Li
Hi,

I am new to SPARK. I just set up a small cluster and wanted to run some
simple MLLIB examples. By following the instructions of
https://spark.apache.org/docs/0.9.0/mllib-guide.html#binary-classification-1,
I could successfully run everything until the step of SVMWithSGD, I got
error the following message. I don't know why the
 file:/root/test/sample_svm_data.txt does not exist since I already read it
out, printed it and converted into the labeled data and passed the parsed
data to the function SvmWithSGD.

Any one have the same issue with me?

Thanks,

Emily

 val model = SVMWithSGD.train(parsedData, numIterations)
14/09/16 10:55:21 INFO SparkContext: Starting job: first at
GeneralizedLinearAlgorithm.scala:121
14/09/16 10:55:21 INFO DAGScheduler: Got job 11 (first at
GeneralizedLinearAlgorithm.scala:121) with 1 output partitions
(allowLocal=true)
14/09/16 10:55:21 INFO DAGScheduler: Final stage: Stage 11 (first at
GeneralizedLinearAlgorithm.scala:121)
14/09/16 10:55:21 INFO DAGScheduler: Parents of final stage: List()
14/09/16 10:55:21 INFO DAGScheduler: Missing parents: List()
14/09/16 10:55:21 INFO DAGScheduler: Computing the requested partition
locally
14/09/16 10:55:21 INFO HadoopRDD: Input split:
file:/root/test/sample_svm_data.txt:0+19737
14/09/16 10:55:21 INFO SparkContext: Job finished: first at
GeneralizedLinearAlgorithm.scala:121, took 0.002697478 s
14/09/16 10:55:21 INFO SparkContext: Starting job: count at
DataValidators.scala:37
14/09/16 10:55:21 INFO DAGScheduler: Got job 12 (count at
DataValidators.scala:37) with 2 output partitions (allowLocal=false)
14/09/16 10:55:21 INFO DAGScheduler: Final stage: Stage 12 (count at
DataValidators.scala:37)
14/09/16 10:55:21 INFO DAGScheduler: Parents of final stage: List()
14/09/16 10:55:21 INFO DAGScheduler: Missing parents: List()
14/09/16 10:55:21 INFO DAGScheduler: Submitting Stage 12 (FilteredRDD[26]
at filter at DataValidators.scala:37), which has no missing parents
14/09/16 10:55:21 INFO DAGScheduler: Submitting 2 missing tasks from Stage
12 (FilteredRDD[26] at filter at DataValidators.scala:37)
14/09/16 10:55:21 INFO TaskSchedulerImpl: Adding task set 12.0 with 2 tasks
14/09/16 10:55:21 INFO TaskSetManager: Starting task 12.0:0 as TID 24 on
executor 2: eecvm0206.demo.sas.com (PROCESS_LOCAL)
14/09/16 10:55:21 INFO TaskSetManager: Serialized task 12.0:0 as 1733 bytes
in 0 ms
14/09/16 10:55:21 INFO TaskSetManager: Starting task 12.0:1 as TID 25 on
executor 5: eecvm0203.demo.sas.com (PROCESS_LOCAL)
14/09/16 10:55:21 INFO TaskSetManager: Serialized task 12.0:1 as 1733 bytes
in 0 ms
14/09/16 10:55:21 WARN TaskSetManager: Lost TID 24 (task 12.0:0)
14/09/16 10:55:21 WARN TaskSetManager: Loss was due to
java.io.FileNotFoundException
java.io.FileNotFoundException: File file:/root/test/sample_svm_data.txt
does not exist
at
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
at
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:402)
at
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.init(ChecksumFileSystem.java:137)
at
org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:764)
at
org.apache.hadoop.mapred.LineRecordReader.init(LineRecordReader.java:108)
at
org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
at
org.apache.spark.rdd.HadoopRDD$$anon$1.init(HadoopRDD.scala:156)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:149)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:64)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:33)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:109)
at org.apache.spark.scheduler.Task.run(Task.scala:53)
at
org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
at
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:42)
at

Spark as a Library

2014-09-16 Thread Ruebenacker, Oliver A

 Hello,

  Suppose I want to use Spark from an application that I already submit to run 
in another container (e.g. Tomcat). Is this at all possible? Or do I have to 
split the app into two components, and submit one to Spark and one to the other 
container? In that case, what is the preferred way for the two components to 
communicate with each other? Thanks!

 Best, Oliver

Oliver Ruebenacker | Solutions Architect

Altisource(tm)
290 Congress St, 7th Floor | Boston, Massachusetts 02210
P: (617) 728-5582 | ext: 275585
oliver.ruebenac...@altisource.commailto:oliver.ruebenac...@altisource.com | 
www.Altisource.com

***

This email message and any attachments are intended solely for the use of the 
addressee. If you are not the intended recipient, you are prohibited from 
reading, disclosing, reproducing, distributing, disseminating or otherwise 
using this transmission. If you have received this message in error, please 
promptly notify the sender by reply email and immediately delete this message 
from your system. This message and any attachments may contain information that 
is confidential, privileged or exempt from disclosure. Delivery of this message 
to any person other than the intended recipient is not intended to waive any 
right or privilege. Message transmission is not guaranteed to be secure or free 
of software viruses.
***


collect on hadoopFile RDD returns wrong results

2014-09-16 Thread vasiliy
Hello. I have a hadoopFile RDD and i tried to collect items to driver
program, but it returns me an array of identical records (equals to last
record of my file). My code is like this:

val rdd = sc.hadoopFile(
  hdfs:///data.avro,
  classOf[org.apache.avro.mapred.AvroInputFormat[MyAvroRecord]],
  classOf[org.apache.avro.mapred.AvroWrapper[MyAvroRecord]],
  classOf[org.apache.hadoop.io.NullWritable],
  10)


val collectedData = rdd.collect()

for (s - collectedData){
   println(s)
}

it prints wrong data. But rdd.foreach(println) works as expected. 

What wrong with my code and how can i collect the hadoop RDD files (actually
i want collect parts of it) to the driver program ?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/collect-on-hadoopFile-RDD-returns-wrong-results-tp14368.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



HBase and non-existent TableInputFormat

2014-09-16 Thread Y. Dong
Hello,

I’m currently using spark-core 1.1 and hbase 0.98.5 and I want to simply read 
from hbase. The Java code is attached. However the problem is TableInputFormat 
does not even exist in hbase-client API, is there any other way I can read from
hbase? Thanks

SparkConf sconf = new SparkConf().setAppName(“App).setMaster(local);
JavaSparkContext sc = new JavaSparkContext(sconf); 

Configuration conf = HBaseConfiguration.create();
conf.set(TableInputFormat.INPUT_TABLE, Article);

JavaPairRDDImmutableBytesWritable, Result hBaseRDD = sc.newAPIHadoopRDD(conf, 
TableInputFormat.class,org.apache.hadoop.hbase.io.ImmutableBytesWritable.class,
org.apache.hadoop.hbase.client.Result.class);





Re: combineByKey throws ClassCastException

2014-09-16 Thread Tao Xiao
This problem was caused by the fact that I used a package jar with a Spark
version (0.9.1) different from that of the cluster (0.9.0). When I used the
correct package jar
(spark-assembly_2.10-0.9.0-cdh5.0.1-hadoop2.3.0-cdh5.0.1.jar) instead the
application can run as expected.



2014-09-15 14:57 GMT+08:00 x wasedax...@gmail.com:

 How about this.

 scala val rdd2 = rdd.combineByKey(
  | (v: Int) = v.toLong,
  | (c: Long, v: Int) = c + v,
  | (c1: Long, c2: Long) = c1 + c2)
 rdd2: org.apache.spark.rdd.RDD[(String, Long)] = MapPartitionsRDD[9] at
 combineB
 yKey at console:14

 xj @ Tokyo

 On Mon, Sep 15, 2014 at 3:06 PM, Tao Xiao xiaotao.cs@gmail.com
 wrote:

 I followd an example presented in the tutorial Learning Spark
 http://www.safaribooksonline.com/library/view/learning-spark/9781449359034/ch04.html
 to compute the per-key average as follows:


 val Array(appName) = args
 val sparkConf = new SparkConf()
 .setAppName(appName)
 val sc = new SparkContext(sparkConf)
 /*
  * compute the per-key average of values
  * results should be:
  *A : 5.8
  *B : 14
  *C : 60.6
  */
 val rdd = sc.parallelize(List(
 (A, 3), (A, 9), (A, 12), (A, 0), (A, 5),
 (B, 4), (B, 10), (B, 11), (B, 20), (B, 25),
 (C, 32), (C, 91), (C, 122), (C, 3), (C, 55)), 2)
 val avg = rdd.combineByKey(
 (x:Int) = (x, 1),  // java.lang.ClassCastException: scala.Tuple2$mcII$sp
 cannot be cast to java.lang.Integer
 (acc:(Int, Int), x) = (acc._1 + x, acc._2 + 1),
 (acc1:(Int, Int), acc2:(Int, Int)) = (acc1._1 + acc2._1, acc1._2 +
 acc2._2))
 .map{case (s, t) = (s, t._1/t._2.toFloat)}
  avg.collect.foreach(t = println(t._1 +  - + t._2))



 When I submitted the application, an exception of 
 *java.lang.ClassCastException:
 scala.Tuple2$mcII$sp cannot be cast to java.lang.Integer* was thrown
 out. The tutorial said that the first function of *combineByKey*, *(x:Int)
 = (x, 1)*, should take a single element in the source RDD and return an
 element of the desired type in the resulting RDD. In my application, we
 take a single element of type *Int *from the source RDD and return a
 tuple of type (*Int*, *Int*), which meets the requirements quite well.
 But why would such an exception be thrown?

 I'm using CDH 5.0 and Spark 0.9

 Thanks.






Re: HBase and non-existent TableInputFormat

2014-09-16 Thread Ted Yu
bq. TableInputFormat does not even exist in hbase-client API

It is in hbase-server module.

Take a look at http://hbase.apache.org/book.html#mapreduce.example.read

On Tue, Sep 16, 2014 at 8:18 AM, Y. Dong tq00...@gmail.com wrote:

 Hello,

 I’m currently using spark-core 1.1 and hbase 0.98.5 and I want to simply
 read from hbase. The Java code is attached. However the problem is
 TableInputFormat does not even exist in hbase-client API, is there any
 other way I can read from
 hbase? Thanks

 SparkConf sconf = *new* SparkConf().setAppName(“App).setMaster(local);
 JavaSparkContext sc = *new* JavaSparkContext(sconf);


 Configuration conf = HBaseConfiguration.*create*();
 conf.set(TableInputFormat.INPUT_TABLE, Article);


 JavaPairRDDImmutableBytesWritable, Result hBaseRDD = 
 sc.newAPIHadoopRDD(conf,
 TableInputFormat.*class*
 ,org.apache.hadoop.hbase.io.ImmutableBytesWritable.*class*,
 org.apache.hadoop.hbase.client.Result.*class*);






Re: HBase and non-existent TableInputFormat

2014-09-16 Thread Ted Yu
hbase-client module serves client facing APIs.
hbase-server module is supposed to host classes used on server side.

There is still some work to be done so that the above goal is achieved.

On Tue, Sep 16, 2014 at 9:06 AM, Y. Dong tq00...@gmail.com wrote:

 Thanks Ted. It is indeed in hbase-server. Just curious, what’s the
 difference between hbase-client and hbase-server?

 On 16 Sep 2014, at 17:01, Ted Yu yuzhih...@gmail.com wrote:

 bq. TableInputFormat does not even exist in hbase-client API

 It is in hbase-server module.

 Take a look at http://hbase.apache.org/book.html#mapreduce.example.read

 On Tue, Sep 16, 2014 at 8:18 AM, Y. Dong tq00...@gmail.com wrote:

 Hello,

 I’m currently using spark-core 1.1 and hbase 0.98.5 and I want to simply
 read from hbase. The Java code is attached. However the problem is
 TableInputFormat does not even exist in hbase-client API, is there any
 other way I can read from
 hbase? Thanks

 SparkConf sconf = *new* SparkConf().setAppName(“App).setMaster(local);
 JavaSparkContext sc = *new* JavaSparkContext(sconf);

 Configuration conf = HBaseConfiguration.*create*();
 conf.set(TableInputFormat.INPUT_TABLE, Article);

 JavaPairRDDImmutableBytesWritable, Result hBaseRDD = 
 sc.newAPIHadoopRDD(conf,
 TableInputFormat.*class*
 ,org.apache.hadoop.hbase.io.ImmutableBytesWritable.*class*,
 org.apache.hadoop.hbase.client.Result.*class*);








Re: scala 2.11?

2014-09-16 Thread Mohit Jaggi
Can I load that plugin in spark-shell? Or perhaps due the 2-phase
compilation quasiquotes won't work in shell?

On Mon, Sep 15, 2014 at 7:15 PM, Mark Hamstra m...@clearstorydata.com
wrote:

 Okay, that's consistent with what I was expecting.  Thanks, Matei.

 On Mon, Sep 15, 2014 at 5:20 PM, Matei Zaharia matei.zaha...@gmail.com
 wrote:

 I think the current plan is to put it in 1.2.0, so that's what I meant by
 soon. It might be possible to backport it too, but I'd be hesitant to do
 that as a maintenance release on 1.1.x and 1.0.x since it would require
 nontrivial changes to the build that could break things on Scala 2.10.

 Matei

 On September 15, 2014 at 12:19:04 PM, Mark Hamstra (
 m...@clearstorydata.com) wrote:

 Are we going to put 2.11 support into 1.1 or 1.0?  Else will be in
 soon applies to the master development branch, but actually in the Spark
 1.2.0 release won't occur until the second half of November at the earliest.

 On Mon, Sep 15, 2014 at 12:11 PM, Matei Zaharia matei.zaha...@gmail.com
 wrote:

  Scala 2.11 work is under way in open pull requests though, so
 hopefully it will be in soon.

  Matei

 On September 15, 2014 at 9:48:42 AM, Mohit Jaggi (mohitja...@gmail.com)
 wrote:

  ah...thanks!

 On Mon, Sep 15, 2014 at 9:47 AM, Mark Hamstra m...@clearstorydata.com
 wrote:

 No, not yet.  Spark SQL is using org.scalamacros:quasiquotes_2.10.

 On Mon, Sep 15, 2014 at 9:28 AM, Mohit Jaggi mohitja...@gmail.com
 wrote:

 Folks,
 I understand Spark SQL uses quasiquotes. Does that mean Spark has now
 moved to Scala 2.11?

 Mohit.








RE: HBase and non-existent TableInputFormat

2014-09-16 Thread abraham.jacob
Hi,

I had a similar situation in which I needed to read data from HBase and work 
with the data inside of a spark context. After much ggling, I finally got 
mine to work. There are a bunch of steps that you need to do get this working -

The problem is that the spark context does not know anything about hbase, so 
you have to provide all the information about hbase classes to both the driver 
code and executor code...


SparkConf sconf = new SparkConf().setAppName(App).setMaster(local);
JavaSparkContext sc = new JavaSparkContext(sconf);

sparkConf.set(spark.executor.extraClassPath, $(hbase classpath));  
//=== you will need to add this to tell the executor about the classpath 
for HBase.


Configuration conf = HBaseConfiguration.create();
conf.set(TableInputFormat.INPUT_TABLE, Article);


JavaPairRDDImmutableBytesWritable, Result hBaseRDD = sc.newAPIHadoopRDD(conf, 
TableInputFormat.class,org.apache.hadoop.hbase.io.ImmutableBytesWritable.class,
org.apache.hadoop.hbase.client.Result.class);


The when you submit the spark job -


spark-submit --driver-class-path $(hbase classpath) --jars 
/usr/lib/hbase/hbase-server.jar,/usr/lib/hbase/hbase-client.jar,/usr/lib/hbase/hbase-common.jar,/usr/lib/hbase/hbase-protocol.jar,/usr/lib/hbase/lib/protobuf-java-2.5.0.jar,/usr/lib/hbase/lib/htrace-core.jar
 --class YourClassName --master local App.jar


Try this and see if it works for you.


From: Y. Dong [mailto:tq00...@gmail.com]
Sent: Tuesday, September 16, 2014 8:18 AM
To: user@spark.apache.org
Subject: HBase and non-existent TableInputFormat

Hello,

I'm currently using spark-core 1.1 and hbase 0.98.5 and I want to simply read 
from hbase. The Java code is attached. However the problem is TableInputFormat 
does not even exist in hbase-client API, is there any other way I can read from
hbase? Thanks

SparkConf sconf = new SparkConf().setAppName(App).setMaster(local);
JavaSparkContext sc = new JavaSparkContext(sconf);


Configuration conf = HBaseConfiguration.create();
conf.set(TableInputFormat.INPUT_TABLE, Article);


JavaPairRDDImmutableBytesWritable, Result hBaseRDD = sc.newAPIHadoopRDD(conf, 
TableInputFormat.class,org.apache.hadoop.hbase.io.ImmutableBytesWritable.class,
org.apache.hadoop.hbase.client.Result.class);





Re: HBase and non-existent TableInputFormat

2014-09-16 Thread Nicholas Chammas
Btw, there are some examples in the Spark GitHub repo that you may find
helpful. Here's one
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/HBaseTest.scala
related to HBase.

On Tue, Sep 16, 2014 at 1:22 PM, abraham.ja...@thomsonreuters.com wrote:

  *Hi, *



 *I had a similar situation in which I needed to read data from HBase and
 work with the data inside of a spark context. After much ggling, I
 finally got mine to work. There are a bunch of steps that you need to do
 get this working – *



 *The problem is that the spark context does not know anything about hbase,
 so you have to provide all the information about hbase classes to both the
 driver code and executor code…*





 SparkConf sconf = *new* SparkConf().setAppName(“App).setMaster(local);

 JavaSparkContext sc = *new* JavaSparkContext(sconf);



 sparkConf.set(spark.executor.extraClassPath, $(hbase classpath));  
 //ç=
 you will need to add this to tell the executor about the classpath for
 HBase.



 Configuration conf = HBaseConfiguration.*create*();

 conf.set(*TableInputFormat*.INPUT_TABLE, Article);



 JavaPairRDDImmutableBytesWritable, Result hBaseRDD = sc.
 *newAPIHadoopRDD*(conf, *TableInputFormat*.*class*
 ,org.apache.hadoop.hbase.io.ImmutableBytesWritable.*class*,

 org.apache.hadoop.hbase.client.Result.*class*);





 *The when you submit the spark job – *





 *spark-submit --driver-class-path $(hbase classpath) --jars
 /usr/lib/hbase/hbase-server.jar,/usr/lib/hbase/hbase-client.jar,/usr/lib/hbase/hbase-common.jar,/usr/lib/hbase/hbase-protocol.jar,/usr/lib/hbase/lib/protobuf-java-2.5.0.jar,/usr/lib/hbase/lib/htrace-core.jar
 --class YourClassName --master local App.jar *





 Try this and see if it works for you.





 *From:* Y. Dong [mailto:tq00...@gmail.com]
 *Sent:* Tuesday, September 16, 2014 8:18 AM
 *To:* user@spark.apache.org
 *Subject:* HBase and non-existent TableInputFormat



 Hello,



 I’m currently using spark-core 1.1 and hbase 0.98.5 and I want to simply
 read from hbase. The Java code is attached. However the problem is
 TableInputFormat does not even exist in hbase-client API, is there any
 other way I can read from

 hbase? Thanks



 SparkConf sconf = *new* SparkConf().setAppName(“App).setMaster(local);

 JavaSparkContext sc = *new* JavaSparkContext(sconf);



 Configuration conf = HBaseConfiguration.*create*();

 conf.set(*TableInputFormat*.INPUT_TABLE, Article);



 JavaPairRDDImmutableBytesWritable, Result hBaseRDD = sc.
 *newAPIHadoopRDD*(conf, *TableInputFormat*.*class*
 ,org.apache.hadoop.hbase.io.ImmutableBytesWritable.*class*,

 org.apache.hadoop.hbase.client.Result.*class*);









Re: Spark as a Library

2014-09-16 Thread Matei Zaharia
If you want to run the computation on just one machine (using Spark's local 
mode), it can probably run in a container. Otherwise you can create a 
SparkContext there and connect it to a cluster outside. Note that I haven't 
tried this though, so the security policies of the container might be too 
restrictive. In that case you'd have to run the app outside and expose an RPC 
interface between them.

Matei

On September 16, 2014 at 8:17:08 AM, Ruebenacker, Oliver A 
(oliver.ruebenac...@altisource.com) wrote:

 

 Hello,

 

  Suppose I want to use Spark from an application that I already submit to run 
in another container (e.g. Tomcat). Is this at all possible? Or do I have to 
split the app into two components, and submit one to Spark and one to the other 
container? In that case, what is the preferred way for the two components to 
communicate with each other? Thanks!

 

 Best, Oliver

 

Oliver Ruebenacker | Solutions Architect

 

Altisource™

290 Congress St, 7th Floor | Boston, Massachusetts 02210

P: (617) 728-5582 | ext: 275585

oliver.ruebenac...@altisource.com | www.Altisource.com

 

***
This email message and any attachments are intended solely for the use of the 
addressee. If you are not the intended recipient, you are prohibited from 
reading, disclosing, reproducing, distributing, disseminating or otherwise 
using this transmission. If you have received this message in error, please 
promptly notify the sender by reply email and immediately delete this message 
from your system. This message and any attachments may contain information that 
is confidential, privileged or exempt from disclosure. Delivery of this message 
to any person other than the intended recipient is not intended to waive any 
right or privilege. Message transmission is not guaranteed to be secure or free 
of software viruses.
***

Re: Spark as a Library

2014-09-16 Thread Soumya Simanta
It depends on what you want to do with Spark. The following has worked for
me.
Let the container handle the HTTP request and then talk to Spark using
another HTTP/REST interface. You can use the Spark Job Server for this.

Embedding Spark inside the container is not a great long term solution IMO
because you may see issues when you want to connect with a Spark cluster.



On Tue, Sep 16, 2014 at 11:16 AM, Ruebenacker, Oliver A 
oliver.ruebenac...@altisource.com wrote:



  Hello,



   Suppose I want to use Spark from an application that I already submit to
 run in another container (e.g. Tomcat). Is this at all possible? Or do I
 have to split the app into two components, and submit one to Spark and one
 to the other container? In that case, what is the preferred way for the two
 components to communicate with each other? Thanks!



  Best, Oliver



 Oliver Ruebenacker | Solutions Architect



 Altisource™

 290 Congress St, 7th Floor | Boston, Massachusetts 02210

 P: (617) 728-5582 | ext: 275585

 oliver.ruebenac...@altisource.com | www.Altisource.com




 ***

 This email message and any attachments are intended solely for the use of
 the addressee. If you are not the intended recipient, you are prohibited
 from reading, disclosing, reproducing, distributing, disseminating or
 otherwise using this transmission. If you have received this message in
 error, please promptly notify the sender by reply email and immediately
 delete this message from your system.
 This message and any attachments may contain information that is
 confidential, privileged or exempt from disclosure. Delivery of this
 message to any person other than the intended recipient is not intended to
 waive any right or privilege. Message transmission is not guaranteed to be
 secure or free of software viruses.

 ***



RE: HBase and non-existent TableInputFormat

2014-09-16 Thread abraham.jacob
Yes that was very helpful… ☺

Here are a few more I found on my quest to get HBase working with Spark –

This one details about Hbase dependencies and spark classpaths

http://www.abcn.net/2014/07/lighting-spark-with-hbase-full-edition.html

This one has a code overview –

http://www.abcn.net/2014/07/spark-hbase-result-keyvalue-bytearray.html
http://www.vidyasource.com/blog/Programming/Scala/Java/Data/Hadoop/Analytics/2014/01/25/lighting-a-spark-with-hbase

All of them were very helpful…



From: Nicholas Chammas [mailto:nicholas.cham...@gmail.com]
Sent: Tuesday, September 16, 2014 10:30 AM
To: Jacob, Abraham (FinancialRisk)
Cc: tq00...@gmail.com; user
Subject: Re: HBase and non-existent TableInputFormat

Btw, there are some examples in the Spark GitHub repo that you may find 
helpful. Here's 
onehttps://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/HBaseTest.scala
 related to HBase.

On Tue, Sep 16, 2014 at 1:22 PM, 
abraham.ja...@thomsonreuters.commailto:abraham.ja...@thomsonreuters.com 
wrote:
Hi,

I had a similar situation in which I needed to read data from HBase and work 
with the data inside of a spark context. After much ggling, I finally got 
mine to work. There are a bunch of steps that you need to do get this working –

The problem is that the spark context does not know anything about hbase, so 
you have to provide all the information about hbase classes to both the driver 
code and executor code…


SparkConf sconf = new SparkConf().setAppName(“App).setMaster(local);
JavaSparkContext sc = new JavaSparkContext(sconf);

sparkConf.set(spark.executor.extraClassPath, $(hbase classpath));  
//=== you will need to add this to tell the executor about the classpath 
for HBase.


Configuration conf = HBaseConfiguration.create();
conf.set(TableInputFormat.INPUT_TABLE, Article);


JavaPairRDDImmutableBytesWritable, Result hBaseRDD = sc.newAPIHadoopRDD(conf, 
TableInputFormat.class,org.apache.hadoop.hbase.io.ImmutableBytesWritable.class,
org.apache.hadoop.hbase.client.Result.class);


The when you submit the spark job –


spark-submit --driver-class-path $(hbase classpath) --jars 
/usr/lib/hbase/hbase-server.jar,/usr/lib/hbase/hbase-client.jar,/usr/lib/hbase/hbase-common.jar,/usr/lib/hbase/hbase-protocol.jar,/usr/lib/hbase/lib/protobuf-java-2.5.0.jar,/usr/lib/hbase/lib/htrace-core.jar
 --class YourClassName --master local App.jar


Try this and see if it works for you.


From: Y. Dong [mailto:tq00...@gmail.commailto:tq00...@gmail.com]
Sent: Tuesday, September 16, 2014 8:18 AM
To: user@spark.apache.orgmailto:user@spark.apache.org
Subject: HBase and non-existent TableInputFormat

Hello,

I’m currently using spark-core 1.1 and hbase 0.98.5 and I want to simply read 
from hbase. The Java code is attached. However the problem is TableInputFormat 
does not even exist in hbase-client API, is there any other way I can read from
hbase? Thanks

SparkConf sconf = new SparkConf().setAppName(“App).setMaster(local);
JavaSparkContext sc = new JavaSparkContext(sconf);


Configuration conf = HBaseConfiguration.create();
conf.set(TableInputFormat.INPUT_TABLE, Article);


JavaPairRDDImmutableBytesWritable, Result hBaseRDD = sc.newAPIHadoopRDD(conf, 
TableInputFormat.class,org.apache.hadoop.hbase.io.ImmutableBytesWritable.class,
org.apache.hadoop.hbase.client.Result.class);






Re: SparkSQL 1.1 hang when DROP or LOAD

2014-09-16 Thread Yin Huai
Seems https://issues.apache.org/jira/browse/HIVE-5474 is related?

On Tue, Sep 16, 2014 at 4:49 AM, Cheng, Hao hao.ch...@intel.com wrote:

 Thank you for pasting the steps, I will look at this, hopefully come out
 with a solution soon.

 -Original Message-
 From: linkpatrickliu [mailto:linkpatrick...@live.com]
 Sent: Tuesday, September 16, 2014 3:17 PM
 To: u...@spark.incubator.apache.org
 Subject: RE: SparkSQL 1.1 hang when DROP or LOAD

 Hi, Hao Cheng.

 I have done other tests. And the result shows the thriftServer can connect
 to Zookeeper.

 However, I found some more interesting things. And I think I have found a
 bug!

 Test procedure:
 Test1:
 (0) Use beeline to connect to thriftServer.
 (1) Switch database use dw_op1; (OK)
 The logs show that the thriftServer connected with Zookeeper and acquired
 locks.

 (2) Drop table drop table src; (Blocked) The logs show that the
 thriftServer is acquireReadWriteLocks.

 Doubt:
 The reason why I cannot drop table src is because the first SQL use
 dw_op1
 have left locks in Zookeeper  unsuccessfully released.
 So when the second SQL is acquiring locks in Zookeeper, it will block.

 Test2:
 Restart thriftServer.
 Instead of switching to another database, I just drop the table in the
 default database;
 (0) Restart thriftServer  use beeline to connect to thriftServer.
 (1) Drop table drop table src; (OK)
 Amazing! Succeed!
 (2) Drop again!  drop table src2; (Blocked) Same error: the thriftServer
 is blocked in the acquireReadWriteLocks
 phrase.

 As you can see.
 Only the first SQL requiring locks can succeed.
 So I think the reason is that the thriftServer cannot release locks
 correctly in Zookeeper.









 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-1-1-hang-when-DROP-or-LOAD-tp14222p14339.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional
 commands, e-mail: user-h...@spark.apache.org


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: SparkSQL 1.1 hang when DROP or LOAD

2014-09-16 Thread Yin Huai
I meant it may be a Hive bug since we also call Hive's drop table
internally.

On Tue, Sep 16, 2014 at 1:44 PM, Yin Huai huaiyin@gmail.com wrote:

 Seems https://issues.apache.org/jira/browse/HIVE-5474 is related?

 On Tue, Sep 16, 2014 at 4:49 AM, Cheng, Hao hao.ch...@intel.com wrote:

 Thank you for pasting the steps, I will look at this, hopefully come out
 with a solution soon.

 -Original Message-
 From: linkpatrickliu [mailto:linkpatrick...@live.com]
 Sent: Tuesday, September 16, 2014 3:17 PM
 To: u...@spark.incubator.apache.org
 Subject: RE: SparkSQL 1.1 hang when DROP or LOAD

 Hi, Hao Cheng.

 I have done other tests. And the result shows the thriftServer can
 connect to Zookeeper.

 However, I found some more interesting things. And I think I have found a
 bug!

 Test procedure:
 Test1:
 (0) Use beeline to connect to thriftServer.
 (1) Switch database use dw_op1; (OK)
 The logs show that the thriftServer connected with Zookeeper and acquired
 locks.

 (2) Drop table drop table src; (Blocked) The logs show that the
 thriftServer is acquireReadWriteLocks.

 Doubt:
 The reason why I cannot drop table src is because the first SQL use
 dw_op1
 have left locks in Zookeeper  unsuccessfully released.
 So when the second SQL is acquiring locks in Zookeeper, it will block.

 Test2:
 Restart thriftServer.
 Instead of switching to another database, I just drop the table in the
 default database;
 (0) Restart thriftServer  use beeline to connect to thriftServer.
 (1) Drop table drop table src; (OK)
 Amazing! Succeed!
 (2) Drop again!  drop table src2; (Blocked) Same error: the
 thriftServer is blocked in the acquireReadWriteLocks
 phrase.

 As you can see.
 Only the first SQL requiring locks can succeed.
 So I think the reason is that the thriftServer cannot release locks
 correctly in Zookeeper.









 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-1-1-hang-when-DROP-or-LOAD-tp14222p14339.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional
 commands, e-mail: user-h...@spark.apache.org


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





RDD projection and sorting

2014-09-16 Thread Sameer Tilak

Hi All,
I have data in for following format:L
1st column is userid and the second column onward are class ids for various 
products. I want to save this in Libsvm format and an intermediate step is to 
sort (in ascending manner) the class ids. For example: I/Puid1   12433580   
 2670122 259317821526uid2   121 285 24471516343 
385 1200 912 143058241451893112711088
258416645481
Desired O/P:uid1   122  1243  1526  1782  2593  2670  3580uid2  
 121 285 343 385   912  1088   1200   1271   1430   1451   1516   
1664   2447   25845481  5824   8931   
Can someone please point me in the right direction. How do I project 
if I use val data = sc.textFile(..)How do I project column 1 to end (not 
including column 0) and then sort these projected columns. 
  

RE: Spark as a Library

2014-09-16 Thread Ruebenacker, Oliver A

 Hello,

  Thanks for the response and great to hear it is possible. But how do I 
connect to Spark without using the submit script?

  I know how to start up a master and some workers and then connect to the 
master by packaging the app that contains the SparkContext and then submitting 
the package with the spark-submit script in standalone-mode. But I don’t want 
to submit the app that contains the SparkContext via the script, because I want 
that app to be running on a web server. So, what are other ways to connect to 
Spark? I can’t find in the docs anything other than using the script. Thanks!

 Best, Oliver

From: Matei Zaharia [mailto:matei.zaha...@gmail.com]
Sent: Tuesday, September 16, 2014 1:31 PM
To: Ruebenacker, Oliver A; user@spark.apache.org
Subject: Re: Spark as a Library

If you want to run the computation on just one machine (using Spark's local 
mode), it can probably run in a container. Otherwise you can create a 
SparkContext there and connect it to a cluster outside. Note that I haven't 
tried this though, so the security policies of the container might be too 
restrictive. In that case you'd have to run the app outside and expose an RPC 
interface between them.

Matei


On September 16, 2014 at 8:17:08 AM, Ruebenacker, Oliver A 
(oliver.ruebenac...@altisource.commailto:oliver.ruebenac...@altisource.com) 
wrote:

 Hello,

  Suppose I want to use Spark from an application that I already submit to run 
in another container (e.g. Tomcat). Is this at all possible? Or do I have to 
split the app into two components, and submit one to Spark and one to the other 
container? In that case, what is the preferred way for the two components to 
communicate with each other? Thanks!

 Best, Oliver

Oliver Ruebenacker | Solutions Architect

Altisource™
290 Congress St, 7th Floor | Boston, Massachusetts 02210
P: (617) 728-5582 | ext: 275585
oliver.ruebenac...@altisource.commailto:oliver.ruebenac...@altisource.com | 
www.Altisource.com

***

This email message and any attachments are intended solely for the use of the 
addressee. If you are not the intended recipient, you are prohibited from 
reading, disclosing, reproducing, distributing, disseminating or otherwise 
using this transmission. If you have received this message in error, please 
promptly notify the sender by reply email and immediately delete this message 
from your system. This message and any attachments may contain information that 
is confidential, privileged or exempt from disclosure. Delivery of this message 
to any person other than the intended recipient is not intended to waive any 
right or privilege. Message transmission is not guaranteed to be secure or free 
of software viruses.
***
***

This email message and any attachments are intended solely for the use of the 
addressee. If you are not the intended recipient, you are prohibited from 
reading, disclosing, reproducing, distributing, disseminating or otherwise 
using this transmission. If you have received this message in error, please 
promptly notify the sender by reply email and immediately delete this message 
from your system. This message and any attachments may contain information that 
is confidential, privileged or exempt from disclosure. Delivery of this message 
to any person other than the intended recipient is not intended to waive any 
right or privilege. Message transmission is not guaranteed to be secure or free 
of software viruses.
***


Re: NullWritable not serializable

2014-09-16 Thread Du Li
Hi,

The test case is separated out as follows. The call to rdd2.first() breaks when 
spark version is changed to 1.1.0, reporting exception NullWritable not 
serializable. However, the same test passed with spark 1.0.2. The pom.xml file 
is attached. The test data README.md was copied from spark.

Thanks,
Du
-

package com.company.project.test

import org.scalatest._

class WritableTestSuite extends FunSuite {
  test(generated sequence file should be readable from spark) {
import org.apache.hadoop.io.{NullWritable, Text}
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.SparkContext._

val conf = new SparkConf(false).setMaster(local).setAppName(test data 
exchange with spark)
val sc = new SparkContext(conf)

val rdd = sc.textFile(README.md)
val res = rdd.map(x = (NullWritable.get(), new Text(x)))
res.saveAsSequenceFile(./test_data)

val rdd2 = sc.sequenceFile(./test_data, classOf[NullWritable], 
classOf[Text])

assert(rdd.first == rdd2.first._2.toString)
  }
}



From: Matei Zaharia matei.zaha...@gmail.commailto:matei.zaha...@gmail.com
Date: Monday, September 15, 2014 at 10:52 PM
To: Du Li l...@yahoo-inc.commailto:l...@yahoo-inc.com
Cc: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org, 
d...@spark.apache.orgmailto:d...@spark.apache.org 
d...@spark.apache.orgmailto:d...@spark.apache.org
Subject: Re: NullWritable not serializable

Can you post the exact code for the test that worked in 1.0? I can't think of 
much that could've changed. The one possibility is if  we had some operations 
that were computed locally on the driver (this happens with things like first() 
and take(), which will try to do the first partition locally). But generally 
speaking these operations should *not* work over a network, so you'll have to 
make sure that you only send serializable types through shuffles or collects, 
or use a serialization framework like Kryo that might be okay with Writables.

Matei


On September 15, 2014 at 9:13:13 PM, Du Li 
(l...@yahoo-inc.commailto:l...@yahoo-inc.com) wrote:

Hi Matei,

Thanks for your reply.

The Writable classes have never been serializable and this is why it is weird. 
I did try as you suggested to map the Writables to integers and strings. It 
didn’t pass, either. Similar exceptions were thrown except that the messages 
became IntWritable, Text are not serializable. The reason is in the implicits 
defined in the SparkContext object that convert those values into their 
corresponding Writable classes before saving the data in sequence file.

My original code was actual some test cases to try out SequenceFile related 
APIs. The tests all passed when the spark version was specified as 1.0.2. But 
this one failed after I changed the spark version to 1.1.0 the new release, 
nothing else changed. In addition, it failed when I called rdd2.collect(), 
take(1), and first(). But it worked fine when calling rdd2.count(). As you can 
see, count() does not need to serialize and ship data while the other three 
methods do.

Do you recall any difference between spark 1.0 and 1.1 that might cause this 
problem?

Thanks,
Du


From: Matei Zaharia matei.zaha...@gmail.commailto:matei.zaha...@gmail.com
Date: Friday, September 12, 2014 at 9:10 PM
To: Du Li l...@yahoo-inc.com.invalidmailto:l...@yahoo-inc.com.invalid, 
user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org, 
d...@spark.apache.orgmailto:d...@spark.apache.org 
d...@spark.apache.orgmailto:d...@spark.apache.org
Subject: Re: NullWritable not serializable

Hi Du,

I don't think NullWritable has ever been serializable, so you must be doing 
something differently from your previous program. In this case though, just use 
a map() to turn your Writables to serializable types (e.g. null and String).

Matie


On September 12, 2014 at 8:48:36 PM, Du Li 
(l...@yahoo-inc.com.invalidmailto:l...@yahoo-inc.com.invalid) wrote:

Hi,

I was trying the following on spark-shell (built with apache master and hadoop 
2.4.0). Both calling rdd2.collect and calling rdd3.collect threw 
java.io.NotSerializableException: org.apache.hadoop.io.NullWritable.

I got the same problem in similar code of my app which uses the newly released 
Spark 1.1.0 under hadoop 2.4.0. Previously it worked fine with spark 1.0.2 
under either hadoop 2.40 and 0.23.10.

Anybody knows what caused the problem?

Thanks,
Du


import org.apache.hadoop.io.{NullWritable, Text}
val rdd = sc.textFile(README.md)
val res = rdd.map(x = (NullWritable.get(), new Text(x)))
res.saveAsSequenceFile(./test_data)
val rdd2 = sc.sequenceFile(./test_data, classOf[NullWritable], classOf[Text])
rdd2.collect
val rdd3 = sc.sequenceFile[NullWritable,Text](./test_data)
rdd3.collect




pom.xml
Description: pom.xml

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional 

Re: Spark as a Library

2014-09-16 Thread Daniel Siegmann
You can create a new SparkContext inside your container pointed to your
master. However, for your script to run you must call addJars to put the
code on your workers' classpaths (except when running locally).

Hopefully your webapp has some lib folder which you can point to as a
source for the jars. In the Play Framework you can use
play.api.Play.application.getFile(lib) to get a path to the lib directory
and get the contents. Of course that only works on the packaged web app.

On Tue, Sep 16, 2014 at 3:17 PM, Ruebenacker, Oliver A 
oliver.ruebenac...@altisource.com wrote:



  Hello,



   Thanks for the response and great to hear it is possible. But how do I
 connect to Spark without using the submit script?



   I know how to start up a master and some workers and then connect to the
 master by packaging the app that contains the SparkContext and then
 submitting the package with the spark-submit script in standalone-mode. But
 I don’t want to submit the app that contains the SparkContext via the
 script, because I want that app to be running on a web server. So, what are
 other ways to connect to Spark? I can’t find in the docs anything other
 than using the script. Thanks!



  Best, Oliver



 *From:* Matei Zaharia [mailto:matei.zaha...@gmail.com]
 *Sent:* Tuesday, September 16, 2014 1:31 PM
 *To:* Ruebenacker, Oliver A; user@spark.apache.org
 *Subject:* Re: Spark as a Library



 If you want to run the computation on just one machine (using Spark's
 local mode), it can probably run in a container. Otherwise you can create a
 SparkContext there and connect it to a cluster outside. Note that I haven't
 tried this though, so the security policies of the container might be too
 restrictive. In that case you'd have to run the app outside and expose an
 RPC interface between them.



 Matei



 On September 16, 2014 at 8:17:08 AM, Ruebenacker, Oliver A (
 oliver.ruebenac...@altisource.com) wrote:



  Hello,



   Suppose I want to use Spark from an application that I already submit to
 run in another container (e.g. Tomcat). Is this at all possible? Or do I
 have to split the app into two components, and submit one to Spark and one
 to the other container? In that case, what is the preferred way for the two
 components to communicate with each other? Thanks!



  Best, Oliver



 Oliver Ruebenacker | Solutions Architect



 Altisource™

 290 Congress St, 7th Floor | Boston, Massachusetts 02210

 P: (617) 728-5582 | ext: 275585

 oliver.ruebenac...@altisource.com | www.Altisource.com



 ***


 This email message and any attachments are intended solely for the use of
 the addressee. If you are not the intended recipient, you are prohibited
 from reading, disclosing, reproducing, distributing, disseminating or
 otherwise using this transmission. If you have received this message in
 error, please promptly notify the sender by reply email and immediately
 delete this message from your system. This message and any attachments
 may contain information that is confidential, privileged or exempt from
 disclosure. Delivery of this message to any person other than the intended
 recipient is not intended to waive any right or privilege. Message
 transmission is not guaranteed to be secure or free of software viruses.

 ***


 ***

 This email message and any attachments are intended solely for the use of
 the addressee. If you are not the intended recipient, you are prohibited
 from reading, disclosing, reproducing, distributing, disseminating or
 otherwise using this transmission. If you have received this message in
 error, please promptly notify the sender by reply email and immediately
 delete this message from your system.
 This message and any attachments may contain information that is
 confidential, privileged or exempt from disclosure. Delivery of this
 message to any person other than the intended recipient is not intended to
 waive any right or privilege. Message transmission is not guaranteed to be
 secure or free of software viruses.

 ***




-- 
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegm...@velos.io W: www.velos.io


R: Spark as a Library

2014-09-16 Thread Paolo Platter
Hi,

Spark job server by ooyala is the right tool for the job. It exposes rest api 
so calling it from a web app is suitable.
Is open source, you can find it on github

Best

Paolo Platter

Da: Ruebenacker, Oliver Amailto:oliver.ruebenac...@altisource.com
Inviato: ‎16/‎09/‎2014 21.18
A: Matei Zahariamailto:matei.zaha...@gmail.com; 
user@spark.apache.orgmailto:user@spark.apache.org
Oggetto: RE: Spark as a Library


 Hello,

  Thanks for the response and great to hear it is possible. But how do I 
connect to Spark without using the submit script?

  I know how to start up a master and some workers and then connect to the 
master by packaging the app that contains the SparkContext and then submitting 
the package with the spark-submit script in standalone-mode. But I don’t want 
to submit the app that contains the SparkContext via the script, because I want 
that app to be running on a web server. So, what are other ways to connect to 
Spark? I can’t find in the docs anything other than using the script. Thanks!

 Best, Oliver

From: Matei Zaharia [mailto:matei.zaha...@gmail.com]
Sent: Tuesday, September 16, 2014 1:31 PM
To: Ruebenacker, Oliver A; user@spark.apache.org
Subject: Re: Spark as a Library

If you want to run the computation on just one machine (using Spark's local 
mode), it can probably run in a container. Otherwise you can create a 
SparkContext there and connect it to a cluster outside. Note that I haven't 
tried this though, so the security policies of the container might be too 
restrictive. In that case you'd have to run the app outside and expose an RPC 
interface between them.

Matei


On September 16, 2014 at 8:17:08 AM, Ruebenacker, Oliver A 
(oliver.ruebenac...@altisource.commailto:oliver.ruebenac...@altisource.com) 
wrote:

 Hello,

  Suppose I want to use Spark from an application that I already submit to run 
in another container (e.g. Tomcat). Is this at all possible? Or do I have to 
split the app into two components, and submit one to Spark and one to the other 
container? In that case, what is the preferred way for the two components to 
communicate with each other? Thanks!

 Best, Oliver

Oliver Ruebenacker | Solutions Architect

Altisource™
290 Congress St, 7th Floor | Boston, Massachusetts 02210
P: (617) 728-5582 | ext: 275585
oliver.ruebenac...@altisource.commailto:oliver.ruebenac...@altisource.com | 
www.Altisource.com

***

This email message and any attachments are intended solely for the use of the 
addressee. If you are not the intended recipient, you are prohibited from 
reading, disclosing, reproducing, distributing, disseminating or otherwise 
using this transmission. If you have received this message in error, please 
promptly notify the sender by reply email and immediately delete this message 
from your system. This message and any attachments may contain information that 
is confidential, privileged or exempt from disclosure. Delivery of this message 
to any person other than the intended recipient is not intended to waive any 
right or privilege. Message transmission is not guaranteed to be secure or free 
of software viruses.
***
***

This email message and any attachments are intended solely for the use of the 
addressee. If you are not the intended recipient, you are prohibited from 
reading, disclosing, reproducing, distributing, disseminating or otherwise 
using this transmission. If you have received this message in error, please 
promptly notify the sender by reply email and immediately delete this message 
from your system. This message and any attachments may contain information that 
is confidential, privileged or exempt from disclosure. Delivery of this message 
to any person other than the intended recipient is not intended to waive any 
right or privilege. Message transmission is not guaranteed to be secure or free 
of software viruses.
***


Re: Spark 1.1 / cdh4 stuck using old hadoop client?

2014-09-16 Thread Paul Wais
Hi Sean,

Great catch! Yes I was including Spark as a dependency and it was
making its way into my uber jar.  Following the advice I just found at
Stackoverflow[1],  I marked Spark as a provided dependency and that
appeared to fix my Hadoop client issue.  Thanks for your help!!!
Perhaps they maintainers might consider setting this in the Quickstart
guide pom.xml ( http://spark.apache.org/docs/latest/quick-start.html )

In summary, here's what worked:
 * Hadoop 2.3 cdh5
http://archive.cloudera.com/cdh5/cdh/5/hadoop-2.3.0-cdh5.0.0.tar.gz
 * Spark 1.1 for Hadoop 2.3
http://d3kbcqa49mib13.cloudfront.net/spark-1.1.0-bin-hadoop2.3.tgz

pom.xml snippets: https://gist.github.com/ypwais/ff188611d4806aa05ed9

[1] 
http://stackoverflow.com/questions/24747037/how-to-define-a-dependency-scope-in-maven-to-include-a-library-in-compile-run

Thanks everybody!!
-Paul


On Tue, Sep 16, 2014 at 3:55 AM, Sean Owen so...@cloudera.com wrote:
 From the caller / application perspective, you don't care what version
 of Hadoop Spark is running on on the cluster. The Spark API you
 compile against is the same. When you spark-submit the app, at
 runtime, Spark is using the Hadoop libraries from the cluster, which
 are the right version.

 So when you build your app, you mark Spark as a 'provided' dependency.
 Therefore in general, no, you do not build Spark for yourself if you
 are a Spark app creator.

 (Of course, your app would care if it were also using Hadoop libraries
 directly. In that case, you will want to depend on hadoop-client, and
 the right version for your cluster, but still mark it as provided.)

 The version Spark is built against only matters when you are deploying
 Spark's artifacts on the cluster to set it up.

 Your error suggests there is still a version mismatch. Either you
 deployed a build that was not compatible, or, maybe you are packaging
 a version of Spark with your app which is incompatible and
 interfering.

 For example, the artifacts you get via Maven depend on Hadoop 1.0.4. I
 suspect that's what you're doing -- packaging Spark(+Hadoop1.0.4) with
 your app, when it shouldn't be packaged.

 Spark works out of the box with just about any modern combo of HDFS and YARN.

 On Tue, Sep 16, 2014 at 2:28 AM, Paul Wais pw...@yelp.com wrote:
 Dear List,

 I'm having trouble getting Spark 1.1 to use the Hadoop 2 API for
 reading SequenceFiles.  In particular, I'm seeing:

 Exception in thread main org.apache.hadoop.ipc.RemoteException:
 Server IPC version 7 cannot communicate with client version 4
 at org.apache.hadoop.ipc.Client.call(Client.java:1070)
 at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
 at com.sun.proxy.$Proxy7.getProtocolVersion(Unknown Source)
 ...

 When invoking JavaSparkContext#newAPIHadoopFile().  (With args
 validSequenceFileURI, SequenceFileInputFormat.class, Text.class,
 BytesWritable.class, new Job().getConfiguration() -- Pretty close to
 the unit test here:
 https://github.com/apache/spark/blob/f0f1ba09b195f23f0c89af6fa040c9e01dfa8951/core/src/test/java/org/apache/spark/JavaAPISuite.java#L916
 )


 This error indicates to me that Spark is using an old hadoop client to
 do reads.  Oddly I'm able to do /writes/ ok, i.e. I'm able to write
 via JavaPairRdd#saveAsNewAPIHadoopFile() to my hdfs cluster.


 Do I need to explicitly build spark for modern hadoop??  I previously
 had an hdfs cluster running hadoop 2.3.0 and I was getting a similar
 error (server is using version 9, client is using version 4).


 I'm using Spark 1.1 cdh4 as well as hadoop cdh4 from the links posted
 on spark's site:
  * http://d3kbcqa49mib13.cloudfront.net/spark-1.1.0-bin-cdh4.tgz
  * http://d3kbcqa49mib13.cloudfront.net/hadoop-2.0.0-cdh4.2.0.tar.gz


 What distro of hadoop is used at Data Bricks?  Are there distros of
 Spark 1.1 and hadoop that should work together out-of-the-box?
 (Previously I had Spark 1.0.0 and Hadoop 2.3 working fine..)

 Thanks for any help anybody can give me here!
 -Paul

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark processing small files.

2014-09-16 Thread cem
Hi all,

Spark is taking too much time to start the first stage with many small
files in HDFS.

I am reading a folder that contains RC files:

sc.hadoopFile(hdfs://hostname :8020/test_data2gb/,
classOf[RCFileInputFormat[LongWritable, BytesRefArrayWritable]],
classOf[LongWritable], classOf[BytesRefArrayWritable])

And parse:
 val parsedData = file.map((tuple: (LongWritable, BytesRefArrayWritable))
=  RCFileUtil.getData(tuple._2))

620 3mb files (2Gb total) takes considerable more time to start the first
stage than 200 40mb 8gb total.


Do you have any idea about the reason? Thanks!

Best Regards,
Cem Cayiroglu


Indexed RDD

2014-09-16 Thread Akshat Aranya
Hi,

I'm trying to implement a custom RDD that essentially works as a
distributed hash table, i.e. the key space is split up into partitions and
within a partition, an element can be looked up efficiently by the key.
However, the RDD lookup() function (in PairRDDFunctions) is implemented in
a way iterate through all elements of a partition and find the matching
ones.  Is there a better way to do what I want to do, short of just
implementing new methods on the custom RDD?

Thanks,
Akshat


Re: Categorical Features for K-Means Clustering

2014-09-16 Thread st553
Does MLlib provide utility functions to do this kind of encoding?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Categorical-Features-for-K-Means-Clustering-tp9416p14394.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Categorical Features for K-Means Clustering

2014-09-16 Thread Sean Owen
I think it's on the table but not yet merged?
https://issues.apache.org/jira/browse/SPARK-1216

On Tue, Sep 16, 2014 at 10:04 PM, st553 sthompson...@gmail.com wrote:
 Does MLlib provide utility functions to do this kind of encoding?



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Categorical-Features-for-K-Means-Clustering-tp9416p14394.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Memory under-utilization

2014-09-16 Thread francisco
Hi, I'm a Spark newbie.

We had installed spark-1.0.2-bin-cdh4 on a 'super machine' with 256gb memory
and 48 cores. 

Tried to allocate a task with 64gb memory but for whatever reason Spark is
only using around 9gb max.

Submitted spark job with the following command:

/bin/spark-submit -class SimpleApp --master local[16] --executor-memory 64G
/var/tmp/simple-project_2.10-1.0.jar /data/lucene/ns.gz


When I run 'top' command I see only 9gb of memory is used by the spark
process

PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
3047005 fran  30  10 8785m 703m  18m S 112.9  0.3  48:19.63 java


Any idea why this is happening? I've also tried to set the memory
programatically using
 new SparkConf().set(spark.executor.memory, 64g)  but that also didn't
do anything.

Is there some limitation when running in 'local' mode?

Thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Memory-under-utilization-tp14396.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Memory under-utilization

2014-09-16 Thread Boromir Widas
Perhaps your job does not use more than 9g. Even though the dashboard shows
64g the process only uses whats needed and grows to 64g max.

On Tue, Sep 16, 2014 at 5:40 PM, francisco ftanudj...@nextag.com wrote:

 Hi, I'm a Spark newbie.

 We had installed spark-1.0.2-bin-cdh4 on a 'super machine' with 256gb
 memory
 and 48 cores.

 Tried to allocate a task with 64gb memory but for whatever reason Spark is
 only using around 9gb max.

 Submitted spark job with the following command:
 
 /bin/spark-submit -class SimpleApp --master local[16] --executor-memory 64G
 /var/tmp/simple-project_2.10-1.0.jar /data/lucene/ns.gz
 

 When I run 'top' command I see only 9gb of memory is used by the spark
 process

 PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
 3047005 fran  30  10 8785m 703m  18m S 112.9  0.3  48:19.63 java


 Any idea why this is happening? I've also tried to set the memory
 programatically using
  new SparkConf().set(spark.executor.memory, 64g)  but that also
 didn't
 do anything.

 Is there some limitation when running in 'local' mode?

 Thanks.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Memory-under-utilization-tp14396.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Questions about Spark speculation

2014-09-16 Thread Nicolas Mai
Hi, guys

My current project is using Spark 0.9.1, and after increasing the level of
parallelism and partitions in our RDDs, stages and tasks seem to complete
much faster. However it also seems that our cluster becomes more unstable
after some time:
- stalled stages still showing under active stages in the Spark app web
dashboard
- incomplete stages showing under completed stages
- stages with failures

I was thinking about reducing/tuning the number of parallelism, but I was
also considering using spark.speculation which is currently turned off but
seems promising.

Questions about speculation:
- Just wondering why it is turned off by default?
- Are there any risks using speculation?
- Is it possible that a speculative task straggles, and would trigger
another new speculative task to finish the job... and so on... (some kind of
loop until there's no more executors available).
- What configuration do you guys usually use for spark.speculation?
(interval, quantile, multiplier) I guess it depends on the project, it may
give some ideas about how to use it properly.

Thank you! :)
Nicolas



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Questions-about-Spark-speculation-tp14398.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Memory under-utilization

2014-09-16 Thread francisco
Thanks for the reply.

I doubt that's the case though ...  the executor kept having to do a file
dump because memory is full.

...
14/09/16 15:00:18 WARN ExternalAppendOnlyMap: Spilling in-memory map of 67
MB to disk (668 times so far)
14/09/16 15:00:21 WARN ExternalAppendOnlyMap: Spilling in-memory map of 66
MB to disk (669 times so far)
14/09/16 15:00:24 WARN ExternalAppendOnlyMap: Spilling in-memory map of 70
MB to disk (670 times so far)
14/09/16 15:00:31 WARN ExternalAppendOnlyMap: Spilling in-memory map of 127
MB to disk (671 times so far)
14/09/16 15:00:43 WARN ExternalAppendOnlyMap: Spilling in-memory map of 67
MB to disk (672 times so far)
...



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Memory-under-utilization-tp14396p14399.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



How do I manipulate values outside of a GraphX loop?

2014-09-16 Thread crockpotveggies
Brand new to Apache Spark and I'm a little confused how to make updates to a
value that sits outside of a .mapTriplets iteration in GraphX. I'm aware
mapTriplets is really only for modifying values inside the graph. What about
using it in conjunction with other computations? See below:

def mapTripletsMethod(edgeWeights: Graph[Int, Double],
stationaryDistribution: Graph[Double, Double]) = {
  val tempMatrix: SparseDoubleMatrix2D = graphToSparseMatrix(edgeWeights)

  stationaryDistribution.mapTriplets{ e =
  val row = e.srcId.toInt
  val column = e.dstId.toInt
  var cellValue = -1 * tempMatrix.get(row, column) + e.dstAttr
  tempMatrix.set(row, column, cellValue) // this doesn't do anything to
tempMatrix
  e
}
}

My first guess is that since the input param for mapTriplets is a function,
then somehow the value for tempMatrix is getting lost or duplicated.

If this really isn't the proper design pattern to do this, does anyone have
any insight on the right way? Implementing some of the more complex
centrality algorithms is rather difficult until I can wrap my head around
this paradigm.

Thank you!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-do-I-manipulate-values-outside-of-a-GraphX-loop-tp14400.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Memory under-utilization

2014-09-16 Thread Boromir Widas
I see, what does http://localhost:4040/executors/ show for memory usage?

I personally find it easier to work with a standalone cluster with a single
worker by using the sbin/start-master.sh and then connecting to the master.

On Tue, Sep 16, 2014 at 6:04 PM, francisco ftanudj...@nextag.com wrote:

 Thanks for the reply.

 I doubt that's the case though ...  the executor kept having to do a file
 dump because memory is full.

 ...
 14/09/16 15:00:18 WARN ExternalAppendOnlyMap: Spilling in-memory map of 67
 MB to disk (668 times so far)
 14/09/16 15:00:21 WARN ExternalAppendOnlyMap: Spilling in-memory map of 66
 MB to disk (669 times so far)
 14/09/16 15:00:24 WARN ExternalAppendOnlyMap: Spilling in-memory map of 70
 MB to disk (670 times so far)
 14/09/16 15:00:31 WARN ExternalAppendOnlyMap: Spilling in-memory map of 127
 MB to disk (671 times so far)
 14/09/16 15:00:43 WARN ExternalAppendOnlyMap: Spilling in-memory map of 67
 MB to disk (672 times so far)
 ...



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Memory-under-utilization-tp14396p14399.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




MLlib - Possible to use SVM with Radial Basis Function kernel rather than Linear Kernel?

2014-09-16 Thread Aris
Hello Spark Community -

I am using the support vector machine / SVM implementation in MLlib with
the standard linear kernel; however, I noticed in the Spark documentation
for StandardScaler is *specifically* mentions that SVMs which use the RBF
kernel work really well when you have standardized data...

which begs the question, is there some kind of support for RBF kernels
rather than linear kernels? In small data tests using R the RBF kernel
worked really well, and linear kernel never converged...so I would really
like to use RBF.

Thank you folks for any help!

Aris


Re: org.apache.spark.SparkException: java.io.FileNotFoundException: does not exist)

2014-09-16 Thread Aris
This should be a really simple problem, but you haven't shared enough code
to determine what's going on here.

On Tue, Sep 16, 2014 at 8:08 AM, Hui Li littleleave...@gmail.com wrote:

 Hi,

 I am new to SPARK. I just set up a small cluster and wanted to run some
 simple MLLIB examples. By following the instructions of
 https://spark.apache.org/docs/0.9.0/mllib-guide.html#binary-classification-1,
 I could successfully run everything until the step of SVMWithSGD, I got
 error the following message. I don't know why the
  file:/root/test/sample_svm_data.txt does not exist since I already read it
 out, printed it and converted into the labeled data and passed the parsed
 data to the function SvmWithSGD.

 Any one have the same issue with me?

 Thanks,

 Emily

  val model = SVMWithSGD.train(parsedData, numIterations)
 14/09/16 10:55:21 INFO SparkContext: Starting job: first at
 GeneralizedLinearAlgorithm.scala:121
 14/09/16 10:55:21 INFO DAGScheduler: Got job 11 (first at
 GeneralizedLinearAlgorithm.scala:121) with 1 output partitions
 (allowLocal=true)
 14/09/16 10:55:21 INFO DAGScheduler: Final stage: Stage 11 (first at
 GeneralizedLinearAlgorithm.scala:121)
 14/09/16 10:55:21 INFO DAGScheduler: Parents of final stage: List()
 14/09/16 10:55:21 INFO DAGScheduler: Missing parents: List()
 14/09/16 10:55:21 INFO DAGScheduler: Computing the requested partition
 locally
 14/09/16 10:55:21 INFO HadoopRDD: Input split:
 file:/root/test/sample_svm_data.txt:0+19737
 14/09/16 10:55:21 INFO SparkContext: Job finished: first at
 GeneralizedLinearAlgorithm.scala:121, took 0.002697478 s
 14/09/16 10:55:21 INFO SparkContext: Starting job: count at
 DataValidators.scala:37
 14/09/16 10:55:21 INFO DAGScheduler: Got job 12 (count at
 DataValidators.scala:37) with 2 output partitions (allowLocal=false)
 14/09/16 10:55:21 INFO DAGScheduler: Final stage: Stage 12 (count at
 DataValidators.scala:37)
 14/09/16 10:55:21 INFO DAGScheduler: Parents of final stage: List()
 14/09/16 10:55:21 INFO DAGScheduler: Missing parents: List()
 14/09/16 10:55:21 INFO DAGScheduler: Submitting Stage 12 (FilteredRDD[26]
 at filter at DataValidators.scala:37), which has no missing parents
 14/09/16 10:55:21 INFO DAGScheduler: Submitting 2 missing tasks from Stage
 12 (FilteredRDD[26] at filter at DataValidators.scala:37)
 14/09/16 10:55:21 INFO TaskSchedulerImpl: Adding task set 12.0 with 2 tasks
 14/09/16 10:55:21 INFO TaskSetManager: Starting task 12.0:0 as TID 24 on
 executor 2: eecvm0206.demo.sas.com (PROCESS_LOCAL)
 14/09/16 10:55:21 INFO TaskSetManager: Serialized task 12.0:0 as 1733
 bytes in 0 ms
 14/09/16 10:55:21 INFO TaskSetManager: Starting task 12.0:1 as TID 25 on
 executor 5: eecvm0203.demo.sas.com (PROCESS_LOCAL)
 14/09/16 10:55:21 INFO TaskSetManager: Serialized task 12.0:1 as 1733
 bytes in 0 ms
 14/09/16 10:55:21 WARN TaskSetManager: Lost TID 24 (task 12.0:0)
 14/09/16 10:55:21 WARN TaskSetManager: Loss was due to
 java.io.FileNotFoundException
 java.io.FileNotFoundException: File file:/root/test/sample_svm_data.txt
 does not exist
 at
 org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
 at
 org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
 at
 org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
 at
 org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:402)
 at
 org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.init(ChecksumFileSystem.java:137)
 at
 org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339)
 at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:764)
 at
 org.apache.hadoop.mapred.LineRecordReader.init(LineRecordReader.java:108)
 at
 org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
 at
 org.apache.spark.rdd.HadoopRDD$$anon$1.init(HadoopRDD.scala:156)
 at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:149)
 at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:64)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
 at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
 at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
 at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:33)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
 at
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:109)
 at 

Problem with pyspark command line invocation -- option truncation... (Spark v1.1.0) ...

2014-09-16 Thread Dimension Data, LLC.



Hello friends:

Yesterday I compiled Spark 1.1.0 against CDH5's Hadoop/YARN 
distribution. Everything went fine, and everything seems

to work, but for the following.

Following are two invocations of the 'pyspark' script, one with 
enclosing quotes around the options passed to
'--driver-java-options', and one without them. I added the following 
one-line in the 'pyspark' script to

show my problem...

ADDED: echo xxx${PYSPARK_SUBMIT_ARGS}xxx # Added after the line that 
exports this variable.


=

FIRST:
[ without enclosing quotes ]:

user@linux$ pyspark --master yarn-client --driver-java-options 
-Dspark.executor.memory=1G -Dspark.ui.port=8468 
-Dspark.driver.memory=512M -Dspark.yarn.executor.memoryOverhead=512M 
-Dspark.executor.instances=3 
-Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar
xxx --master yarn-client --driver-java-options 
-Dspark.executor.memory=1Gxxx --- echo statement show option truncation.


While this succeeds in getting to a pyspark shell prompt (sc), the 
context isn't setup properly because, as seen
in red above and below, all but the first option took effect. (Note 
spark.executor.memory is correct but that's only because

my spark defaults coincide with it.)

14/09/16 17:35:32 INFO yarn.Client:   command: $JAVA_HOME/bin/java 
-server -Xmx512m -Djava.io.tmpdir=$PWD/tmp 
'-Dspark.tachyonStore.folderName=spark-e225c04d-5333-4ca6-9a78-1c3392438d89' 
'-Dspark.serializer.objectStreamReset=100' '-Dspark.executor.memory=1G' 
'-Dspark.rdd.compress=True' '-Dspark.yarn.secondary.jars=' 
'-Dspark.submit.pyFiles=' 
'-Dspark.serializer=org.apache.spark.serializer.KryoSerializer' 
'-Dspark.driver.host=dstorm' '-Dspark.driver.appUIHistoryAddress=' 
'-Dspark.app.name=PySparkShell' 
'-Dspark.driver.appUIAddress=dstorm:4040' 
'-Dspark.driver.extraJavaOptions=-Dspark.executor.memory=1G' 
'-Dspark.fileserver.uri=http://192.168.0.16:60305' 
'-Dspark.driver.port=44616' '-Dspark.master=yarn-client' 
org.apache.spark.deploy.yarn.ExecutorLauncher --class 'notused' --jar  
null  --arg  'dstorm:44616' --executor-memory 1024 --executor-cores 1 
--num-executors  2 1 LOG_DIR/stdout 2 LOG_DIR/stderr


(Note: I happen to notice that 'spark.driver.memory' is missing as well).

===

NEXT:

[ So let's try with enclosing quotes ]
user@linux$ pyspark --master yarn-client --driver-java-options 
'-Dspark.executor.memory=1G -Dspark.ui.port=8468 
-Dspark.driver.memory=512M -Dspark.yarn.executor.memoryOverhead=512M 
-Dspark.executor.instances=3 
-Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar'
xxx --master yarn-client --driver-java-options 
-Dspark.executor.memory=1G -Dspark.ui.port=8468 
-Dspark.driver.memory=512M -Dspark.yarn.executor.memoryOverhead=512M 
-Dspark.executor.instances=3 
-Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jarxxx


While this does have all the options (shown in the red echo output above 
and the command executed below), pyspark invocation fails, indicating

that the application ended before I got to a shell prompt.
See below snippet.

14/09/16 17:44:12 INFO yarn.Client:   command: $JAVA_HOME/bin/java 
-server -Xmx512m -Djava.io.tmpdir=$PWD/tmp 
'-Dspark.tachyonStore.folderName=spark-3b62ece7-a22a-4d0a-b773-1f5601e5eada' 
'-Dspark.executor.memory=1G' '-Dspark.driver.memory=512M' 
'-Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar' 
'-Dspark.serializer.objectStreamReset=100' 
'-Dspark.executor.instances=3' '-Dspark.rdd.compress=True' 
'-Dspark.yarn.secondary.jars=' '-Dspark.submit.pyFiles=' 
'-Dspark.ui.port=8468' '-Dspark.driver.host=dstorm' 
'-Dspark.serializer=org.apache.spark.serializer.KryoSerializer' 
'-Dspark.driver.appUIHistoryAddress=' '-Dspark.app.name=PySparkShell' 
'-Dspark.driver.appUIAddress=dstorm:8468' 
'-Dspark.yarn.executor.memoryOverhead=512M' 
'-Dspark.driver.extraJavaOptions=-Dspark.executor.memory=1G 
-Dspark.ui.port=8468 -Dspark.driver.memory=512M 
-Dspark.yarn.executor.memoryOverhead=512M -Dspark.executor.instances=3 
-Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar' 
'-Dspark.fileserver.uri=http://192.168.0.16:54171' 
'-Dspark.master=yarn-client' '-Dspark.driver.port=58542' 
org.apache.spark.deploy.yarn.ExecutorLauncher --class 'notused' --jar  
null  --arg  'dstorm:58542' --executor-memory 1024 --executor-cores 1 
--num-executors  3 1 LOG_DIR/stdout 2 LOG_DIR/stderr



[ ... SNIP ... ]
4/09/16 17:44:12 INFO cluster.YarnClientSchedulerBackend: Application 
report from ASM:

 appMasterRpcPort: -1
 appStartTime: 1410903852044
 yarnAppState: ACCEPTED

14/09/16 17:44:13 INFO cluster.YarnClientSchedulerBackend: Application 
report from ASM:

 appMasterRpcPort: -1
 appStartTime: 

partitioned groupBy

2014-09-16 Thread Akshat Aranya
I have a use case where my RDD is set up such:

Partition 0:
K1 - [V1, V2]
K2 - [V2]

Partition 1:
K3 - [V1]
K4 - [V3]

I want to invert this RDD, but only within a partition, so that the
operation does not require a shuffle.  It doesn't matter if the partitions
of the inverted RDD have non unique keys across the partitions, for example:

Partition 0:
V1 - [K1]
V2 - [K1, K2]

Partition 1:
V1 - [K3]
V3 - [K4]

Is there a way to do only a per-partition groupBy, instead of shuffling the
entire data?


Re: partitioned groupBy

2014-09-16 Thread Patrick Wendell
If each partition can fit in memory, you can do this using
mapPartitions and then building an inverse mapping within each
partition. You'd need to construct a hash map within each partition
yourself.

On Tue, Sep 16, 2014 at 4:27 PM, Akshat Aranya aara...@gmail.com wrote:
 I have a use case where my RDD is set up such:

 Partition 0:
 K1 - [V1, V2]
 K2 - [V2]

 Partition 1:
 K3 - [V1]
 K4 - [V3]

 I want to invert this RDD, but only within a partition, so that the
 operation does not require a shuffle.  It doesn't matter if the partitions
 of the inverted RDD have non unique keys across the partitions, for example:

 Partition 0:
 V1 - [K1]
 V2 - [K1, K2]

 Partition 1:
 V1 - [K3]
 V3 - [K4]

 Is there a way to do only a per-partition groupBy, instead of shuffling the
 entire data?


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Memory under-utilization

2014-09-16 Thread francisco
Thanks for the tip.

http://localhost:4040/executors/ is showing 
Executors(1)
Memory: 0.0 B used (294.9 MB Total)
Disk: 0.0 B Used

However, running as standalone cluster does resolve the problem.
I can see a worker process running w/ the allocated memory.

My conclusion (I may be wrong) is for 'local' mode the 'executor-memory'
parameter is not honored.

Thanks again for the help!






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Memory-under-utilization-tp14396p14409.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



how to report documentation bug?

2014-09-16 Thread Andy Davidson

http://spark.apache.org/docs/latest/quick-start.html#standalone-applications

Click on java tab There is a bug in the maven section

  version1.1.0-SNAPSHOT/version
 

Should be 
version1.1.0/version

Hope this helps

Andy




RE: SparkSQL 1.1 hang when DROP or LOAD

2014-09-16 Thread Cheng, Hao
Thank you Yin Huai. This is probably true.

I saw in the hive-site.xml, Liu has changed the entry, which is default should 
be false.

  property
namehive.support.concurrency/name
descriptionEnable Hive's Table Lock Manager Service/description
valuetrue/value
  /property

Someone is working on upgrading the Hive to 0.13 for SparkSQL 
(https://github.com/apache/spark/pull/2241), not sure if you can wait for this. 
☺

From: Yin Huai [mailto:huaiyin@gmail.com]
Sent: Wednesday, September 17, 2014 1:50 AM
To: Cheng, Hao
Cc: linkpatrickliu; u...@spark.incubator.apache.org
Subject: Re: SparkSQL 1.1 hang when DROP or LOAD

I meant it may be a Hive bug since we also call Hive's drop table internally.

On Tue, Sep 16, 2014 at 1:44 PM, Yin Huai 
huaiyin@gmail.commailto:huaiyin@gmail.com wrote:
Seems https://issues.apache.org/jira/browse/HIVE-5474 is related?

On Tue, Sep 16, 2014 at 4:49 AM, Cheng, Hao 
hao.ch...@intel.commailto:hao.ch...@intel.com wrote:
Thank you for pasting the steps, I will look at this, hopefully come out with a 
solution soon.

-Original Message-
From: linkpatrickliu 
[mailto:linkpatrick...@live.commailto:linkpatrick...@live.com]
Sent: Tuesday, September 16, 2014 3:17 PM
To: u...@spark.incubator.apache.orgmailto:u...@spark.incubator.apache.org
Subject: RE: SparkSQL 1.1 hang when DROP or LOAD
Hi, Hao Cheng.

I have done other tests. And the result shows the thriftServer can connect to 
Zookeeper.

However, I found some more interesting things. And I think I have found a bug!

Test procedure:
Test1:
(0) Use beeline to connect to thriftServer.
(1) Switch database use dw_op1; (OK)
The logs show that the thriftServer connected with Zookeeper and acquired locks.

(2) Drop table drop table src; (Blocked) The logs show that the thriftServer 
is acquireReadWriteLocks.

Doubt:
The reason why I cannot drop table src is because the first SQL use dw_op1
have left locks in Zookeeper  unsuccessfully released.
So when the second SQL is acquiring locks in Zookeeper, it will block.

Test2:
Restart thriftServer.
Instead of switching to another database, I just drop the table in the default 
database;
(0) Restart thriftServer  use beeline to connect to thriftServer.
(1) Drop table drop table src; (OK)
Amazing! Succeed!
(2) Drop again!  drop table src2; (Blocked) Same error: the thriftServer is 
blocked in the acquireReadWriteLocks
phrase.

As you can see.
Only the first SQL requiring locks can succeed.
So I think the reason is that the thriftServer cannot release locks correctly 
in Zookeeper.









--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-1-1-hang-when-DROP-or-LOAD-tp14222p14339.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.orgmailto:user-unsubscr...@spark.apache.org For 
additional commands, e-mail: 
user-h...@spark.apache.orgmailto:user-h...@spark.apache.org


-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.orgmailto:user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.orgmailto:user-h...@spark.apache.org




Re: Unable to ship external Python libraries in PYSPARK

2014-09-16 Thread daijia
Is there some way to ship textfile just like ship python libraries?

Thanks in advance
Daijia



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-ship-external-Python-libraries-in-PYSPARK-tp14074p14412.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: how to report documentation bug?

2014-09-16 Thread Nicholas Chammas
You can send an email like you just did or open an issue in the Spark issue
tracker http://issues.apache.org/jira/. This looks like a problem with
how the version is generated in this file
https://github.com/apache/spark/blob/branch-1.1/docs/quick-start.md.

On Tue, Sep 16, 2014 at 8:55 PM, Andy Davidson 
a...@santacruzintegration.com wrote:



 http://spark.apache.org/docs/latest/quick-start.html#standalone-applications

 Click on java tab There is a bug in the maven section

   version1.1.0-SNAPSHOT/version



 Should be
 version1.1.0/version

 Hope this helps

 Andy



Re: Problem with pyspark command line invocation -- option truncation... (Spark v1.1.0) ...

2014-09-16 Thread Dimension Data, LLC.

Hi Sandy:

Thank you. I have not tried that mechanism (I wasn't are of it). I will 
try that instead.


Is it possible to also represent '--driver-memory' and 
'--executor-memory' (and basically all properties)

using the '--conf' directive?

The Reason: I actually discovered the below issue while writing a custom 
PYTHONSTARTUP script that I use
to launch *bpython* or *python* or my *WING python IDE* with. That 
script reads a python *dict* (from a file)
containing key/value pairs from which it constructs the 
--driver-java-options ..., which I will now
switch generate '--conf key1=val1 --conf key2=val2 --conf key3=val3 (and 
so on), instead.


If all of the properties could be represented in this way, then it makes 
the code cleaner (all in

the dict file, and no one-offs).

Either way, thank you. =:)

Noel,
team didata


On 09/16/2014 08:03 PM, Sandy Ryza wrote:

Hi team didata,

This doesn't directly answer your question, but with Spark 1.1, 
instead of user the driver options, it's better to pass your spark 
properties using the conf option.


E.g.
pyspark --master yarn-client --conf spark.shuffle.spill=true --conf 
spark.yarn.executor.memoryOverhead=512M


Additionally, executor and memory have dedicated options:

pyspark --master yarn-client --conf spark.shuffle.spill=true --conf 
spark.yarn.executor.memoryOverhead=512M --driver-memory 3G 
--executor-memory 5G


-Sandy


On Tue, Sep 16, 2014 at 6:22 PM, Dimension Data, LLC. 
subscripti...@didata.us mailto:subscripti...@didata.us wrote:




Hello friends:

Yesterday I compiled Spark 1.1.0 against CDH5's Hadoop/YARN
distribution. Everything went fine, and everything seems
to work, but for the following.

Following are two invocations of the 'pyspark' script, one with
enclosing quotes around the options passed to
'--driver-java-options', and one without them. I added the
following one-line in the 'pyspark' script to
show my problem...

ADDED: echo xxx${PYSPARK_SUBMIT_ARGS}xxx # Added after the line
that exports this variable.

=

FIRST:
[ without enclosing quotes ]:

user@linux$ pyspark --master yarn-client --driver-java-options
-Dspark.executor.memory=1G -Dspark.ui.port=8468
-Dspark.driver.memory=512M
-Dspark.yarn.executor.memoryOverhead=512M
-Dspark.executor.instances=3

-Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar
xxx --master yarn-client --driver-java-options
-Dspark.executor.memory=1Gxxx  --- echo statement show option
truncation.

While this succeeds in getting to a pyspark shell prompt (sc), the
context isn't setup properly because, as seen
in red above and below, all but the first option took effect.
(Note spark.executor.memory is correct but that's only because
my spark defaults coincide with it.)

14/09/16 17:35:32 INFO yarn.Client:   command: $JAVA_HOME/bin/java
-server -Xmx512m -Djava.io.tmpdir=$PWD/tmp
'-Dspark.tachyonStore.folderName=spark-e225c04d-5333-4ca6-9a78-1c3392438d89'
'-Dspark.serializer.objectStreamReset=100'
'-Dspark.executor.memory=1G' '-Dspark.rdd.compress=True'
'-Dspark.yarn.secondary.jars=' '-Dspark.submit.pyFiles='
'-Dspark.serializer=org.apache.spark.serializer.KryoSerializer'
'-Dspark.driver.host=dstorm' '-Dspark.driver.appUIHistoryAddress='
'-Dspark.app.name http://Dspark.app.name=PySparkShell'
'-Dspark.driver.appUIAddress=dstorm:4040'
'-Dspark.driver.extraJavaOptions=-Dspark.executor.memory=1G'
'-Dspark.fileserver.uri=http://192.168.0.16:60305'
'-Dspark.driver.port=44616' '-Dspark.master=yarn-client'
org.apache.spark.deploy.yarn.ExecutorLauncher --class 'notused'
--jar  null  --arg 'dstorm:44616' --executor-memory 1024
--executor-cores 1 --num-executors 2 1 LOG_DIR/stdout 2
LOG_DIR/stderr

(Note: I happen to notice that 'spark.driver.memory' is missing as
well).

===

NEXT:

[ So let's try with enclosing quotes ]
user@linux$ pyspark --master yarn-client --driver-java-options
'-Dspark.executor.memory=1G -Dspark.ui.port=8468
-Dspark.driver.memory=512M
-Dspark.yarn.executor.memoryOverhead=512M
-Dspark.executor.instances=3

-Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar'
xxx --master yarn-client --driver-java-options
-Dspark.executor.memory=1G -Dspark.ui.port=8468
-Dspark.driver.memory=512M
-Dspark.yarn.executor.memoryOverhead=512M
-Dspark.executor.instances=3

-Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jarxxx

While this does have all the options (shown in the red echo output
above and the command executed below), pyspark invocation fails,
indicating
that the application ended before I got 

CPU RAM

2014-09-16 Thread VJ Shalish
Hi

I need to get the CPU utilisation, RAM usage, Network IO and other metrics
using Java program. Can anyone help me on this?

Thanks
Shalish.


Re: Unable to ship external Python libraries in PYSPARK

2014-09-16 Thread Davies Liu
Yes, sc.addFile() is what you want:

 |  addFile(self, path)
 |  Add a file to be downloaded with this Spark job on every node.
 |  The C{path} passed can be either a local file, a file in HDFS
 |  (or other Hadoop-supported filesystems), or an HTTP, HTTPS or
 |  FTP URI.
 |
 |  To access the file in Spark jobs, use
 |  L{SparkFiles.get(fileName)pyspark.files.SparkFiles.get} with the
 |  filename to find its download location.
 |
 |   from pyspark import SparkFiles
 |   path = os.path.join(tempdir, test.txt)
 |   with open(path, w) as testFile:
 |  ...testFile.write(100)
 |   sc.addFile(path)
 |   def func(iterator):
 |  ...with open(SparkFiles.get(test.txt)) as testFile:
 |  ...fileVal = int(testFile.readline())
 |  ...return [x * fileVal for x in iterator]
 |   sc.parallelize([1, 2, 3, 4]).mapPartitions(func).collect()
 |  [100, 200, 300, 400]

On Tue, Sep 16, 2014 at 7:02 PM, daijia jia_...@intsig.com wrote:
 Is there some way to ship textfile just like ship python libraries?

 Thanks in advance
 Daijia



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-ship-external-Python-libraries-in-PYSPARK-tp14074p14412.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: CPU RAM

2014-09-16 Thread Amit
Not particularly related to Spark, but you can check out SIGAR API. It let's 
you get CPU, Memory, Network, Filesystem and process based metrics.

Amit
On Sep 16, 2014, at 20:14, VJ Shalish vjshal...@gmail.com wrote:

 Hi
  
 I need to get the CPU utilisation, RAM usage, Network IO and other metrics 
 using Java program. Can anyone help me on this?
  
 Thanks
 Shalish.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: CPU RAM

2014-09-16 Thread VJ Shalish
Thank u for the response Amit.
So is it that, we cannot measure the CPU consumption, RAM usage of a spark
job through a Java program?

On Tue, Sep 16, 2014 at 11:23 PM, Amit kumarami...@gmail.com wrote:

 Not particularly related to Spark, but you can check out SIGAR API. It
 let's you get CPU, Memory, Network, Filesystem and process based metrics.

 Amit
 On Sep 16, 2014, at 20:14, VJ Shalish vjshal...@gmail.com wrote:

  Hi
 
  I need to get the CPU utilisation, RAM usage, Network IO and other
 metrics using Java program. Can anyone help me on this?
 
  Thanks
  Shalish.



Re: CPU RAM

2014-09-16 Thread VJ Shalish
Sorry for the confusion Team.
My requirement is to measure the CPU utilisation, RAM usage, Network IO and
other metrics of a SPARK JOB using Java program.
Please help on the same.

On Tue, Sep 16, 2014 at 11:23 PM, Amit kumarami...@gmail.com wrote:

 Not particularly related to Spark, but you can check out SIGAR API. It
 let's you get CPU, Memory, Network, Filesystem and process based metrics.

 Amit
 On Sep 16, 2014, at 20:14, VJ Shalish vjshal...@gmail.com wrote:

  Hi
 
  I need to get the CPU utilisation, RAM usage, Network IO and other
 metrics using Java program. Can anyone help me on this?
 
  Thanks
  Shalish.



YARN mode not available error

2014-09-16 Thread Barrington
Hi,

I am running Spark in cluster mode with Hadoop YARN as the underlying
cluster manager. I get this error when trying to initialize the
SparkContext. 


Exception in thread main org.apache.spark.SparkException: YARN mode not
available ?
at
org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:1586)
at org.apache.spark.SparkContext.init(SparkContext.scala:310)
at org.apache.spark.SparkContext.init(SparkContext.scala:86)
at LascoScript$.main(LascoScript.scala:24)
at LascoScript.main(LascoScript.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
Caused by: java.lang.ClassNotFoundException:
org.apache.spark.scheduler.cluster.YarnClientClusterScheduler
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:190)
at
org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:1580)




My build.sbt file  looks like this:



name := LascoScript

version := 1.0

scalaVersion := 2.10.4

val excludeJBossNetty = ExclusionRule(organization = org.jboss.netty)
val excludeMortbayJetty = ExclusionRule(organization = org.eclipse.jetty,
artifact = jetty-server)
val excludeAsm = ExclusionRule(organization = org.ow2.asm)
val excludeCommonsLogging = ExclusionRule(organization = commons-logging)
val excludeSLF4J = ExclusionRule(organization = org.slf4j)
val excludeOldAsm = ExclusionRule(organization = asm)
val excludeServletApi = ExclusionRule(organization = javax.servlet,
artifact = servlet-api)


libraryDependencies += org.apache.spark %% spark-core % 1.1.0
excludeAll(
 excludeServletApi, excludeMortbayJetty
)

libraryDependencies += org.apache.hadoop % hadoop-client % 2.5.1
excludeAll(
 excludeJBossNetty, excludeMortbayJetty, excludeAsm, excludeCommonsLogging,
excludeSLF4J, excludeOldAsm, excludeServletApi
 )

libraryDependencies += org.mortbay.jetty % servlet-api % 3.0.20100224

libraryDependencies += org.eclipse.jetty % jetty-server %
8.1.16.v20140903

unmanagedJars in Compile ++= {
 val base = baseDirectory.value
 val baseDirectories = (base / lib) +++ (base)
 val customJars = (baseDirectories ** *.jar)
 customJars.classpath
}

resolvers += Akka Repository at http://repo.akka.io/releases/“



How can I fix this issue?

- Barrington



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/YARN-mode-not-available-error-tp14420.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



The difference between pyspark.rdd.PipelinedRDD and pyspark.rdd.RDD

2014-09-16 Thread edmond_huo
Hi,

I am a freshman about spark. I tried to run a job like wordcount example in
python. But when I tried to get the top 10 popular words in the file, I got
the message:AttributeError: 'PipelinedRDD' object has no attribute
'sortByKey'.

So my question is what is the difference between PipelinedRDD and RDD? and
if I want to sort the data in PipelinedRDD, how can I do it?

Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/The-difference-between-pyspark-rdd-PipelinedRDD-and-pyspark-rdd-RDD-tp14421.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



permission denied on local dir

2014-09-16 Thread style95
I am running spark on shared yarn cluster.
My user ID is online, but I found that when I run my spark application,
local directories are created by yarn user ID.
So I am unable to delete local directories and finally application failed.

Please refer to my log below:

14/09/16 21:59:02 ERROR DiskBlockManager: Exception while deleting local
spark dir:
/hadoop02/hadoop/yarn/local/usercache/online/appcache/application_1410795082830_3994/spark-local-20140916215842-6fe7
java.io.IOException: Failed to list files for dir:
/hadoop02/hadoop/yarn/local/usercache/online/appcache/application_1410795082830_3994/spark-local-20140916215842-6fe7/3a
at org.apache.spark.util.Utils$.listFilesSafely(Utils.scala:580)
at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:592)
at
org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:593)
at
org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:592)
at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at
scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:592)
at
org.apache.spark.storage.DiskBlockManager$$anonfun$stop$1.apply(DiskBlockManager.scala:163)
at
org.apache.spark.storage.DiskBlockManager$$anonfun$stop$1.apply(DiskBlockManager.scala:160)
at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at
scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at
org.apache.spark.storage.DiskBlockManager.stop(DiskBlockManager.scala:160)
at
org.apache.spark.storage.DiskBlockManager$$anon$1$$anonfun$run$1.apply$mcV$sp(DiskBlockManager.scala:153)
at
org.apache.spark.storage.DiskBlockManager$$anon$1$$anonfun$run$1.apply(DiskBlockManager.scala:151)
at
org.apache.spark.storage.DiskBlockManager$$anon$1$$anonfun$run$1.apply(DiskBlockManager.scala:151)
at
org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160)
at
org.apache.spark.storage.DiskBlockManager$$anon$1.run(DiskBlockManager.scala:151)


I am unable to access
/hadoop02/hadoop/yarn/local/usercache/online/appcache/application_1410795082830_3994/spark-local-20140916215842-6fe7
 
e.g) ls
/hadoop02/hadoop/yarn/local/usercache/online/appcache/application_1410795082830_3994/spark-local-20140916215842-6fe7
does not work and permission denied occurred.

I am using spark-1.0.0 and yarn 2.4.0.

Thanks in advance.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/permission-denied-on-local-dir-tp14422.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org