date:20140916

Re: Invalid signature file digest for Manifest main attributes with spark job built using maven

2014-09-16 Thread Kevin Peng

Sean,

Thanks.  That worked.

Kevin

On Mon, Sep 15, 2014 at 3:37 PM, Sean Owen so...@cloudera.com wrote:

 This is more of a Java / Maven issue than Spark per se. I would use
 the shade plugin to remove signature files in your final META-INF/
 dir. As Spark does, in its configuration:

 filters
   filter
 artifact*:*/artifact
 excludes
   excludeorg/datanucleus/**/exclude
   excludeMETA-INF/*.SF/exclude
   excludeMETA-INF/*.DSA/exclude
   excludeMETA-INF/*.RSA/exclude
 /excludes
   /filter
 /filters

 On Mon, Sep 15, 2014 at 11:33 PM, kpeng1 kpe...@gmail.com wrote:
  Hi All,
 
  I am trying to submit a spark job that I have built in maven using the
  following command:
  /usr/bin/spark-submit --deploy-mode client --class com.spark.TheMain
  --master local[1] /home/cloudera/myjar.jar 100
 
  But I seem to be getting the following error:
  Exception in thread main java.lang.SecurityException: Invalid signature
  file digest for Manifest main attributes
  at
 
 sun.security.util.SignatureFileVerifier.processImpl(SignatureFileVerifier.java:286)
  at
 
 sun.security.util.SignatureFileVerifier.process(SignatureFileVerifier.java:239)
  at java.util.jar.JarVerifier.processEntry(JarVerifier.java:307)
  at java.util.jar.JarVerifier.update(JarVerifier.java:218)
  at java.util.jar.JarFile.initializeVerifier(JarFile.java:345)
  at java.util.jar.JarFile.getInputStream(JarFile.java:412)
  at
 sun.misc.URLClassPath$JarLoader$2.getInputStream(URLClassPath.java:775)
  at sun.misc.Resource.cachedInputStream(Resource.java:77)
  at sun.misc.Resource.getByteBuffer(Resource.java:160)
  at java.net.URLClassLoader.defineClass(URLClassLoader.java:436)
  at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
  at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
  at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
  at java.security.AccessController.doPrivileged(Native Method)
  at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
  at java.lang.Class.forName0(Native Method)
  at java.lang.Class.forName(Class.java:270)
  at
 org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:289)
  at
 org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
  at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 
 
  Here is the pom file I am using to build the jar:
  project xmlns=http://maven.apache.org/POM/4.0.0;
  xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance;
  xsi:schemaLocation=http://maven.apache.org/POM/4.0.0
  http://maven.apache.org/maven-v4_0_0.xsd;
modelVersion4.0.0/modelVersion
groupIdcom.spark/groupId
artifactIdmyjar/artifactId
version0.0.1-SNAPSHOT/version
name${project.artifactId}/name
descriptionMy wonderfull scala app/description
inceptionYear2010/inceptionYear
licenses
  license
nameMy License/name
urlhttp:///url
distributionrepo/distribution
  /license
/licenses
 
properties
  cdh.versioncdh5.1.0/cdh.version
  maven.compiler.source1.6/maven.compiler.source
  maven.compiler.target1.6/maven.compiler.target
  encodingUTF-8/encoding
  scala.tools.version2.10/scala.tools.version
  scala.version2.10.4/scala.version
/properties
 
repositories
  repository
idscala-tools.org/id
nameScala-tools Maven2 Repository/name
urlhttps://oss.sonatype.org/content/repositories/snapshots/
 /url
  /repository
  repository
idmaven-hadoop/id
nameHadoop Releases/name
 
  urlhttps://repository.cloudera.com/content/repositories/releases/
 /url
  /repository
  repository
idcloudera-repos/id
nameCloudera Repos/name
urlhttps://repository.cloudera.com/artifactory/cloudera-repos/
 /url
  /repository
/repositories
pluginRepositories
  pluginRepository
idscala-tools.org/id
nameScala-tools Maven2 Repository/name
urlhttps://oss.sonatype.org/content/repositories/snapshots/
 /url
  /pluginRepository
/pluginRepositories
 
dependencies
  dependency
groupIdorg.scala-lang/groupId
artifactIdscala-library/artifactId
version${scala.version}/version
  /dependency
  dependency
groupIdorg.apache.spark/groupId
artifactIdspark-core_2.10/artifactId
version1.0.0-${cdh.version}/version
  /dependency
  dependency
groupIdorg.apache.spark/groupId
artifactIdspark-tools_2.10/artifactId
version1.0.0-${cdh.version}/version
  /dependency
  dependency
groupIdorg.apache.spark/groupId
artifactIdspark-streaming-flume_2.10/artifactId

Re: spark and mesos issue

2014-09-16 Thread Gurvinder Singh

It might not be related only to memory issue. Memory issue is also
there as you mentioned. I have seen that one too. The fine mode issue
is mainly spark considering that it got two different block manager
for same ID, whereas if I search for the ID in the mesos slave, it
exist only on the one slave not on multiple of them. Theis might be
due to the size of ID, as spark out the error as

14/09/16 08:04:29 ERROR BlockManagerMasterActor: Got two different
block manager registrations on 20140822-112818-711206558-5050-25951-0

where as in the mesos slave I see logs as

I0915 20:55:18.293903 31434 containerizer.cpp:392] Starting container
'3aab2237-d32f-470d-a206-7bada454ad3f' for executor
'20140822-112818-711206558-5050-25951-0' of framework
'20140822-112818-711206558-5050-25951-0053'

I0915 20:53:28.039218 31437 containerizer.cpp:392] Starting container
'fe4b344f-16c9-484a-9c2f-92bd92b43f6d' for executor
'20140822-112818-711206558-5050-25951-0' of framework
'20140822-112818-711206558-5050-25951-0050'


you the last 3 digits of ID are missing in spark where as they are
different in mesos slaves.

- Gurvinder
On 09/15/2014 11:13 PM, Brenden Matthews wrote:
 I started hitting a similar problem, and it seems to be related to 
 memory overhead and tasks getting OOM killed.  I filed a ticket
 here:
 
 https://issues.apache.org/jira/browse/SPARK-3535
 
 On Wed, Jul 16, 2014 at 5:27 AM, Ray Rodriguez
 rayrod2...@gmail.com mailto:rayrod2...@gmail.com wrote:
 
 I'll set some time aside today to gather and post some logs and 
 details about this issue from our end.
 
 
 On Wed, Jul 16, 2014 at 2:05 AM, Vinod Kone vinodk...@gmail.com 
 mailto:vinodk...@gmail.com wrote:
 
 
 
 
 On Tue, Jul 15, 2014 at 11:02 PM, Vinod Kone vi...@twitter.com 
 mailto:vi...@twitter.com wrote:
 
 
 On Fri, Jul 4, 2014 at 2:05 AM, Gurvinder Singh 
 gurvinder.si...@uninett.no mailto:gurvinder.si...@uninett.no
 wrote:
 
 ERROR storage.BlockManagerMasterActor: Got two different block
 manager registrations on 201407031041-1227224054-5050-24004-0
 
 Googling about it seems that mesos is starting slaves at the same
 time and giving them the same id. So may bug in mesos ?
 
 
 Has this issue been resolved? We need more information to triage
 this. Maybe some logs that show the lifecycle of the duplicate
 instances?
 
 
 @vinodkone
 
 
 
 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

RE: SparkSQL 1.1 hang when DROP or LOAD

2014-09-16 Thread linkpatrickliu

Seems like the thriftServer cannot connect to Zookeeper, so it cannot get
lock.

This is how it the log looks when I run SparkSQL:
load data inpath kv1.txt into table src;
log:
14/09/16 14:40:47 INFO Driver: PERFLOG method=acquireReadWriteLocks
14/09/16 14:40:47 INFO ClientCnxn: Opening socket connection to server
SVR4044HW2285.hadoop.lpt.qa.nt.ctripcorp.com/10.2.4.191:2181. Will not
attempt to authenticate using SASL (unknown error)
14/09/16 14:40:47 INFO ClientCnxn: Socket connection established to
SVR4044HW2285.hadoop.lpt.qa.nt.ctripcorp.com/10.2.4.191:2181, initiating
session
14/09/16 14:40:47 INFO ClientCnxn: Session establishment complete on server
SVR4044HW2285.hadoop.lpt.qa.nt.ctripcorp.com/10.2.4.191:2181, sessionid =
0x347c1b1f78d495e, negotiated timeout = 18
14/09/16 14:40:47 INFO Driver: /PERFLOG method=acquireReadWriteLocks
start=1410849647447 end=1410849647457 duration=10

You can see, between the PERFLOG of acquireReadWriteLocks, the ClientCnxn
will try to connect to Zookeeper. After the connection has been successfully
established, the acquireReadWriteLocks phrase can be finished.

But, when I run the ThriftServer, and run the same SQL.
Here is the log:
14/09/16 14:40:09 INFO Driver: PERFLOG method=acquireReadWriteLocks

It will wait here. 
So I doubt, the reason why Drop or Load failed in thriftServer mode, is
because of the thriftServer cannot connect to Zookeeper.





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-1-1-hang-when-DROP-or-LOAD-tp14222p14336.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

SparkContext creation slow down unit tests

2014-09-16 Thread 诺铁

hi,

I am trying to write some unit test, following spark programming guide
http://spark.apache.org/docs/latest/programming-guide.html#unit-testing.
but I observed unit test runs very slow(the code is just a SparkPi),  so I
turn log level to trace and look through the log output.  and found
creation of SparkContext seems to take most time.

there are some action take a lot of time:
1, seems starting jetty
14:04:55.038 [ScalaTest-run-running-MySparkPiSpec] DEBUG
o.e.jetty.servlet.ServletHandler -
servletNameMap={org.apache.spark.ui.JettyUtils$$anon$1-672f11c2=org.apache.spark.ui.JettyUtils$$anon$1-672f11c2}
14:05:25.121 [ScalaTest-run-running-MySparkPiSpec] DEBUG
o.e.jetty.util.component.Container - Container
org.eclipse.jetty.server.Server@e3cee7b +
SelectChannelConnector@0.0.0.0:4040 as connector

2, I don't know what's this
14:05:25.202 [ScalaTest-run-running-MySparkPiSpec] DEBUG
org.apache.hadoop.security.Groups - Group mapping
impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping;
cacheTimeout=30
14:05:54.594 [spark-akka.actor.default-dispatcher-4] TRACE
o.a.s.s.BlockManagerMasterActor - Checking for hosts with no recent heart
beats in BlockManagerMaster.


are there any way to make this faster for unit test?
I also notice in spark's unit test, that there exists a SharedSparkContext
https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/SharedSparkContext.scala,
is this the suggested way to work around this problem? if so, I would
suggest to document this in programming guide.

RE: SparkSQL 1.1 hang when DROP or LOAD

2014-09-16 Thread linkpatrickliu

Hi, Hao Cheng.

I have done other tests. And the result shows the thriftServer can connect
to Zookeeper.

However, I found some more interesting things. And I think I have found a
bug!

Test procedure:
Test1:
(0) Use beeline to connect to thriftServer.
(1) Switch database use dw_op1; (OK)
The logs show that the thriftServer connected with Zookeeper and acquired
locks.

(2) Drop table drop table src; (Blocked)
The logs show that the thriftServer is acquireReadWriteLocks.

Doubt:
The reason why I cannot drop table src is because the first SQL use dw_op1
have left locks in Zookeeper  unsuccessfully released.
So when the second SQL is acquiring locks in Zookeeper, it will block.

Test2:
Restart thriftServer.
Instead of switching to another database, I just drop the table in the
default database;
(0) Restart thriftServer  use beeline to connect to thriftServer.
(1) Drop table drop table src; (OK)
Amazing! Succeed!
(2) Drop again!  drop table src2; (Blocked)
Same error: the thriftServer is blocked in the acquireReadWriteLocks
phrase.

As you can see. 
Only the first SQL requiring locks can succeed.
So I think the reason is that the thriftServer cannot release locks
correctly in Zookeeper.









--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-1-1-hang-when-DROP-or-LOAD-tp14222p14339.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark 1.1 / cdh4 stuck using old hadoop client?

2014-09-16 Thread Paul Wais

Thanks Christian!  I tried compiling from source but am still getting the
same hadoop client version error when reading from HDFS.  Will have to poke
deeper... perhaps I've got some classpath issues.  FWIW I compiled using:

$ MAVEN_OPTS=-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m
mvn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests clean package

and hadoop 2.3 / cdh5 from
http://archive.cloudera.com/cdh5/cdh/5/hadoop-2.3.0-cdh5.0.0.tar.gz





On Mon, Sep 15, 2014 at 6:49 PM, Christian Chua cc8...@icloud.com wrote:

 Hi Paul.

 I would recommend building your own 1.1.0 distribution.

 ./make-distribution.sh --name hadoop-personal-build-2.4 --tgz -Pyarn
 -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests



 I downloaded the Pre-build for Hadoop 2.4 binary, and it had this
 strange behavior where

 spark-submit --master yarn-cluster ...

 will work, but

 spark-submit --master yarn-client ...

 will fail.


 But on the personal build obtained from the command above, both will then
 work.


 -Christian




 On Sep 15, 2014, at 6:28 PM, Paul Wais pw...@yelp.com wrote:

 Dear List,

 I'm having trouble getting Spark 1.1 to use the Hadoop 2 API for
 reading SequenceFiles.  In particular, I'm seeing:

 Exception in thread main org.apache.hadoop.ipc.RemoteException:
 Server IPC version 7 cannot communicate with client version 4
at org.apache.hadoop.ipc.Client.call(Client.java:1070)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
at com.sun.proxy.$Proxy7.getProtocolVersion(Unknown Source)
...

 When invoking JavaSparkContext#newAPIHadoopFile().  (With args
 validSequenceFileURI, SequenceFileInputFormat.class, Text.class,
 BytesWritable.class, new Job().getConfiguration() -- Pretty close to
 the unit test here:

 https://github.com/apache/spark/blob/f0f1ba09b195f23f0c89af6fa040c9e01dfa8951/core/src/test/java/org/apache/spark/JavaAPISuite.java#L916
 )


 This error indicates to me that Spark is using an old hadoop client to
 do reads.  Oddly I'm able to do /writes/ ok, i.e. I'm able to write
 via JavaPairRdd#saveAsNewAPIHadoopFile() to my hdfs cluster.


 Do I need to explicitly build spark for modern hadoop??  I previously
 had an hdfs cluster running hadoop 2.3.0 and I was getting a similar
 error (server is using version 9, client is using version 4).


 I'm using Spark 1.1 cdh4 as well as hadoop cdh4 from the links posted
 on spark's site:
 * http://d3kbcqa49mib13.cloudfront.net/spark-1.1.0-bin-cdh4.tgz
 * http://d3kbcqa49mib13.cloudfront.net/hadoop-2.0.0-cdh4.2.0.tar.gz


 What distro of hadoop is used at Data Bricks?  Are there distros of
 Spark 1.1 and hadoop that should work together out-of-the-box?
 (Previously I had Spark 1.0.0 and Hadoop 2.3 working fine..)

 Thanks for any help anybody can give me here!
 -Paul

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: SchemaRDD saveToCassandra

2014-09-16 Thread lmk

Hi Michael,
Please correct me if I am wrong. The error seems to originate from spark
only. Please have a look at the stack trace of the error which is as
follows:

[error] (run-main-0) java.lang.NoSuchMethodException: Cannot resolve any
suitable constructor for class org.apache.spark.sql.catalyst.expressions.Row
java.lang.NoSuchMethodException: Cannot resolve any suitable constructor for
class org.apache.spark.sql.catalyst.expressions.Row
at
com.datastax.spark.connector.rdd.reader.AnyObjectFactory$$anonfun$resolveConstructor$2.apply(AnyObjectFactory.scala:134)
at
com.datastax.spark.connector.rdd.reader.AnyObjectFactory$$anonfun$resolveConstructor$2.apply(AnyObjectFactory.scala:134)
at scala.util.Try.getOrElse(Try.scala:77)
at
com.datastax.spark.connector.rdd.reader.AnyObjectFactory$.resolveConstructor(AnyObjectFactory.scala:133)
at
com.datastax.spark.connector.mapper.ReflectionColumnMapper.columnMap(ReflectionColumnMapper.scala:46)
at
com.datastax.spark.connector.writer.DefaultRowWriter.init(DefaultRowWriter.scala:20)
at
com.datastax.spark.connector.writer.DefaultRowWriter$$anon$2.rowWriter(DefaultRowWriter.scala:109)
at
com.datastax.spark.connector.writer.DefaultRowWriter$$anon$2.rowWriter(DefaultRowWriter.scala:107)
at
com.datastax.spark.connector.writer.TableWriter$.apply(TableWriter.scala:171)
at
com.datastax.spark.connector.RDDFunctions.tableWriter(RDDFunctions.scala:76)
at
com.datastax.spark.connector.RDDFunctions.saveToCassandra(RDDFunctions.scala:56)
at ScalaSparkCassandra$.main(ScalaSparkCassandra.scala:66)
at ScalaSparkCassandra.main(ScalaSparkCassandra.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)

Please clarify.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-saveToCassandra-tp13951p14340.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: SparkContext creation slow down unit tests

2014-09-16 Thread 诺铁

I connect my sample project to a hosted CI service, it only takes 3 seconds
to run there...while the same tests takes 2minutes on my macbook pro.  so
maybe this is a mac os specific problem?

On Tue, Sep 16, 2014 at 3:06 PM, 诺铁 noty...@gmail.com wrote:

 hi,

 I am trying to write some unit test, following spark programming guide
 http://spark.apache.org/docs/latest/programming-guide.html#unit-testing
 .
 but I observed unit test runs very slow(the code is just a SparkPi),  so I
 turn log level to trace and look through the log output.  and found
 creation of SparkContext seems to take most time.

 there are some action take a lot of time:
 1, seems starting jetty
 14:04:55.038 [ScalaTest-run-running-MySparkPiSpec] DEBUG
 o.e.jetty.servlet.ServletHandler -
 servletNameMap={org.apache.spark.ui.JettyUtils$$anon$1-672f11c2=org.apache.spark.ui.JettyUtils$$anon$1-672f11c2}
 14:05:25.121 [ScalaTest-run-running-MySparkPiSpec] DEBUG
 o.e.jetty.util.component.Container - Container
 org.eclipse.jetty.server.Server@e3cee7b +
 SelectChannelConnector@0.0.0.0:4040 as connector

 2, I don't know what's this
 14:05:25.202 [ScalaTest-run-running-MySparkPiSpec] DEBUG
 org.apache.hadoop.security.Groups - Group mapping
 impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping;
 cacheTimeout=30
 14:05:54.594 [spark-akka.actor.default-dispatcher-4] TRACE
 o.a.s.s.BlockManagerMasterActor - Checking for hosts with no recent heart
 beats in BlockManagerMaster.


 are there any way to make this faster for unit test?
 I also notice in spark's unit test, that there exists a SharedSparkContext
 https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/SharedSparkContext.scala,
 is this the suggested way to work around this problem? if so, I would
 suggest to document this in programming guide.

Re: Broadcast error

2014-09-16 Thread Chengi Liu

Cool.. While let me try that.. any other suggestion(s) on things I can try?

On Mon, Sep 15, 2014 at 9:59 AM, Davies Liu dav...@databricks.com wrote:

 I think the 1.1 will be really helpful for you, it's all compatitble
 with 1.0, so it's
 not hard to upgrade to 1.1.

 On Mon, Sep 15, 2014 at 2:35 AM, Chengi Liu chengi.liu...@gmail.com
 wrote:
  So.. same result with parallelize (matrix,1000)
  with broadcast.. seems like I got jvm core dump :-/
  4/09/15 02:31:22 INFO BlockManagerInfo: Registering block manager
 host:47978
  with 19.2 GB RAM
  14/09/15 02:31:22 INFO BlockManagerInfo: Registering block manager
  host:43360 with 19.2 GB RAM
  Unhandled exception
  Unhandled exception
  Type=Segmentation error vmState=0x
  J9Generic_Signal_Number=0004 Signal_Number=000b
 Error_Value=
  Signal_Code=0001
  Handler1=2BF53760 Handler2=2C3069D0
  InaccessibleAddress=
  RDI=2AB9505F2698 RSI=2AABAE2C54D8 RAX=2AB7CE6009A0
  RBX=2AB7CE6009C0
  RCX=FFC7FFE0 RDX=2AB8509726A8 R8=7FE41FF0
  R9=2000
  R10=2DA318A0 R11=2AB850959520 R12=2AB5EF97DD88
  R13=2AB5EF97BD88
  R14=2C0CE940 R15=2AB5EF97BD88
  RIP= GS= FS= RSP=007367A0
  EFlags=00210282 CS=0033 RBP=00BCDB00 ERR=0014
  TRAPNO=000E OLDMASK= CR2=
  xmm0 4141414141414141 (f: 1094795648.00, d: 2.261635e+06)
  xmm1 4141414141414141 (f: 1094795648.00, d: 2.261635e+06)
  xmm2 4141414141414141 (f: 1094795648.00, d: 2.261635e+06)
  xmm3 4141414141414141 (f: 1094795648.00, d: 2.261635e+06)
  xmm4 4141414141414141 (f: 1094795648.00, d: 2.261635e+06)
  xmm5 4141414141414141 (f: 1094795648.00, d: 2.261635e+06)
  xmm6 4141414141414141 (f: 1094795648.00, d: 2.261635e+06)
  xmm7 f180c714f8e2a139 (f: 4175601920.00, d: -5.462583e+238)
  xmm8 428e8000 (f: 1116635136.00, d: 5.516911e-315)
  xmm9  (f: 0.00, d: 0.00e+00)
  xmm10  (f: 0.00, d: 0.00e+00)
  xmm11  (f: 0.00, d: 0.00e+00)
  xmm12  (f: 0.00, d: 0.00e+00)
  xmm13  (f: 0.00, d: 0.00e+00)
  xmm14  (f: 0.00, d: 0.00e+00)
  xmm15  (f: 0.00, d: 0.00e+00)
  Target=2_60_20140106_181350 (Linux 3.0.93-0.8.2_1.0502.8048-cray_ari_c)
  CPU=amd64 (48 logical CPUs) (0xfc0c5b000 RAM)
  --- Stack Backtrace ---
  (0x2C2FA122 [libj9prt26.so+0x13122])
  (0x2C30779F [libj9prt26.so+0x2079f])
  (0x2C2F9E6B [libj9prt26.so+0x12e6b])
  (0x2C2F9F67 [libj9prt26.so+0x12f67])
  (0x2C30779F [libj9prt26.so+0x2079f])
  (0x2C2F9A8B [libj9prt26.so+0x12a8b])
  (0x2BF52C9D [libj9vm26.so+0x1ac9d])
  (0x2C30779F [libj9prt26.so+0x2079f])
  (0x2BF52F56 [libj9vm26.so+0x1af56])
  (0x2BF96CA0 [libj9vm26.so+0x5eca0])
  ---
  JVMDUMP039I
  JVMDUMP032I
 
 
  Note, this still is with the framesize I modified in the last email
 thread?
 
  On Mon, Sep 15, 2014 at 2:12 AM, Akhil Das ak...@sigmoidanalytics.com
  wrote:
 
  Try:
 
  rdd = sc.broadcast(matrix)
 
  Or
 
  rdd = sc.parallelize(matrix,100) // Just increase the number of slices,
  give it a try.
 
 
 
  Thanks
  Best Regards
 
  On Mon, Sep 15, 2014 at 2:18 PM, Chengi Liu chengi.liu...@gmail.com
  wrote:
 
  Hi Akhil,
So with your config (specifically with set(spark.akka.frameSize ,
  1000)) , I see the error:
  org.apache.spark.SparkException: Job aborted due to stage failure:
  Serialized task 0:0 was 401970046 bytes which exceeds
 spark.akka.frameSize
  (10485760 bytes). Consider using broadcast variables for large values.
  at
  org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1049)
  at org.apache.spark
 
  So, I changed
  set(spark.akka.frameSize , 1000) to set(spark.akka.frameSize
 ,
  10)
  but now I get the same error?
 
  y4j.protocol.Py4JJavaError: An error occurred while calling
  o28.trainKMeansModel.
  : org.apache.spark.SparkException: Job aborted due to stage failure:
 All
  masters are unresponsive! Giving up.
  at
  org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1049)
  at
 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033)
  at
 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031)
  at
 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
  at
 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031)
  at

Re: Spark 1.1 / cdh4 stuck using old hadoop client?

2014-09-16 Thread Christian Chua

Is 1.0.8 working for you ?

You indicated your last known good version is 1.0.0

Maybe we can track down where it broke.

On Sep 16, 2014, at 12:25 AM, Paul Wais pw...@yelp.com wrote:

Thanks Christian! I tried compiling from source but am still getting the
same hadoop client version error when reading from HDFS. Will have to poke
deeper... perhaps I've got some classpath issues. FWIW I compiled using:

$ MAVEN_OPTS=-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m mvn
-Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests clean package

and hadoop 2.3 / cdh5 from
http://archive.cloudera.com/cdh5/cdh/5/hadoop-2.3.0-cdh5.0.0.tar.gz

On Mon, Sep 15, 2014 at 6:49 PM, Christian Chua cc8...@icloud.com wrote:
Hi Paul.

I would recommend building your own 1.1.0 distribution.

./make-distribution.sh --name hadoop-personal-build-2.4 --tgz -Pyarn
-Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests

I downloaded the Pre-build for Hadoop 2.4 binary, and it had this strange
behavior where

spark-submit --master yarn-cluster ...

will work, but

spark-submit --master yarn-client ...

will fail.

But on the personal build obtained from the command above, both will then
work.

-Christian

On Sep 15, 2014, at 6:28 PM, Paul Wais pw...@yelp.com wrote:

Dear List,

I'm having trouble getting Spark 1.1 to use the Hadoop 2 API for
reading SequenceFiles. In particular, I'm seeing:

Exception in thread main org.apache.hadoop.ipc.RemoteException:
Server IPC version 7 cannot communicate with client version 4
at org.apache.hadoop.ipc.Client.call(Client.java:1070)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
at com.sun.proxy.$Proxy7.getProtocolVersion(Unknown Source)
...

When invoking JavaSparkContext#newAPIHadoopFile(). (With args
validSequenceFileURI, SequenceFileInputFormat.class, Text.class,
BytesWritable.class, new Job().getConfiguration() -- Pretty close to
the unit test here:
https://github.com/apache/spark/blob/f0f1ba09b195f23f0c89af6fa040c9e01dfa8951/core/src/test/java/org/apache/spark/JavaAPISuite.java#L916
)

This error indicates to me that Spark is using an old hadoop client to
do reads. Oddly I'm able to do /writes/ ok, i.e. I'm able to write
via JavaPairRdd#saveAsNewAPIHadoopFile() to my hdfs cluster.

Do I need to explicitly build spark for modern hadoop?? I previously
had an hdfs cluster running hadoop 2.3.0 and I was getting a similar
error (server is using version 9, client is using version 4).

I'm using Spark 1.1 cdh4 as well as hadoop cdh4 from the links posted
on spark's site:
* http://d3kbcqa49mib13.cloudfront.net/spark-1.1.0-bin-cdh4.tgz
* http://d3kbcqa49mib13.cloudfront.net/hadoop-2.0.0-cdh4.2.0.tar.gz

What distro of hadoop is used at Data Bricks? Are there distros of
Spark 1.1 and hadoop that should work together out-of-the-box?
(Previously I had Spark 1.0.0 and Hadoop 2.3 working fine..)

Thanks for any help anybody can give me here!
-Paul

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark SQL Thrift JDBC server deployment for production

2014-09-16 Thread vasiliy

it works, thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Thrift-JDBC-server-deployment-for-production-tp13947p14345.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: SparkContext creation slow down unit tests

2014-09-16 Thread 诺铁

sorry for disturb, please ignore this mail
in the end, I find it slow because lack of memory in my machine..

sorry again.

On Tue, Sep 16, 2014 at 3:26 PM, 诺铁 noty...@gmail.com wrote:

 I connect my sample project to a hosted CI service, it only takes 3
 seconds to run there...while the same tests takes 2minutes on my macbook
 pro.  so maybe this is a mac os specific problem?

 On Tue, Sep 16, 2014 at 3:06 PM, 诺铁 noty...@gmail.com wrote:

 hi,

 I am trying to write some unit test, following spark programming guide
 http://spark.apache.org/docs/latest/programming-guide.html#unit-testing
 .
 but I observed unit test runs very slow(the code is just a SparkPi),  so
 I turn log level to trace and look through the log output.  and found
 creation of SparkContext seems to take most time.

 there are some action take a lot of time:
 1, seems starting jetty
 14:04:55.038 [ScalaTest-run-running-MySparkPiSpec] DEBUG
 o.e.jetty.servlet.ServletHandler -
 servletNameMap={org.apache.spark.ui.JettyUtils$$anon$1-672f11c2=org.apache.spark.ui.JettyUtils$$anon$1-672f11c2}
 14:05:25.121 [ScalaTest-run-running-MySparkPiSpec] DEBUG
 o.e.jetty.util.component.Container - Container
 org.eclipse.jetty.server.Server@e3cee7b +
 SelectChannelConnector@0.0.0.0:4040 as connector

 2, I don't know what's this
 14:05:25.202 [ScalaTest-run-running-MySparkPiSpec] DEBUG
 org.apache.hadoop.security.Groups - Group mapping
 impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping;
 cacheTimeout=30
 14:05:54.594 [spark-akka.actor.default-dispatcher-4] TRACE
 o.a.s.s.BlockManagerMasterActor - Checking for hosts with no recent heart
 beats in BlockManagerMaster.


 are there any way to make this faster for unit test?
 I also notice in spark's unit test, that there exists a
 SharedSparkContext
 https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/SharedSparkContext.scala,
 is this the suggested way to work around this problem? if so, I would
 suggest to document this in programming guide.

Re: Serving data

2014-09-16 Thread Marius Soutier

Writing to Parquet and querying the result via SparkSQL works great (except for 
some strange SQL parser errors). However the problem remains, how do I get that 
data back to a dashboard. So I guess I’ll have to use a database after all.


You can batch up data  store into parquet partitions as well.  query it using 
another SparkSQL  shell, JDBC driver in SparkSQL is part 1.1 i believe.

RE: SparkSQL 1.1 hang when DROP or LOAD

2014-09-16 Thread Cheng, Hao

Thank you for pasting the steps, I will look at this, hopefully come out with a 
solution soon.

-Original Message-
From: linkpatrickliu [mailto:linkpatrick...@live.com] 
Sent: Tuesday, September 16, 2014 3:17 PM
To: u...@spark.incubator.apache.org
Subject: RE: SparkSQL 1.1 hang when DROP or LOAD

Hi, Hao Cheng.

I have done other tests. And the result shows the thriftServer can connect to 
Zookeeper.

However, I found some more interesting things. And I think I have found a bug!

Test procedure:
Test1:
(0) Use beeline to connect to thriftServer.
(1) Switch database use dw_op1; (OK)
The logs show that the thriftServer connected with Zookeeper and acquired locks.

(2) Drop table drop table src; (Blocked) The logs show that the thriftServer 
is acquireReadWriteLocks.

Doubt:
The reason why I cannot drop table src is because the first SQL use dw_op1
have left locks in Zookeeper  unsuccessfully released.
So when the second SQL is acquiring locks in Zookeeper, it will block.

Test2:
Restart thriftServer.
Instead of switching to another database, I just drop the table in the default 
database;
(0) Restart thriftServer  use beeline to connect to thriftServer.
(1) Drop table drop table src; (OK)
Amazing! Succeed!
(2) Drop again!  drop table src2; (Blocked) Same error: the thriftServer is 
blocked in the acquireReadWriteLocks
phrase.

As you can see. 
Only the first SQL requiring locks can succeed.
So I think the reason is that the thriftServer cannot release locks correctly 
in Zookeeper.

--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-1-1-hang-when-DROP-or-LOAD-tp14222p14339.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: vertex active/inactive feature in Pregel API ?

2014-09-16 Thread Yifan LI

Dear Ankur,

Thanks! :)

- from [1], and my understanding, the existing inactive feature in graphx 
pregel api is “if there is no in-edges, from active vertex, to this vertex, 
then we will say this one is inactive”, right?

For instance, there is a graph in which every vertex has at least one in-edges, 
then we run static Pagerank on it for 10 iterations. During this calculation, 
is there any vertex would be set inactive?


- for more “explicit active vertex tracking”, e.g. vote to halt, how to achieve 
it in existing api?
(I am not sure I got the point of [2], that “vote” function has already been 
introduced in graphx pregel api? )


Best,
Yifan LI

On 15 Sep 2014, at 23:07, Ankur Dave ankurd...@gmail.com wrote:

 At 2014-09-15 16:25:04 +0200, Yifan LI iamyifa...@gmail.com wrote:
 I am wondering if the vertex active/inactive(corresponding the change of its 
 value between two supersteps) feature is introduced in Pregel API of GraphX?
 
 Vertex activeness in Pregel is controlled by messages: if a vertex did not 
 receive a message in the previous iteration, its vertex program will not run 
 in the current iteration. Also, inactive vertices will not be able to send 
 messages because by default the sendMsg function will only be run on edges 
 where at least one of the adjacent vertices received a message. You can 
 change this behavior -- see the documentation for the activeDirection 
 parameter to Pregel.apply [1].
 
 There is also an open pull request to make active vertex tracking more 
 explicit by allowing vertices to vote to halt directly [2].
 
 Ankur
 
 [1] 
 http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.graphx.Pregel$
 [2] https://github.com/apache/spark/pull/1217

Re: vertex active/inactive feature in Pregel API ?

2014-09-16 Thread Ankur Dave

At 2014-09-16 10:55:37 +0200, Yifan LI iamyifa...@gmail.com wrote:
 - from [1], and my understanding, the existing inactive feature in graphx 
 pregel api is “if there is no in-edges, from active vertex, to this vertex, 
 then we will say this one is inactive”, right?

Well, that's true when messages are only sent forward along edges (from the 
source to the destination) and the activeDirection is EdgeDirection.Out. If 
both of these conditions are true, then a vertex without in-edges cannot 
receive a message, and therefore its vertex program will never run and a 
message will never be sent along its out-edges. PageRank is an application that 
satisfies both the conditions.

 For instance, there is a graph in which every vertex has at least one 
 in-edges, then we run static Pagerank on it for 10 iterations. During this 
 calculation, is there any vertex would be set inactive?

No: since every vertex always sends a message in static PageRank, if every 
vertex has an in-edge, it will always receive a message and will remain active.

In fact, this is why I recently rewrote static PageRank not to use Pregel [3]. 
Assuming that most vertices do have in-edges, it's unnecessary to track active 
vertices, which can provide a big savings.

 - for more “explicit active vertex tracking”, e.g. vote to halt, how to 
 achieve it in existing api?
 (I am not sure I got the point of [2], that “vote” function has already been 
 introduced in graphx pregel api? )

The current Pregel API effectively makes every vertex vote to halt in every 
superstep. Therefore only vertices that receive messages get awoken in the next 
superstep.

Instead, [2] proposes to make every vertex run by default in every superstep 
unless it votes to halt *and* receives no messages. This allows a vertex to 
have more control over whether or not it will run, rather than leaving that 
entirely up to its neighbors.

Ankur

 [1] 
 http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.graphx.Pregel$
 [2] https://github.com/apache/spark/pull/1217
[3] https://github.com/apache/spark/pull/2308

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: vertex active/inactive feature in Pregel API ?

2014-09-16 Thread Ankur Dave

At 2014-09-16 12:23:10 +0200, Yifan LI iamyifa...@gmail.com wrote:
 but I am wondering if there is a message(none?) sent to the target vertex(the 
 rank change is less than tolerance) in below dynamic page rank implementation,

  def sendMessage(edge: EdgeTriplet[(Double, Double), Double]) = {
   if (edge.srcAttr._2  tol) {
 Iterator((edge.dstId, edge.srcAttr._2 * edge.attr))
   } else {
 Iterator.empty
   }
 }

 so, in this case, there is a message, even is none, is still sent? or not?

No, in that case no message is sent, and if all in-edges of a particular vertex 
return Iterator.empty, then the vertex will become inactive in the next 
iteration.

Ankur

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark 1.1 / cdh4 stuck using old hadoop client?

2014-09-16 Thread Sean Owen

From the caller / application perspective, you don't care what version
of Hadoop Spark is running on on the cluster. The Spark API you
compile against is the same. When you spark-submit the app, at
runtime, Spark is using the Hadoop libraries from the cluster, which
are the right version.

So when you build your app, you mark Spark as a 'provided' dependency.
Therefore in general, no, you do not build Spark for yourself if you
are a Spark app creator.

(Of course, your app would care if it were also using Hadoop libraries
directly. In that case, you will want to depend on hadoop-client, and
the right version for your cluster, but still mark it as provided.)

The version Spark is built against only matters when you are deploying
Spark's artifacts on the cluster to set it up.

Your error suggests there is still a version mismatch. Either you
deployed a build that was not compatible, or, maybe you are packaging
a version of Spark with your app which is incompatible and
interfering.

For example, the artifacts you get via Maven depend on Hadoop 1.0.4. I
suspect that's what you're doing -- packaging Spark(+Hadoop1.0.4) with
your app, when it shouldn't be packaged.

Spark works out of the box with just about any modern combo of HDFS and YARN.

On Tue, Sep 16, 2014 at 2:28 AM, Paul Wais pw...@yelp.com wrote:
Dear List,

I'm having trouble getting Spark 1.1 to use the Hadoop 2 API for
reading SequenceFiles. In particular, I'm seeing:

This error indicates to me that Spark is using an old hadoop client to
do reads. Oddly I'm able to do /writes/ ok, i.e. I'm able to write
via JavaPairRdd#saveAsNewAPIHadoopFile() to my hdfs cluster.

Do I need to explicitly build spark for modern hadoop?? I previously
had an hdfs cluster running hadoop 2.3.0 and I was getting a similar
error (server is using version 9, client is using version 4).

What distro of hadoop is used at Data Bricks? Are there distros of
Spark 1.1 and hadoop that should work together out-of-the-box?
(Previously I had Spark 1.0.0 and Hadoop 2.3 working fine..)

Thanks for any help anybody can give me here!
-Paul

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: How to set executor num on spark on yarn

2014-09-16 Thread Sean Owen

How many cores do your machines have? --executor-cores should be the
number of cores each executor uses. Fewer cores means more executors
in general. From your data, it sounds like, for example, there are 7
nodes with 4+ cores available to YARN, and 2 more nodes with 2-3 cores
available. Hence when you ask for fewer cores per executor, more can
fit.

You may need to increase the number of cores you are letting YARN
manage if this doesn't match your expectation. For example, I'm
guessing your machines have more than 4 cores in reality.

On Tue, Sep 16, 2014 at 6:08 AM, hequn cheng chenghe...@gmail.com wrote:
 hi~I want to set the executor number to 16, but it is very strange that
 executor cores may affect executor num on spark on yarn, i don't know why
 and how to set executor number.
 =
 ./bin/spark-submit --class com.hequn.spark.SparkJoins \
 --master yarn-cluster \
 --num-executors 16 \
 --driver-memory 2g \
 --executor-memory 10g \
 --executor-cores 4 \
 /home/sparkjoins-1.0-SNAPSHOT.jar

 The UI shows there are 7 executors
 =
 ./bin/spark-submit --class com.hequn.spark.SparkJoins \
 --master yarn-cluster \
 --num-executors 16 \
 --driver-memory 2g \
 --executor-memory 10g \
 --executor-cores 2 \
 /home/sparkjoins-1.0-SNAPSHOT.jar

 The UI shows there are 9 executors
 =
 ./bin/spark-submit --class com.hequn.spark.SparkJoins \
 --master yarn-cluster \
 --num-executors 16 \
 --driver-memory 2g \
 --executor-memory 10g \
 --executor-cores 1 \
 /home/sparkjoins-1.0-SNAPSHOT.jar

 The UI shows there are 9 executors
 ==
 The cluster contains 16 nodes. Each node 64G RAM.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Spark Streaming: CoarseGrainedExecutorBackend: Slave registration failed: Duplicate executor ID

2014-09-16 Thread Luis Ángel Vicente Sánchez

I have a standalone spark cluster and from within the same scala
application I'm creating 2 different spark context to run two different
spark streaming jobs as SparkConf is different for each of them.

I'm getting this error that... I don't really understand:

14/09/16 11:51:35 ERROR OneForOneStrategy: spark.httpBroadcast.uri
 java.util.NoSuchElementException: spark.httpBroadcast.uri
   at org.apache.spark.SparkConf$$anonfun$get$1.apply(SparkConf.scala:149)
   at org.apache.spark.SparkConf$$anonfun$get$1.apply(SparkConf.scala:149)
   at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
   at scala.collection.AbstractMap.getOrElse(Map.scala:58)
   at org.apache.spark.SparkConf.get(SparkConf.scala:149)
   at 
 org.apache.spark.broadcast.HttpBroadcast$.initialize(HttpBroadcast.scala:130)
   at 
 org.apache.spark.broadcast.HttpBroadcastFactory.initialize(HttpBroadcastFactory.scala:31)
   at 
 org.apache.spark.broadcast.BroadcastManager.initialize(BroadcastManager.scala:48)
   at 
 org.apache.spark.broadcast.BroadcastManager.init(BroadcastManager.scala:35)
   at org.apache.spark.SparkEnv$.create(SparkEnv.scala:218)
   at org.apache.spark.executor.Executor.init(Executor.scala:85)
   at 
 org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:59)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
   at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 14/09/16 11:51:35 ERROR CoarseGrainedExecutorBackend: Slave registration 
 failed: Duplicate executor ID: 10


 Both apps has different names so I don't get how the executor ID is not
unique :-S

This happens on startup and most of the time only one job dies, but
sometimes both of them die.

Regards,

Luis

Re: Spark Streaming: CoarseGrainedExecutorBackend: Slave registration failed: Duplicate executor ID

2014-09-16 Thread Luis Ángel Vicente Sánchez

It seems that, as I have a single scala application, the scheduler is the
same and there is a collision between executors of both spark context. Is
there a way to change how the executor ID is generated (maybe an uuid
instead of a sequential number..?)

2014-09-16 13:07 GMT+01:00 Luis Ángel Vicente Sánchez 
langel.gro...@gmail.com:

 I have a standalone spark cluster and from within the same scala
 application I'm creating 2 different spark context to run two different
 spark streaming jobs as SparkConf is different for each of them.

 I'm getting this error that... I don't really understand:

 14/09/16 11:51:35 ERROR OneForOneStrategy: spark.httpBroadcast.uri
 java.util.NoSuchElementException: spark.httpBroadcast.uri
  at org.apache.spark.SparkConf$$anonfun$get$1.apply(SparkConf.scala:149)
  at org.apache.spark.SparkConf$$anonfun$get$1.apply(SparkConf.scala:149)
  at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
  at scala.collection.AbstractMap.getOrElse(Map.scala:58)
  at org.apache.spark.SparkConf.get(SparkConf.scala:149)
  at 
 org.apache.spark.broadcast.HttpBroadcast$.initialize(HttpBroadcast.scala:130)
  at 
 org.apache.spark.broadcast.HttpBroadcastFactory.initialize(HttpBroadcastFactory.scala:31)
  at 
 org.apache.spark.broadcast.BroadcastManager.initialize(BroadcastManager.scala:48)
  at 
 org.apache.spark.broadcast.BroadcastManager.init(BroadcastManager.scala:35)
  at org.apache.spark.SparkEnv$.create(SparkEnv.scala:218)
  at org.apache.spark.executor.Executor.init(Executor.scala:85)
  at 
 org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:59)
  at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
  at akka.actor.ActorCell.invoke(ActorCell.scala:456)
  at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
  at akka.dispatch.Mailbox.run(Mailbox.scala:219)
  at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
  at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
  at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
  at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
  at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 14/09/16 11:51:35 ERROR CoarseGrainedExecutorBackend: Slave registration 
 failed: Duplicate executor ID: 10


  Both apps has different names so I don't get how the executor ID is not
 unique :-S

 This happens on startup and most of the time only one job dies, but
 sometimes both of them die.

 Regards,

 Luis

Re: PySpark on Yarn - how group by data properly

2014-09-16 Thread Oleg Ruchovets

I am expand my data set and executing pyspark on yarn:
   I payed attention that only 2 processes processed the data:

14210 yarn  20   0 2463m 2.0g 9708 R 100.0  4.3   8:22.63 python2.7
32467 yarn  20   0 2519m 2.1g 9720 R 99.3  4.4   7:16.97   python2.7


*Question:*
   *how to configure pyspark to have more processes for  process the data?*


Here is my command :

  [hdfs@UCS-MASTER cad]$
/usr/lib/spark-1.0.1.2.1.3.0-563-bin-2.4.0.2.1.3.0-563/bin/spark-submit
--master yarn  --num-executors 12  --driver-memory 4g --executor-memory 2g
--py-files tad.zip --executor-cores 4   /usr/lib/cad/PrepareDataSetYarn.py
/input/tad/data.csv /output/cad_model_500_1

I tried to play with num-executors and executor-cores but it is still 2
python processes doing the job. I have 5 machine cluster with 32 GB ram.

console output:


14/09/16 20:07:34 INFO spark.SecurityManager: Changing view acls to: hdfs
14/09/16 20:07:34 INFO spark.SecurityManager: SecurityManager:
authentication disabled; ui acls disabled; users with view permissions:
Set(hdfs)
14/09/16 20:07:34 INFO slf4j.Slf4jLogger: Slf4jLogger started
14/09/16 20:07:35 INFO Remoting: Starting remoting
14/09/16 20:07:35 INFO Remoting: Remoting started; listening on addresses
:[akka.tcp://spark@UCS-MASTER.sms1.local:39379]
14/09/16 20:07:35 INFO Remoting: Remoting now listens on addresses:
[akka.tcp://spark@UCS-MASTER.sms1.local:39379]
14/09/16 20:07:35 INFO spark.SparkEnv: Registering MapOutputTracker
14/09/16 20:07:35 INFO spark.SparkEnv: Registering BlockManagerMaster
14/09/16 20:07:35 INFO storage.DiskBlockManager: Created local directory at
/tmp/spark-local-20140916200735-53f6
14/09/16 20:07:35 INFO storage.MemoryStore: MemoryStore started with
capacity 2.3 GB.
14/09/16 20:07:35 INFO network.ConnectionManager: Bound socket to port
37255 with id = ConnectionManagerId(UCS-MASTER.sms1.local,37255)
14/09/16 20:07:35 INFO storage.BlockManagerMaster: Trying to register
BlockManager
14/09/16 20:07:35 INFO storage.BlockManagerInfo: Registering block manager
UCS-MASTER.sms1.local:37255 with 2.3 GB RAM
14/09/16 20:07:35 INFO storage.BlockManagerMaster: Registered BlockManager
14/09/16 20:07:35 INFO spark.HttpServer: Starting HTTP Server
14/09/16 20:07:35 INFO server.Server: jetty-8.y.z-SNAPSHOT
14/09/16 20:07:35 INFO server.AbstractConnector: Started
SocketConnector@0.0.0.0:55286
14/09/16 20:07:35 INFO broadcast.HttpBroadcast: Broadcast server started at
http://10.193.218.2:55286
14/09/16 20:07:35 INFO spark.HttpFileServer: HTTP File server directory is
/tmp/spark-ca8193be-9148-4e7e-a0cc-4b6e7cb72172
14/09/16 20:07:35 INFO spark.HttpServer: Starting HTTP Server
14/09/16 20:07:35 INFO server.Server: jetty-8.y.z-SNAPSHOT
14/09/16 20:07:35 INFO server.AbstractConnector: Started
SocketConnector@0.0.0.0:38065
14/09/16 20:07:35 INFO server.Server: jetty-8.y.z-SNAPSHOT
14/09/16 20:07:35 INFO server.AbstractConnector: Started
SelectChannelConnector@0.0.0.0:4040
14/09/16 20:07:35 INFO ui.SparkUI: Started SparkUI at
http://UCS-MASTER.sms1.local:4040
14/09/16 20:07:36 WARN util.NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
--args is deprecated. Use --arg instead.
14/09/16 20:07:36 INFO impl.TimelineClientImpl: Timeline service address:
http://UCS-NODE1.sms1.local:8188/ws/v1/timeline/
14/09/16 20:07:37 INFO client.RMProxy: Connecting to ResourceManager at
UCS-NODE1.sms1.local/10.193.218.3:8050
14/09/16 20:07:37 INFO yarn.Client: Got Cluster metric info from
ApplicationsManager (ASM), number of NodeManagers: 5
14/09/16 20:07:37 INFO yarn.Client: Queue info ... queueName: default,
queueCurrentCapacity: 0.0, queueMaxCapacity: 1.0,
  queueApplicationCount = 0, queueChildQueueCount = 0
14/09/16 20:07:37 INFO yarn.Client: Max mem capabililty of a single
resource in this cluster 53248
14/09/16 20:07:37 INFO yarn.Client: Preparing Local resources
14/09/16 20:07:37 WARN hdfs.BlockReaderLocal: The short-circuit local reads
feature cannot be used because libhadoop cannot be loaded.
14/09/16 20:07:37 INFO yarn.Client: Uploading
file:/usr/lib/spark-1.0.1.2.1.3.0-563-bin-2.4.0.2.1.3.0-563/lib/spark-assembly-1.0.1.2.1.3.0-563-hadoop2.4.0.2.1.3.0-563.jar
to
hdfs://UCS-MASTER.sms1.local:8020/user/hdfs/.sparkStaging/application_1409564765875_0046/spark-assembly-1.0.1.2.1.3.0-563-hadoop2.4.0.2.1.3.0-563.jar
14/09/16 20:07:39 INFO yarn.Client: Uploading
file:/usr/lib/cad/PrepareDataSetYarn.py to
hdfs://UCS-MASTER.sms1.local:8020/user/hdfs/.sparkStaging/application_1409564765875_0046/PrepareDataSetYarn.py
14/09/16 20:07:39 INFO yarn.Client: Uploading file:/usr/lib/cad/tad.zip to
hdfs://UCS-MASTER.sms1.local:8020/user/hdfs/.sparkStaging/application_1409564765875_0046/tad.zip
14/09/16 20:07:39 INFO yarn.Client: Setting up the launch environment
14/09/16 20:07:39 INFO yarn.Client: Setting up container launch context
14/09/16 20:07:39 INFO yarn.Client: Command for starting the Spark
ApplicationMaster:

Re: Spark Streaming: CoarseGrainedExecutorBackend: Slave registration failed: Duplicate executor ID

2014-09-16 Thread Luis Ángel Vicente Sánchez

When I said scheduler I meant executor backend.

2014-09-16 13:26 GMT+01:00 Luis Ángel Vicente Sánchez 
langel.gro...@gmail.com:

 It seems that, as I have a single scala application, the scheduler is the
 same and there is a collision between executors of both spark context. Is
 there a way to change how the executor ID is generated (maybe an uuid
 instead of a sequential number..?)

 2014-09-16 13:07 GMT+01:00 Luis Ángel Vicente Sánchez 
 langel.gro...@gmail.com:

 I have a standalone spark cluster and from within the same scala
 application I'm creating 2 different spark context to run two different
 spark streaming jobs as SparkConf is different for each of them.

 I'm getting this error that... I don't really understand:

 14/09/16 11:51:35 ERROR OneForOneStrategy: spark.httpBroadcast.uri
 java.util.NoSuchElementException: spark.httpBroadcast.uri
 at org.apache.spark.SparkConf$$anonfun$get$1.apply(SparkConf.scala:149)
 at org.apache.spark.SparkConf$$anonfun$get$1.apply(SparkConf.scala:149)
 at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
 at scala.collection.AbstractMap.getOrElse(Map.scala:58)
 at org.apache.spark.SparkConf.get(SparkConf.scala:149)
 at 
 org.apache.spark.broadcast.HttpBroadcast$.initialize(HttpBroadcast.scala:130)
 at 
 org.apache.spark.broadcast.HttpBroadcastFactory.initialize(HttpBroadcastFactory.scala:31)
 at 
 org.apache.spark.broadcast.BroadcastManager.initialize(BroadcastManager.scala:48)
 at 
 org.apache.spark.broadcast.BroadcastManager.init(BroadcastManager.scala:35)
 at org.apache.spark.SparkEnv$.create(SparkEnv.scala:218)
 at org.apache.spark.executor.Executor.init(Executor.scala:85)
 at 
 org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:59)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 at akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 14/09/16 11:51:35 ERROR CoarseGrainedExecutorBackend: Slave registration 
 failed: Duplicate executor ID: 10


  Both apps has different names so I don't get how the executor ID is not
 unique :-S

 This happens on startup and most of the time only one job dies, but
 sometimes both of them die.

 Regards,

 Luis

Re: Spark Streaming: CoarseGrainedExecutorBackend: Slave registration failed: Duplicate executor ID

2014-09-16 Thread Luis Ángel Vicente Sánchez

I dug a bit more and the executor ID is a number so it's seems there is not
possible workaround.

Looking at the code of the CoarseGrainedSchedulerBackend.scala:

https://github.com/apache/spark/blob/6324eb7b5b0ae005cb2e913e36b1508bd6f1b9b8/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L86

It seems there is only one DriverActor and, as the RegisterExecutor message
only contains the executorID, there is a collision.

I wonder if what I'm doing is wrong... basically, from the same scala
application I...

1. create one actor per job.
2. send a message to each actor with configuration details to create a
SparkContext/StreamingContext.
3. send a message to each actor to start the job and streaming context.

2014-09-16 13:29 GMT+01:00 Luis Ángel Vicente Sánchez 
langel.gro...@gmail.com:

 When I said scheduler I meant executor backend.

 2014-09-16 13:26 GMT+01:00 Luis Ángel Vicente Sánchez 
 langel.gro...@gmail.com:

 It seems that, as I have a single scala application, the scheduler is the
 same and there is a collision between executors of both spark context. Is
 there a way to change how the executor ID is generated (maybe an uuid
 instead of a sequential number..?)

 2014-09-16 13:07 GMT+01:00 Luis Ángel Vicente Sánchez 
 langel.gro...@gmail.com:

 I have a standalone spark cluster and from within the same scala
 application I'm creating 2 different spark context to run two different
 spark streaming jobs as SparkConf is different for each of them.

 I'm getting this error that... I don't really understand:

 14/09/16 11:51:35 ERROR OneForOneStrategy: spark.httpBroadcast.uri
 java.util.NoSuchElementException: spark.httpBroadcast.uri
at org.apache.spark.SparkConf$$anonfun$get$1.apply(SparkConf.scala:149)
at org.apache.spark.SparkConf$$anonfun$get$1.apply(SparkConf.scala:149)
at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
at scala.collection.AbstractMap.getOrElse(Map.scala:58)
at org.apache.spark.SparkConf.get(SparkConf.scala:149)
at 
 org.apache.spark.broadcast.HttpBroadcast$.initialize(HttpBroadcast.scala:130)
at 
 org.apache.spark.broadcast.HttpBroadcastFactory.initialize(HttpBroadcastFactory.scala:31)
at 
 org.apache.spark.broadcast.BroadcastManager.initialize(BroadcastManager.scala:48)
at 
 org.apache.spark.broadcast.BroadcastManager.init(BroadcastManager.scala:35)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:218)
at org.apache.spark.executor.Executor.init(Executor.scala:85)
at 
 org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:59)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 14/09/16 11:51:35 ERROR CoarseGrainedExecutorBackend: Slave registration 
 failed: Duplicate executor ID: 10


  Both apps has different names so I don't get how the executor ID is not
 unique :-S

 This happens on startup and most of the time only one job dies, but
 sometimes both of them die.

 Regards,

 Luis

Reduce Tuple2Integer, Integer to Tuple2Integer,ListInteger

2014-09-16 Thread Tom

From my map function I create Tuple2Integer, Integer pairs. Now I want to
reduce them, and get something like Tuple2Integer, Listlt;Integer. 

The only way I found to do this was by treating all variables as String, and
in the reduceByKey do
/return a._2 + , + b._2/ //in which both are numeric values saved in a
String
After which I do a Arrays.asList(string.split(,)) in mapValues. This
leaves me with String, Listlt;Integer. So now I am looking for either
- A function with which I can transform String, Listlt;Integer to
Integer, Listlt;Integer
or
- A way to reduce Tuple2Integer, Integer into a Tuple2Integer,
Listlt;Integer in the reduceByKey function so that I can use Integers all
the way

Of course option two would have preferences.

Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Reduce-Tuple2-Integer-Integer-to-Tuple2-Integer-List-Integer-tp14361.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Reduce Tuple2Integer, Integer to Tuple2Integer,ListInteger

2014-09-16 Thread Sean Owen

If you mean you have (key,value) pairs, and want pairs with key, and
all values for that key, then you're looking for groupByKey

On Tue, Sep 16, 2014 at 2:42 PM, Tom thubregt...@gmail.com wrote:
 From my map function I create Tuple2Integer, Integer pairs. Now I want to
 reduce them, and get something like Tuple2Integer, Listlt;Integer.

 The only way I found to do this was by treating all variables as String, and
 in the reduceByKey do
 /return a._2 + , + b._2/ //in which both are numeric values saved in a
 String
 After which I do a Arrays.asList(string.split(,)) in mapValues. This
 leaves me with String, Listlt;Integer. So now I am looking for either
 - A function with which I can transform String, Listlt;Integer to
 Integer, Listlt;Integer
 or
 - A way to reduce Tuple2Integer, Integer into a Tuple2Integer,
 Listlt;Integer in the reduceByKey function so that I can use Integers all
 the way

 Of course option two would have preferences.

 Thanks!



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Reduce-Tuple2-Integer-Integer-to-Tuple2-Integer-List-Integer-tp14361.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Serving data

2014-09-16 Thread Yana Kadiyska

If your dashboard is doing ajax/pull requests against say a REST API you
can always create a Spark context in your rest service and use SparkSQL to
query over the parquet files. The parquet files are already on disk so it
seems silly to write both to parquet and to a DB...unless I'm missing
something in your setup.

On Tue, Sep 16, 2014 at 4:18 AM, Marius Soutier mps@gmail.com wrote:

 Writing to Parquet and querying the result via SparkSQL works great
 (except for some strange SQL parser errors). However the problem remains,
 how do I get that data back to a dashboard. So I guess I’ll have to use a
 database after all.


 You can batch up data  store into parquet partitions as well.  query
 it using another SparkSQL  shell, JDBC driver in SparkSQL is part 1.1 i
 believe.

java.util.NoSuchElementException: key not found

2014-09-16 Thread Brad Miller

Hi All,

I suspect I am experiencing a bug. I've noticed that while running
larger jobs, they occasionally die with the exception
java.util.NoSuchElementException: key not found xyz, where xyz
denotes the ID of some particular task.  I've excerpted the log from
one job that died in this way below and attached the full log for
reference.

I suspect that my bug is the same as SPARK-2002 (linked below).  Is
there any reason to suspect otherwise?  Is there any known workaround
other than not coalescing?
https://issues.apache.org/jira/browse/SPARK-2002
http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201406.mbox/%3CCAMwrk0=d1dww5fdbtpkefwokyozltosbbjqamsqqjowlzng...@mail.gmail.com%3E

Note that I have been coalescing SchemaRDDs using srdd =
SchemaRDD(srdd._jschema_rdd.coalesce(partitions, False, None),
sqlCtx), the workaround described in this thread.
http://mail-archives.apache.org/mod_mbox/spark-user/201409.mbox/%3ccanr-kkciei17m43-yz5z-pj00zwpw3ka_u7zhve2y7ejw1v...@mail.gmail.com%3E

...
14/09/15 21:43:14 INFO scheduler.TaskSetManager: Starting task 78.0 in
stage 551.0 (TID 78738, bennett.research.intel-research.net,
PROCESS_LOCAL, 1056 bytes)
...
14/09/15 21:43:15 INFO storage.BlockManagerInfo: Added
taskresult_78738 in memory on
bennett.research.intel-research.net:38074 (size: 13.0 MB, free: 1560.8
MB)
...
14/09/15 21:43:15 ERROR scheduler.TaskResultGetter: Exception while
getting task result
java.util.NoSuchElementException: key not found: 78738
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:58)
at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
at 
org.apache.spark.scheduler.TaskSetManager.handleTaskGettingResult(TaskSetManager.scala:500)
at 
org.apache.spark.scheduler.TaskSchedulerImpl.handleTaskGettingResult(TaskSchedulerImpl.scala:348)
at 
org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:52)
at 
org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47)
at 
org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311)
at 
org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:46)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:701)


I am running the pre-compiled 1.1.0 binaries.

best,
-Brad

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

org.apache.spark.SparkException: java.io.FileNotFoundException: does not exist)

2014-09-16 Thread Hui Li

Hi,

I am new to SPARK. I just set up a small cluster and wanted to run some
simple MLLIB examples. By following the instructions of
https://spark.apache.org/docs/0.9.0/mllib-guide.html#binary-classification-1,
I could successfully run everything until the step of SVMWithSGD, I got
error the following message. I don't know why the
 file:/root/test/sample_svm_data.txt does not exist since I already read it
out, printed it and converted into the labeled data and passed the parsed
data to the function SvmWithSGD.

Any one have the same issue with me?

Thanks,

Emily

 val model = SVMWithSGD.train(parsedData, numIterations)
14/09/16 10:55:21 INFO SparkContext: Starting job: first at
GeneralizedLinearAlgorithm.scala:121
14/09/16 10:55:21 INFO DAGScheduler: Got job 11 (first at
GeneralizedLinearAlgorithm.scala:121) with 1 output partitions
(allowLocal=true)
14/09/16 10:55:21 INFO DAGScheduler: Final stage: Stage 11 (first at
GeneralizedLinearAlgorithm.scala:121)
14/09/16 10:55:21 INFO DAGScheduler: Parents of final stage: List()
14/09/16 10:55:21 INFO DAGScheduler: Missing parents: List()
14/09/16 10:55:21 INFO DAGScheduler: Computing the requested partition
locally
14/09/16 10:55:21 INFO HadoopRDD: Input split:
file:/root/test/sample_svm_data.txt:0+19737
14/09/16 10:55:21 INFO SparkContext: Job finished: first at
GeneralizedLinearAlgorithm.scala:121, took 0.002697478 s
14/09/16 10:55:21 INFO SparkContext: Starting job: count at
DataValidators.scala:37
14/09/16 10:55:21 INFO DAGScheduler: Got job 12 (count at
DataValidators.scala:37) with 2 output partitions (allowLocal=false)
14/09/16 10:55:21 INFO DAGScheduler: Final stage: Stage 12 (count at
DataValidators.scala:37)
14/09/16 10:55:21 INFO DAGScheduler: Parents of final stage: List()
14/09/16 10:55:21 INFO DAGScheduler: Missing parents: List()
14/09/16 10:55:21 INFO DAGScheduler: Submitting Stage 12 (FilteredRDD[26]
at filter at DataValidators.scala:37), which has no missing parents
14/09/16 10:55:21 INFO DAGScheduler: Submitting 2 missing tasks from Stage
12 (FilteredRDD[26] at filter at DataValidators.scala:37)
14/09/16 10:55:21 INFO TaskSchedulerImpl: Adding task set 12.0 with 2 tasks
14/09/16 10:55:21 INFO TaskSetManager: Starting task 12.0:0 as TID 24 on
executor 2: eecvm0206.demo.sas.com (PROCESS_LOCAL)
14/09/16 10:55:21 INFO TaskSetManager: Serialized task 12.0:0 as 1733 bytes
in 0 ms
14/09/16 10:55:21 INFO TaskSetManager: Starting task 12.0:1 as TID 25 on
executor 5: eecvm0203.demo.sas.com (PROCESS_LOCAL)
14/09/16 10:55:21 INFO TaskSetManager: Serialized task 12.0:1 as 1733 bytes
in 0 ms
14/09/16 10:55:21 WARN TaskSetManager: Lost TID 24 (task 12.0:0)
14/09/16 10:55:21 WARN TaskSetManager: Loss was due to
java.io.FileNotFoundException
java.io.FileNotFoundException: File file:/root/test/sample_svm_data.txt
does not exist
at
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
at
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:402)
at
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.init(ChecksumFileSystem.java:137)
at
org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:764)
at
org.apache.hadoop.mapred.LineRecordReader.init(LineRecordReader.java:108)
at
org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
at
org.apache.spark.rdd.HadoopRDD$$anon$1.init(HadoopRDD.scala:156)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:149)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:64)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:33)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:109)
at org.apache.spark.scheduler.Task.run(Task.scala:53)
at
org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
at
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:42)
at

Spark as a Library

2014-09-16 Thread Ruebenacker, Oliver A


 Hello,

  Suppose I want to use Spark from an application that I already submit to run 
in another container (e.g. Tomcat). Is this at all possible? Or do I have to 
split the app into two components, and submit one to Spark and one to the other 
container? In that case, what is the preferred way for the two components to 
communicate with each other? Thanks!

 Best, Oliver

Oliver Ruebenacker | Solutions Architect

Altisource(tm)
290 Congress St, 7th Floor | Boston, Massachusetts 02210
P: (617) 728-5582 | ext: 275585
oliver.ruebenac...@altisource.commailto:oliver.ruebenac...@altisource.com | 
www.Altisource.com

***

This email message and any attachments are intended solely for the use of the 
addressee. If you are not the intended recipient, you are prohibited from 
reading, disclosing, reproducing, distributing, disseminating or otherwise 
using this transmission. If you have received this message in error, please 
promptly notify the sender by reply email and immediately delete this message 
from your system. This message and any attachments may contain information that 
is confidential, privileged or exempt from disclosure. Delivery of this message 
to any person other than the intended recipient is not intended to waive any 
right or privilege. Message transmission is not guaranteed to be secure or free 
of software viruses.
***

collect on hadoopFile RDD returns wrong results

2014-09-16 Thread vasiliy

Hello. I have a hadoopFile RDD and i tried to collect items to driver
program, but it returns me an array of identical records (equals to last
record of my file). My code is like this:

val rdd = sc.hadoopFile(
  hdfs:///data.avro,
  classOf[org.apache.avro.mapred.AvroInputFormat[MyAvroRecord]],
  classOf[org.apache.avro.mapred.AvroWrapper[MyAvroRecord]],
  classOf[org.apache.hadoop.io.NullWritable],
  10)


val collectedData = rdd.collect()

for (s - collectedData){
   println(s)
}

it prints wrong data. But rdd.foreach(println) works as expected. 

What wrong with my code and how can i collect the hadoop RDD files (actually
i want collect parts of it) to the driver program ?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/collect-on-hadoopFile-RDD-returns-wrong-results-tp14368.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

HBase and non-existent TableInputFormat

2014-09-16 Thread Y. Dong

Hello,

I’m currently using spark-core 1.1 and hbase 0.98.5 and I want to simply read 
from hbase. The Java code is attached. However the problem is TableInputFormat 
does not even exist in hbase-client API, is there any other way I can read from
hbase? Thanks

SparkConf sconf = new SparkConf().setAppName(“App).setMaster(local);
JavaSparkContext sc = new JavaSparkContext(sconf); 

Configuration conf = HBaseConfiguration.create();
conf.set(TableInputFormat.INPUT_TABLE, Article);

JavaPairRDDImmutableBytesWritable, Result hBaseRDD = sc.newAPIHadoopRDD(conf, 
TableInputFormat.class,org.apache.hadoop.hbase.io.ImmutableBytesWritable.class,
org.apache.hadoop.hbase.client.Result.class);

Re: combineByKey throws ClassCastException

2014-09-16 Thread Tao Xiao

This problem was caused by the fact that I used a package jar with a Spark
version (0.9.1) different from that of the cluster (0.9.0). When I used the
correct package jar
(spark-assembly_2.10-0.9.0-cdh5.0.1-hadoop2.3.0-cdh5.0.1.jar) instead the
application can run as expected.



2014-09-15 14:57 GMT+08:00 x wasedax...@gmail.com:

 How about this.

 scala val rdd2 = rdd.combineByKey(
  | (v: Int) = v.toLong,
  | (c: Long, v: Int) = c + v,
  | (c1: Long, c2: Long) = c1 + c2)
 rdd2: org.apache.spark.rdd.RDD[(String, Long)] = MapPartitionsRDD[9] at
 combineB
 yKey at console:14

 xj @ Tokyo

 On Mon, Sep 15, 2014 at 3:06 PM, Tao Xiao xiaotao.cs@gmail.com
 wrote:

 I followd an example presented in the tutorial Learning Spark
 http://www.safaribooksonline.com/library/view/learning-spark/9781449359034/ch04.html
 to compute the per-key average as follows:


 val Array(appName) = args
 val sparkConf = new SparkConf()
 .setAppName(appName)
 val sc = new SparkContext(sparkConf)
 /*
  * compute the per-key average of values
  * results should be:
  *A : 5.8
  *B : 14
  *C : 60.6
  */
 val rdd = sc.parallelize(List(
 (A, 3), (A, 9), (A, 12), (A, 0), (A, 5),
 (B, 4), (B, 10), (B, 11), (B, 20), (B, 25),
 (C, 32), (C, 91), (C, 122), (C, 3), (C, 55)), 2)
 val avg = rdd.combineByKey(
 (x:Int) = (x, 1),  // java.lang.ClassCastException: scala.Tuple2$mcII$sp
 cannot be cast to java.lang.Integer
 (acc:(Int, Int), x) = (acc._1 + x, acc._2 + 1),
 (acc1:(Int, Int), acc2:(Int, Int)) = (acc1._1 + acc2._1, acc1._2 +
 acc2._2))
 .map{case (s, t) = (s, t._1/t._2.toFloat)}
  avg.collect.foreach(t = println(t._1 +  - + t._2))



 When I submitted the application, an exception of 
 *java.lang.ClassCastException:
 scala.Tuple2$mcII$sp cannot be cast to java.lang.Integer* was thrown
 out. The tutorial said that the first function of *combineByKey*, *(x:Int)
 = (x, 1)*, should take a single element in the source RDD and return an
 element of the desired type in the resulting RDD. In my application, we
 take a single element of type *Int *from the source RDD and return a
 tuple of type (*Int*, *Int*), which meets the requirements quite well.
 But why would such an exception be thrown?

 I'm using CDH 5.0 and Spark 0.9

 Thanks.

Re: HBase and non-existent TableInputFormat

2014-09-16 Thread Ted Yu

bq. TableInputFormat does not even exist in hbase-client API

It is in hbase-server module.

Take a look at http://hbase.apache.org/book.html#mapreduce.example.read

On Tue, Sep 16, 2014 at 8:18 AM, Y. Dong tq00...@gmail.com wrote:

 Hello,

 I’m currently using spark-core 1.1 and hbase 0.98.5 and I want to simply
 read from hbase. The Java code is attached. However the problem is
 TableInputFormat does not even exist in hbase-client API, is there any
 other way I can read from
 hbase? Thanks

 SparkConf sconf = *new* SparkConf().setAppName(“App).setMaster(local);
 JavaSparkContext sc = *new* JavaSparkContext(sconf);


 Configuration conf = HBaseConfiguration.*create*();
 conf.set(TableInputFormat.INPUT_TABLE, Article);


 JavaPairRDDImmutableBytesWritable, Result hBaseRDD = 
 sc.newAPIHadoopRDD(conf,
 TableInputFormat.*class*
 ,org.apache.hadoop.hbase.io.ImmutableBytesWritable.*class*,
 org.apache.hadoop.hbase.client.Result.*class*);

Re: HBase and non-existent TableInputFormat

2014-09-16 Thread Ted Yu

hbase-client module serves client facing APIs.
hbase-server module is supposed to host classes used on server side.

There is still some work to be done so that the above goal is achieved.

On Tue, Sep 16, 2014 at 9:06 AM, Y. Dong tq00...@gmail.com wrote:

 Thanks Ted. It is indeed in hbase-server. Just curious, what’s the
 difference between hbase-client and hbase-server?

 On 16 Sep 2014, at 17:01, Ted Yu yuzhih...@gmail.com wrote:

 bq. TableInputFormat does not even exist in hbase-client API

 It is in hbase-server module.

 Take a look at http://hbase.apache.org/book.html#mapreduce.example.read

 On Tue, Sep 16, 2014 at 8:18 AM, Y. Dong tq00...@gmail.com wrote:

 Hello,

 I’m currently using spark-core 1.1 and hbase 0.98.5 and I want to simply
 read from hbase. The Java code is attached. However the problem is
 TableInputFormat does not even exist in hbase-client API, is there any
 other way I can read from
 hbase? Thanks

 SparkConf sconf = *new* SparkConf().setAppName(“App).setMaster(local);
 JavaSparkContext sc = *new* JavaSparkContext(sconf);

 Configuration conf = HBaseConfiguration.*create*();
 conf.set(TableInputFormat.INPUT_TABLE, Article);

 JavaPairRDDImmutableBytesWritable, Result hBaseRDD = 
 sc.newAPIHadoopRDD(conf,
 TableInputFormat.*class*
 ,org.apache.hadoop.hbase.io.ImmutableBytesWritable.*class*,
 org.apache.hadoop.hbase.client.Result.*class*);

Re: scala 2.11?

2014-09-16 Thread Mohit Jaggi

Can I load that plugin in spark-shell? Or perhaps due the 2-phase
compilation quasiquotes won't work in shell?

On Mon, Sep 15, 2014 at 7:15 PM, Mark Hamstra m...@clearstorydata.com
wrote:

 Okay, that's consistent with what I was expecting.  Thanks, Matei.

 On Mon, Sep 15, 2014 at 5:20 PM, Matei Zaharia matei.zaha...@gmail.com
 wrote:

 I think the current plan is to put it in 1.2.0, so that's what I meant by
 soon. It might be possible to backport it too, but I'd be hesitant to do
 that as a maintenance release on 1.1.x and 1.0.x since it would require
 nontrivial changes to the build that could break things on Scala 2.10.

 Matei

 On September 15, 2014 at 12:19:04 PM, Mark Hamstra (
 m...@clearstorydata.com) wrote:

 Are we going to put 2.11 support into 1.1 or 1.0?  Else will be in
 soon applies to the master development branch, but actually in the Spark
 1.2.0 release won't occur until the second half of November at the earliest.

 On Mon, Sep 15, 2014 at 12:11 PM, Matei Zaharia matei.zaha...@gmail.com
 wrote:

  Scala 2.11 work is under way in open pull requests though, so
 hopefully it will be in soon.

  Matei

 On September 15, 2014 at 9:48:42 AM, Mohit Jaggi (mohitja...@gmail.com)
 wrote:

  ah...thanks!

 On Mon, Sep 15, 2014 at 9:47 AM, Mark Hamstra m...@clearstorydata.com
 wrote:

 No, not yet.  Spark SQL is using org.scalamacros:quasiquotes_2.10.

 On Mon, Sep 15, 2014 at 9:28 AM, Mohit Jaggi mohitja...@gmail.com
 wrote:

 Folks,
 I understand Spark SQL uses quasiquotes. Does that mean Spark has now
 moved to Scala 2.11?

 Mohit.

RE: HBase and non-existent TableInputFormat

2014-09-16 Thread abraham.jacob

Hi,

I had a similar situation in which I needed to read data from HBase and work 
with the data inside of a spark context. After much ggling, I finally got 
mine to work. There are a bunch of steps that you need to do get this working -

The problem is that the spark context does not know anything about hbase, so 
you have to provide all the information about hbase classes to both the driver 
code and executor code...


SparkConf sconf = new SparkConf().setAppName(App).setMaster(local);
JavaSparkContext sc = new JavaSparkContext(sconf);

sparkConf.set(spark.executor.extraClassPath, $(hbase classpath));  
//=== you will need to add this to tell the executor about the classpath 
for HBase.


Configuration conf = HBaseConfiguration.create();
conf.set(TableInputFormat.INPUT_TABLE, Article);


JavaPairRDDImmutableBytesWritable, Result hBaseRDD = sc.newAPIHadoopRDD(conf, 
TableInputFormat.class,org.apache.hadoop.hbase.io.ImmutableBytesWritable.class,
org.apache.hadoop.hbase.client.Result.class);


The when you submit the spark job -


spark-submit --driver-class-path $(hbase classpath) --jars 
/usr/lib/hbase/hbase-server.jar,/usr/lib/hbase/hbase-client.jar,/usr/lib/hbase/hbase-common.jar,/usr/lib/hbase/hbase-protocol.jar,/usr/lib/hbase/lib/protobuf-java-2.5.0.jar,/usr/lib/hbase/lib/htrace-core.jar
 --class YourClassName --master local App.jar


Try this and see if it works for you.


From: Y. Dong [mailto:tq00...@gmail.com]
Sent: Tuesday, September 16, 2014 8:18 AM
To: user@spark.apache.org
Subject: HBase and non-existent TableInputFormat

Hello,

I'm currently using spark-core 1.1 and hbase 0.98.5 and I want to simply read 
from hbase. The Java code is attached. However the problem is TableInputFormat 
does not even exist in hbase-client API, is there any other way I can read from
hbase? Thanks

SparkConf sconf = new SparkConf().setAppName(App).setMaster(local);
JavaSparkContext sc = new JavaSparkContext(sconf);


Configuration conf = HBaseConfiguration.create();
conf.set(TableInputFormat.INPUT_TABLE, Article);


JavaPairRDDImmutableBytesWritable, Result hBaseRDD = sc.newAPIHadoopRDD(conf, 
TableInputFormat.class,org.apache.hadoop.hbase.io.ImmutableBytesWritable.class,
org.apache.hadoop.hbase.client.Result.class);

Re: HBase and non-existent TableInputFormat

2014-09-16 Thread Nicholas Chammas

Btw, there are some examples in the Spark GitHub repo that you may find
helpful. Here's one
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/HBaseTest.scala
related to HBase.

On Tue, Sep 16, 2014 at 1:22 PM, abraham.ja...@thomsonreuters.com wrote:

  *Hi, *



 *I had a similar situation in which I needed to read data from HBase and
 work with the data inside of a spark context. After much ggling, I
 finally got mine to work. There are a bunch of steps that you need to do
 get this working – *



 *The problem is that the spark context does not know anything about hbase,
 so you have to provide all the information about hbase classes to both the
 driver code and executor code…*





 SparkConf sconf = *new* SparkConf().setAppName(“App).setMaster(local);

 JavaSparkContext sc = *new* JavaSparkContext(sconf);



 sparkConf.set(spark.executor.extraClassPath, $(hbase classpath));  
 //ç=
 you will need to add this to tell the executor about the classpath for
 HBase.



 Configuration conf = HBaseConfiguration.*create*();

 conf.set(*TableInputFormat*.INPUT_TABLE, Article);



 JavaPairRDDImmutableBytesWritable, Result hBaseRDD = sc.
 *newAPIHadoopRDD*(conf, *TableInputFormat*.*class*
 ,org.apache.hadoop.hbase.io.ImmutableBytesWritable.*class*,

 org.apache.hadoop.hbase.client.Result.*class*);





 *The when you submit the spark job – *





 *spark-submit --driver-class-path $(hbase classpath) --jars
 /usr/lib/hbase/hbase-server.jar,/usr/lib/hbase/hbase-client.jar,/usr/lib/hbase/hbase-common.jar,/usr/lib/hbase/hbase-protocol.jar,/usr/lib/hbase/lib/protobuf-java-2.5.0.jar,/usr/lib/hbase/lib/htrace-core.jar
 --class YourClassName --master local App.jar *





 Try this and see if it works for you.





 *From:* Y. Dong [mailto:tq00...@gmail.com]
 *Sent:* Tuesday, September 16, 2014 8:18 AM
 *To:* user@spark.apache.org
 *Subject:* HBase and non-existent TableInputFormat



 Hello,



 I’m currently using spark-core 1.1 and hbase 0.98.5 and I want to simply
 read from hbase. The Java code is attached. However the problem is
 TableInputFormat does not even exist in hbase-client API, is there any
 other way I can read from

 hbase? Thanks



 SparkConf sconf = *new* SparkConf().setAppName(“App).setMaster(local);

 JavaSparkContext sc = *new* JavaSparkContext(sconf);



 Configuration conf = HBaseConfiguration.*create*();

 conf.set(*TableInputFormat*.INPUT_TABLE, Article);



 JavaPairRDDImmutableBytesWritable, Result hBaseRDD = sc.
 *newAPIHadoopRDD*(conf, *TableInputFormat*.*class*
 ,org.apache.hadoop.hbase.io.ImmutableBytesWritable.*class*,

 org.apache.hadoop.hbase.client.Result.*class*);

Re: Spark as a Library

2014-09-16 Thread Matei Zaharia

If you want to run the computation on just one machine (using Spark's local 
mode), it can probably run in a container. Otherwise you can create a 
SparkContext there and connect it to a cluster outside. Note that I haven't 
tried this though, so the security policies of the container might be too 
restrictive. In that case you'd have to run the app outside and expose an RPC 
interface between them.

Matei

On September 16, 2014 at 8:17:08 AM, Ruebenacker, Oliver A 
(oliver.ruebenac...@altisource.com) wrote:

 

 Hello,

 

  Suppose I want to use Spark from an application that I already submit to run 
in another container (e.g. Tomcat). Is this at all possible? Or do I have to 
split the app into two components, and submit one to Spark and one to the other 
container? In that case, what is the preferred way for the two components to 
communicate with each other? Thanks!

 

 Best, Oliver

 

Oliver Ruebenacker | Solutions Architect

 

Altisource™

290 Congress St, 7th Floor | Boston, Massachusetts 02210

P: (617) 728-5582 | ext: 275585

oliver.ruebenac...@altisource.com | www.Altisource.com

 

***
This email message and any attachments are intended solely for the use of the 
addressee. If you are not the intended recipient, you are prohibited from 
reading, disclosing, reproducing, distributing, disseminating or otherwise 
using this transmission. If you have received this message in error, please 
promptly notify the sender by reply email and immediately delete this message 
from your system. This message and any attachments may contain information that 
is confidential, privileged or exempt from disclosure. Delivery of this message 
to any person other than the intended recipient is not intended to waive any 
right or privilege. Message transmission is not guaranteed to be secure or free 
of software viruses.
***

Re: Spark as a Library

2014-09-16 Thread Soumya Simanta

It depends on what you want to do with Spark. The following has worked for
me.
Let the container handle the HTTP request and then talk to Spark using
another HTTP/REST interface. You can use the Spark Job Server for this.

Embedding Spark inside the container is not a great long term solution IMO
because you may see issues when you want to connect with a Spark cluster.



On Tue, Sep 16, 2014 at 11:16 AM, Ruebenacker, Oliver A 
oliver.ruebenac...@altisource.com wrote:



  Hello,



   Suppose I want to use Spark from an application that I already submit to
 run in another container (e.g. Tomcat). Is this at all possible? Or do I
 have to split the app into two components, and submit one to Spark and one
 to the other container? In that case, what is the preferred way for the two
 components to communicate with each other? Thanks!



  Best, Oliver



 Oliver Ruebenacker | Solutions Architect



 Altisource™

 290 Congress St, 7th Floor | Boston, Massachusetts 02210

 P: (617) 728-5582 | ext: 275585

 oliver.ruebenac...@altisource.com | www.Altisource.com




 ***

 This email message and any attachments are intended solely for the use of
 the addressee. If you are not the intended recipient, you are prohibited
 from reading, disclosing, reproducing, distributing, disseminating or
 otherwise using this transmission. If you have received this message in
 error, please promptly notify the sender by reply email and immediately
 delete this message from your system.
 This message and any attachments may contain information that is
 confidential, privileged or exempt from disclosure. Delivery of this
 message to any person other than the intended recipient is not intended to
 waive any right or privilege. Message transmission is not guaranteed to be
 secure or free of software viruses.

 ***

RE: HBase and non-existent TableInputFormat

2014-09-16 Thread abraham.jacob

Yes that was very helpful… ☺

Here are a few more I found on my quest to get HBase working with Spark –

This one details about Hbase dependencies and spark classpaths

http://www.abcn.net/2014/07/lighting-spark-with-hbase-full-edition.html

This one has a code overview –

http://www.abcn.net/2014/07/spark-hbase-result-keyvalue-bytearray.html
http://www.vidyasource.com/blog/Programming/Scala/Java/Data/Hadoop/Analytics/2014/01/25/lighting-a-spark-with-hbase

All of them were very helpful…



From: Nicholas Chammas [mailto:nicholas.cham...@gmail.com]
Sent: Tuesday, September 16, 2014 10:30 AM
To: Jacob, Abraham (FinancialRisk)
Cc: tq00...@gmail.com; user
Subject: Re: HBase and non-existent TableInputFormat

Btw, there are some examples in the Spark GitHub repo that you may find 
helpful. Here's 
onehttps://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/HBaseTest.scala
 related to HBase.

On Tue, Sep 16, 2014 at 1:22 PM, 
abraham.ja...@thomsonreuters.commailto:abraham.ja...@thomsonreuters.com 
wrote:
Hi,

I had a similar situation in which I needed to read data from HBase and work 
with the data inside of a spark context. After much ggling, I finally got 
mine to work. There are a bunch of steps that you need to do get this working –

The problem is that the spark context does not know anything about hbase, so 
you have to provide all the information about hbase classes to both the driver 
code and executor code…


SparkConf sconf = new SparkConf().setAppName(“App).setMaster(local);
JavaSparkContext sc = new JavaSparkContext(sconf);

sparkConf.set(spark.executor.extraClassPath, $(hbase classpath));  
//=== you will need to add this to tell the executor about the classpath 
for HBase.


Configuration conf = HBaseConfiguration.create();
conf.set(TableInputFormat.INPUT_TABLE, Article);


JavaPairRDDImmutableBytesWritable, Result hBaseRDD = sc.newAPIHadoopRDD(conf, 
TableInputFormat.class,org.apache.hadoop.hbase.io.ImmutableBytesWritable.class,
org.apache.hadoop.hbase.client.Result.class);


The when you submit the spark job –


spark-submit --driver-class-path $(hbase classpath) --jars 
/usr/lib/hbase/hbase-server.jar,/usr/lib/hbase/hbase-client.jar,/usr/lib/hbase/hbase-common.jar,/usr/lib/hbase/hbase-protocol.jar,/usr/lib/hbase/lib/protobuf-java-2.5.0.jar,/usr/lib/hbase/lib/htrace-core.jar
 --class YourClassName --master local App.jar


Try this and see if it works for you.


From: Y. Dong [mailto:tq00...@gmail.commailto:tq00...@gmail.com]
Sent: Tuesday, September 16, 2014 8:18 AM
To: user@spark.apache.orgmailto:user@spark.apache.org
Subject: HBase and non-existent TableInputFormat

Hello,

I’m currently using spark-core 1.1 and hbase 0.98.5 and I want to simply read 
from hbase. The Java code is attached. However the problem is TableInputFormat 
does not even exist in hbase-client API, is there any other way I can read from
hbase? Thanks

SparkConf sconf = new SparkConf().setAppName(“App).setMaster(local);
JavaSparkContext sc = new JavaSparkContext(sconf);


Configuration conf = HBaseConfiguration.create();
conf.set(TableInputFormat.INPUT_TABLE, Article);


JavaPairRDDImmutableBytesWritable, Result hBaseRDD = sc.newAPIHadoopRDD(conf, 
TableInputFormat.class,org.apache.hadoop.hbase.io.ImmutableBytesWritable.class,
org.apache.hadoop.hbase.client.Result.class);

Re: SparkSQL 1.1 hang when DROP or LOAD

2014-09-16 Thread Yin Huai

Seems https://issues.apache.org/jira/browse/HIVE-5474 is related?

On Tue, Sep 16, 2014 at 4:49 AM, Cheng, Hao hao.ch...@intel.com wrote:

Thank you for pasting the steps, I will look at this, hopefully come out
with a solution soon.

-Original Message-
From: linkpatrickliu [mailto:linkpatrick...@live.com]
Sent: Tuesday, September 16, 2014 3:17 PM
To: u...@spark.incubator.apache.org
Subject: RE: SparkSQL 1.1 hang when DROP or LOAD

Hi, Hao Cheng.

I have done other tests. And the result shows the thriftServer can connect
to Zookeeper.

However, I found some more interesting things. And I think I have found a
bug!

Test procedure:
Test1:
(0) Use beeline to connect to thriftServer.
(1) Switch database use dw_op1; (OK)
The logs show that the thriftServer connected with Zookeeper and acquired
locks.

(2) Drop table drop table src; (Blocked) The logs show that the
thriftServer is acquireReadWriteLocks.

Doubt:
The reason why I cannot drop table src is because the first SQL use
dw_op1
have left locks in Zookeeper unsuccessfully released.
So when the second SQL is acquiring locks in Zookeeper, it will block.

Test2:
Restart thriftServer.
Instead of switching to another database, I just drop the table in the
default database;
(0) Restart thriftServer use beeline to connect to thriftServer.
(1) Drop table drop table src; (OK)
Amazing! Succeed!
(2) Drop again! drop table src2; (Blocked) Same error: the thriftServer
is blocked in the acquireReadWriteLocks
phrase.

As you can see.
Only the first SQL requiring locks can succeed.
So I think the reason is that the thriftServer cannot release locks
correctly in Zookeeper.

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-1-1-hang-when-DROP-or-LOAD-tp14222p14339.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional
commands, e-mail: user-h...@spark.apache.org

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: SparkSQL 1.1 hang when DROP or LOAD

2014-09-16 Thread Yin Huai

I meant it may be a Hive bug since we also call Hive's drop table
internally.

On Tue, Sep 16, 2014 at 1:44 PM, Yin Huai huaiyin@gmail.com wrote:

Seems https://issues.apache.org/jira/browse/HIVE-5474 is related?

On Tue, Sep 16, 2014 at 4:49 AM, Cheng, Hao hao.ch...@intel.com wrote:

Thank you for pasting the steps, I will look at this, hopefully come out
with a solution soon.

-Original Message-
From: linkpatrickliu [mailto:linkpatrick...@live.com]
Sent: Tuesday, September 16, 2014 3:17 PM
To: u...@spark.incubator.apache.org
Subject: RE: SparkSQL 1.1 hang when DROP or LOAD

Hi, Hao Cheng.

I have done other tests. And the result shows the thriftServer can
connect to Zookeeper.

However, I found some more interesting things. And I think I have found a
bug!

Test procedure:
Test1:
(0) Use beeline to connect to thriftServer.
(1) Switch database use dw_op1; (OK)
The logs show that the thriftServer connected with Zookeeper and acquired
locks.

(2) Drop table drop table src; (Blocked) The logs show that the
thriftServer is acquireReadWriteLocks.

Test2:
Restart thriftServer.
Instead of switching to another database, I just drop the table in the
default database;
(0) Restart thriftServer use beeline to connect to thriftServer.
(1) Drop table drop table src; (OK)
Amazing! Succeed!
(2) Drop again! drop table src2; (Blocked) Same error: the
thriftServer is blocked in the acquireReadWriteLocks
phrase.

As you can see.
Only the first SQL requiring locks can succeed.
So I think the reason is that the thriftServer cannot release locks
correctly in Zookeeper.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional
commands, e-mail: user-h...@spark.apache.org

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

RDD projection and sorting

2014-09-16 Thread Sameer Tilak


Hi All,
I have data in for following format:L
1st column is userid and the second column onward are class ids for various 
products. I want to save this in Libsvm format and an intermediate step is to 
sort (in ascending manner) the class ids. For example: I/Puid1   12433580   
 2670122 259317821526uid2   121 285 24471516343 
385 1200 912 143058241451893112711088
258416645481
Desired O/P:uid1   122  1243  1526  1782  2593  2670  3580uid2  
 121 285 343 385   912  1088   1200   1271   1430   1451   1516   
1664   2447   25845481  5824   8931   
Can someone please point me in the right direction. How do I project 
if I use val data = sc.textFile(..)How do I project column 1 to end (not 
including column 0) and then sort these projected columns.

RE: Spark as a Library

2014-09-16 Thread Ruebenacker, Oliver A


 Hello,

  Thanks for the response and great to hear it is possible. But how do I 
connect to Spark without using the submit script?

  I know how to start up a master and some workers and then connect to the 
master by packaging the app that contains the SparkContext and then submitting 
the package with the spark-submit script in standalone-mode. But I don’t want 
to submit the app that contains the SparkContext via the script, because I want 
that app to be running on a web server. So, what are other ways to connect to 
Spark? I can’t find in the docs anything other than using the script. Thanks!

 Best, Oliver

From: Matei Zaharia [mailto:matei.zaha...@gmail.com]
Sent: Tuesday, September 16, 2014 1:31 PM
To: Ruebenacker, Oliver A; user@spark.apache.org
Subject: Re: Spark as a Library

If you want to run the computation on just one machine (using Spark's local 
mode), it can probably run in a container. Otherwise you can create a 
SparkContext there and connect it to a cluster outside. Note that I haven't 
tried this though, so the security policies of the container might be too 
restrictive. In that case you'd have to run the app outside and expose an RPC 
interface between them.

Matei


On September 16, 2014 at 8:17:08 AM, Ruebenacker, Oliver A 
(oliver.ruebenac...@altisource.commailto:oliver.ruebenac...@altisource.com) 
wrote:

 Hello,

  Suppose I want to use Spark from an application that I already submit to run 
in another container (e.g. Tomcat). Is this at all possible? Or do I have to 
split the app into two components, and submit one to Spark and one to the other 
container? In that case, what is the preferred way for the two components to 
communicate with each other? Thanks!

 Best, Oliver

Oliver Ruebenacker | Solutions Architect

Altisource™
290 Congress St, 7th Floor | Boston, Massachusetts 02210
P: (617) 728-5582 | ext: 275585
oliver.ruebenac...@altisource.commailto:oliver.ruebenac...@altisource.com | 
www.Altisource.com

***

This email message and any attachments are intended solely for the use of the 
addressee. If you are not the intended recipient, you are prohibited from 
reading, disclosing, reproducing, distributing, disseminating or otherwise 
using this transmission. If you have received this message in error, please 
promptly notify the sender by reply email and immediately delete this message 
from your system. This message and any attachments may contain information that 
is confidential, privileged or exempt from disclosure. Delivery of this message 
to any person other than the intended recipient is not intended to waive any 
right or privilege. Message transmission is not guaranteed to be secure or free 
of software viruses.
***
***

This email message and any attachments are intended solely for the use of the 
addressee. If you are not the intended recipient, you are prohibited from 
reading, disclosing, reproducing, distributing, disseminating or otherwise 
using this transmission. If you have received this message in error, please 
promptly notify the sender by reply email and immediately delete this message 
from your system. This message and any attachments may contain information that 
is confidential, privileged or exempt from disclosure. Delivery of this message 
to any person other than the intended recipient is not intended to waive any 
right or privilege. Message transmission is not guaranteed to be secure or free 
of software viruses.
***

Re: NullWritable not serializable

2014-09-16 Thread Du Li

Hi,

The test case is separated out as follows. The call to rdd2.first() breaks when 
spark version is changed to 1.1.0, reporting exception NullWritable not 
serializable. However, the same test passed with spark 1.0.2. The pom.xml file 
is attached. The test data README.md was copied from spark.

Thanks,
Du
-

package com.company.project.test

import org.scalatest._

class WritableTestSuite extends FunSuite {
  test(generated sequence file should be readable from spark) {
import org.apache.hadoop.io.{NullWritable, Text}
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.SparkContext._

val conf = new SparkConf(false).setMaster(local).setAppName(test data 
exchange with spark)
val sc = new SparkContext(conf)

val rdd = sc.textFile(README.md)
val res = rdd.map(x = (NullWritable.get(), new Text(x)))
res.saveAsSequenceFile(./test_data)

val rdd2 = sc.sequenceFile(./test_data, classOf[NullWritable], 
classOf[Text])

assert(rdd.first == rdd2.first._2.toString)
  }
}



From: Matei Zaharia matei.zaha...@gmail.commailto:matei.zaha...@gmail.com
Date: Monday, September 15, 2014 at 10:52 PM
To: Du Li l...@yahoo-inc.commailto:l...@yahoo-inc.com
Cc: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org, 
d...@spark.apache.orgmailto:d...@spark.apache.org 
d...@spark.apache.orgmailto:d...@spark.apache.org
Subject: Re: NullWritable not serializable

Can you post the exact code for the test that worked in 1.0? I can't think of 
much that could've changed. The one possibility is if  we had some operations 
that were computed locally on the driver (this happens with things like first() 
and take(), which will try to do the first partition locally). But generally 
speaking these operations should *not* work over a network, so you'll have to 
make sure that you only send serializable types through shuffles or collects, 
or use a serialization framework like Kryo that might be okay with Writables.

Matei


On September 15, 2014 at 9:13:13 PM, Du Li 
(l...@yahoo-inc.commailto:l...@yahoo-inc.com) wrote:

Hi Matei,

Thanks for your reply.

The Writable classes have never been serializable and this is why it is weird. 
I did try as you suggested to map the Writables to integers and strings. It 
didn’t pass, either. Similar exceptions were thrown except that the messages 
became IntWritable, Text are not serializable. The reason is in the implicits 
defined in the SparkContext object that convert those values into their 
corresponding Writable classes before saving the data in sequence file.

My original code was actual some test cases to try out SequenceFile related 
APIs. The tests all passed when the spark version was specified as 1.0.2. But 
this one failed after I changed the spark version to 1.1.0 the new release, 
nothing else changed. In addition, it failed when I called rdd2.collect(), 
take(1), and first(). But it worked fine when calling rdd2.count(). As you can 
see, count() does not need to serialize and ship data while the other three 
methods do.

Do you recall any difference between spark 1.0 and 1.1 that might cause this 
problem?

Thanks,
Du


From: Matei Zaharia matei.zaha...@gmail.commailto:matei.zaha...@gmail.com
Date: Friday, September 12, 2014 at 9:10 PM
To: Du Li l...@yahoo-inc.com.invalidmailto:l...@yahoo-inc.com.invalid, 
user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org, 
d...@spark.apache.orgmailto:d...@spark.apache.org 
d...@spark.apache.orgmailto:d...@spark.apache.org
Subject: Re: NullWritable not serializable

Hi Du,

I don't think NullWritable has ever been serializable, so you must be doing 
something differently from your previous program. In this case though, just use 
a map() to turn your Writables to serializable types (e.g. null and String).

Matie


On September 12, 2014 at 8:48:36 PM, Du Li 
(l...@yahoo-inc.com.invalidmailto:l...@yahoo-inc.com.invalid) wrote:

Hi,

I was trying the following on spark-shell (built with apache master and hadoop 
2.4.0). Both calling rdd2.collect and calling rdd3.collect threw 
java.io.NotSerializableException: org.apache.hadoop.io.NullWritable.

I got the same problem in similar code of my app which uses the newly released 
Spark 1.1.0 under hadoop 2.4.0. Previously it worked fine with spark 1.0.2 
under either hadoop 2.40 and 0.23.10.

Anybody knows what caused the problem?

Thanks,
Du


import org.apache.hadoop.io.{NullWritable, Text}
val rdd = sc.textFile(README.md)
val res = rdd.map(x = (NullWritable.get(), new Text(x)))
res.saveAsSequenceFile(./test_data)
val rdd2 = sc.sequenceFile(./test_data, classOf[NullWritable], classOf[Text])
rdd2.collect
val rdd3 = sc.sequenceFile[NullWritable,Text](./test_data)
rdd3.collect




pom.xml
Description: pom.xml

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional

Re: Spark as a Library

2014-09-16 Thread Daniel Siegmann

You can create a new SparkContext inside your container pointed to your
master. However, for your script to run you must call addJars to put the
code on your workers' classpaths (except when running locally).

Hopefully your webapp has some lib folder which you can point to as a
source for the jars. In the Play Framework you can use
play.api.Play.application.getFile(lib) to get a path to the lib directory
and get the contents. Of course that only works on the packaged web app.

On Tue, Sep 16, 2014 at 3:17 PM, Ruebenacker, Oliver A 
oliver.ruebenac...@altisource.com wrote:



  Hello,



   Thanks for the response and great to hear it is possible. But how do I
 connect to Spark without using the submit script?



   I know how to start up a master and some workers and then connect to the
 master by packaging the app that contains the SparkContext and then
 submitting the package with the spark-submit script in standalone-mode. But
 I don’t want to submit the app that contains the SparkContext via the
 script, because I want that app to be running on a web server. So, what are
 other ways to connect to Spark? I can’t find in the docs anything other
 than using the script. Thanks!



  Best, Oliver



 *From:* Matei Zaharia [mailto:matei.zaha...@gmail.com]
 *Sent:* Tuesday, September 16, 2014 1:31 PM
 *To:* Ruebenacker, Oliver A; user@spark.apache.org
 *Subject:* Re: Spark as a Library



 If you want to run the computation on just one machine (using Spark's
 local mode), it can probably run in a container. Otherwise you can create a
 SparkContext there and connect it to a cluster outside. Note that I haven't
 tried this though, so the security policies of the container might be too
 restrictive. In that case you'd have to run the app outside and expose an
 RPC interface between them.



 Matei



 On September 16, 2014 at 8:17:08 AM, Ruebenacker, Oliver A (
 oliver.ruebenac...@altisource.com) wrote:



  Hello,



   Suppose I want to use Spark from an application that I already submit to
 run in another container (e.g. Tomcat). Is this at all possible? Or do I
 have to split the app into two components, and submit one to Spark and one
 to the other container? In that case, what is the preferred way for the two
 components to communicate with each other? Thanks!



  Best, Oliver



 Oliver Ruebenacker | Solutions Architect



 Altisource™

 290 Congress St, 7th Floor | Boston, Massachusetts 02210

 P: (617) 728-5582 | ext: 275585

 oliver.ruebenac...@altisource.com | www.Altisource.com



 ***


 This email message and any attachments are intended solely for the use of
 the addressee. If you are not the intended recipient, you are prohibited
 from reading, disclosing, reproducing, distributing, disseminating or
 otherwise using this transmission. If you have received this message in
 error, please promptly notify the sender by reply email and immediately
 delete this message from your system. This message and any attachments
 may contain information that is confidential, privileged or exempt from
 disclosure. Delivery of this message to any person other than the intended
 recipient is not intended to waive any right or privilege. Message
 transmission is not guaranteed to be secure or free of software viruses.

 ***


 ***

 This email message and any attachments are intended solely for the use of
 the addressee. If you are not the intended recipient, you are prohibited
 from reading, disclosing, reproducing, distributing, disseminating or
 otherwise using this transmission. If you have received this message in
 error, please promptly notify the sender by reply email and immediately
 delete this message from your system.
 This message and any attachments may contain information that is
 confidential, privileged or exempt from disclosure. Delivery of this
 message to any person other than the intended recipient is not intended to
 waive any right or privilege. Message transmission is not guaranteed to be
 secure or free of software viruses.

 ***




-- 
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegm...@velos.io W: www.velos.io

R: Spark as a Library

2014-09-16 Thread Paolo Platter

Hi,

Spark job server by ooyala is the right tool for the job. It exposes rest api 
so calling it from a web app is suitable.
Is open source, you can find it on github

Best

Paolo Platter

Da: Ruebenacker, Oliver Amailto:oliver.ruebenac...@altisource.com
Inviato: ‎16/‎09/‎2014 21.18
A: Matei Zahariamailto:matei.zaha...@gmail.com; 
user@spark.apache.orgmailto:user@spark.apache.org
Oggetto: RE: Spark as a Library


 Hello,

  Thanks for the response and great to hear it is possible. But how do I 
connect to Spark without using the submit script?

  I know how to start up a master and some workers and then connect to the 
master by packaging the app that contains the SparkContext and then submitting 
the package with the spark-submit script in standalone-mode. But I don’t want 
to submit the app that contains the SparkContext via the script, because I want 
that app to be running on a web server. So, what are other ways to connect to 
Spark? I can’t find in the docs anything other than using the script. Thanks!

 Best, Oliver

From: Matei Zaharia [mailto:matei.zaha...@gmail.com]
Sent: Tuesday, September 16, 2014 1:31 PM
To: Ruebenacker, Oliver A; user@spark.apache.org
Subject: Re: Spark as a Library

If you want to run the computation on just one machine (using Spark's local 
mode), it can probably run in a container. Otherwise you can create a 
SparkContext there and connect it to a cluster outside. Note that I haven't 
tried this though, so the security policies of the container might be too 
restrictive. In that case you'd have to run the app outside and expose an RPC 
interface between them.

Matei


On September 16, 2014 at 8:17:08 AM, Ruebenacker, Oliver A 
(oliver.ruebenac...@altisource.commailto:oliver.ruebenac...@altisource.com) 
wrote:

 Hello,

  Suppose I want to use Spark from an application that I already submit to run 
in another container (e.g. Tomcat). Is this at all possible? Or do I have to 
split the app into two components, and submit one to Spark and one to the other 
container? In that case, what is the preferred way for the two components to 
communicate with each other? Thanks!

 Best, Oliver

Oliver Ruebenacker | Solutions Architect

Altisource™
290 Congress St, 7th Floor | Boston, Massachusetts 02210
P: (617) 728-5582 | ext: 275585
oliver.ruebenac...@altisource.commailto:oliver.ruebenac...@altisource.com | 
www.Altisource.com

***

This email message and any attachments are intended solely for the use of the 
addressee. If you are not the intended recipient, you are prohibited from 
reading, disclosing, reproducing, distributing, disseminating or otherwise 
using this transmission. If you have received this message in error, please 
promptly notify the sender by reply email and immediately delete this message 
from your system. This message and any attachments may contain information that 
is confidential, privileged or exempt from disclosure. Delivery of this message 
to any person other than the intended recipient is not intended to waive any 
right or privilege. Message transmission is not guaranteed to be secure or free 
of software viruses.
***
***

This email message and any attachments are intended solely for the use of the 
addressee. If you are not the intended recipient, you are prohibited from 
reading, disclosing, reproducing, distributing, disseminating or otherwise 
using this transmission. If you have received this message in error, please 
promptly notify the sender by reply email and immediately delete this message 
from your system. This message and any attachments may contain information that 
is confidential, privileged or exempt from disclosure. Delivery of this message 
to any person other than the intended recipient is not intended to waive any 
right or privilege. Message transmission is not guaranteed to be secure or free 
of software viruses.
***

Re: Spark 1.1 / cdh4 stuck using old hadoop client?

2014-09-16 Thread Paul Wais

Hi Sean,

Great catch! Yes I was including Spark as a dependency and it was
making its way into my uber jar. Following the advice I just found at
Stackoverflow[1], I marked Spark as a provided dependency and that
appeared to fix my Hadoop client issue. Thanks for your help!!!
Perhaps they maintainers might consider setting this in the Quickstart
guide pom.xml ( http://spark.apache.org/docs/latest/quick-start.html )

In summary, here's what worked:
* Hadoop 2.3 cdh5
http://archive.cloudera.com/cdh5/cdh/5/hadoop-2.3.0-cdh5.0.0.tar.gz
* Spark 1.1 for Hadoop 2.3
http://d3kbcqa49mib13.cloudfront.net/spark-1.1.0-bin-hadoop2.3.tgz

pom.xml snippets: https://gist.github.com/ypwais/ff188611d4806aa05ed9

[1]
http://stackoverflow.com/questions/24747037/how-to-define-a-dependency-scope-in-maven-to-include-a-library-in-compile-run

Thanks everybody!!
-Paul

On Tue, Sep 16, 2014 at 3:55 AM, Sean Owen so...@cloudera.com wrote:
From the caller / application perspective, you don't care what version
of Hadoop Spark is running on on the cluster. The Spark API you
compile against is the same. When you spark-submit the app, at
runtime, Spark is using the Hadoop libraries from the cluster, which
are the right version.