date:20151211

Using TestHiveContext/HiveContext in unit tests

2015-12-11 Thread Sahil Sareen

I'm trying to do this in unit tests:

val sConf = new SparkConf()
  .setAppName("RandomAppName")
  .setMaster("local")
val sc = new SparkContext(sConf)
val sqlContext = new TestHiveContext(sc)  // tried new HiveContext(sc)
as well


But I get this:

*[scalatest] **Exception encountered when invoking run on a nested suite -
java.lang.RuntimeException: Unable to instantiate
org.apache.hadoop.hive.metastore.HiveMetaStoreClient *** ABORTED 

*[scalatest]   java.lang.RuntimeException: java.lang.RuntimeException:
Unable to instantiate
org.apache.hadoop.hive.metastore.HiveMetaStoreClient*[scalatest]
  at
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:346)
[scalatest]   at
org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:120)
[scalatest]   at
org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:163)
[scalatest]   at
org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:161)
[scalatest]   at
org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:168)
[scalatest]   at
org.apache.spark.sql.hive.test.TestHiveContext.(TestHive.scala:72)
[scalatest]   at mypackage.NewHiveTest.beforeAll(NewHiveTest.scala:48)
[scalatest]   at
org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187)
[scalatest]   at mypackage.NewHiveTest.beforeAll(NewHiveTest.scala:35)
[scalatest]   at
org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:253)
[scalatest]   at mypackage.NewHiveTest.run(NewHiveTest.scala:35)
[scalatest]   at
org.scalatest.Suite$class.callExecuteOnSuite$1(Suite.scala:1491)

The code works perfectly when I run using spark-submit, but not in unit
tests. Any inputs??

-Sahil

Re: Spark Java.lang.NullPointerException

2015-12-11 Thread Steve Loughran

> On 11 Dec 2015, at 05:14, michael_han  wrote:
> 
> Hi Sarala,
> I found the reason, it's because when spark run it still needs Hadoop
> support, I think it's a bug in Spark and still not fixed now ;)
> 

It's related to how the hadoop filesystem apis are used to access pretty much 
every filesystem out there, and for local "file://" filesystems, to give you 
unix-like fs permissions as well as linking. It's a pain, but not something 
that can be removed.

For windows, that hadoop native stuff is critical; on Linux you can get away 
without it when playing with spark, but in production having the hadoop native 
libraries are critical for performance, especially when working with compressed 
code. And with encryption and erasure coding, that problem isn't going to go 
away

view it less a bug and more "evidence that you can't get away from low-level 
native code, not if you want performance or OS integration"

> After I download winutils.exe and following the steps from bellow
> workaround, it works fine:
> http://qnalist.com/questions/4994960/run-spark-unit-test-on-windows-7
> 
> 

see also: https://wiki.apache.org/hadoop/WindowsProblems

Now, one thing that could help would be for someone to extend the spark build 
with a standalone-windows package, including the native libs for the same 
version of hadoop that spark was built with. That could a nice little project 
for someone to work on: something your fellow windows users will appreciate...

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark streaming with Kinesis broken?

2015-12-11 Thread Nick Pentreath

cc'ing dev list

Ok, looks like when the KCL version was updated in
https://github.com/apache/spark/pull/8957, the AWS SDK version was not,
probably leading to dependency conflict, though as Burak mentions its hard
to debug as no exceptions seem to get thrown... I've tested 1.5.2 locally
and on my 1.5.2 EC2 cluster, and no data is received, and nothing shows up
in driver or worker logs, so any exception is getting swallowed somewhere.

Run starting. Expected test count is: 4
KinesisStreamSuite:
Using endpoint URL https://kinesis.eu-west-1.amazonaws.com for creating
Kinesis streams for tests.
- KinesisUtils API
- RDD generation
- basic operation *** FAILED ***
  The code passed to eventually never returned normally. Attempted 13 times
over 2.04 minutes. Last failure message: Set() did not equal Set(5, 10,
1, 6, 9, 2, 7, 3, 8, 4)
  Data received does not match data sent. (KinesisStreamSuite.scala:188)
- failure recovery *** FAILED ***
  The code passed to eventually never returned normally. Attempted 63 times
over 2.02863831 minutes. Last failure message: isCheckpointPresent
was true, but 0 was not greater than 10. (KinesisStreamSuite.scala:228)
Run completed in 5 minutes, 0 seconds.
Total number of tests run: 4
Suites: completed 1, aborted 0
Tests: succeeded 2, failed 2, canceled 0, ignored 0, pending 0
*** 2 TESTS FAILED ***
[INFO]

[INFO] BUILD FAILURE
[INFO]



KCL 1.3.0 depends on *1.9.37* SDK (
https://github.com/awslabs/amazon-kinesis-client/blob/1.3.0/pom.xml#L26)
while the Spark Kinesis dependency was kept at *1.9.16.*

I've run the integration tests on branch-1.5 (1.5.3-SNAPSHOT) with AWS SDK
1.9.37 and everything works.

Run starting. Expected test count is: 28
KinesisBackedBlockRDDSuite:
Using endpoint URL https://kinesis.eu-west-1.amazonaws.com for creating
Kinesis streams for tests.
- Basic reading from Kinesis
- Read data available in both block manager and Kinesis
- Read data available only in block manager, not in Kinesis
- Read data available only in Kinesis, not in block manager
- Read data available partially in block manager, rest in Kinesis
- Test isBlockValid skips block fetching from block manager
- Test whether RDD is valid after removing blocks from block anager
KinesisStreamSuite:
- KinesisUtils API
- RDD generation
- basic operation
- failure recovery
KinesisReceiverSuite:
- check serializability of SerializableAWSCredentials
- process records including store and checkpoint
- shouldn't store and checkpoint when receiver is stopped
- shouldn't checkpoint when exception occurs during store
- should set checkpoint time to currentTime + checkpoint interval upon
instantiation
- should checkpoint if we have exceeded the checkpoint interval
- shouldn't checkpoint if we have not exceeded the checkpoint interval
- should add to time when advancing checkpoint
- shutdown should checkpoint if the reason is TERMINATE
- shutdown should not checkpoint if the reason is something other than
TERMINATE
- retry success on first attempt
- retry success on second attempt after a Kinesis throttling exception
- retry success on second attempt after a Kinesis dependency exception
- retry failed after a shutdown exception
- retry failed after an invalid state exception
- retry failed after unexpected exception
- retry failed after exhausing all retries
Run completed in 3 minutes, 28 seconds.
Total number of tests run: 28
Suites: completed 4, aborted 0
Tests: succeeded 28, failed 0, canceled 0, ignored 0, pending 0
All tests passed.

So this is a regression in Spark Streaming Kinesis 1.5.2 - @Brian can you
file a JIRA for this?

@dev-list, since KCL brings in AWS SDK dependencies itself, is it necessary
to declare an explicit dependency on aws-java-sdk in the Kinesis POM? Also,
from KCL 1.5.0+, only the relevant components used from the AWS SDKs are
brought in, making things a bit leaner (this can be upgraded in Spark
1.7/2.0 perhaps). All local tests (and integration tests) pass with
removing the explicit dependency and only depending on KCL. Is aws-java-sdk
used anywhere else (AFAIK it is not, but in case I missed something let me
know any good reason to keep the explicit dependency)?

N



On Fri, Dec 11, 2015 at 6:55 AM, Nick Pentreath 
wrote:

> Yeah also the integration tests need to be specifically run - I would have
> thought the contributor would have run those tests and also tested the
> change themselves using live Kinesis :(
>
> —
> Sent from Mailbox 
>
>
> On Fri, Dec 11, 2015 at 6:18 AM, Burak Yavuz  wrote:
>
>> I don't think the Kinesis tests specifically ran when that was merged
>> into 1.5.2 :(
>> https://github.com/apache/spark/pull/8957
>>
>> https://github.com/apache/spark/commit/883bd8fccf83aae7a2a847c9a6ca129fac86e6a3
>>
>> AFAIK pom changes don't trigger

Re:Re: HELP! I get "java.lang.String cannot be cast to java.lang.Intege " for a long time.

2015-12-11 Thread Bonsen

Thank you,and I find the problem is my package is test,but I write package 
org.apache.spark.examples ,and IDEA had imported the 
spark-examples-1.5.2-hadoop2.6.0.jar ,so I can run it,and it makes lots of 
problems
__
Now , I change the package like this:


package test
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
object test {
  def main(args: Array[String]) {
val conf = new 
SparkConf().setAppName("mytest").setMaster("spark://Master:7077")
val sc = new SparkContext(conf)

sc.addJar("/home/hadoop/spark-assembly-1.5.2-hadoop2.6.0.jar")//It doesn't 
work.!?

val rawData = sc.textFile("/home/hadoop/123.csv")
val secondData = rawData.flatMap(_.split(",").toString)
println(secondData.first)   /line 32
sc.stop()
  }
}

it causes that: 
15/12/11 18:41:06 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 
219.216.65.129): java.lang.ClassNotFoundException: test.test$$anonfun$1
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at 
org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:67)




//  219.216.65.129 is my worker computer.
//  I can connect to my worker computer.
// Spark can start successfully.
//  addFile is also doesn't work,the tmp file will also dismiss.








At 2015-12-10 22:32:21, "Himanshu Mehra [via Apache Spark User List]" 
 wrote:
You are trying to print an array, but anyway it will print the objectID  of the 
array if the input is same as you have shown here. Try flatMap() instead of map 
and check if the problem is same.

   --Himanshu


If you reply to this email, your message will be added to the discussion below:
http://apache-spark-user-list.1001560.n3.nabble.com/HELP-I-get-java-lang-String-cannot-be-cast-to-java-lang-Intege-for-a-long-time-tp25666p25667.html
To unsubscribe from HELP! I get "java.lang.String cannot be cast to 
java.lang.Intege " for a long time., click here.
NAML



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/HELP-I-get-java-lang-String-cannot-be-cast-to-java-lang-Intege-for-a-long-time-tp25666p25689.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

coGroup problem /spark streaming

2015-12-11 Thread Vieru, Mihail

Hi,

we have the following use case for stream processing. We have incoming
events of the following form: (itemID, orderID, orderStatus, timestamp).
There are several itemIDs for each orderID, each with its own timestamp.
The orderStatus can be "created" and "sent". The first incoming itemID
represents "order created" and the last "order sent". We need to issue an
alarm if those events occur over a certain time period, i.e. the difference
of their timestamps exceeds a threshold.

I have implemented this as follows.

source
|
mapToPair
|
|
|--
|(1)|
updateStateByKey |(2)
|   |
coGroup--|
|
|(3)
|
map
|
sink


In *mapToPair* I put the orderID in the pair's first position and the
remaining fields as a Tuple in the second.
In the state I set a flag when both itemIDs for "created" and "sent" are
received and their time difference is computed. If it's bigger than the
threshold, the flag is false. (*updateStateByKey*)
I use *coGroup* and a *map* in order to be able to filter the initial
events and forward them to the sink. The grouping is done by orderID.

The problem is that the coGroup's output doesn't contain all initial
itemIDs for each orderID key. Thus the result is incorrect.

Do you guys have any idea what could cause this issue? Am I
missing/overlooking something?

Best,
Mihail

spark metrics in graphite missing for some executors

2015-12-11 Thread rok

I'm using graphite/grafana to collect and visualize metrics from my spark
jobs.

It appears that not all executors report all the metrics -- for example,
even jvm heap data is missing from some. Is there an obvious reason why
this happens? Are metrics somehow held back? Often, an executor's metrics
will show up with a delay, but since they are aggregate metrics (e.g.
number of completed tasks), it is clear that they are being collected from
the beginning (the level once it appears matches other executors) but for
some reason just don't show up initially.

Any experience with this? How can it be fixed? Right now it's rendering
many metrics useless since I want to have a complete view into the
application and I'm only seeing a few executors at a time.

Thanks,

rok




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-metrics-in-graphite-missing-for-some-executors-tp25688.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

What is the relationship between reduceByKey and spark.driver.maxResultSize?

2015-12-11 Thread Tom Seddon

I have a job that is running into intermittent errors with  [SparkDriver]
java.lang.OutOfMemoryError: Java heap space.  Before I was getting this
error I was getting errors saying the result size exceed the
spark.driver.maxResultSize.
This does not make any sense to me, as there are no actions in my job that
send data to the driver - just a pull of data from S3, a map and
reduceByKey and then conversion to dataframe and saveAsTable action that
puts the results back on S3.

I've found a few references to reduceByKey and spark.driver.maxResultSize
having some importance, but cannot fathom how this setting could be related.

Would greatly appreciated any advice.

Thanks in advance,

Tom

Re: Mesos scheduler obeying limit of tasks / executor

2015-12-11 Thread Iulian Dragoș

Hi Charles,

I am not sure I totally understand your issues, but the spark.task.cpus
limit is imposed at a higher level, for all cluster managers. The code is
in TaskSchedulerImpl

.

There is a pending PR to implement spark.executor.cores (and launching
multiple executors on a single worker), but it wasn’t yet merged:
https://github.com/apache/spark/pull/4027

iulian


On Wed, Dec 9, 2015 at 7:23 PM, Charles Allen  wrote:

> I have a spark app in development which has relatively strict cpu/mem
> ratios that are required. As such, I cannot arbitrarily add CPUs to a
> limited memory size.
>
> The general spark cluster behaves as expected, where tasks are launched
> with a specified memory/cpu ratio, but the mesos scheduler seems to ignore
> this.
>
> Specifically, I cannot find where in the code the limit of number of tasks
> per executor of "spark.executor.cores" / "spark.task.cpus" is enforced in
> the MesosBackendScheduler.
>
> The Spark App in question has some JVM heap heavy activities inside a
> RDD.mapPartitionsWithIndex, so having more tasks per limited JVM memory
> resource is bad. The workaround planned handling of this is to limit the
> number of tasks per JVM, which does not seem possible in mesos mode, where
> it seems to just keep stacking on CPUs as tasks come in without adjusting
> any memory constraints, or looking for limits of tasks per executor.
>
> How can I limit the tasks per executor (or per memory pool) in the Mesos
> backend scheduler?
>
> Thanks,
> Charles Allen
>



-- 

--
Iulian Dragos

--
Reactive Apps on the JVM
www.typesafe.com

Creation of RDD in foreachAsync is failing

2015-12-11 Thread Madabhattula Rajesh Kumar

Hi,

I have a below query. Please help me to solve this

I have a 2 ids. I want to join these ids to table. This table contains
some blob data. So i can not join these 2 ids to this table in one step.

I'm planning to join this table in a chunks. For example, each step I will
join 5000 ids. The total number of batches are : 2/500 = 40

I want to run these 40 batches in parallel. For that, I'm using
*foreachAsync* method. Now I'm getting below exception

*An error has occured: Job aborted due to stage failure: Task 0 in stage
1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1,
localhost): java.lang.IllegalStateException: Cannot call methods on a
stopped SparkContext.*

*Code :- *

var listOfEqs = //ListBuffer. Number of elements are 2
var listOfEqsSeq = listOfEqs.grouped(500).toList
var listR = sc.parallelize(listOfEqsSeq)
var asyncRd =  new AsyncRDDActions(listR)

val f = asyncRd.foreachAsync { x =>
{
val r = sc.parallelize(x).toDF()  ==> This line I'm getting above
mentioned exception
r.registerTempTable("r")
 val acc = sc.accumulator(0, "My Accumulator")
var result = sqlContext.sql("SELECT r.id, t.data from r, t where r.id = t.id
")
 result.foreach{ y =>
 {
 acc += y
  }
}
acc.value.foreach(f => // saving values to other db)
}
f.onComplete {
case scala.util.Success(res) => println(res)
case scala.util.Failure(e)=> println("An error has occured: " +
e.getMessage)
}

Please help me to solve this issue

Regards,
Rajesh

Compiling ERROR for Spark MetricsSystem

2015-12-11 Thread Haijia Zhou

Hi,

0 down vote 
favorite


I'm trying to register a custom Spark metrics source to Spark metrics system 
with following code:

val source = new CustomMetricSource() 
SparkEnv.get.metricsSystem.registerSource(source)

Then the code failed compiling with following error:

Error:scalac: Error: bad symbolic reference. A signature in MetricsSystem.class 
refers to term servlet in value org.jetty which is not available. It may be 
completely missing from the current classpath, or the version on the classpath 
might be incompatible with the version used when compiling MetricsSystem.class. 
scala.reflect.internal.Types$TypeError: bad symbolic reference. A signature in 
MetricsSystem.class refers to term servlet in value org.jetty which is not 
available. It may be completely missing from the current classpath, or the 
version on the classpath might be incompatible with the version used when 
compiling MetricsSystem.class. at 
scala.reflect.internal.pickling.UnPickler$Scan.toTypeError(UnPickler.scala:847) 
at 
scala.reflect.internal.pickling.UnPickler$Scan$LazyTypeRef.complete(UnPickler.scala:854)
 at 
scala.reflect.internal.pickling.UnPickler$Scan$LazyTypeRef.load(UnPickler.scala:863)
 at scala.reflect.internal.Symbols$Symbol.typeParams(Symbols.scala:1489) at 
scala.tools.nsc.transform.SpecializeTypes$$anonfun$scala$tools$nsc$transform$SpecializeTypes$$normalizeMember$1.apply(SpecializeTypes.scala:798)
 at 
scala.tools.nsc.transform.SpecializeTypes$$anonfun$scala$tools$nsc$transform$SpecializeTypes$$normalizeMember$1.apply(SpecializeTypes.scala:798)
 at scala.reflect.internal.SymbolTable.atPhase(SymbolTable.scala:207) at 
scala.reflect.internal.SymbolTable.beforePhase(SymbolTable.scala:215) at 
scala.tools.nsc.transform.SpecializeTypes.scala$tools$nsc$transform$SpecializeTypes$$normalizeMember(SpecializeTypes.scala:797)
 at 
scala.tools.nsc.transform.SpecializeTypes$$anonfun$22.apply(SpecializeTypes.scala:751)
 at 
scala.tools.nsc.transform.SpecializeTypes$$anonfun$22.apply(SpecializeTypes.scala:749)
 at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
 at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
 at scala.collection.immutable.List.foreach(List.scala:318) at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) at 
scala.collection.AbstractTraversable.flatMap(Traversable.scala:105) at 
scala.tools.nsc.transform.SpecializeTypes.specializeClass(SpecializeTypes.scala:749)
 at 
scala.tools.nsc.transform.SpecializeTypes.transformInfo(SpecializeTypes.scala:1172)
 at 
scala.tools.nsc.transform.InfoTransform$Phase$$anon$1.transform(InfoTransform.scala:38)
 at scala.reflect.internal.Symbols$Symbol.rawInfo(Symbols.scala:1321) at 
scala.reflect.internal.Symbols$Symbol.info(Symbols.scala:1241) at 
scala.reflect.internal.Symbols$Symbol.isDerivedValueClass(Symbols.scala:658) at 
scala.reflect.internal.Symbols$Symbol.isMethodWithExtension(Symbols.scala:661) 
at 
scala.tools.nsc.transform.Erasure$ErasureTransformer$$anon$1.preEraseNormalApply(Erasure.scala:1100)
 at 
scala.tools.nsc.transform.Erasure$ErasureTransformer$$anon$1.preEraseApply(Erasure.scala:1195)
 at 
scala.tools.nsc.transform.Erasure$ErasureTransformer$$anon$1.preErase(Erasure.scala:1205)
 at 
scala.tools.nsc.transform.Erasure$ErasureTransformer$$anon$1.transform(Erasure.scala:1280)
 at 
scala.tools.nsc.transform.Erasure$ErasureTransformer$$anon$1.transform(Erasure.scala:1030)
 at scala.reflect.internal.Trees$$anonfun$itransform$2.apply(Trees.scala:1235) 
at scala.reflect.internal.Trees$$anonfun$itransform$2.apply(Trees.scala:1233) 
at scala.reflect.api.Trees$Transformer.atOwner(Trees.scala:2936) at 
scala.tools.nsc.transform.TypingTransformers$TypingTransformer.atOwner(TypingTransformers.scala:34)
 at 
scala.tools.nsc.transform.TypingTransformers$TypingTransformer.atOwner(TypingTransformers.scala:28)
 at 
scala.tools.nsc.transform.TypingTransformers$TypingTransformer.atOwner(TypingTransformers.scala:19)
 at scala.reflect.internal.Trees$class.itransform(Trees.scala:1232) at 
scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:13) at 
scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:13) at 
scala.reflect.api.Trees$Transformer.transform(Trees.scala:2897) at 
scala.tools.nsc.transform.TypingTransformers$TypingTransformer.transform(TypingTransformers.scala:48)
 at 
scala.tools.nsc.transform.Erasure$ErasureTransformer$$anon$1.transform(Erasure.scala:1288)
 at 
scala.tools.nsc.transform.Erasure$ErasureTransformer$$anon$1.transform(Erasure.scala:1030)
 at 
scala.reflect.api.Trees$Transformer$$anonfun$transformStats$1.apply(Trees.scala:2927)
 at 
scala.reflect.api.Trees$Transformer$$anonfun$transformStats$1.apply(Trees.scala:2925)
 at scala.collection.immutable.List.loop$1(List.scala:170) at

Re: default parallelism and mesos executors

2015-12-11 Thread Iulian Dragoș

On Wed, Dec 9, 2015 at 4:29 PM, Adrian Bridgett 
wrote:

> (resending, text only as first post on 2nd never seemed to make it)
>
> Using parallelize() on a dataset I'm only seeing two tasks rather than the
> number of cores in the Mesos cluster.  This is with spark 1.5.1 and using
> the mesos coarse grained scheduler.
>
> Running pyspark in a console seems to show that it's taking a while before
> the mesos executors come online (at which point the default parallelism is
> changing).  If I add "sleep 30" after initialising the SparkContext I get
> the "right" number (42 by coincidence!)
>
> I've just tried increasing minRegisteredResourcesRatio to 0.5 but this
> doesn't affect either the test case below nor my code.
>

This limit seems to be implemented only in the coarse-grained Mesos
scheduler, but the fix will be available starting with Spark 1.6.0 (1.5.2
doesn't have it).

iulian


>
> Is there something else I can do instead?  Perhaps it should be seeing how
> many tasks _should_ be available rather than how many are (I'm also using
> dynamicAllocation).
>
> 15/12/02 14:34:09 INFO mesos.CoarseMesosSchedulerBackend: SchedulerBackend
> is ready for scheduling beginning after reached
> minRegisteredResourcesRatio: 0.0
> >>>
> >>>
> >>> print (sc.defaultParallelism)
> 2
> >>> 15/12/02 14:34:12 INFO mesos.CoarseMesosSchedulerBackend: Mesos task 5
> is now TASK_RUNNING
> 15/12/02 14:34:13 INFO mesos.MesosExternalShuffleClient: Successfully
> registered app 20151117-115458-164233482-5050-24333-0126 with external
> shuffle service.
> 
> 15/12/02 14:34:15 INFO mesos.CoarseMesosSchedulerBackend: Registered
> executor:
> AkkaRpcEndpointRef(Actor[akka.tcp://sparkExecutor@ip-10-1-200-147.ec2.internal:41194/user/Executor#-1021429650])
> with ID 20151117-115458-164233482-5050-24333-S22/5
> 15/12/02 14:34:15 INFO spark.ExecutorAllocationManager: New executor
> 20151117-115458-164233482-5050-24333-S22/5 has registered (new total is 1)
> 
> >>> print (sc.defaultParallelism)
> 42
>
> Thanks
>
> Adrian Bridgett
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 

--
Iulian Dragos

--
Reactive Apps on the JVM
www.typesafe.com

Spark Submit - java.lang.IllegalArgumentException: requirement failed

2015-12-11 Thread Afshartous, Nick


Hi,


I'm trying to run a streaming job on a single node EMR 4.1/Spark 1.5 cluster.  
Its throwing an IllegalArgumentException right away on the submit.

Attaching full output from console.


Thanks for any insights.

--

Nick



15/12/11 16:44:43 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
15/12/11 16:44:43 INFO client.RMProxy: Connecting to ResourceManager at 
ip-10-247-129-50.ec2.internal/10.247.129.50:8032
15/12/11 16:44:43 INFO yarn.Client: Requesting a new application from cluster 
with 1 NodeManagers
15/12/11 16:44:43 INFO yarn.Client: Verifying our application has not requested 
more than the maximum memory capability of the cluster (54272 MB per container)
15/12/11 16:44:43 INFO yarn.Client: Will allocate AM container, with 11264 MB 
memory including 1024 MB overhead
15/12/11 16:44:43 INFO yarn.Client: Setting up container launch context for our 
AM
15/12/11 16:44:43 INFO yarn.Client: Setting up the launch environment for our 
AM container
15/12/11 16:44:43 INFO yarn.Client: Preparing resources for our AM container
15/12/11 16:44:44 INFO yarn.Client: Uploading resource 
file:/usr/lib/spark/lib/spark-assembly-1.5.0-hadoop2.6.0-amzn-1.jar -> 
hdfs://ip-10-247-129-50.ec2.internal:8020/user/hadoop/.sparkStaging/application_1447\
442727308_0126/spark-assembly-1.5.0-hadoop2.6.0-amzn-1.jar
15/12/11 16:44:44 INFO metrics.MetricsSaver: MetricsConfigRecord 
disabledInCluster: false instanceEngineCycleSec: 60 clusterEngineCycleSec: 60 
disableClusterEngine: false maxMemoryMb: 3072 maxInstanceCount: 500\
 lastModified: 1447442734295
15/12/11 16:44:44 INFO metrics.MetricsSaver: Created MetricsSaver 
j-2H3BTA60FGUYO:i-f7812947:SparkSubmit:15603 period:60 
/mnt/var/em/raw/i-f7812947_20151211_SparkSubmit_15603_raw.bin
15/12/11 16:44:45 INFO metrics.MetricsSaver: 1 aggregated HDFSWriteDelay 1276 
raw values into 1 aggregated values, total 1
15/12/11 16:44:45 INFO yarn.Client: Uploading resource 
file:/home/hadoop/spark-pipeline-framework-1.1.6-SNAPSHOT/workflow/lib/spark-kafka-services-1.0.jar
 -> hdfs://ip-10-247-129-50.ec2.internal:8020/user/hadoo\
p/.sparkStaging/application_1447442727308_0126/spark-kafka-services-1.0.jar
15/12/11 16:44:45 INFO yarn.Client: Uploading resource 
file:/home/hadoop/spark-pipeline-framework-1.1.6-SNAPSHOT/conf/AwsCredentials.properties
 -> hdfs://ip-10-247-129-50.ec2.internal:8020/user/hadoop/.sparkSta\
ging/application_1447442727308_0126/AwsCredentials.properties
15/12/11 16:44:45 WARN yarn.Client: Resource 
file:/home/hadoop/spark-pipeline-framework-1.1.6-SNAPSHOT/conf/AwsCredentials.properties
 added multiple times to distributed cache.
15/12/11 16:44:45 INFO yarn.Client: Deleting staging directory 
.sparkStaging/application_1447442727308_0126
Exception in thread "main" java.lang.IllegalArgumentException: requirement 
failed
at scala.Predef$.require(Predef.scala:221)
at 
org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6$$anonfun$apply$2.apply(Client.scala:392)
at 
org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6$$anonfun$apply$2.apply(Client.scala:390)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at 
org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6.apply(Client.scala:390)
at 
org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6.apply(Client.scala:388)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:388)
at 
org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:629)
at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:119)
at org.apache.spark.deploy.yarn.Client.run(Client.scala:907)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:966)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)


adjust ~/spark-pipeline-framework-1.1.6-SNAPSHOT > 
adjust ~/spark-pipeline-framework-1.1.6-SNAPSHOT > ./bin/run-event-streaming.sh 
conf/dev/nick-malcolm-events.properties  > console.txt
Using properties file: /usr/lib/spark/conf/spark-defaults.conf
Adding default property: spark.executor.extraJavaOptions=-verbose:gc 
-XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC 
-XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 
-XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p'
Adding default property: 
spark.history.fs.logDirectory=hdfs:///var/log/spark/apps
Adding default property: spark.eventLog.enabled=true
Adding default property: spark.shuffle.service.enabled=true
Adding default property: 
spark.driver.extraLibraryPath=/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native
Adding default property: 
spark.yarn.historyServer.address=ip-10-247-129-50.ec2.internal:18080
Adding default

spark-submit problems with --packages and --deploy-mode cluster

2015-12-11 Thread Greg Hill

I'm using Spark 1.5.0 with the standalone scheduler, and for the life of me I 
can't figure out why this isn't working.  I have an application that works fine 
with --deploy-mode client that I'm trying to get to run in cluster mode so I 
can use --supervise.  I ran into a few issues with my configuration that I had 
to sort out (classpath stuff mostly), but now I'm stumped.  We rely on the 
databricks spark csv plugin.  We're loading that using --packages 
"com.databricks:spark-csv_2.11:1.2.0".  This works without issue in client 
mode, but when run in cluster mode, it tries to load the spark-csv jar from 
/root/.ivy2 and fails because that folder doesn't exist on the slave node that 
ends up running the driver.  Does --packages not work when the driver is loaded 
on the cluster?  Does it download the jars in the client before loading the 
driver on the cluster and doesn't pass along the downloaded JARs?

Here's my stderr output:

https://gist.github.com/jimbobhickville/1f10b3508ef946eccb92

Thanks in advance for any suggestions.

Greg

Re: Spark streaming with Kinesis broken?

2015-12-11 Thread Brian London

Yes, it's against master: https://github.com/apache/spark/pull/10256

I'll push the KCL version bump after my local tests finish.

On Fri, Dec 11, 2015 at 10:42 AM Nick Pentreath 
wrote:

> Is that PR against master branch?
>
> S3 read comes from Hadoop / jet3t afaik
>
> —
> Sent from Mailbox 
>
>
> On Fri, Dec 11, 2015 at 5:38 PM, Brian London 
> wrote:
>
>> That's good news  I've got a PR in to up the SDK version to 1.10.40 and
>> the KCL to 1.6.1 which I'm running tests on locally now.
>>
>> Is the AWS SDK not used for reading/writing from S3 or do we get that for
>> free from the Hadoop dependencies?
>>
>> On Fri, Dec 11, 2015 at 5:07 AM Nick Pentreath 
>> wrote:
>>
>>> cc'ing dev list
>>>
>>> Ok, looks like when the KCL version was updated in
>>> https://github.com/apache/spark/pull/8957, the AWS SDK version was not,
>>> probably leading to dependency conflict, though as Burak mentions its hard
>>> to debug as no exceptions seem to get thrown... I've tested 1.5.2 locally
>>> and on my 1.5.2 EC2 cluster, and no data is received, and nothing shows up
>>> in driver or worker logs, so any exception is getting swallowed somewhere.
>>>
>>> Run starting. Expected test count is: 4
>>> KinesisStreamSuite:
>>> Using endpoint URL https://kinesis.eu-west-1.amazonaws.com for creating
>>> Kinesis streams for tests.
>>> - KinesisUtils API
>>> - RDD generation
>>> - basic operation *** FAILED ***
>>>   The code passed to eventually never returned normally. Attempted 13
>>> times over 2.04 minutes. Last failure message: Set() did not equal
>>> Set(5, 10, 1, 6, 9, 2, 7, 3, 8, 4)
>>>   Data received does not match data sent. (KinesisStreamSuite.scala:188)
>>> - failure recovery *** FAILED ***
>>>   The code passed to eventually never returned normally. Attempted 63
>>> times over 2.02863831 minutes. Last failure message:
>>> isCheckpointPresent was true, but 0 was not greater than 10.
>>> (KinesisStreamSuite.scala:228)
>>> Run completed in 5 minutes, 0 seconds.
>>> Total number of tests run: 4
>>> Suites: completed 1, aborted 0
>>> Tests: succeeded 2, failed 2, canceled 0, ignored 0, pending 0
>>> *** 2 TESTS FAILED ***
>>> [INFO]
>>> 
>>> [INFO] BUILD FAILURE
>>> [INFO]
>>> 
>>>
>>>
>>> KCL 1.3.0 depends on *1.9.37* SDK (
>>> https://github.com/awslabs/amazon-kinesis-client/blob/1.3.0/pom.xml#L26)
>>> while the Spark Kinesis dependency was kept at *1.9.16.*
>>>
>>> I've run the integration tests on branch-1.5 (1.5.3-SNAPSHOT) with AWS
>>> SDK 1.9.37 and everything works.
>>>
>>> Run starting. Expected test count is: 28
>>> KinesisBackedBlockRDDSuite:
>>> Using endpoint URL https://kinesis.eu-west-1.amazonaws.com for creating
>>> Kinesis streams for tests.
>>> - Basic reading from Kinesis
>>> - Read data available in both block manager and Kinesis
>>> - Read data available only in block manager, not in Kinesis
>>> - Read data available only in Kinesis, not in block manager
>>> - Read data available partially in block manager, rest in Kinesis
>>> - Test isBlockValid skips block fetching from block manager
>>> - Test whether RDD is valid after removing blocks from block anager
>>> KinesisStreamSuite:
>>> - KinesisUtils API
>>> - RDD generation
>>> - basic operation
>>> - failure recovery
>>> KinesisReceiverSuite:
>>> - check serializability of SerializableAWSCredentials
>>> - process records including store and checkpoint
>>> - shouldn't store and checkpoint when receiver is stopped
>>> - shouldn't checkpoint when exception occurs during store
>>> - should set checkpoint time to currentTime + checkpoint interval upon
>>> instantiation
>>> - should checkpoint if we have exceeded the checkpoint interval
>>> - shouldn't checkpoint if we have not exceeded the checkpoint interval
>>> - should add to time when advancing checkpoint
>>> - shutdown should checkpoint if the reason is TERMINATE
>>> - shutdown should not checkpoint if the reason is something other than
>>> TERMINATE
>>> - retry success on first attempt
>>> - retry success on second attempt after a Kinesis throttling exception
>>> - retry success on second attempt after a Kinesis dependency exception
>>> - retry failed after a shutdown exception
>>> - retry failed after an invalid state exception
>>> - retry failed after unexpected exception
>>> - retry failed after exhausing all retries
>>> Run completed in 3 minutes, 28 seconds.
>>> Total number of tests run: 28
>>> Suites: completed 4, aborted 0
>>> Tests: succeeded 28, failed 0, canceled 0, ignored 0, pending 0
>>> All tests passed.
>>>
>>> So this is a regression in Spark Streaming Kinesis 1.5.2 - @Brian can
>>> you file a JIRA for this?
>>>
>>> @dev-list, since KCL brings in AWS SDK dependencies itself, is it
>>> necessary to declare an explicit

Re: Spark streaming with Kinesis broken?

2015-12-11 Thread Brian London

That's good news  I've got a PR in to up the SDK version to 1.10.40 and the
KCL to 1.6.1 which I'm running tests on locally now.

Is the AWS SDK not used for reading/writing from S3 or do we get that for
free from the Hadoop dependencies?

On Fri, Dec 11, 2015 at 5:07 AM Nick Pentreath 
wrote:

> cc'ing dev list
>
> Ok, looks like when the KCL version was updated in
> https://github.com/apache/spark/pull/8957, the AWS SDK version was not,
> probably leading to dependency conflict, though as Burak mentions its hard
> to debug as no exceptions seem to get thrown... I've tested 1.5.2 locally
> and on my 1.5.2 EC2 cluster, and no data is received, and nothing shows up
> in driver or worker logs, so any exception is getting swallowed somewhere.
>
> Run starting. Expected test count is: 4
> KinesisStreamSuite:
> Using endpoint URL https://kinesis.eu-west-1.amazonaws.com for creating
> Kinesis streams for tests.
> - KinesisUtils API
> - RDD generation
> - basic operation *** FAILED ***
>   The code passed to eventually never returned normally. Attempted 13
> times over 2.04 minutes. Last failure message: Set() did not equal
> Set(5, 10, 1, 6, 9, 2, 7, 3, 8, 4)
>   Data received does not match data sent. (KinesisStreamSuite.scala:188)
> - failure recovery *** FAILED ***
>   The code passed to eventually never returned normally. Attempted 63
> times over 2.02863831 minutes. Last failure message:
> isCheckpointPresent was true, but 0 was not greater than 10.
> (KinesisStreamSuite.scala:228)
> Run completed in 5 minutes, 0 seconds.
> Total number of tests run: 4
> Suites: completed 1, aborted 0
> Tests: succeeded 2, failed 2, canceled 0, ignored 0, pending 0
> *** 2 TESTS FAILED ***
> [INFO]
> 
> [INFO] BUILD FAILURE
> [INFO]
> 
>
>
> KCL 1.3.0 depends on *1.9.37* SDK (
> https://github.com/awslabs/amazon-kinesis-client/blob/1.3.0/pom.xml#L26)
> while the Spark Kinesis dependency was kept at *1.9.16.*
>
> I've run the integration tests on branch-1.5 (1.5.3-SNAPSHOT) with AWS SDK
> 1.9.37 and everything works.
>
> Run starting. Expected test count is: 28
> KinesisBackedBlockRDDSuite:
> Using endpoint URL https://kinesis.eu-west-1.amazonaws.com for creating
> Kinesis streams for tests.
> - Basic reading from Kinesis
> - Read data available in both block manager and Kinesis
> - Read data available only in block manager, not in Kinesis
> - Read data available only in Kinesis, not in block manager
> - Read data available partially in block manager, rest in Kinesis
> - Test isBlockValid skips block fetching from block manager
> - Test whether RDD is valid after removing blocks from block anager
> KinesisStreamSuite:
> - KinesisUtils API
> - RDD generation
> - basic operation
> - failure recovery
> KinesisReceiverSuite:
> - check serializability of SerializableAWSCredentials
> - process records including store and checkpoint
> - shouldn't store and checkpoint when receiver is stopped
> - shouldn't checkpoint when exception occurs during store
> - should set checkpoint time to currentTime + checkpoint interval upon
> instantiation
> - should checkpoint if we have exceeded the checkpoint interval
> - shouldn't checkpoint if we have not exceeded the checkpoint interval
> - should add to time when advancing checkpoint
> - shutdown should checkpoint if the reason is TERMINATE
> - shutdown should not checkpoint if the reason is something other than
> TERMINATE
> - retry success on first attempt
> - retry success on second attempt after a Kinesis throttling exception
> - retry success on second attempt after a Kinesis dependency exception
> - retry failed after a shutdown exception
> - retry failed after an invalid state exception
> - retry failed after unexpected exception
> - retry failed after exhausing all retries
> Run completed in 3 minutes, 28 seconds.
> Total number of tests run: 28
> Suites: completed 4, aborted 0
> Tests: succeeded 28, failed 0, canceled 0, ignored 0, pending 0
> All tests passed.
>
> So this is a regression in Spark Streaming Kinesis 1.5.2 - @Brian can you
> file a JIRA for this?
>
> @dev-list, since KCL brings in AWS SDK dependencies itself, is it
> necessary to declare an explicit dependency on aws-java-sdk in the Kinesis
> POM? Also, from KCL 1.5.0+, only the relevant components used from the AWS
> SDKs are brought in, making things a bit leaner (this can be upgraded in
> Spark 1.7/2.0 perhaps). All local tests (and integration tests) pass with
> removing the explicit dependency and only depending on KCL. Is aws-java-sdk
> used anywhere else (AFAIK it is not, but in case I missed something let me
> know any good reason to keep the explicit dependency)?
>
> N
>
>
>
> On Fri, Dec 11, 2015 at 6:55 AM, Nick Pentreath 
> wrote:
>
>> Yeah also the integration tests need to be specifically run - I

Re: Spark on EMR: out-of-the-box solution for real-time application logs monitoring?

2015-12-11 Thread Roberto Coluccio

Thanks for your advice, Steve.

I'm mainly talking about application logs. To be more clear, just for
instance think about the
"//hadoop/userlogs/application_blablabla/container_blablabla/stderr_or_stdout".
So YARN's applications containers logs, stored (at least for EMR's hadoop
2.4) on DataNodes and aggregated/pushed only once the application completes.

"yarn logs" issued from the cluster Master doesn't allow you to on-demand
aggregate logs for applications the are in running/active state.

For now I managed to install the awslogs agent (
http://docs.aws.amazon.com/AmazonCloudWatch/latest/DeveloperGuide/CWL_GettingStarted.html)
on
DataNodes so to push containers logs in real-time to CloudWatch logs, but
that's kinda of a workaround too, this is why I was wondering what the
community (in general, not only on EMR) uses to real-time monitor
application logs (in an automated fashion) for long-running processes like
streaming driver and if are there out-of-the-box solutions.

Thanks,

Roberto





On Thu, Dec 10, 2015 at 3:06 PM, Steve Loughran 
wrote:

>
> > On 10 Dec 2015, at 14:52, Roberto Coluccio 
> wrote:
> >
> > Hello,
> >
> > I'm investigating on a solution to real-time monitor Spark logs produced
> by my EMR cluster in order to collect statistics and trigger alarms. Being
> on EMR, I found the CloudWatch Logs + Lambda pretty straightforward and,
> since I'm on AWS, those service are pretty well integrated together..but I
> could just find examples about it using on standalone EC2 instances.
> >
> > In my use case, EMR 3.9 and Spark 1.4.1 drivers running on YARN (cluster
> mode), I would like to be able to real-time monitor Spark logs, so not just
> about when the processing ends and they are copied to S3. Is there any
> out-of-the-box solution or best-practice for accomplish this goal when
> running on EMR that I'm not aware of?
> >
> > Spark logs are written on the Data Nodes (Core Instances) local file
> systems as YARN containers logs, so probably installing the awslogs agent
> on them and pointing to those logfiles would help pushing such logs on
> CloudWatch, but I was wondering how the community real-time monitors
> application logs when running Spark on YARN on EMR.
> >
> > Or maybe I'm looking at a wrong solution. Maybe the correct way would be
> using something like a CloudwatchSink so to make Spark (log4j) pushing logs
> directly to the sink and the sink pushing them to CloudWatch (I do like the
> out-of-the-box EMR logging experience and I want to keep the usual eventual
> logs archiving on S3 when the EMR cluster is terminated).
> >
> > Any ideas or experience about this problem?
> >
> > Thank you.
> >
> > Roberto
>
>
> are you talking about event logs as used by the history server, or
> application logs?
>
> the current spark log server writes events to a file, but as the hadoop s3
> fs client doesn't write except in close(), they won't be pushed out while
> thing are running. Someone (you?) could have a go at implementing a new
> event listener; some stuff that will come out in Spark 2.0 will make it
> easier to wire this up (SPARK-11314), which is coming as part of some work
> on spark-YARN timelineserver itnegration.
>
> In Hadoop 2.7.1 The log4j logs can be regularly captured by the Yarn
> Nodemanagers and automatically copied out, look at
> yarn.nodemanager.log-aggregation.roll-monitoring-interval-seconds . For
> that to work you need to set up your log wildcard patterns to for the NM to
> locate (i.e. have rolling logs with the right extensions)...the details
> escape me right now
>
> In earlier versions, you can use "yarn logs' to grab them and pull them
> down.
>
> I don't know anything about cloudwatch integration, sorry
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Spark streaming with Kinesis broken?

2015-12-11 Thread Nick Pentreath

Is that PR against master branch?




S3 read comes from Hadoop / jet3t afaik



—
Sent from Mailbox

On Fri, Dec 11, 2015 at 5:38 PM, Brian London 
wrote:

> That's good news  I've got a PR in to up the SDK version to 1.10.40 and the
> KCL to 1.6.1 which I'm running tests on locally now.
> Is the AWS SDK not used for reading/writing from S3 or do we get that for
> free from the Hadoop dependencies?
> On Fri, Dec 11, 2015 at 5:07 AM Nick Pentreath 
> wrote:
>> cc'ing dev list
>>
>> Ok, looks like when the KCL version was updated in
>> https://github.com/apache/spark/pull/8957, the AWS SDK version was not,
>> probably leading to dependency conflict, though as Burak mentions its hard
>> to debug as no exceptions seem to get thrown... I've tested 1.5.2 locally
>> and on my 1.5.2 EC2 cluster, and no data is received, and nothing shows up
>> in driver or worker logs, so any exception is getting swallowed somewhere.
>>
>> Run starting. Expected test count is: 4
>> KinesisStreamSuite:
>> Using endpoint URL https://kinesis.eu-west-1.amazonaws.com for creating
>> Kinesis streams for tests.
>> - KinesisUtils API
>> - RDD generation
>> - basic operation *** FAILED ***
>>   The code passed to eventually never returned normally. Attempted 13
>> times over 2.04 minutes. Last failure message: Set() did not equal
>> Set(5, 10, 1, 6, 9, 2, 7, 3, 8, 4)
>>   Data received does not match data sent. (KinesisStreamSuite.scala:188)
>> - failure recovery *** FAILED ***
>>   The code passed to eventually never returned normally. Attempted 63
>> times over 2.02863831 minutes. Last failure message:
>> isCheckpointPresent was true, but 0 was not greater than 10.
>> (KinesisStreamSuite.scala:228)
>> Run completed in 5 minutes, 0 seconds.
>> Total number of tests run: 4
>> Suites: completed 1, aborted 0
>> Tests: succeeded 2, failed 2, canceled 0, ignored 0, pending 0
>> *** 2 TESTS FAILED ***
>> [INFO]
>> 
>> [INFO] BUILD FAILURE
>> [INFO]
>> 
>>
>>
>> KCL 1.3.0 depends on *1.9.37* SDK (
>> https://github.com/awslabs/amazon-kinesis-client/blob/1.3.0/pom.xml#L26)
>> while the Spark Kinesis dependency was kept at *1.9.16.*
>>
>> I've run the integration tests on branch-1.5 (1.5.3-SNAPSHOT) with AWS SDK
>> 1.9.37 and everything works.
>>
>> Run starting. Expected test count is: 28
>> KinesisBackedBlockRDDSuite:
>> Using endpoint URL https://kinesis.eu-west-1.amazonaws.com for creating
>> Kinesis streams for tests.
>> - Basic reading from Kinesis
>> - Read data available in both block manager and Kinesis
>> - Read data available only in block manager, not in Kinesis
>> - Read data available only in Kinesis, not in block manager
>> - Read data available partially in block manager, rest in Kinesis
>> - Test isBlockValid skips block fetching from block manager
>> - Test whether RDD is valid after removing blocks from block anager
>> KinesisStreamSuite:
>> - KinesisUtils API
>> - RDD generation
>> - basic operation
>> - failure recovery
>> KinesisReceiverSuite:
>> - check serializability of SerializableAWSCredentials
>> - process records including store and checkpoint
>> - shouldn't store and checkpoint when receiver is stopped
>> - shouldn't checkpoint when exception occurs during store
>> - should set checkpoint time to currentTime + checkpoint interval upon
>> instantiation
>> - should checkpoint if we have exceeded the checkpoint interval
>> - shouldn't checkpoint if we have not exceeded the checkpoint interval
>> - should add to time when advancing checkpoint
>> - shutdown should checkpoint if the reason is TERMINATE
>> - shutdown should not checkpoint if the reason is something other than
>> TERMINATE
>> - retry success on first attempt
>> - retry success on second attempt after a Kinesis throttling exception
>> - retry success on second attempt after a Kinesis dependency exception
>> - retry failed after a shutdown exception
>> - retry failed after an invalid state exception
>> - retry failed after unexpected exception
>> - retry failed after exhausing all retries
>> Run completed in 3 minutes, 28 seconds.
>> Total number of tests run: 28
>> Suites: completed 4, aborted 0
>> Tests: succeeded 28, failed 0, canceled 0, ignored 0, pending 0
>> All tests passed.
>>
>> So this is a regression in Spark Streaming Kinesis 1.5.2 - @Brian can you
>> file a JIRA for this?
>>
>> @dev-list, since KCL brings in AWS SDK dependencies itself, is it
>> necessary to declare an explicit dependency on aws-java-sdk in the Kinesis
>> POM? Also, from KCL 1.5.0+, only the relevant components used from the AWS
>> SDKs are brought in, making things a bit leaner (this can be upgraded in
>> Spark 1.7/2.0 perhaps). All local tests (and integration tests) pass with
>> removing the explicit dependency and only depending on KCL. Is aws-java-sdk
>> used anywhere

UNSUBSCRIBE

2015-12-11 Thread williamtellme123

Re: Window function in Spark SQL

2015-12-11 Thread Michael Armbrust

Can you change permissions on that directory so that hive can write to it?
We start up a mini version of hive so that we can use some of its
functionality.

On Fri, Dec 11, 2015 at 12:47 PM, Sourav Mazumder <
sourav.mazumde...@gmail.com> wrote:

> In 1.5.x whenever I try to create a HiveContext from SparkContext I get
> following error. Please note that I'm not running any Hadoop/Hive server in
> my cluster. I'm only running Spark.
>
> I never faced HiveContext creation problem like this previously in 1.4.x.
>
> Is it now a requirement in 1.5.x that to create HIveContext Hive Server
> should be running ?
>
> Regards,
> Sourav
>
>
> -- Forwarded message --
> From: 
> Date: Fri, Dec 11, 2015 at 11:39 AM
> Subject: Re: Window function in Spark SQL
> To: sourav.mazumde...@gmail.com
>
>
> I’m not familiar with that issue, I wasn’t able to reproduce in my
> environment - might want to copy that to the Spark user list. Sorry!
>
> On Dec 11, 2015, at 1:37 PM, Sourav Mazumder 
> wrote:
>
> Hi Ross,
>
> Thanks for your answer.
>
> In 1.5.x whenever I try to create a HiveContext from SparkContext I get
> following error. Please note that I'm not running any Hadoop/Hive server in
> my cluster. I'm only running Spark.
>
> I never faced HiveContext creation problem like this previously
>
> Regards,
> Sourav
>
> java.lang.RuntimeException: java.lang.RuntimeException: The root scratch
> dir: /t
> mp/hive on HDFS should be writable. Current permissions are: rwx--
> at
> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.jav
> a:522)
> at
> org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.s
> cala:171)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
>
> at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstruct
> orAccessorImpl.java:62)
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingC
> onstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
> at
> org.apache.spark.sql.hive.client.IsolatedClientLoader.liftedTree1$1(I
> solatedClientLoader.scala:183)
> at
> org.apache.spark.sql.hive.client.IsolatedClientLoader.(Isolated
> ClientLoader.scala:179)
> at
> org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveCon
> text.scala:226)
> at
> org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:
> 185)
> at
> org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:392)
> at
> org.apache.spark.sql.hive.HiveContext.defaultOverrides(HiveContext.sc
> ala:174)
> at
> org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:177)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:15)
> at $iwC$$iwC$$iwC$$iwC$$iwC.(:20)
> at $iwC$$iwC$$iwC$$iwC.(:22)
> at $iwC$$iwC$$iwC.(:24)
> at $iwC$$iwC.(:26)
> at $iwC.(:28)
> at (:30)
> at .(:34)
> at .()
> at .(:7)
> at .()
> at $print()
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>
> On Fri, Dec 11, 2015 at 10:10 AM, 
> wrote:
>
>> Hey Sourav,
>> Window functions require using a HiveContext rather than the default
>> SQLContext. See here:
>> http://spark.apache.org/docs/latest/sql-programming-guide.html#starting-point-sqlcontext
>> 
>>
>> HiveContext provides all the same functionality of SQLContext, as well as
>> extra features like Window functions.
>>
>> - Ross
>>
>> On Dec 11, 2015, at 12:59 PM, Sourav Mazumder <
>> sourav.mazumde...@gmail.com> wrote:
>>
>> Hi,
>>
>> Spark SQL documentation says that it complies with Hive 1.2.1 APIs and
>> supports Window functions. I'm using Spark 1.5.0.
>>
>> However, when I try to execute something like below I get an error
>>
>> val lol5 = sqlContext.sql("select ky, lead(ky, 5, 0) over (order by ky
>> rows 5 following) from lolt")
>>
>> java.lang.RuntimeException: [1.32] failure: ``union'' expected but `('
>> found select ky, lead(ky, 5, 0) over (order by ky rows 5 following) from
>> lolt ^ at scala.sys.package$.error(package.scala:27) at
>> org.apache.spark.sql.catalyst.AbstractSparkSQLParser.parse(AbstractSparkSQLParser.scala:36)
>> at
>> org.apache.spark.sql.catalyst.DefaultParserDialect.parse(ParserDialect.scala:67)
>> at org.apache.spark.sql.SQLContext$$anonfun$3.apply(SQLContext.scala:169)
>> at org.apache.spark.sql.SQLContext$$anonfun$3.apply(SQLContext.scala:169)
>> at
>>

Re: Questions on Kerberos usage with YARN and JDBC

2015-12-11 Thread Todd Simmer

hey Mike,

Are these part of an Active Directory Domain? If so are they pointed at the
AD domain controllers that hosts the Kerberos server? Windows AD create SRV
records in DNS to help windows clients find the Kerberos server for their
domain. If you look you can see if you have a kdc record in Windows DNS and
what it's pointing at. Can you do a

kinit *username *

on that host? It should tell you if it can find the KDC.

Let me know if that's helpful at all.

Todd

On Fri, Dec 11, 2015 at 1:50 PM, Mike Wright  wrote:

> As part of our implementation, we are utilizing a full "Kerberized"
> cluster built on the Hortonworks suite. We're using Job Server as the front
> end to initiate short-run jobs directly from our client-facing product
> suite.
>
> 1) We believe we have configured the job server to start with the
> appropriate credentials, specifying a principal and keytab. We switch to
> YARN-CLIENT mode and can see Job Server attempt to connect to the resource
> manager, and the result is that whatever the principal name is, it "cannot
> impersonate root."  We have been unable to solve this.
>
> 2) We are primarily a Windows shop, hence our cluelessness here. That
> said, we're using the JDBC driver version 4.2 and want to use JavaKerberos
> authentication to connect to SQL Server. The queries performed by the job
> are done in the driver, and hence would be running on the Job Server, which
> we confirmed is running as the principal we have designated. However, when
> attempting to connect with this option enabled I receive a "Unable to
> obtain Principal Name for authentication" exception.
>
> Reading this:
>
> https://msdn.microsoft.com/en-us/library/ms378428.aspx
>
> We have Kerberos working on the machine and thus have krb5.conf setup
> correctly. However the section, "
> 
> Enabling the Domain Configuration File and the Login Module Configuration
> File" seems to indicate we've missed a step somewhere.
>
> Forgive my ignorance here ... I've been on Windows for 20 years and this
> is all new to.
>
> Thanks for any guidance you can provide.
>

UNSUBSCRIBE

2015-12-11 Thread williamtellme123

imposed dynamic resource allocation

2015-12-11 Thread Antony Mayi

Hi,
using spark 1.5.2 on yarn (client mode) and was trying to use the dynamic 
resource allocation but it seems once it is enabled by first app then any 
following application is managed that way even if explicitly disabling.
example:1) yarn configured with 
org.apache.spark.network.yarn.YarnShuffleService as spark_shuffle aux class2) 
running first app that doesnt specify dynamic allocation / shuffle service - it 
runs as expected with static executors3) running second application that 
enables spark.dynamicAllocation.enabled and spark.shuffle.service.enabled - it 
is dynamic as expected4) running another app that doesnt enable and it even 
disables dynamic allocation / shuffle service still the executors are being 
added/removed dynamically throughout the runtime.5) restarting nodemanagers to 
reset this
Is this known issue or have I missed something? Can the dynamic resource 
allocation be enabled per application?
Thanks,Antony.

Spark REST API shows Error 503 Service Unavailable

2015-12-11 Thread prateek arora



Hi
 
I am trying to access Spark Using REST API but got below error :

Command :
 
curl http://:18088/api/v1/applications
 
Response:





Error 503 Service Unavailable


HTTP ERROR 503

Problem accessing /api/v1/applications. Reason:
Service Unavailable
Caused by:
org.spark-project.jetty.servlet.ServletHolder$1:
java.lang.reflect.InvocationTargetException
at
org.spark-project.jetty.servlet.ServletHolder.makeUnavailable(ServletHolder.java:496)
at
org.spark-project.jetty.servlet.ServletHolder.initServlet(ServletHolder.java:543)
at
org.spark-project.jetty.servlet.ServletHolder.getServlet(ServletHolder.java:415)
at
org.spark-project.jetty.servlet.ServletHolder.handle(ServletHolder.java:657)
at
org.spark-project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:501)
at
org.spark-project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086)
at
org.spark-project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:428)
at
org.spark-project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020)
at
org.spark-project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
at
org.spark-project.jetty.server.handler.GzipHandler.handle(GzipHandler.java:301)
at
org.spark-project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
at
org.spark-project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
at org.spark-project.jetty.server.Server.handle(Server.java:370)
at
org.spark-project.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
at
org.spark-project.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971)
at
org.spark-project.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1033)
at
org.spark-project.jetty.http.HttpParser.parseNext(HttpParser.java:644)
at
org.spark-project.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
at
org.spark-project.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
at
org.spark-project.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:667)
at
org.spark-project.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52)
at
org.spark-project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at
org.spark-project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at
com.sun.jersey.spi.container.servlet.WebComponent.createResourceConfig(WebComponent.java:728)
at
com.sun.jersey.spi.container.servlet.WebComponent.createResourceConfig(WebComponent.java:678)
at
com.sun.jersey.spi.container.servlet.WebComponent.init(WebComponent.java:203)
at
com.sun.jersey.spi.container.servlet.ServletContainer.init(ServletContainer.java:373)
at
com.sun.jersey.spi.container.servlet.ServletContainer.init(ServletContainer.java:556)
at javax.servlet.GenericServlet.init(GenericServlet.java:244)
at
org.spark-project.jetty.servlet.ServletHolder.initServlet(ServletHolder.java:532)
... 22 more
Caused by: java.lang.NoSuchMethodError:
com.sun.jersey.core.reflection.ReflectionHelper.getOsgiRegistryInstance()Lcom/sun/jersey/core/osgi/OsgiRegistry;
at
com.sun.jersey.spi.scanning.AnnotationScannerListener$AnnotatedClassVisitor.getClassForName(AnnotationScannerListener.java:217)
at
com.sun.jersey.spi.scanning.AnnotationScannerListener$AnnotatedClassVisitor.visitEnd(AnnotationScannerListener.java:186)
at org.objectweb.asm.ClassReader.accept(Unknown Source)
at org.objectweb.asm.ClassReader.accept(Unknown Source)
at
com.sun.jersey.spi.scanning.AnnotationScannerListener.onProcess(AnnotationScannerListener.java:136)
at
com.sun.jersey.core.spi.scanning.JarFileScanner.scan(JarFileScanner.java:97)
at
com.sun.jersey.core.spi.scanning.uri.JarZipSchemeScanner$1.f(JarZipSchemeScanner.java:78)
at com.sun.jersey.core.util.Closing.f(Closing.java:71)
at
com.sun.jersey.core.spi.scanning.uri.JarZipSchemeScanner.scan(JarZipSchemeScanner.java:75)
at
com.sun.jersey.core.spi.scanning.PackageNamesScanner.scan(PackageNamesScanner.java:223)
at
com.sun.jersey.core.spi.scanning.PackageNamesScanner.scan(PackageNamesScanner.java:139)
at

Re: how to access local file from Spark sc.textFile("file:///path to/myfile")

2015-12-11 Thread Harsh J

General note: The /root is a protected local directory, meaning that if
your program spawns as a non-root user, it will never be able to access the
file.

On Sat, Dec 12, 2015 at 12:21 AM Zhan Zhang  wrote:

> As Sean mentioned, you cannot referring to the local file in your remote
> machine (executors). One walk around is to copy the file to all machines
> within same directory.
>
> Thanks.
>
> Zhan Zhang
>
> On Dec 11, 2015, at 10:26 AM, Lin, Hao  wrote:
>
>  of the master node
>
>
>

How to display column names in spark-sql output

2015-12-11 Thread Ashwin Shankar

Hi,
When we run spark-sql, is there a way to get column names/headers with the
result?

-- 
Thanks,
Ashwin

Re: How to display column names in spark-sql output

2015-12-11 Thread Ashwin Sai Shankar

Never mind, its *set hive.cli.print.header=true*
Thanks !

On Fri, Dec 11, 2015 at 5:16 PM, Ashwin Shankar 
wrote:

> Hi,
> When we run spark-sql, is there a way to get column names/headers with the
> result?
>
> --
> Thanks,
> Ashwin
>
>
>

Re: SparkML. RandomForest predict performance for small dataset.

2015-12-11 Thread Yanbo Liang

I think you are finding the ability of prediction on single instance. It's
a feature on the development, please refer SPARK-10413.

2015-12-10 4:37 GMT+08:00 Eugene Morozov :

> Hello,
>
> I'm using RandomForest pipeline (ml package). Everything is working fine
> (learning models, prediction, etc), but I'd like to tune it for the case,
> when I predict with small dataset.
> My issue is that when I apply
>
> (PipelineModel)model.transform(dataset)
>
> The model consists of the following stages:
>
> StringIndexerModel labelIndexer = new StringIndexer()...
> RandomForestClassifier classifier = new RandomForestClassifier()...
> IndexToString labelConverter = new IndexToString()...
> Pipeline pipeline = new Pipeline().setStages(new 
> PipelineStage[]{labelIndexer, classifier, labelConverter});
>
> it obviously takes some time to predict, but when my dataset consists of
> just 1 (record) I'd expect it to be really fast.
>
> My observations are even though I use small dataset Spark broadcasts
> something over and over again. That's fine, when I load my (serialized)
> model from disk and use it just once for prediction, but when I use the
> same model in a loop for the same! dataset, I'd say that everything should
> already be on a worker nodes, thus I'd expect prediction to be fast.
> It takes 20 seconds to predict dataset once (with one input row) and all
> subsequent predictions over the same dataset with the same model takes
> roughly 10 seconds.
> My goal is to have 0.5 - 1 second response.
>
> My intention was to keep learned model on a driver (that's stay online
> with created SparkContext) to use it for any subsequent predictions, but
> these 10 seconds predictions basically kill the whole idea.
>
> Is it possible somehow to distribute the model over the cluster upfront so
> that the prediction is really fast?
> Are there any specific params to apply to the PipelineModel to stay
> resident on a worker nodes? Anything to keep and reuse broadcasted data?
>
> Thanks in advance.
> --
> Be well!
> Jean Morozov
>

RE: Re: Spark assembly in Maven repo?

2015-12-11 Thread Xiaoyong Zhu

Yes, so our scenario is to treat the spark assembly as an “SDK” so users can 
develop Spark applications easily without downloading them. In this case which 
way do you guys think might be good?

Xiaoyong

From: fightf...@163.com [mailto:fightf...@163.com]
Sent: Friday, December 11, 2015 12:08 AM
To: Mark Hamstra 
Cc: Xiaoyong Zhu ; Jeff Zhang ; user 
; Zhaomin Xu ; Joe Zhang (SDE) 

Subject: Re: Re: Spark assembly in Maven repo?

Agree with you that assembly jar is not good to publish. However, what he 
really need is to fetch
an updatable maven jar file.

fightf...@163.com

From: Mark Hamstra
Date: 2015-12-11 15:34
To: fightf...@163.com
CC: Xiaoyong Zhu; Jeff 
Zhang; user; Zhaomin 
Xu; Joe Zhang (SDE)
Subject: Re: RE: Spark assembly in Maven repo?
No, publishing a spark assembly jar is not fine.  See the doc attached to 
https://issues.apache.org/jira/browse/SPARK-11157
 and be aware that a likely goal of Spark 2.0 will be the elimination of 
assemblies.

On Thu, Dec 10, 2015 at 11:19 PM, fightf...@163.com 
> wrote:
Using maven to download the assembly jar is fine. I would recommend to deploy 
this
assembly jar to your local maven repo, i.e. nexus repo, Or more likey a 
snapshot repository

fightf...@163.com

From: Xiaoyong Zhu
Date: 2015-12-11 15:10
To: Jeff Zhang
CC: user@spark.apache.org; Zhaomin 
Xu; Joe Zhang (SDE)
Subject: RE: Spark assembly in Maven repo?
Sorry – I didn’t make it clear. It’s actually not a “dependency” – it’s 
actually that we are building a certain plugin for IntelliJ where we want to 
distribute this jar. But since the jar is updated frequently we don't want to 
distribute it together with our plugin but we would like to download it via 
Maven.

In this case what’s the recommended way?

Xiaoyong

From: Jeff Zhang [mailto:zjf...@gmail.com]
Sent: Thursday, December 10, 2015 11:03 PM
To: Xiaoyong Zhu >
Cc: user@spark.apache.org
Subject: Re: Spark assembly in Maven repo?

I don't think make the assembly jar as dependency a good practice. You may meet 
jar hell issue in that case.

On Fri, Dec 11, 2015 at 2:46 PM, Xiaoyong Zhu 
> wrote:
Hi Experts,

We have a project which has a dependency for the following jar

spark-assembly--hadoop.jar
for example:
spark-assembly-1.4.1.2.3.3.0-2983-hadoop2.7.1.2.3.3.0-2983.jar

since this assembly might be updated in the future, I am not sure if there is a 
Maven repo that has the above spark assembly jar? Or should we create & upload 
it to Maven central?

Thanks!

Xiaoyong

--
Best Regards

Jeff Zhang

Re: Inverse of the matrix

2015-12-11 Thread Zhiliang Zhu

use matrix SVD decomposition  and spark has the lib .
http://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html#singular-value-decomposition-svd
 


 


On Thursday, December 10, 2015 7:33 PM, Arunkumar Pillai 
 wrote:
 

 Hi
I need to find inverse (X(Transpose) * X) matrix. I have found X transpose and 
matrix multiplication.
is there any way to find to find the inverse of the matrix. 


-- 
Thanks and Regards
        Arun

Re: Classpath problem trying to use DataFrames

2015-12-11 Thread Harsh J

Do you have all your hive jars listed in the classpath.txt /
SPARK_DIST_CLASSPATH env., specifically the hive-exec jar? Is the location
of that jar also the same on all the distributed hosts?

Passing an explicit executor classpath string may also help overcome this
(replace HIVE_BASE_DIR to the root of your hive installation):

--conf "spark.executor.extraClassPath=$HIVE_BASE_DIR/hive/lib/*"

On Sat, Dec 12, 2015 at 6:32 AM Christopher Brady <
christopher.br...@oracle.com> wrote:

> I'm trying to run a basic "Hello world" type example using DataFrames
> with Hive in yarn-client mode. My code is:
>
> JavaSparkContext sc = new JavaSparkContext("yarn-client", "Test app"))
> HiveContext sqlContext = new HiveContext(sc.sc());
> sqlContext.sql("SELECT * FROM my_table").count();
>
> The exception I get on the driver is:
> java.lang.ClassNotFoundException: org.apache.hadoop.hive.ql.plan.TableDesc
>
> There are no exceptions on the executors.
>
> That class is definitely on the classpath of the driver, and it runs
> without errors in local mode. I haven't been able to find any similar
> errors on google. Does anyone know what I'm doing wrong?
>
> The full stack trace is included below:
>
> java.lang.NoClassDefFoundError: Lorg/apache/hadoop/hive/ql/plan/TableDesc;
>  at java.lang.Class.getDeclaredFields0(Native Method)
>  at java.lang.Class.privateGetDeclaredFields(Class.java:2436)
>  at java.lang.Class.getDeclaredField(Class.java:1946)
>  at
> java.io.ObjectStreamClass.getDeclaredSUID(ObjectStreamClass.java:1659)
>  at java.io.ObjectStreamClass.access$700(ObjectStreamClass.java:72)
>  at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:480)
>  at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:468)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at java.io.ObjectStreamClass.(ObjectStreamClass.java:468)
>  at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:365)
>  at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:602)
>  at
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1622)
>  at
> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
>  at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
>  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>  at
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>  at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>  at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>  at
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>  at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>  at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>  at
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>  at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>  at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>  at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>  at scala.collection.immutable.$colon$colon.readObject(List.scala:362)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>  at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:606)
>  at
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>  at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>  at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>  at
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>  at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>  at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>  at
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>  at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>  at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>  at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>  at scala.collection.immutable.$colon$colon.readObject(List.scala:362)
>  at

Classpath problem trying to use DataFrames

2015-12-11 Thread Christopher Brady

I'm trying to run a basic "Hello world" type example using DataFrames 
with Hive in yarn-client mode. My code is:


JavaSparkContext sc = new JavaSparkContext("yarn-client", "Test app"))
HiveContext sqlContext = new HiveContext(sc.sc());
sqlContext.sql("SELECT * FROM my_table").count();

The exception I get on the driver is:
java.lang.ClassNotFoundException: org.apache.hadoop.hive.ql.plan.TableDesc

There are no exceptions on the executors.

That class is definitely on the classpath of the driver, and it runs 
without errors in local mode. I haven't been able to find any similar 
errors on google. Does anyone know what I'm doing wrong?


The full stack trace is included below:

java.lang.NoClassDefFoundError: Lorg/apache/hadoop/hive/ql/plan/TableDesc;
at java.lang.Class.getDeclaredFields0(Native Method)
at java.lang.Class.privateGetDeclaredFields(Class.java:2436)
at java.lang.Class.getDeclaredField(Class.java:1946)
at 
java.io.ObjectStreamClass.getDeclaredSUID(ObjectStreamClass.java:1659)

at java.io.ObjectStreamClass.access$700(ObjectStreamClass.java:72)
at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:480)
at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:468)
at java.security.AccessController.doPrivileged(Native Method)
at java.io.ObjectStreamClass.(ObjectStreamClass.java:468)
at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:365)
at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:602)
at 
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1622)

at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)

at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at 
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)

at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at 
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)

at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at 
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)

at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at scala.collection.immutable.$colon$colon.readObject(List.scala:362)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606)
at 
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
at 
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)

at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at 
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)

at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at 
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)

at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at scala.collection.immutable.$colon$colon.readObject(List.scala:362)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606)
at 
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
at 
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)

at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at

Re: is Multiple Spark Contexts is supported in spark 1.5.0 ?

2015-12-11 Thread Michael Armbrust

The way that we do this is to have a single context with a server in front
that multiplexes jobs that use that shared context.  Even if you aren't
sharing data this is going to give you the best fine grained sharing of the
resources that the context is managing.

On Fri, Dec 11, 2015 at 10:55 AM, Mike Wright  wrote:

> Somewhat related - What's the correct implementation when you have a
> single cluster to support multiple jobs that are unrelated and NOT sharing
> data? I was directed to figure out, via job server, to support "multiple
> contexts" and explained that multiple contexts per JVM is not really
> supported. So, via job server, how does one support multiple contexts in
> DIFFERENT JVM's? I specify multiple contexts in the conf file and the
> initialization of the subsequent contexts fail.
>
>
>
> On Fri, Dec 4, 2015 at 3:37 PM, Michael Armbrust 
> wrote:
>
>> On Fri, Dec 4, 2015 at 11:24 AM, Anfernee Xu 
>> wrote:
>>
>>> If multiple users are looking at the same data set, then it's good
>>> choice to share the SparkContext.
>>>
>>> But my usercases are different, users are looking at different data(I
>>> use custom Hadoop InputFormat to load data from my data source based on the
>>> user input), the data might not have any overlap. For now I'm taking below
>>> approach
>>>
>>
>> Still if you want fine grained sharing of compute resources as well, you
>> want to using single SparkContext.
>>
>
>

Re: Multi-core support per task in Spark

2015-12-11 Thread Zhan Zhang

I noticed that it is configurable in job level spark.task.cpus.  Anyway to 
support on task level?

Thanks.

Zhan Zhang


On Dec 11, 2015, at 10:46 AM, Zhan Zhang  wrote:

> Hi Folks,
> 
> Is it possible to assign multiple core per task and how? Suppose we have some 
> scenario, in which some tasks are really heavy processing each record and 
> require multi-threading, and we want to avoid similar tasks assigned to the 
> same executors/hosts. 
> 
> If it is not supported, does it make sense to add this feature. It may seems 
> make user worry about more configuration, but by default we can still do 1 
> core per task and only advanced users need to be aware of this feature.
> 
> Thanks.
> 
> Zhan Zhang
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
> 
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Using TestHiveContext/HiveContext in unit tests

2015-12-11 Thread Michael Armbrust

Just use TestHive.  Its a global singlton that you can share across test
cases.  It has a reset function if you want to clear out the state at the
begining of a test.

On Fri, Dec 11, 2015 at 2:06 AM, Sahil Sareen  wrote:

> I'm trying to do this in unit tests:
>
> val sConf = new SparkConf()
>   .setAppName("RandomAppName")
>   .setMaster("local")
> val sc = new SparkContext(sConf)
> val sqlContext = new TestHiveContext(sc)  // tried new
> HiveContext(sc) as well
>
>
> But I get this:
>
> *[scalatest] **Exception encountered when invoking run on a nested suite
> - java.lang.RuntimeException: Unable to instantiate
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient *** ABORTED 
>
> *[scalatest]   java.lang.RuntimeException: java.lang.RuntimeException:
> Unable to instantiate 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient*[scalatest]
>   at
> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:346)
> [scalatest]   at
> org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:120)
> [scalatest]   at
> org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:163)
> [scalatest]   at
> org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:161)
> [scalatest]   at
> org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:168)
> [scalatest]   at
> org.apache.spark.sql.hive.test.TestHiveContext.(TestHive.scala:72)
> [scalatest]   at mypackage.NewHiveTest.beforeAll(NewHiveTest.scala:48)
> [scalatest]   at
> org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187)
> [scalatest]   at mypackage.NewHiveTest.beforeAll(NewHiveTest.scala:35)
> [scalatest]   at
> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:253)
> [scalatest]   at mypackage.NewHiveTest.run(NewHiveTest.scala:35)
> [scalatest]   at
> org.scalatest.Suite$class.callExecuteOnSuite$1(Suite.scala:1491)
>
> The code works perfectly when I run using spark-submit, but not in unit
> tests. Any inputs??
>
> -Sahil
>

Re: is Multiple Spark Contexts is supported in spark 1.5.0 ?

2015-12-11 Thread Mike Wright

Thanks for the insight!


___

*Mike Wright*
Principal Architect, Software Engineering
S Capital IQ and SNL

434-951-7816 *p*
434-244-4466 *f*
540-470-0119 *m*

mwri...@snl.com



On Fri, Dec 11, 2015 at 2:38 PM, Michael Armbrust 
wrote:

> The way that we do this is to have a single context with a server in front
> that multiplexes jobs that use that shared context.  Even if you aren't
> sharing data this is going to give you the best fine grained sharing of the
> resources that the context is managing.
>
> On Fri, Dec 11, 2015 at 10:55 AM, Mike Wright  wrote:
>
>> Somewhat related - What's the correct implementation when you have a
>> single cluster to support multiple jobs that are unrelated and NOT sharing
>> data? I was directed to figure out, via job server, to support "multiple
>> contexts" and explained that multiple contexts per JVM is not really
>> supported. So, via job server, how does one support multiple contexts in
>> DIFFERENT JVM's? I specify multiple contexts in the conf file and the
>> initialization of the subsequent contexts fail.
>>
>>
>>
>> On Fri, Dec 4, 2015 at 3:37 PM, Michael Armbrust 
>> wrote:
>>
>>> On Fri, Dec 4, 2015 at 11:24 AM, Anfernee Xu 
>>> wrote:
>>>
 If multiple users are looking at the same data set, then it's good
 choice to share the SparkContext.

 But my usercases are different, users are looking at different data(I
 use custom Hadoop InputFormat to load data from my data source based on the
 user input), the data might not have any overlap. For now I'm taking below
 approach

>>>
>>> Still if you want fine grained sharing of compute resources as well, you
>>> want to using single SparkContext.
>>>
>>
>>
>

how to access local file from Spark sc.textFile("file:///path to/myfile")

2015-12-11 Thread Lin, Hao

Hi,

I have problem accessing local file, with such example:

sc.textFile("file:///root/2008.csv").count()

with error: File file:/root/2008.csv does not exist.
The file clearly exists since, since if I missed type the file name to an 
non-existing one, it will show:

Error: Input path does not exist

Please help!

The following is the error message:

scala> sc.textFile("file:///root/2008.csv").count()
15/12/11 17:12:08 WARN TaskSetManager: Lost task 15.0 in stage 8.0 (TID 498, 
10.162.167.24): java.io.FileNotFoundException: File file:/root/2008.csv does 
not exist
at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:397)
at 
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:137)
at 
org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:764)
at 
org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:108)
at 
org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:239)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

15/12/11 17:12:08 ERROR TaskSetManager: Task 9 in stage 8.0 failed 4 times; 
aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 9 in 
stage 8.0 failed 4 times, most recent failure: Lost task 9.3 in stage 8.0 (TID 
547, 10.162.167.23): java.io.FileNotFoundException: File file:/root/2008.csv 
does not exist
at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:397)
at 
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:137)
at 
org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:764)
at 
org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:108)
at 
org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:239)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1271)
at

Re: Compiling ERROR for Spark MetricsSystem

2015-12-11 Thread Jean-Baptiste Onofré


Hi,

can you check the scala version you use ? 2.10 or 2.11 ?

Thanks,
Regards
JB

On 12/11/2015 05:47 PM, Haijia Zhou wrote:

Hi,

0down vote favorite


**


I'm trying to register a custom Spark metrics source to Spark metrics
system with following code:

val source = new CustomMetricSource()
SparkEnv.get.metricsSystem.registerSource(source)

Then the code failed compiling with following error:

Error:scalac: Error: bad symbolic reference. A signature in
MetricsSystem.class refers to term servlet in value org.jetty which is
not available. It may be completely missing from the current classpath,
or the version on the classpath might be incompatible with the version
used when compiling MetricsSystem.class.
scala.reflect.internal.Types$TypeError: bad symbolic reference. A
signature in MetricsSystem.class refers to term servlet in value
org.jetty which is not available. It may be completely missing from the
current classpath, or the version on the classpath might be incompatible
with the version used when compiling MetricsSystem.class. at
scala.reflect.internal.pickling.UnPickler$Scan.toTypeError(UnPickler.scala:847)
at
scala.reflect.internal.pickling.UnPickler$Scan$LazyTypeRef.complete(UnPickler.scala:854)
at
scala.reflect.internal.pickling.UnPickler$Scan$LazyTypeRef.load(UnPickler.scala:863)
at scala.reflect.internal.Symbols$Symbol.typeParams(Symbols.scala:1489)
at
scala.tools.nsc.transform.SpecializeTypes$$anonfun$scala$tools$nsc$transform$SpecializeTypes$$normalizeMember$1.apply(SpecializeTypes.scala:798)
at
scala.tools.nsc.transform.SpecializeTypes$$anonfun$scala$tools$nsc$transform$SpecializeTypes$$normalizeMember$1.apply(SpecializeTypes.scala:798)
at scala.reflect.internal.SymbolTable.atPhase(SymbolTable.scala:207) at
scala.reflect.internal.SymbolTable.beforePhase(SymbolTable.scala:215) at
scala.tools.nsc.transform.SpecializeTypes.scala$tools$nsc$transform$SpecializeTypes$$normalizeMember(SpecializeTypes.scala:797)
at
scala.tools.nsc.transform.SpecializeTypes$$anonfun$22.apply(SpecializeTypes.scala:751)
at
scala.tools.nsc.transform.SpecializeTypes$$anonfun$22.apply(SpecializeTypes.scala:749)
at
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.immutable.List.foreach(List.scala:318) at
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) at
scala.collection.AbstractTraversable.flatMap(Traversable.scala:105) at
scala.tools.nsc.transform.SpecializeTypes.specializeClass(SpecializeTypes.scala:749)
at
scala.tools.nsc.transform.SpecializeTypes.transformInfo(SpecializeTypes.scala:1172)
at
scala.tools.nsc.transform.InfoTransform$Phase$$anon$1.transform(InfoTransform.scala:38)
at scala.reflect.internal.Symbols$Symbol.rawInfo(Symbols.scala:1321) at
scala.reflect.internal.Symbols$Symbol.info(Symbols.scala:1241) at
scala.reflect.internal.Symbols$Symbol.isDerivedValueClass(Symbols.scala:658)
at
scala.reflect.internal.Symbols$Symbol.isMethodWithExtension(Symbols.scala:661)
at
scala.tools.nsc.transform.Erasure$ErasureTransformer$$anon$1.preEraseNormalApply(Erasure.scala:1100)
at
scala.tools.nsc.transform.Erasure$ErasureTransformer$$anon$1.preEraseApply(Erasure.scala:1195)
at
scala.tools.nsc.transform.Erasure$ErasureTransformer$$anon$1.preErase(Erasure.scala:1205)
at
scala.tools.nsc.transform.Erasure$ErasureTransformer$$anon$1.transform(Erasure.scala:1280)
at
scala.tools.nsc.transform.Erasure$ErasureTransformer$$anon$1.transform(Erasure.scala:1030)
at
scala.reflect.internal.Trees$$anonfun$itransform$2.apply(Trees.scala:1235)
at
scala.reflect.internal.Trees$$anonfun$itransform$2.apply(Trees.scala:1233)
at scala.reflect.api.Trees$Transformer.atOwner(Trees.scala:2936) at
scala.tools.nsc.transform.TypingTransformers$TypingTransformer.atOwner(TypingTransformers.scala:34)
at
scala.tools.nsc.transform.TypingTransformers$TypingTransformer.atOwner(TypingTransformers.scala:28)
at
scala.tools.nsc.transform.TypingTransformers$TypingTransformer.atOwner(TypingTransformers.scala:19)
at scala.reflect.internal.Trees$class.itransform(Trees.scala:1232) at
scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:13) at
scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:13) at
scala.reflect.api.Trees$Transformer.transform(Trees.scala:2897) at
scala.tools.nsc.transform.TypingTransformers$TypingTransformer.transform(TypingTransformers.scala:48)
at
scala.tools.nsc.transform.Erasure$ErasureTransformer$$anon$1.transform(Erasure.scala:1288)
at
scala.tools.nsc.transform.Erasure$ErasureTransformer$$anon$1.transform(Erasure.scala:1030)
at
scala.reflect.api.Trees$Transformer$$anonfun$transformStats$1.apply(Trees.scala:2927)
at
scala.reflect.api.Trees$Transformer$$anonfun$transformStats$1.apply(Trees.scala:2925)
at

Re: how to access local file from Spark sc.textFile("file:///path to/myfile")

2015-12-11 Thread Vijay Gharge

This issue is due to file permission issue. You need to execute spark
operations using root command only.



Regards,
Vijay Gharge



On Fri, Dec 11, 2015 at 11:20 PM, Vijay Gharge 
wrote:

> One more question. Are you also running spark commands using root user ?
> Meanwhile am trying to simulate this locally.
>
>
> On Friday 11 December 2015, Lin, Hao  wrote:
>
>> Here you go, thanks.
>>
>>
>>
>> -rw-r--r-- 1 root root 658M Dec  9  2014 /root/2008.csv
>>
>>
>>
>> *From:* Vijay Gharge [mailto:vijay.gha...@gmail.com]
>> *Sent:* Friday, December 11, 2015 12:31 PM
>> *To:* Lin, Hao
>> *Cc:* user@spark.apache.org
>> *Subject:* Re: how to access local file from Spark
>> sc.textFile("file:///path to/myfile")
>>
>>
>>
>> Can you provide output of "ls -lh /root/2008.csv" ?
>>
>> On Friday 11 December 2015, Lin, Hao  wrote:
>>
>> Hi,
>>
>>
>>
>> I have problem accessing local file, with such example:
>>
>>
>>
>> sc.textFile("file:///root/2008.csv").count()
>>
>>
>>
>> with error: File file:/root/2008.csv does not exist.
>>
>> The file clearly exists since, since if I missed type the file name to an
>> non-existing one, it will show:
>>
>>
>>
>> Error: Input path does not exist
>>
>>
>>
>> Please help!
>>
>>
>>
>> The following is the error message:
>>
>>
>>
>> scala> sc.textFile("file:///root/2008.csv").count()
>>
>> 15/12/11 17:12:08 WARN TaskSetManager: Lost task 15.0 in stage 8.0 (TID
>> 498, 10.162.167.24): java.io.FileNotFoundException: File
>> file:/root/2008.csv does not exist
>>
>> at
>> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
>>
>> at
>> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
>>
>> at
>> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
>>
>> at
>> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:397)
>>
>> at
>> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:137)
>>
>> at
>> org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339)
>>
>> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:764)
>>
>> at
>> org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:108)
>>
>> at
>> org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
>>
>> at
>> org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:239)
>>
>> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)
>>
>> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
>>
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>>
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>>
>> at
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>>
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>>
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>>
>> at
>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>>
>> at org.apache.spark.scheduler.Task.run(Task.scala:88)
>>
>> at
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>>
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>
>> at java.lang.Thread.run(Thread.java:745)
>>
>>
>>
>> 15/12/11 17:12:08 ERROR TaskSetManager: Task 9 in stage 8.0 failed 4
>> times; aborting job
>>
>> org.apache.spark.SparkException: Job aborted due to stage failure: Task 9
>> in stage 8.0 failed 4 times, most recent failure: Lost task 9.3 in stage
>> 8.0 (TID 547, 10.162.167.23): java.io.FileNotFoundException: File
>> file:/root/2008.csv does not exist
>>
>> at
>> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
>>
>> at
>> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
>>
>> at
>> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
>>
>> at
>> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:397)
>>
>> at
>> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:137)
>>
>> at
>> org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339)
>>
>> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:764)
>>
>> at
>> org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:108)
>>
>> at
>> org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
>>
>> at
>> org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:239)
>>
>> at

Re: how to access local file from Spark sc.textFile("file:///path to/myfile")

2015-12-11 Thread Sean Owen

Hm, are you referencing a local file from your remote workers? That
won't work as the file only exists in one machine (I presume).

On Fri, Dec 11, 2015 at 5:19 PM, Lin, Hao  wrote:
> Hi,
>
>
>
> I have problem accessing local file, with such example:
>
>
>
> sc.textFile("file:///root/2008.csv").count()
>
>
>
> with error: File file:/root/2008.csv does not exist.
>
> The file clearly exists since, since if I missed type the file name to an
> non-existing one, it will show:
>
>
>
> Error: Input path does not exist
>
>
>
> Please help!
>
>
>
> The following is the error message:
>
>
>
> scala> sc.textFile("file:///root/2008.csv").count()
>
> 15/12/11 17:12:08 WARN TaskSetManager: Lost task 15.0 in stage 8.0 (TID 498,
> 10.162.167.24): java.io.FileNotFoundException: File file:/root/2008.csv does
> not exist
>
> at
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
>
> at
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
>
> at
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
>
> at
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:397)
>
> at
> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:137)
>
> at
> org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339)
>
> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:764)
>
> at
> org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:108)
>
> at
> org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
>
> at
> org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:239)
>
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)
>
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
>
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>
> at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
>
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>
> at java.lang.Thread.run(Thread.java:745)
>
>
>
> 15/12/11 17:12:08 ERROR TaskSetManager: Task 9 in stage 8.0 failed 4 times;
> aborting job
>
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 9 in
> stage 8.0 failed 4 times, most recent failure: Lost task 9.3 in stage 8.0
> (TID 547, 10.162.167.23): java.io.FileNotFoundException: File
> file:/root/2008.csv does not exist
>
> at
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
>
> at
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
>
> at
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
>
> at
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:397)
>
> at
> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:137)
>
> at
> org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339)
>
> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:764)
>
> at
> org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:108)
>
> at
> org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
>
> at
> org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:239)
>
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)
>
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
>
>at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>
> at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
>
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
> at
>

RE: how to access local file from Spark sc.textFile("file:///path to/myfile")

2015-12-11 Thread Lin, Hao

Here you go, thanks.

-rw-r--r-- 1 root root 658M Dec  9  2014 /root/2008.csv

From: Vijay Gharge [mailto:vijay.gha...@gmail.com]
Sent: Friday, December 11, 2015 12:31 PM
To: Lin, Hao
Cc: user@spark.apache.org
Subject: Re: how to access local file from Spark sc.textFile("file:///path 
to/myfile")

Can you provide output of "ls -lh /root/2008.csv" ?

On Friday 11 December 2015, Lin, Hao 
> wrote:
Hi,

I have problem accessing local file, with such example:

sc.textFile("file:///root/2008.csv").count()

with error: File file:/root/2008.csv does not exist.
The file clearly exists since, since if I missed type the file name to an 
non-existing one, it will show:

Error: Input path does not exist

Please help!

The following is the error message:

scala> sc.textFile("file:///root/2008.csv").count()
15/12/11 17:12:08 WARN TaskSetManager: Lost task 15.0 in stage 8.0 (TID 498, 
10.162.167.24): java.io.FileNotFoundException: File file:/root/2008.csv does 
not exist
at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:397)
at 
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:137)
at 
org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:764)
at 
org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:108)
at 
org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:239)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

15/12/11 17:12:08 ERROR TaskSetManager: Task 9 in stage 8.0 failed 4 times; 
aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 9 in 
stage 8.0 failed 4 times, most recent failure: Lost task 9.3 in stage 8.0 (TID 
547, 10.162.167.23): java.io.FileNotFoundException: File file:/root/2008.csv 
does not exist
at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:397)
at 
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:137)
at 
org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:764)
at 
org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:108)
at 
org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:239)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at

Re: how to access local file from Spark sc.textFile("file:///path to/myfile")

2015-12-11 Thread Vijay Gharge

Please ignore typo. I meant root "permissions"

Regards,
Vijay Gharge



On Fri, Dec 11, 2015 at 11:30 PM, Vijay Gharge 
wrote:

> This issue is due to file permission issue. You need to execute spark
> operations using root command only.
>
>
>
> Regards,
> Vijay Gharge
>
>
>
> On Fri, Dec 11, 2015 at 11:20 PM, Vijay Gharge 
> wrote:
>
>> One more question. Are you also running spark commands using root user ?
>> Meanwhile am trying to simulate this locally.
>>
>>
>> On Friday 11 December 2015, Lin, Hao  wrote:
>>
>>> Here you go, thanks.
>>>
>>>
>>>
>>> -rw-r--r-- 1 root root 658M Dec  9  2014 /root/2008.csv
>>>
>>>
>>>
>>> *From:* Vijay Gharge [mailto:vijay.gha...@gmail.com]
>>> *Sent:* Friday, December 11, 2015 12:31 PM
>>> *To:* Lin, Hao
>>> *Cc:* user@spark.apache.org
>>> *Subject:* Re: how to access local file from Spark
>>> sc.textFile("file:///path to/myfile")
>>>
>>>
>>>
>>> Can you provide output of "ls -lh /root/2008.csv" ?
>>>
>>> On Friday 11 December 2015, Lin, Hao  wrote:
>>>
>>> Hi,
>>>
>>>
>>>
>>> I have problem accessing local file, with such example:
>>>
>>>
>>>
>>> sc.textFile("file:///root/2008.csv").count()
>>>
>>>
>>>
>>> with error: File file:/root/2008.csv does not exist.
>>>
>>> The file clearly exists since, since if I missed type the file name to
>>> an non-existing one, it will show:
>>>
>>>
>>>
>>> Error: Input path does not exist
>>>
>>>
>>>
>>> Please help!
>>>
>>>
>>>
>>> The following is the error message:
>>>
>>>
>>>
>>> scala> sc.textFile("file:///root/2008.csv").count()
>>>
>>> 15/12/11 17:12:08 WARN TaskSetManager: Lost task 15.0 in stage 8.0 (TID
>>> 498, 10.162.167.24): java.io.FileNotFoundException: File
>>> file:/root/2008.csv does not exist
>>>
>>> at
>>> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
>>>
>>> at
>>> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
>>>
>>> at
>>> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
>>>
>>> at
>>> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:397)
>>>
>>> at
>>> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:137)
>>>
>>> at
>>> org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339)
>>>
>>> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:764)
>>>
>>> at
>>> org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:108)
>>>
>>> at
>>> org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
>>>
>>> at
>>> org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:239)
>>>
>>> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)
>>>
>>> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
>>>
>>> at
>>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>>>
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>>>
>>> at
>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>>>
>>> at
>>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>>>
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>>>
>>> at
>>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>>>
>>> at org.apache.spark.scheduler.Task.run(Task.scala:88)
>>>
>>> at
>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>>>
>>> at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>
>>> at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>
>>> at java.lang.Thread.run(Thread.java:745)
>>>
>>>
>>>
>>> 15/12/11 17:12:08 ERROR TaskSetManager: Task 9 in stage 8.0 failed 4
>>> times; aborting job
>>>
>>> org.apache.spark.SparkException: Job aborted due to stage failure: Task
>>> 9 in stage 8.0 failed 4 times, most recent failure: Lost task 9.3 in stage
>>> 8.0 (TID 547, 10.162.167.23): java.io.FileNotFoundException: File
>>> file:/root/2008.csv does not exist
>>>
>>> at
>>> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
>>>
>>> at
>>> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
>>>
>>> at
>>> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
>>>
>>> at
>>> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:397)
>>>
>>> at
>>> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:137)
>>>
>>> at
>>> org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339)
>>>
>>> at

Re: how to access local file from Spark sc.textFile("file:///path to/myfile")

2015-12-11 Thread Vijay Gharge

Can you provide output of "ls -lh /root/2008.csv" ?

On Friday 11 December 2015, Lin, Hao  wrote:

> Hi,
>
>
>
> I have problem accessing local file, with such example:
>
>
>
> sc.textFile("file:///root/2008.csv").count()
>
>
>
> with error: File file:/root/2008.csv does not exist.
>
> The file clearly exists since, since if I missed type the file name to an
> non-existing one, it will show:
>
>
>
> Error: Input path does not exist
>
>
>
> Please help!
>
>
>
> The following is the error message:
>
>
>
> scala> sc.textFile("file:///root/2008.csv").count()
>
> 15/12/11 17:12:08 WARN TaskSetManager: Lost task 15.0 in stage 8.0 (TID
> 498, 10.162.167.24): java.io.FileNotFoundException: File
> file:/root/2008.csv does not exist
>
> at
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
>
> at
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
>
> at
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
>
> at
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:397)
>
> at
> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:137)
>
> at
> org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339)
>
> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:764)
>
> at
> org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:108)
>
> at
> org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
>
> at
> org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:239)
>
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)
>
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
>
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>
> at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
>
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>
> at java.lang.Thread.run(Thread.java:745)
>
>
>
> 15/12/11 17:12:08 ERROR TaskSetManager: Task 9 in stage 8.0 failed 4
> times; aborting job
>
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 9
> in stage 8.0 failed 4 times, most recent failure: Lost task 9.3 in stage
> 8.0 (TID 547, 10.162.167.23): java.io.FileNotFoundException: File
> file:/root/2008.csv does not exist
>
> at
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
>
> at
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
>
> at
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
>
> at
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:397)
>
> at
> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:137)
>
> at
> org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339)
>
> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:764)
>
> at
> org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:108)
>
> at
> org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
>
> at
> org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:239)
>
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)
>
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
>
>at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>
> at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
>
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>
>

Re: how to access local file from Spark sc.textFile("file:///path to/myfile")

2015-12-11 Thread Vijay Gharge

One more question. Are you also running spark commands using root user ?
Meanwhile am trying to simulate this locally.

On Friday 11 December 2015, Lin, Hao  wrote:

> Here you go, thanks.
>
>
>
> -rw-r--r-- 1 root root 658M Dec  9  2014 /root/2008.csv
>
>
>
> *From:* Vijay Gharge [mailto:vijay.gha...@gmail.com
> ]
> *Sent:* Friday, December 11, 2015 12:31 PM
> *To:* Lin, Hao
> *Cc:* user@spark.apache.org
> 
> *Subject:* Re: how to access local file from Spark
> sc.textFile("file:///path to/myfile")
>
>
>
> Can you provide output of "ls -lh /root/2008.csv" ?
>
> On Friday 11 December 2015, Lin, Hao  > wrote:
>
> Hi,
>
>
>
> I have problem accessing local file, with such example:
>
>
>
> sc.textFile("file:///root/2008.csv").count()
>
>
>
> with error: File file:/root/2008.csv does not exist.
>
> The file clearly exists since, since if I missed type the file name to an
> non-existing one, it will show:
>
>
>
> Error: Input path does not exist
>
>
>
> Please help!
>
>
>
> The following is the error message:
>
>
>
> scala> sc.textFile("file:///root/2008.csv").count()
>
> 15/12/11 17:12:08 WARN TaskSetManager: Lost task 15.0 in stage 8.0 (TID
> 498, 10.162.167.24): java.io.FileNotFoundException: File
> file:/root/2008.csv does not exist
>
> at
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
>
> at
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
>
> at
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
>
> at
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:397)
>
> at
> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:137)
>
> at
> org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339)
>
> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:764)
>
> at
> org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:108)
>
> at
> org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
>
> at
> org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:239)
>
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)
>
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
>
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>
> at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
>
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>
> at java.lang.Thread.run(Thread.java:745)
>
>
>
> 15/12/11 17:12:08 ERROR TaskSetManager: Task 9 in stage 8.0 failed 4
> times; aborting job
>
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 9
> in stage 8.0 failed 4 times, most recent failure: Lost task 9.3 in stage
> 8.0 (TID 547, 10.162.167.23): java.io.FileNotFoundException: File
> file:/root/2008.csv does not exist
>
> at
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
>
> at
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
>
> at
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
>
> at
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:397)
>
> at
> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:137)
>
> at
> org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339)
>
> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:764)
>
> at
> org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:108)
>
> at
> org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
>
> at
> org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:239)
>
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)
>
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
>
>at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>
> at

Window function in Spark SQL

2015-12-11 Thread Sourav Mazumder

Hi,

Spark SQL documentation says that it complies with Hive 1.2.1 APIs and
supports Window functions. I'm using Spark 1.5.0.

However, when I try to execute something like below I get an error

val lol5 = sqlContext.sql("select ky, lead(ky, 5, 0) over (order by ky rows
5 following) from lolt")

java.lang.RuntimeException: [1.32] failure: ``union'' expected but `('
found select ky, lead(ky, 5, 0) over (order by ky rows 5 following) from
lolt ^ at scala.sys.package$.error(package.scala:27) at
org.apache.spark.sql.catalyst.AbstractSparkSQLParser.parse(AbstractSparkSQLParser.scala:36)
at
org.apache.spark.sql.catalyst.DefaultParserDialect.parse(ParserDialect.scala:67)
at org.apache.spark.sql.SQLContext$$anonfun$3.apply(SQLContext.scala:169)
at org.apache.spark.sql.SQLContext$$anonfun$3.apply(SQLContext.scala:169)
at
org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:115)
at
org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:114)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) at
scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135) at
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
at
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
at scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
at
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at
scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
at
scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) at
scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890) at
scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
at
org.apache.spark.sql.catalyst.AbstractSparkSQLParser.parse(AbstractSparkSQLParser.scala:34)
at org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:166)
at org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:166)
at
org.apache.spark.sql.execution.datasources.DDLParser.parse(DDLParser.scala:42)
at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:189) at
org.apache.spark.sql.SQLContext.sql(SQLContext.scala:719) at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:63)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:68)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:70)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:72)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:74)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:76)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:78) at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:80) at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:82) at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:84) at
$iwC$$iwC$$iwC$$iwC$$iwC.(:86) at
$iwC$$iwC$$iwC$$iwC.(:88) at
$iwC$$iwC$$iwC.(:90) at $iwC$$iwC.(:92) at
$iwC.(:94) at (:96) at .(:100)
at .() at .(:7) at .() at
$print()

Regards,
Sourav

Re: Spark Submit - java.lang.IllegalArgumentException: requirement failed

2015-12-11 Thread Jean-Baptiste Onofré


Hi Nick,

the localizedPath has to be not null, that's why the requirement fails.

In the SparkConf used by the spark-submit (default in 
conf/spark-default.conf), do you have all properties defined, especially 
spark.yarn.keytab ?


Thanks,
Regards
JB

On 12/11/2015 05:49 PM, Afshartous, Nick wrote:


Hi,


I'm trying to run a streaming job on a single node EMR 4.1/Spark 1.5
cluster.  Its throwing an IllegalArgumentException right away on the submit.

Attaching full output from console.


Thanks for any insights.

--

 Nick



15/12/11 16:44:43 WARN util.NativeCodeLoader: Unable to load
native-hadoop library for your platform... using builtin-java classes
where applicable
15/12/11 16:44:43 INFO client.RMProxy: Connecting to ResourceManager at
ip-10-247-129-50.ec2.internal/10.247.129.50:8032
15/12/11 16:44:43 INFO yarn.Client: Requesting a new application from
cluster with 1 NodeManagers
15/12/11 16:44:43 INFO yarn.Client: Verifying our application has not
requested more than the maximum memory capability of the cluster (54272
MB per container)
15/12/11 16:44:43 INFO yarn.Client: Will allocate AM container, with
11264 MB memory including 1024 MB overhead
15/12/11 16:44:43 INFO yarn.Client: Setting up container launch context
for our AM
15/12/11 16:44:43 INFO yarn.Client: Setting up the launch environment
for our AM container
15/12/11 16:44:43 INFO yarn.Client: Preparing resources for our AM container
15/12/11 16:44:44 INFO yarn.Client: Uploading resource
file:/usr/lib/spark/lib/spark-assembly-1.5.0-hadoop2.6.0-amzn-1.jar ->
hdfs://ip-10-247-129-50.ec2.internal:8020/user/hadoop/.sparkStaging/application_1447\
442727308_0126/spark-assembly-1.5.0-hadoop2.6.0-amzn-1.jar
15/12/11 16:44:44 INFO metrics.MetricsSaver: MetricsConfigRecord
disabledInCluster: false instanceEngineCycleSec: 60
clusterEngineCycleSec: 60 disableClusterEngine: false maxMemoryMb: 3072
maxInstanceCount: 500\
  lastModified: 1447442734295
15/12/11 16:44:44 INFO metrics.MetricsSaver: Created MetricsSaver
j-2H3BTA60FGUYO:i-f7812947:SparkSubmit:15603 period:60
/mnt/var/em/raw/i-f7812947_20151211_SparkSubmit_15603_raw.bin
15/12/11 16:44:45 INFO metrics.MetricsSaver: 1 aggregated HDFSWriteDelay
1276 raw values into 1 aggregated values, total 1
15/12/11 16:44:45 INFO yarn.Client: Uploading resource
file:/home/hadoop/spark-pipeline-framework-1.1.6-SNAPSHOT/workflow/lib/spark-kafka-services-1.0.jar
-> hdfs://ip-10-247-129-50.ec2.internal:8020/user/hadoo\
p/.sparkStaging/application_1447442727308_0126/spark-kafka-services-1.0.jar
15/12/11 16:44:45 INFO yarn.Client: Uploading resource
file:/home/hadoop/spark-pipeline-framework-1.1.6-SNAPSHOT/conf/AwsCredentials.properties
-> hdfs://ip-10-247-129-50.ec2.internal:8020/user/hadoop/.sparkSta\
ging/application_1447442727308_0126/AwsCredentials.properties
15/12/11 16:44:45 WARN yarn.Client: Resource
file:/home/hadoop/spark-pipeline-framework-1.1.6-SNAPSHOT/conf/AwsCredentials.properties
added multiple times to distributed cache.
15/12/11 16:44:45 INFO yarn.Client: Deleting staging directory
.sparkStaging/application_1447442727308_0126
Exception in thread "main" java.lang.IllegalArgumentException:
requirement failed
 at scala.Predef$.require(Predef.scala:221)
 at
org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6$$anonfun$apply$2.apply(Client.scala:392)
 at
org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6$$anonfun$apply$2.apply(Client.scala:390)
 at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
 at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
 at
org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6.apply(Client.scala:390)
 at
org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6.apply(Client.scala:388)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at
org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:388)
 at
org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:629)
 at
org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:119)
 at org.apache.spark.deploy.yarn.Client.run(Client.scala:907)
 at org.apache.spark.deploy.yarn.Client$.main(Client.scala:966)
 at org.apache.spark.deploy.yarn.Client.main(Client.scala)




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Window function in Spark SQL

2015-12-11 Thread Ross.Cramblit

Hey Sourav,
Window functions require using a HiveContext rather than the default 
SQLContext. See here: 
http://spark.apache.org/docs/latest/sql-programming-guide.html#starting-point-sqlcontext

HiveContext provides all the same functionality of SQLContext, as well as extra 
features like Window functions.

- Ross

On Dec 11, 2015, at 12:59 PM, Sourav Mazumder 
> wrote:

Hi,

Spark SQL documentation says that it complies with Hive 1.2.1 APIs and supports 
Window functions. I'm using Spark 1.5.0.

However, when I try to execute something like below I get an error

val lol5 = sqlContext.sql("select ky, lead(ky, 5, 0) over (order by ky rows 5 
following) from lolt")

java.lang.RuntimeException: [1.32] failure: ``union'' expected but `(' found 
select ky, lead(ky, 5, 0) over (order by ky rows 5 following) from lolt ^ at 
scala.sys.package$.error(package.scala:27) at 
org.apache.spark.sql.catalyst.AbstractSparkSQLParser.parse(AbstractSparkSQLParser.scala:36)
 at 
org.apache.spark.sql.catalyst.DefaultParserDialect.parse(ParserDialect.scala:67)
 at org.apache.spark.sql.SQLContext$$anonfun$3.apply(SQLContext.scala:169) at 
org.apache.spark.sql.SQLContext$$anonfun$3.apply(SQLContext.scala:169) at 
org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:115)
 at 
org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:114)
 at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) at 
scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135) at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
 at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
 at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
 at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
 at scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202) at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
 at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
 at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at 
scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
 at 
scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
 at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) at 
scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890) at 
scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
 at 
org.apache.spark.sql.catalyst.AbstractSparkSQLParser.parse(AbstractSparkSQLParser.scala:34)
 at org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:166) at 
org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:166) at 
org.apache.spark.sql.execution.datasources.DDLParser.parse(DDLParser.scala:42) 
at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:189) at 
org.apache.spark.sql.SQLContext.sql(SQLContext.scala:719) at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:63)
 at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:68)
 at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:70)
 at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:72)
 at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:74) 
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:76) at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:78) at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:80) at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:82) at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:84) at 
$iwC$$iwC$$iwC$$iwC$$iwC.(:86) at 
$iwC$$iwC$$iwC$$iwC.(:88) at $iwC$$iwC$$iwC.(:90) 
at $iwC$$iwC.(:92) at $iwC.(:94) at 
(:96) at .(:100) at .() at 
.(:7) at .() at $print()

Regards,
Sourav

Re: DataFrame creation delay?

2015-12-11 Thread Isabelle Phan

Hi Harsh,

Thanks a lot for your reply.

I added a predicate to my query to select a single partition in the table,
and tested with both "spark.sql.hive.metastorePartitionPruning" setting on
and off, and there is no difference in DataFrame creation time.

Yes, Michael's proposed workaround works. But I was under the impression
that this workaround was only for Spark version < 1.5. With the Hive
metastore partition pruning feature from Spark 1.5, I thought there would
be no more delay, so I could create DataFrames left and right.

I noticed that regardless of the setting, when I create a DataFrame with or
without a predicate, I get a log message from HadoopFsRelation class which
lists hdfs filepaths to ALL partitions in my table (see logInfo call in the
code
).
Is this expected? I am not sure who is creating this Array of filepaths,
but I am guessing this is the source of my delay.

Thanks,

Isabelle

On Thu, Dec 10, 2015 at 7:38 PM, Harsh J  wrote:

> The option of "spark.sql.hive.metastorePartitionPruning=true" will not
> work unless you have a partition column predicate in your query. Your query
> of "select * from temp.log" does not do this. The slowdown appears to be
> due to the need of loading all partition metadata.
>
> Have you also tried to see if Michael's temp-table suggestion helps you
> cache the expensive partition lookup? (re-quoted below)
>
> """
> If you run sqlContext.table("...").registerTempTable("...") that temptable
> will cache the lookup of partitions [the first time is slow, but subsequent
> lookups will be faster].
> """ - X-Ref: Permalink
> 
>
> Also, do you absolutely need to use "select * from temp.log"? Adding a
> where clause to the query with a partition condition will help Spark prune
> the request to just the required partitions (vs. all, which is proving
> expensive).
>
> On Fri, Dec 11, 2015 at 3:59 AM Isabelle Phan  wrote:
>
>> Hi Michael,
>>
>> We have just upgraded to Spark 1.5.0 (actually 1.5.0_cdh-5.5 since we are
>> on cloudera), and Parquet formatted tables. I turned on  spark
>> .sql.hive.metastorePartitionPruning=true, but DataFrame creation still
>> takes a long time.
>> Is there any other configuration to consider?
>>
>>
>> Thanks a lot for your help,
>>
>> Isabelle
>>
>> On Fri, Sep 4, 2015 at 1:42 PM, Michael Armbrust 
>> wrote:
>>
>>> If you run sqlContext.table("...").registerTempTable("...") that
>>> temptable will cache the lookup of partitions.
>>>
>>> On Fri, Sep 4, 2015 at 1:16 PM, Isabelle Phan  wrote:
>>>
 Hi Michael,

 Thanks a lot for your reply.

 This table is stored as text file with tab delimited columns.

 You are correct, the problem is because my table has too many
 partitions (1825 in total). Since I am on Spark 1.4, I think I am hitting
 bug 6984 .

 Not sure when my company can move to 1.5. Would you know some
 workaround for this bug?
 If I cannot find workaround for this, will have to change our schema
 design to reduce number of partitions.

 Thanks,

 Isabelle

 On Fri, Sep 4, 2015 at 12:56 PM, Michael Armbrust <
 mich...@databricks.com> wrote:

> Also, do you mean two partitions or two partition columns?  If there
> are many partitions it can be much slower.  In Spark 1.5 I'd consider
> setting spark.sql.hive.metastorePartitionPruning=true if you have
> predicates over the partition columns.
>
> On Fri, Sep 4, 2015 at 12:54 PM, Michael Armbrust <
> mich...@databricks.com> wrote:
>
>> What format is this table.  For parquet and other optimized formats
>> we cache a bunch of file metadata on first access to make interactive
>> queries faster.
>>
>> On Thu, Sep 3, 2015 at 8:17 PM, Isabelle Phan 
>> wrote:
>>
>>> Hello,
>>>
>>> I am using SparkSQL to query some Hive tables. Most of the time,
>>> when I create a DataFrame using sqlContext.sql("select * from
>>> table") command, DataFrame creation is less than 0.5 second.
>>> But I have this one table with which it takes almost 12 seconds!
>>>
>>> scala>  val start = scala.compat.Platform.currentTime; val logs =
>>> sqlContext.sql("select * from temp.log"); val execution =
>>> scala.compat.Platform.currentTime - start
>>> 15/09/04 12:07:02 INFO ParseDriver: Parsing command: select * from
>>> temp.log
>>> 15/09/04 12:07:02 INFO ParseDriver: Parse Completed
>>> start: Long = 1441336022731
>>> logs: org.apache.spark.sql.DataFrame = [user_id:

Re: Re: Spark assembly in Maven repo?

2015-12-11 Thread fightf...@163.com

Agree with you that assembly jar is not good to publish. However, what he 
really need is to fetch 
an updatable maven jar file. 

fightf...@163.com

From: Mark Hamstra
Date: 2015-12-11 15:34
To: fightf...@163.com
CC: Xiaoyong Zhu; Jeff Zhang; user; Zhaomin Xu; Joe Zhang (SDE)
Subject: Re: RE: Spark assembly in Maven repo?
No, publishing a spark assembly jar is not fine.  See the doc attached to 
https://issues.apache.org/jira/browse/SPARK-11157 and be aware that a likely 
goal of Spark 2.0 will be the elimination of assemblies.

On Thu, Dec 10, 2015 at 11:19 PM, fightf...@163.com  wrote:
Using maven to download the assembly jar is fine. I would recommend to deploy 
this 
assembly jar to your local maven repo, i.e. nexus repo, Or more likey a 
snapshot repository

fightf...@163.com

From: Xiaoyong Zhu
Date: 2015-12-11 15:10
To: Jeff Zhang
CC: user@spark.apache.org; Zhaomin Xu; Joe Zhang (SDE)
Subject: RE: Spark assembly in Maven repo?
Sorry – I didn’t make it clear. It’s actually not a “dependency” – it’s 
actually that we are building a certain plugin for IntelliJ where we want to 
distribute this jar. But since the jar is updated frequently we don't want to 
distribute it together with our plugin but we would like to download it via 
Maven.

In this case what’s the recommended way? 

Xiaoyong

From: Jeff Zhang [mailto:zjf...@gmail.com] 
Sent: Thursday, December 10, 2015 11:03 PM
To: Xiaoyong Zhu 
Cc: user@spark.apache.org
Subject: Re: Spark assembly in Maven repo?

I don't think make the assembly jar as dependency a good practice. You may meet 
jar hell issue in that case. 

On Fri, Dec 11, 2015 at 2:46 PM, Xiaoyong Zhu  wrote:
Hi Experts,

We have a project which has a dependency for the following jar

spark-assembly--hadoop.jar
for example:
spark-assembly-1.4.1.2.3.3.0-2983-hadoop2.7.1.2.3.3.0-2983.jar

since this assembly might be updated in the future, I am not sure if there is a 
Maven repo that has the above spark assembly jar? Or should we create & upload 
it to Maven central?

Thanks!

Xiaoyong

-- 
Best Regards

Jeff Zhang

Re: Spark Submit - java.lang.IllegalArgumentException: requirement failed

2015-12-11 Thread Afshartous, Nick


Thanks JB.

I'm submitting from the AWS Spark master node, the spark-default.conf is 
pre-deployed by Amazon (attached) and there is no setting
for spark.yarn.keytab.  Is there any doc for setting this up if required in 
this scenario ?

Also, I if deploy-mode is switched from cluster to client on spark-submit then 
the error no longer appears.  Just wondering if there's any difference to 
using client versus cluster mode if the submit is being done on the master 
node. 

Thanks for any suggestions,
--
Nick


From: Jean-Baptiste Onofré 
Sent: Friday, December 11, 2015 1:01 PM
To: user@spark.apache.org
Subject: Re: Spark Submit - java.lang.IllegalArgumentException: requirement 
failed

Hi Nick,

the localizedPath has to be not null, that's why the requirement fails.

In the SparkConf used by the spark-submit (default in
conf/spark-default.conf), do you have all properties defined, especially
spark.yarn.keytab ?

Thanks,
Regards
JB

On 12/11/2015 05:49 PM, Afshartous, Nick wrote:
>
> Hi,
>
>
> I'm trying to run a streaming job on a single node EMR 4.1/Spark 1.5
> cluster.  Its throwing an IllegalArgumentException right away on the submit.
>
> Attaching full output from console.
>
>
> Thanks for any insights.
>
> --
>
>  Nick
>
>
>
> 15/12/11 16:44:43 WARN util.NativeCodeLoader: Unable to load
> native-hadoop library for your platform... using builtin-java classes
> where applicable
> 15/12/11 16:44:43 INFO client.RMProxy: Connecting to ResourceManager at
> ip-10-247-129-50.ec2.internal/10.247.129.50:8032
> 15/12/11 16:44:43 INFO yarn.Client: Requesting a new application from
> cluster with 1 NodeManagers
> 15/12/11 16:44:43 INFO yarn.Client: Verifying our application has not
> requested more than the maximum memory capability of the cluster (54272
> MB per container)
> 15/12/11 16:44:43 INFO yarn.Client: Will allocate AM container, with
> 11264 MB memory including 1024 MB overhead
> 15/12/11 16:44:43 INFO yarn.Client: Setting up container launch context
> for our AM
> 15/12/11 16:44:43 INFO yarn.Client: Setting up the launch environment
> for our AM container
> 15/12/11 16:44:43 INFO yarn.Client: Preparing resources for our AM container
> 15/12/11 16:44:44 INFO yarn.Client: Uploading resource
> file:/usr/lib/spark/lib/spark-assembly-1.5.0-hadoop2.6.0-amzn-1.jar ->
> hdfs://ip-10-247-129-50.ec2.internal:8020/user/hadoop/.sparkStaging/application_1447\
> 442727308_0126/spark-assembly-1.5.0-hadoop2.6.0-amzn-1.jar
> 15/12/11 16:44:44 INFO metrics.MetricsSaver: MetricsConfigRecord
> disabledInCluster: false instanceEngineCycleSec: 60
> clusterEngineCycleSec: 60 disableClusterEngine: false maxMemoryMb: 3072
> maxInstanceCount: 500\
>   lastModified: 1447442734295
> 15/12/11 16:44:44 INFO metrics.MetricsSaver: Created MetricsSaver
> j-2H3BTA60FGUYO:i-f7812947:SparkSubmit:15603 period:60
> /mnt/var/em/raw/i-f7812947_20151211_SparkSubmit_15603_raw.bin
> 15/12/11 16:44:45 INFO metrics.MetricsSaver: 1 aggregated HDFSWriteDelay
> 1276 raw values into 1 aggregated values, total 1
> 15/12/11 16:44:45 INFO yarn.Client: Uploading resource
> file:/home/hadoop/spark-pipeline-framework-1.1.6-SNAPSHOT/workflow/lib/spark-kafka-services-1.0.jar
> -> hdfs://ip-10-247-129-50.ec2.internal:8020/user/hadoo\
> p/.sparkStaging/application_1447442727308_0126/spark-kafka-services-1.0.jar
> 15/12/11 16:44:45 INFO yarn.Client: Uploading resource
> file:/home/hadoop/spark-pipeline-framework-1.1.6-SNAPSHOT/conf/AwsCredentials.properties
> -> hdfs://ip-10-247-129-50.ec2.internal:8020/user/hadoop/.sparkSta\
> ging/application_1447442727308_0126/AwsCredentials.properties
> 15/12/11 16:44:45 WARN yarn.Client: Resource
> file:/home/hadoop/spark-pipeline-framework-1.1.6-SNAPSHOT/conf/AwsCredentials.properties
> added multiple times to distributed cache.
> 15/12/11 16:44:45 INFO yarn.Client: Deleting staging directory
> .sparkStaging/application_1447442727308_0126
> Exception in thread "main" java.lang.IllegalArgumentException:
> requirement failed
>  at scala.Predef$.require(Predef.scala:221)
>  at
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6$$anonfun$apply$2.apply(Client.scala:392)
>  at
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6$$anonfun$apply$2.apply(Client.scala:390)
>  at
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>  at
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6.apply(Client.scala:390)
>  at
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6.apply(Client.scala:388)
>  at scala.collection.immutable.List.foreach(List.scala:318)
>  at
> org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:388)
>  at
> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:629)
>  at
>

Re: What is the relationship between reduceByKey and spark.driver.maxResultSize?

2015-12-11 Thread Zhan Zhang

I think you are fetching too many results to the driver. Typically, it is not 
recommended to collect much data to driver. But if you have to, you can 
increase the driver memory, when submitting jobs.

Thanks.

Zhan Zhang

On Dec 11, 2015, at 6:14 AM, Tom Seddon 
> wrote:

I have a job that is running into intermittent errors with  [SparkDriver] 
java.lang.OutOfMemoryError: Java heap space.  Before I was getting this error I 
was getting errors saying the result size exceed the 
spark.driver.maxResultSize.  This does not make any sense to me, as there are 
no actions in my job that send data to the driver - just a pull of data from 
S3, a map and reduceByKey and then conversion to dataframe and saveAsTable 
action that puts the results back on S3.

I've found a few references to reduceByKey and spark.driver.maxResultSize 
having some importance, but cannot fathom how this setting could be related.

Would greatly appreciated any advice.

Thanks in advance,

Tom

Re: What is the relationship between reduceByKey and spark.driver.maxResultSize?

2015-12-11 Thread Eugen Cepoi

Do you have a large number of tasks? This can happen if you have a large
number of tasks and a small driver or if you use accumulators of lists like
datastructures.

2015-12-11 11:17 GMT-08:00 Zhan Zhang :

> I think you are fetching too many results to the driver. Typically, it is
> not recommended to collect much data to driver. But if you have to, you can
> increase the driver memory, when submitting jobs.
>
> Thanks.
>
> Zhan Zhang
>
> On Dec 11, 2015, at 6:14 AM, Tom Seddon  wrote:
>
> I have a job that is running into intermittent errors with  [SparkDriver]
> java.lang.OutOfMemoryError: Java heap space.  Before I was getting this
> error I was getting errors saying the result size exceed the 
> spark.driver.maxResultSize.
> This does not make any sense to me, as there are no actions in my job that
> send data to the driver - just a pull of data from S3, a map and
> reduceByKey and then conversion to dataframe and saveAsTable action that
> puts the results back on S3.
>
> I've found a few references to reduceByKey and spark.driver.maxResultSize
> having some importance, but cannot fathom how this setting could be related.
>
> Would greatly appreciated any advice.
>
> Thanks in advance,
>
> Tom
>
>
>

Re: Mesos scheduler obeying limit of tasks / executor

2015-12-11 Thread Charles Allen

That answers it thanks!

On Fri, Dec 11, 2015 at 6:37 AM Iulian Dragoș 
wrote:

> Hi Charles,
>
> I am not sure I totally understand your issues, but the spark.task.cpus
> limit is imposed at a higher level, for all cluster managers. The code is
> in TaskSchedulerImpl
> 
> .
>
> There is a pending PR to implement spark.executor.cores (and launching
> multiple executors on a single worker), but it wasn’t yet merged:
> https://github.com/apache/spark/pull/4027
>
> iulian
> 
>
> On Wed, Dec 9, 2015 at 7:23 PM, Charles Allen <
> charles.al...@metamarkets.com> wrote:
>
>> I have a spark app in development which has relatively strict cpu/mem
>> ratios that are required. As such, I cannot arbitrarily add CPUs to a
>> limited memory size.
>>
>> The general spark cluster behaves as expected, where tasks are launched
>> with a specified memory/cpu ratio, but the mesos scheduler seems to ignore
>> this.
>>
>> Specifically, I cannot find where in the code the limit of number of
>> tasks per executor of "spark.executor.cores" / "spark.task.cpus" is
>> enforced in the MesosBackendScheduler.
>>
>> The Spark App in question has some JVM heap heavy activities inside a
>> RDD.mapPartitionsWithIndex, so having more tasks per limited JVM memory
>> resource is bad. The workaround planned handling of this is to limit the
>> number of tasks per JVM, which does not seem possible in mesos mode, where
>> it seems to just keep stacking on CPUs as tasks come in without adjusting
>> any memory constraints, or looking for limits of tasks per executor.
>>
>> How can I limit the tasks per executor (or per memory pool) in the Mesos
>> backend scheduler?
>>
>> Thanks,
>> Charles Allen
>>
>
>
>
> --
>
> --
> Iulian Dragos
>
> --
> Reactive Apps on the JVM
> www.typesafe.com
>
>

Fwd: Window function in Spark SQL

2015-12-11 Thread Sourav Mazumder

In 1.5.x whenever I try to create a HiveContext from SparkContext I get
following error. Please note that I'm not running any Hadoop/Hive server in
my cluster. I'm only running Spark.

I never faced HiveContext creation problem like this previously in 1.4.x.

Is it now a requirement in 1.5.x that to create HIveContext Hive Server
should be running ?

Regards,
Sourav


-- Forwarded message --
From: 
Date: Fri, Dec 11, 2015 at 11:39 AM
Subject: Re: Window function in Spark SQL
To: sourav.mazumde...@gmail.com


I’m not familiar with that issue, I wasn’t able to reproduce in my
environment - might want to copy that to the Spark user list. Sorry!

On Dec 11, 2015, at 1:37 PM, Sourav Mazumder 
wrote:

Hi Ross,

Thanks for your answer.

In 1.5.x whenever I try to create a HiveContext from SparkContext I get
following error. Please note that I'm not running any Hadoop/Hive server in
my cluster. I'm only running Spark.

I never faced HiveContext creation problem like this previously

Regards,
Sourav

java.lang.RuntimeException: java.lang.RuntimeException: The root scratch
dir: /t
mp/hive on HDFS should be writable. Current permissions are: rwx--
at
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.jav
a:522)
at
org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.s
cala:171)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)

at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstruct
orAccessorImpl.java:62)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingC
onstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
at
org.apache.spark.sql.hive.client.IsolatedClientLoader.liftedTree1$1(I
solatedClientLoader.scala:183)
at
org.apache.spark.sql.hive.client.IsolatedClientLoader.(Isolated
ClientLoader.scala:179)
at
org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveCon
text.scala:226)
at
org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:
185)
at
org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:392)
at
org.apache.spark.sql.hive.HiveContext.defaultOverrides(HiveContext.sc
ala:174)
at
org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:177)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:15)
at $iwC$$iwC$$iwC$$iwC$$iwC.(:20)
at $iwC$$iwC$$iwC$$iwC.(:22)
at $iwC$$iwC$$iwC.(:24)
at $iwC$$iwC.(:26)
at $iwC.(:28)
at (:30)
at .(:34)
at .()
at .(:7)
at .()
at $print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

On Fri, Dec 11, 2015 at 10:10 AM,  wrote:

> Hey Sourav,
> Window functions require using a HiveContext rather than the default
> SQLContext. See here:
> http://spark.apache.org/docs/latest/sql-programming-guide.html#starting-point-sqlcontext
> 
>
> HiveContext provides all the same functionality of SQLContext, as well as
> extra features like Window functions.
>
> - Ross
>
> On Dec 11, 2015, at 12:59 PM, Sourav Mazumder 
> wrote:
>
> Hi,
>
> Spark SQL documentation says that it complies with Hive 1.2.1 APIs and
> supports Window functions. I'm using Spark 1.5.0.
>
> However, when I try to execute something like below I get an error
>
> val lol5 = sqlContext.sql("select ky, lead(ky, 5, 0) over (order by ky
> rows 5 following) from lolt")
>
> java.lang.RuntimeException: [1.32] failure: ``union'' expected but `('
> found select ky, lead(ky, 5, 0) over (order by ky rows 5 following) from
> lolt ^ at scala.sys.package$.error(package.scala:27) at
> org.apache.spark.sql.catalyst.AbstractSparkSQLParser.parse(AbstractSparkSQLParser.scala:36)
> at
> org.apache.spark.sql.catalyst.DefaultParserDialect.parse(ParserDialect.scala:67)
> at org.apache.spark.sql.SQLContext$$anonfun$3.apply(SQLContext.scala:169)
> at org.apache.spark.sql.SQLContext$$anonfun$3.apply(SQLContext.scala:169)
> at
> org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:115)
> at
> org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:114)
> at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) at
> scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135) at
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
> at
>

RE: how to access local file from Spark sc.textFile("file:///path to/myfile")

2015-12-11 Thread Lin, Hao

Yes to your question. I have spun up a cluster, login to the master as a root 
user, run spark-shell, and reference the local file of the master machine.

From: Vijay Gharge [mailto:vijay.gha...@gmail.com]
Sent: Friday, December 11, 2015 12:50 PM
To: Lin, Hao
Cc: user@spark.apache.org
Subject: Re: how to access local file from Spark sc.textFile("file:///path 
to/myfile")

One more question. Are you also running spark commands using root user ? 
Meanwhile am trying to simulate this locally.

On Friday 11 December 2015, Lin, Hao 
> wrote:
Here you go, thanks.

-rw-r--r-- 1 root root 658M Dec  9  2014 /root/2008.csv

From: Vijay Gharge 
[mailto:vijay.gha...@gmail.com]
Sent: Friday, December 11, 2015 12:31 PM
To: Lin, Hao
Cc: user@spark.apache.org
Subject: Re: how to access local file from Spark sc.textFile("file:///path 
to/myfile")

Can you provide output of "ls -lh /root/2008.csv" ?

On Friday 11 December 2015, Lin, Hao 
> wrote:
Hi,

I have problem accessing local file, with such example:

sc.textFile("file:///root/2008.csv").count()

with error: File file:/root/2008.csv does not exist.
The file clearly exists since, since if I missed type the file name to an 
non-existing one, it will show:

Error: Input path does not exist

Please help!

The following is the error message:

scala> sc.textFile("file:///root/2008.csv").count()
15/12/11 17:12:08 WARN TaskSetManager: Lost task 15.0 in stage 8.0 (TID 498, 
10.162.167.24): java.io.FileNotFoundException: File file:/root/2008.csv does 
not exist
at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:397)
at 
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:137)
at 
org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:764)
at 
org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:108)
at 
org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:239)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

15/12/11 17:12:08 ERROR TaskSetManager: Task 9 in stage 8.0 failed 4 times; 
aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 9 in 
stage 8.0 failed 4 times, most recent failure: Lost task 9.3 in stage 8.0 (TID 
547, 10.162.167.23): java.io.FileNotFoundException: File file:/root/2008.csv 
does not exist
at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:397)
at 
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:137)
at 
org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:764)
at 
org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:108)
at 
org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:239)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
   at

RE: how to access local file from Spark sc.textFile("file:///path to/myfile")

2015-12-11 Thread Lin, Hao

I logged into master of my cluster and referenced the local file of the master 
node machine.  And yes that file only resides on master node, not on any of the 
remote workers.  

-Original Message-
From: Sean Owen [mailto:so...@cloudera.com] 
Sent: Friday, December 11, 2015 1:00 PM
To: Lin, Hao
Cc: user@spark.apache.org
Subject: Re: how to access local file from Spark sc.textFile("file:///path 
to/myfile")

Hm, are you referencing a local file from your remote workers? That won't work 
as the file only exists in one machine (I presume).

On Fri, Dec 11, 2015 at 5:19 PM, Lin, Hao  wrote:
> Hi,
>
>
>
> I have problem accessing local file, with such example:
>
>
>
> sc.textFile("file:///root/2008.csv").count()
>
>
>
> with error: File file:/root/2008.csv does not exist.
>
> The file clearly exists since, since if I missed type the file name to 
> an non-existing one, it will show:
>
>
>
> Error: Input path does not exist
>
>
>
> Please help!
>
>
>
> The following is the error message:
>
>
>
> scala> sc.textFile("file:///root/2008.csv").count()
>
> 15/12/11 17:12:08 WARN TaskSetManager: Lost task 15.0 in stage 8.0 
> (TID 498,
> 10.162.167.24): java.io.FileNotFoundException: File 
> file:/root/2008.csv does not exist
>
> at
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLoc
> alFileSystem.java:511)
>
> at
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawL
> ocalFileSystem.java:724)
>
> at
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSyst
> em.java:501)
>
> at
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.j
> ava:397)
>
> at
> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(
> ChecksumFileSystem.java:137)
>
> at
> org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:3
> 39)
>
> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:764)
>
> at
> org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java
> :108)
>
> at
> org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputForm
> at.java:67)
>
> at
> org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:239)
>
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)
>
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
>
> at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:3
> 8)
>
> at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>
> at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
>
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.j
> ava:1145)
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.
> java:615)
>
> at java.lang.Thread.run(Thread.java:745)
>
>
>
> 15/12/11 17:12:08 ERROR TaskSetManager: Task 9 in stage 8.0 failed 4 
> times; aborting job
>
> org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 9 in stage 8.0 failed 4 times, most recent failure: Lost task 9.3 
> in stage 8.0 (TID 547, 10.162.167.23): java.io.FileNotFoundException: 
> File file:/root/2008.csv does not exist
>
> at
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLoc
> alFileSystem.java:511)
>
> at
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawL
> ocalFileSystem.java:724)
>
> at
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSyst
> em.java:501)
>
> at
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.j
> ava:397)
>
> at
> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(
> ChecksumFileSystem.java:137)
>
> at
> org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:3
> 39)
>
> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:764)
>
> at
> org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java
> :108)
>
> at
> org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputForm
> at.java:67)
>
> at
> org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:239)
>
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)
>
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
>
>at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:3
> 8)
>
>

Questions on Kerberos usage with YARN and JDBC

2015-12-11 Thread Mike Wright

As part of our implementation, we are utilizing a full "Kerberized" cluster
built on the Hortonworks suite. We're using Job Server as the front end to
initiate short-run jobs directly from our client-facing product suite.

1) We believe we have configured the job server to start with the
appropriate credentials, specifying a principal and keytab. We switch to
YARN-CLIENT mode and can see Job Server attempt to connect to the resource
manager, and the result is that whatever the principal name is, it "cannot
impersonate root."  We have been unable to solve this.

2) We are primarily a Windows shop, hence our cluelessness here. That said,
we're using the JDBC driver version 4.2 and want to use JavaKerberos
authentication to connect to SQL Server. The queries performed by the job
are done in the driver, and hence would be running on the Job Server, which
we confirmed is running as the principal we have designated. However, when
attempting to connect with this option enabled I receive a "Unable to
obtain Principal Name for authentication" exception.

Reading this:

https://msdn.microsoft.com/en-us/library/ms378428.aspx

We have Kerberos working on the machine and thus have krb5.conf setup
correctly. However the section, "

Enabling the Domain Configuration File and the Login Module Configuration
File" seems to indicate we've missed a step somewhere.

Forgive my ignorance here ... I've been on Windows for 20 years and this is
all new to.

Thanks for any guidance you can provide.

Re: Re: HELP! I get "java.lang.String cannot be cast to java.lang.Intege " for a long time.

2015-12-11 Thread Jakob Odersky

It looks like you have an issue with your classpath, I think it is because
you add a jar containing Spark twice: first, you have a dependency on Spark
somewhere in your build tool (this allows you to compile and run your
application), second you re-add Spark here

>  sc.addJar("/home/hadoop/spark-assembly-1.5.2-hadoop2.6.0.jar")//It
doesn't work.!?

I recommend you remove that line and see if everything works.
If you have that line because you need hadoop 2.6, I recommend you build
spark against that version and publish locally with maven

Re: is Multiple Spark Contexts is supported in spark 1.5.0 ?

2015-12-11 Thread Mike Wright

Somewhat related - What's the correct implementation when you have a single
cluster to support multiple jobs that are unrelated and NOT sharing data? I
was directed to figure out, via job server, to support "multiple contexts"
and explained that multiple contexts per JVM is not really supported. So,
via job server, how does one support multiple contexts in DIFFERENT JVM's?
I specify multiple contexts in the conf file and the initialization of the
subsequent contexts fail.

On Fri, Dec 4, 2015 at 3:37 PM, Michael Armbrust 
wrote:

> On Fri, Dec 4, 2015 at 11:24 AM, Anfernee Xu 
> wrote:
>
>> If multiple users are looking at the same data set, then it's good choice
>> to share the SparkContext.
>>
>> But my usercases are different, users are looking at different data(I use
>> custom Hadoop InputFormat to load data from my data source based on the
>> user input), the data might not have any overlap. For now I'm taking below
>> approach
>>
>
> Still if you want fine grained sharing of compute resources as well, you
> want to using single SparkContext.
>

Re: Performance does not increase as the number of workers increasing in cluster mode

2015-12-11 Thread Zhan Zhang

Not sure your data and model size. But intuitively, there is a tradeoff between 
parallel and network overhead. With the same data set and model, there is a 
optimum point of cluster size (performance may degrade at some point with the 
cluster size increment).  You may want to test larger data set if you wan tot 
do some performance benchmark.

Thanks.

Zhan Zhang



On Dec 11, 2015, at 9:34 AM, Wei Da > 
wrote:

Hi, all

I have done a test in different HW configurations of Spark 1.5.0. A KMeans 
algorithm has been ran in four different Spark environments, the first one ran 
in local mode, the other three ran in cluster mode, all the nodes are with the 
same CPU (6 cores) and Memory (8G). The running times are recorded in the 
following. I thought the performance should increase as the number of workers 
increasing. But the result shows no obvious improvement. Does anybody know the 
reason? Thanks a lot in advance!

The number of rows in test data is about 2.6 million, the input file is about 
810M and stores in HDFS.
[X]


Following is snapshot of the Spark WebUI.
[X]

Wei Da

Wei Da
xwd0...@qq.com

cluster mode uses port 6066 Re: Warning: Master endpoint spark://ip:7077 was not a REST server. Falling back to legacy submission gateway instead.

2015-12-11 Thread Andy Davidson

Hi Andrew

You are correct I am using cluster mode.

Many thanks

Andy

From:  Andrew Or 
Date:  Thursday, December 10, 2015 at 6:31 PM
To:  Andrew Davidson 
Cc:  Jakob Odersky , "user @spark"

Subject:  Re: Warning: Master endpoint spark://ip:7077 was not a REST
server. Falling back to legacy submission gateway instead.

> Hi Andy,
> 
> You must be running in cluster mode. The Spark Master accepts client mode
> submissions on port 7077 and cluster mode submissions on port 6066. This is
> because standalone cluster mode uses a REST API to submit applications by
> default. If you submit to port 6066 instead the warning should go away.
> 
> -Andrew
> 
> 
> 2015-12-10 18:13 GMT-08:00 Andy Davidson :
>> Hi Jakob
>> 
>> The cluster was set up using the spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2
>> script
>> 
>> Given my limited knowledge I think this looks okay?
>> 
>> Thanks
>> 
>> Andy
>> 
>> $ sudo netstat -peant | grep 7077
>> 
>> tcp0  0 :::172-31-30-51:7077:::*
>> LISTEN  0  311641427355/java
>> 
>> tcp0  0 :::172-31-30-51:7077:::172-31-30-51:57311
>> ESTABLISHED 0  311591927355/java
>> 
>> tcp0  0 :::172-31-30-51:7077:::172-31-30-51:42333
>> ESTABLISHED 0  373666427355/java
>> 
>> tcp0  0 :::172-31-30-51:7077:::172-31-30-51:49796
>> ESTABLISHED 0  311592527355/java
>> 
>> tcp0  0 :::172-31-30-51:7077:::172-31-30-51:42290
>> ESTABLISHED 0  311592327355/java
>> 
>> 
>> 
>> $ ps -aux | grep 27355
>> 
>> Warning: bad syntax, perhaps a bogus '-'? See /usr/share/doc/procps-3.2.8/FAQ
>> 
>> ec2-user 23867  0.0  0.0 110404   872 pts/0S+   02:06   0:00 grep 27355
>> 
>> root 27355  0.5  6.7 3679096 515836 ?  Sl   Nov26 107:04
>> /usr/java/latest/bin/java -cp
>> /root/spark/sbin/../conf/:/root/spark/lib/spark-assembly-1.5.1-hadoop1.2.1.ja
>> r:/root/spark/lib/datanucleus-api-jdo-3.2.6.jar:/root/spark/lib/datanucleus-r
>> dbms-3.2.9.jar:/root/spark/lib/datanucleus-core-3.2.10.jar:/root/ephemeral-hd
>> fs/conf/ -Xms1g -Xmx1g org.apache.spark.deploy.master.Master --ip
>> ec2-54-215-217-122.us-west-1.compute.amazonaws.com
>>   --port 7077
>> --webui-port 8080
>> 
>> 
>> From:  Jakob Odersky 
>> Date:  Thursday, December 10, 2015 at 5:55 PM
>> To:  Andrew Davidson 
>> Cc:  "user @spark" 
>> Subject:  Re: Warning: Master endpoint spark://ip:7077 was not a REST server.
>> Falling back to legacy submission gateway instead.
>> 
>>> Is there any other process using port 7077?
>>> 
>>> On 10 December 2015 at 08:52, Andy Davidson 
>>> wrote:
 Hi
 
 I am using spark-1.5.1-bin-hadoop2.6. Any idea why I get this warning. My
 job seems to run with out any problem.
 
 Kind regards 
 
 Andy
 
 + /root/spark/bin/spark-submit --class com.pws.spark.streaming.IngestDriver
 --master spark://ec2-54-205-209-122.us-west-1.compute.amazonaws.com:7077
 
 --total-executor-cores 2 --deploy-mode cluster
 hdfs:///home/ec2-user/build/ingest-all.jar --clusterMode --dirPath week_3
 
 Running Spark using the REST application submission protocol.
 
 15/12/10 16:46:33 WARN RestSubmissionClient: Unable to connect to server
 spark://ec2-54-205-209-122.us-west-1.compute.amazonaws.com:7077
  .
 
 Warning: Master endpoint
 ec2-54-205-209-122.us-west-1.compute.amazonaws.com:7077
   was not a
 REST server. Falling back to legacy submission gateway instead.
>>> 
>

HDFS

2015-12-11 Thread shahid ashraf

hi Folks

I am using standalone cluster of 50 servers on aws. i loaded data on hdfs,
 why i am getting Locality Level as ANY for data on hdfs, i have 900+
partitions.


-- 
with Regards
Shahid Ashraf

Multi-core support per task in Spark

2015-12-11 Thread Zhan Zhang

Hi Folks,

Is it possible to assign multiple core per task and how? Suppose we have some 
scenario, in which some tasks are really heavy processing each record and 
require multi-threading, and we want to avoid similar tasks assigned to the 
same executors/hosts. 

If it is not supported, does it make sense to add this feature. It may seems 
make user worry about more configuration, but by default we can still do 1 core 
per task and only advanced users need to be aware of this feature.

Thanks.

Zhan Zhang

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: how to access local file from Spark sc.textFile("file:///path to/myfile")

2015-12-11 Thread Zhan Zhang

As Sean mentioned, you cannot referring to the local file in your remote 
machine (executors). One walk around is to copy the file to all machines within 
same directory.

Thanks.

Zhan Zhang

On Dec 11, 2015, at 10:26 AM, Lin, Hao 
> wrote:

 of the master node

Re: Re: HELP! I get "java.lang.String cannot be cast to java.lang.Intege " for a long time.

2015-12-11 Thread Jakob Odersky

Btw, Spark 1.5 comes with support for hadoop 2.2 by default

On 11 December 2015 at 03:08, Bonsen  wrote:

> Thank you,and I find the problem is my package is test,but I write package
> org.apache.spark.examples ,and IDEA had imported the
> spark-examples-1.5.2-hadoop2.6.0.jar ,so I can run it,and it makes lots of
> problems
> __
> Now , I change the package like this:
>
> package test
> import org.apache.spark.SparkConf
> import org.apache.spark.SparkContext
> object test {
>   def main(args: Array[String]) {
> val conf = new
> SparkConf().setAppName("mytest").setMaster("spark://Master:7077")
> val sc = new SparkContext(conf)
> sc.addJar("/home/hadoop/spark-assembly-1.5.2-hadoop2.6.0.jar")//It
> doesn't work.!?
> val rawData = sc.textFile("/home/hadoop/123.csv")
> val secondData = rawData.flatMap(_.split(",").toString)
> println(secondData.first)   /line 32
> sc.stop()
>   }
> }
> it causes that:
> 15/12/11 18:41:06 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0,
> 219.216.65.129): java.lang.ClassNotFoundException: test.test$$anonfun$1
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:348)
> at
> org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:67)
> 
> 
> //  219.216.65.129 is my worker computer.
> //  I can connect to my worker computer.
> // Spark can start successfully.
> //  addFile is also doesn't work,the tmp file will also dismiss.
>
>
>
>
>
>
> At 2015-12-10 22:32:21, "Himanshu Mehra [via Apache Spark User List]" <[hidden
> email] > wrote:
>
> You are trying to print an array, but anyway it will print the objectID
>  of the array if the input is same as you have shown here. Try flatMap()
> instead of map and check if the problem is same.
>
>--Himanshu
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/HELP-I-get-java-lang-String-cannot-be-cast-to-java-lang-Intege-for-a-long-time-tp25666p25667.html
> To unsubscribe from HELP! I get "java.lang.String cannot be cast to
> java.lang.Intege " for a long time., click here.
> NAML
> 
>
>
>
>
>
> --
> View this message in context: Re:Re: HELP! I get "java.lang.String cannot
> be cast to java.lang.Intege " for a long time.
> 
>
> Sent from the Apache Spark User List mailing list archive
>  at Nabble.com.
>

64 matches

Mail list logo