Re: Set EXTRA_JAR environment variable for spark-jobserver

2015-01-08 Thread Sasi
Thank you Pankaj. We are able to create the Uber JAR (very good to bind all
dependency JARs together) and run it on spark-jobserver. One step better
than what we are.

However, now facing *SparkException: Job aborted due to stage failure: All
masters are unresponsive! Giving up*. We may need to raise another post as
this is not related to current one.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Set-EXTRA-JAR-environment-variable-for-spark-jobserver-tp20989p21053.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Getting Output From a Cluster

2015-01-08 Thread Su She
1) Thank you everyone for the help once again...the support here is really
amazing and I hope to contribute soon!

2) The solution I actually ended up using was from this thread:
http://mail-archives.apache.org/mod_mbox/spark-user/201310.mbox/%3ccafnzj5ejxdgqju7nbdqy6xureq3d1pcxr+i2s99g5brcj5e...@mail.gmail.com%3E

in case the thread ever goes down, the soln provided by Matei:

plans.saveAsHadoopFiles("hdfs://localhost:8020/user/hue/output/completed","csv",
String.class, String.class, (Class) TextOutputFormat.class);

I had browsed a lot of similar threads that did not have answers, but found
this one from quite some time ago, so apologize for posting a question that
had been answered before.

3) Akhil, I was specifying the format as "txt", but it was not compatible

Thanks for the help!


On Thu, Jan 8, 2015 at 11:23 PM, Akhil Das 
wrote:

> saveAsHadoopFiles requires you to specify the output format which i
> believe you are not specifying anywhere and hence the program crashes.
>
> You could try something like this:
>
> Class> outputFormatClass = (Class OutputFormat>) (Class) SequenceFileOutputFormat.class;
> 46
>
> yourStream.saveAsNewAPIHadoopFiles(hdfsUrl, "/output-location",Text.class,
> Text.class, outputFormatClass);
>
>
>
> Thanks
> Best Regards
>
> On Fri, Jan 9, 2015 at 10:22 AM, Su She  wrote:
>
>> Yes, I am calling the saveAsHadoopFiles on the Dstream. However, when I
>> call print on the Dstream it works? If I had to do foreachRDD to
>> saveAsHadoopFile, then why is it working for print?
>>
>> Also, if I am doing foreachRDD, do I need connections, or can I simply
>> put the saveAsHadoopFiles inside the foreachRDD function?
>>
>> Thanks Yana for the help! I will play around with foreachRDD and convey
>> my results.
>>
>>
>>
>> On Thu, Jan 8, 2015 at 6:06 PM, Yana Kadiyska 
>> wrote:
>>
>>> are you calling the saveAsText files on the DStream --looks like it?
>>> Look at the section called "Design Patterns for using foreachRDD" in the
>>> link you sent -- you want to do  dstream.foreachRDD(rdd =>
>>> rdd.saveAs)
>>>
>>> On Thu, Jan 8, 2015 at 5:20 PM, Su She  wrote:
>>>
 Hello Everyone,

 Thanks in advance for the help!

 I successfully got my Kafka/Spark WordCount app to print locally.
 However, I want to run it on a cluster, which means that I will have to
 save it to HDFS if I want to be able to read the output.

 I am running Spark 1.1.0, which means according to this document:
 https://spark.apache.org/docs/1.1.0/streaming-programming-guide.html

 I should be able to use commands such as saveAsText/HadoopFiles.

 1) When I try saveAsTextFiles it says:
 cannot find symbol
 [ERROR] symbol  : method
 saveAsTextFiles(java.lang.String,java.lang.String)
 [ERROR] location: class
 org.apache.spark.streaming.api.java.JavaPairDStream

 This makes some sense as saveAsTextFiles is not included here:

 http://people.apache.org/~tdas/spark-1.1.0-temp-docs/api/java/org/apache/spark/streaming/api/java/JavaPairDStream.html

 2) When I try
 saveAsHadoopFiles("hdfs://ipus-west-1.compute.internal:8020/user/testwordcount",
 "txt") it builds, but when I try running it it throws this exception:

 Exception in thread "main" java.lang.RuntimeException:
 java.lang.RuntimeException: class scala.runtime.Nothing$ not
 org.apache.hadoop.mapred.OutputFormat
 at
 org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2079)
 at
 org.apache.hadoop.mapred.JobConf.getOutputFormat(JobConf.java:712)
 at
 org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1021)
 at
 org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:940)
 at
 org.apache.spark.streaming.dstream.PairDStreamFunctions$$anonfun$8.apply(PairDStreamFunctions.scala:632)
 at
 org.apache.spark.streaming.dstream.PairDStreamFunctions$$anonfun$8.apply(PairDStreamFunctions.scala:630)
 at
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:42)
 at
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
 at
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
 at scala.util.Try$.apply(Try.scala:161)
 at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32)
 at
 org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:171)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:724)
 Caused by: java.lang.RuntimeException: class scala.runtime.Nothing$ not
>>

Re: SparkSQL schemaRDD & MapPartitions calls - performance issues - columnar formats?

2015-01-08 Thread Nathan McCarthy
Any ideas? :)

From: Nathan 
mailto:nathan.mccar...@quantium.com.au>>
Date: Wednesday, 7 January 2015 2:53 pm
To: "user@spark.apache.org" 
mailto:user@spark.apache.org>>
Subject: SparkSQL schemaRDD & MapPartitions calls - performance issues - 
columnar formats?

Hi,

I’m trying to use a combination of SparkSQL and ‘normal' Spark/Scala via 
rdd.mapPartitions(…). Using the latest release 1.2.0.

Simple example; load up some sample data from parquet on HDFS (about 380m rows, 
10 columns) on a 7 node cluster.

  val t = sqlC.parquetFile("/user/n/sales-tran12m.parquet”)
  t.registerTempTable("test1”)
  sqlC.cacheTable("test1”)

Now lets do some operations on it; I want the total sales & quantities sold for 
each hour in the day so I choose 3 out of the 10 possible columns...

  sqlC.sql("select Hour, sum(ItemQty), sum(Sales) from test1 group by 
Hour").collect().foreach(println)

After the table has been 100% cached in memory, this takes around 11 seconds.

Lets do the same thing but via a MapPartitions call (this isn’t production 
ready code but gets the job done).

  val try2 = sqlC.sql("select Hour, ItemQty, Sales from test1”)
  rddPC.mapPartitions { case hrs =>
val qtySum = new Array[Double](24)
val salesSum = new Array[Double](24)

for(r <- hrs) {
  val hr = r.getInt(0)
  qtySum(hr) += r.getDouble(1)
  salesSum(hr) += r.getDouble(2)
}
(salesSum zip qtySum).zipWithIndex.map(_.swap).iterator
  }.reduceByKey((a,b) => (a._1 + b._1, a._2 + b._2)).collect().foreach(println)

Now this takes around ~49 seconds… Even though test1 table is 100% cached. The 
number of partitions remains the same…

Now if I create a simple RDD of a case class HourSum(hour: Int, qty: Double, 
sales: Double)

Convert the SchemaRDD;
val rdd = sqlC.sql("select * from test1").map{ r => HourSum(r.getInt(1), 
r.getDouble(7), r.getDouble(8)) }.cache()
//cache all the data
rdd.count()

Then run basically the same MapPartitions query;

rdd.mapPartitions { case hrs =>
  val qtySum = new Array[Double](24)
  val salesSum = new Array[Double](24)

  for(r <- hrs) {
val hr = r.hour
qtySum(hr) += r.qty
salesSum(hr) += r.sales
  }
  (salesSum zip qtySum).zipWithIndex.map(_.swap).iterator
}.reduceByKey((a,b) => (a._1 + b._1, a._2 + b._2)).collect().foreach(println)

This takes around 1.5 seconds! Albeit the memory footprint is much larger.

My thinking is that because SparkSQL does store things in a columnar format, 
there is some unwrapping to be done out of the column array buffers which takes 
time and for some reason this just takes longer when I switch out to map 
partitions (maybe its unwrapping the entire row, even though I’m using just a 
subset of columns, or maybe there is some object creation/autoboxing going on 
when calling getInt or getDouble)…

I’ve tried simpler cases too, like just summing sales. Running sum via SQL is 
fast (4.7 seconds), running a mapPartition sum on a double RDD is even faster 
(2.6 seconds). But MapPartitions on the SchemaRDD;

sqlC.sql("select SalesInclGST from test1").mapPartitions(iter => 
Iterator(iter.foldLeft(0.0)((t,r) => t+r.getDouble(0.sum

 takes a long time (33 seconds). In all these examples everything is fully 
cached in memory. And yes for these kinds of operations I can use SQL, but for 
more complex queries I’d much rather be using a combo of SparkSQL to select the 
data (so I get nice things like Parquet pushdowns etc.) & functional Scala!

I think I’m doing something dumb… Is there something I should be doing to get 
faster performance on MapPartitions on SchemaRDDs? Is there some unwrapping 
going on in the background that catalyst does in a smart way that I’m missing?

Cheers,
~N

Nathan McCarthy
QUANTIUM
Level 25, 8 Chifley, 8-12 Chifley Square
Sydney NSW 2000

T: +61 2 8224 8922
F: +61 2 9292 6444

W: quantium.com.au



linkedin.com/company/quantium

facebook.com/QuantiumAustralia

twitter.com/QuantiumAU


The contents of this email, including attachments, may be confidential 
information. If you are not the intended recipient, any use, disclosure or 
copying of the information is unauthorised. If you have received this email in 
error, we would be grateful if you would notify us immediately by email reply, 
phone (+ 61 2 9292 6400) or fax (+ 61 2 9292 6444) and delete the message from 
your system.


Re: Join RDDs with DStreams

2015-01-08 Thread Akhil Das
Here's how you do it:

val joined_stream = *myStream*.transform((x: RDD[(String, String)]) =>
{  val prdd = new PairRDDFunctions[String, String](x)
prdd.join(*myRDD*)})




Thanks
Best Regards

On Thu, Jan 8, 2015 at 10:20 PM, Asim Jalis  wrote:

> Is there a way to join non-DStream RDDs with DStream RDDs?
>
> Here is the use case. I have a lookup table stored in HDFS that I want to
> read as an RDD. Then I want to join it with the RDDs that are coming in
> through the DStream. How can I do this?
>
> Thanks.
>
> Asim
>


Re: Getting Output From a Cluster

2015-01-08 Thread Akhil Das
saveAsHadoopFiles requires you to specify the output format which i believe
you are not specifying anywhere and hence the program crashes.

You could try something like this:

Class> outputFormatClass = (Class>) (Class) SequenceFileOutputFormat.class;
46

yourStream.saveAsNewAPIHadoopFiles(hdfsUrl, "/output-location",Text.class,
Text.class, outputFormatClass);



Thanks
Best Regards

On Fri, Jan 9, 2015 at 10:22 AM, Su She  wrote:

> Yes, I am calling the saveAsHadoopFiles on the Dstream. However, when I
> call print on the Dstream it works? If I had to do foreachRDD to
> saveAsHadoopFile, then why is it working for print?
>
> Also, if I am doing foreachRDD, do I need connections, or can I simply put
> the saveAsHadoopFiles inside the foreachRDD function?
>
> Thanks Yana for the help! I will play around with foreachRDD and convey my
> results.
>
> Suhas Shekar
>
> University of California, Los Angeles
> B.A. Economics, Specialization in Computing 2014
>
> On Thu, Jan 8, 2015 at 6:06 PM, Yana Kadiyska 
> wrote:
>
>> are you calling the saveAsText files on the DStream --looks like it? Look
>> at the section called "Design Patterns for using foreachRDD" in the link
>> you sent -- you want to do  dstream.foreachRDD(rdd => rdd.saveAs)
>>
>> On Thu, Jan 8, 2015 at 5:20 PM, Su She  wrote:
>>
>>> Hello Everyone,
>>>
>>> Thanks in advance for the help!
>>>
>>> I successfully got my Kafka/Spark WordCount app to print locally.
>>> However, I want to run it on a cluster, which means that I will have to
>>> save it to HDFS if I want to be able to read the output.
>>>
>>> I am running Spark 1.1.0, which means according to this document:
>>> https://spark.apache.org/docs/1.1.0/streaming-programming-guide.html
>>>
>>> I should be able to use commands such as saveAsText/HadoopFiles.
>>>
>>> 1) When I try saveAsTextFiles it says:
>>> cannot find symbol
>>> [ERROR] symbol  : method
>>> saveAsTextFiles(java.lang.String,java.lang.String)
>>> [ERROR] location: class
>>> org.apache.spark.streaming.api.java.JavaPairDStream
>>>
>>> This makes some sense as saveAsTextFiles is not included here:
>>>
>>> http://people.apache.org/~tdas/spark-1.1.0-temp-docs/api/java/org/apache/spark/streaming/api/java/JavaPairDStream.html
>>>
>>> 2) When I try
>>> saveAsHadoopFiles("hdfs://ipus-west-1.compute.internal:8020/user/testwordcount",
>>> "txt") it builds, but when I try running it it throws this exception:
>>>
>>> Exception in thread "main" java.lang.RuntimeException:
>>> java.lang.RuntimeException: class scala.runtime.Nothing$ not
>>> org.apache.hadoop.mapred.OutputFormat
>>> at
>>> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2079)
>>> at
>>> org.apache.hadoop.mapred.JobConf.getOutputFormat(JobConf.java:712)
>>> at
>>> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1021)
>>> at
>>> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:940)
>>> at
>>> org.apache.spark.streaming.dstream.PairDStreamFunctions$$anonfun$8.apply(PairDStreamFunctions.scala:632)
>>> at
>>> org.apache.spark.streaming.dstream.PairDStreamFunctions$$anonfun$8.apply(PairDStreamFunctions.scala:630)
>>> at
>>> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:42)
>>> at
>>> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
>>> at
>>> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
>>> at scala.util.Try$.apply(Try.scala:161)
>>> at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32)
>>> at
>>> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:171)
>>> at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>> at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>> at java.lang.Thread.run(Thread.java:724)
>>> Caused by: java.lang.RuntimeException: class scala.runtime.Nothing$ not
>>> org.apache.hadoop.mapred.OutputFormat
>>> at
>>> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2073)
>>> ... 14 more
>>>
>>>
>>> Any help is really appreciated! Thanks.
>>>
>>>
>>
>


Re: Cannot save RDD as text file to local file system

2015-01-08 Thread Akhil Das
Are you running the program in local mode or in standalone cluster mode?

Thanks
Best Regards

On Fri, Jan 9, 2015 at 10:12 AM, Wang, Ningjun (LNG-NPV) <
ningjun.w...@lexisnexis.com> wrote:

>  I try to save RDD as text file to local file system (Linux) but it does
> not work
>
>
>
> Launch spark-shell and run the following
>
>
>
> val r = sc.parallelize(Array("a", "b", "c"))
>
> r.saveAsTextFile("file:///home/cloudera/tmp/out1")
>
>
>
>
>
> IOException: Mkdirs failed to create
>
>
> file:/home/cloudera/tmp/out1/_temporary/0/_temporary/attempt_201501082027_0003_m_00_47
>
> (exists=false, cwd=file:/var/run/spark/work/app-20150108201046-0021/0)
>
> at
>
> org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:442)
>
> at
>
> org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:428)
>
> at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
>
> at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:801)
>
> at
>
>
> org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:123)
>
> at
> org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:90)
>
> at
>
>
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1056)
>
> at
>
>
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1047)
>
> at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>
> at org.apache.spark.scheduler.Task.run(Task.scala:56)
>
> at
>
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
>
> at
>
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
> at
>
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>
> at java.lang.Thread.run(Thread.java:745)
>
>
>
>
>
> I also try with 4 slash but still get the same error
>
> r.saveAsTextFile("file:home/cloudera/tmp/out1")
>
>
>
> Please advise
>
>
>
> Ningjun
>
>
>


Re: Spark SQL: Storing AVRO Schema in Parquet

2015-01-08 Thread Raghavendra Pandey
I cam across this http://zenfractal.com/2013/08/21/a-powerful-big-data-trio/.
You can take a look.

On Fri Jan 09 2015 at 12:08:49 PM Raghavendra Pandey <
raghavendra.pan...@gmail.com> wrote:

> I have the similar kind of requirement where I want to push avro data into
> parquet. But it seems you have to do it on your own. There is parquet-mr
> project that uses hadoop to do so. I am trying to write a spark job to do
> similar kind of thing.
>
> On Fri, Jan 9, 2015 at 3:20 AM, Jerry Lam  wrote:
>
>> Hi spark users,
>>
>> I'm using spark SQL to create parquet files on HDFS. I would like to
>> store the avro schema into the parquet meta so that non spark sql
>> applications can marshall the data without avro schema using the avro
>> parquet reader. Currently, schemaRDD.saveAsParquetFile does not allow to do
>> that. Is there another API that allows me to do this?
>>
>> Best Regards,
>>
>> Jerry
>>
>
>


Re: Did anyone tried overcommit of CPU cores?

2015-01-08 Thread Jörn Franke
Hallo,

Based on experiences with other software in virtualized environments I
cannot really recommend this. However, I am not sure how Spark reacts. You
may face unpredictable task failures depending on utilization, tasks
connecting to external systems (databases etc.) may fail unexpectedly and
this might be a problem for them (transactions not finishing etc.).

Why not increase the tasks per core?

Best regards
Le 9 janv. 2015 06:46, "Xuelin Cao"  a écrit :

>
> Hi,
>
>   I'm wondering whether it is a good idea to overcommit CPU cores on
> the spark cluster.
>
>   For example, in our testing cluster, each worker machine has 24
> physical CPU cores. However, we are allowed to set the CPU core number to
> 48 or more in the spark configuration file. As a result, we are allowed to
> launch more tasks than the number of physical CPU cores.
>
>   The motivation of overcommit CPU cores is, for many times, a task
> cannot consume 100% resource of a single CPU core (due to I/O, shuffle,
> etc.).
>
>   So, overcommit the CPU cores allows more tasks running at the same
> time, and makes the resource be used economically.
>
>   But, is there any reason that we should not doing like this? Anyone
> tried this?
>
>   [image: Inline image 1]
>
>
>


Parallel execution on one node

2015-01-08 Thread mikens
Hello,
I am new to Spark. I have adapted an example code to do binary
classification using logistic regression. I tried it on rcv1_train.binary
dataset using LBFGS.runLBFGS solver, and obtained correct loss.

Now, I'd like to run code in parallel across 16 cores of my single CPU
socket. If I understand correctly, parallelism in Spark is achieved by
partitioning dataset into some number of partitions, approximately 3-4 times
the amount of cores in the system.  To partition the data, I am calling
data.repartition(npart), where npart is number of partitions (16*4=64 in my
case).

I run the code as follows: 
spark-submit --master local[16] --class "logreg"
target/scala-2.10/logistic-regression_2.10-1.0.2.jar  72

However, I do not observe any speedup compared to when I just use one
partition. I would much appreciate your help understanding what I am doing
wrong and why I am not seeing any speedup due to 16 cores. Please find my
code below.

Best,
Mike

*CODE*
object logreg {
  def main(args: Array[String]) {

val conf = new SparkConf().setAppName("logreg")
val sc = new SparkContext(conf)
val npart=args(0).toInt;
val data_ = MLUtils.loadLibSVMFile(sc,
"rcv1_train.binary.0label").cache()
val data=data_.repartition(npart); // partition dataset in "npart"
partitions
val lambda=(1.0/data.count())
val splits = data.randomSplit(Array(1.0, 0.0), seed = 11L)
val training = splits(0).map(x => (x.label,
MLUtils.appendBias(x.features))).cache()
val numFeatures = data.take(1)(0).features.size
val start = System.currentTimeMillis
val initialWeightsWithIntercept = Vectors.dense(new
Array[Double](numFeatures + 1))
val (weightsWithIntercept, loss) = LBFGS.runLBFGS(training,
  new
LogisticGradient(),
  new
SquaredL2Updater(),
  10,
  1e-14,
  100,
  lambda,
 
initialWeightsWithIntercept)
val took = (System.currentTimeMillis - start)/1000.0;
println("LBFGS.runLBFGS: " +  took + "s")
sc.stop()
  }
}





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Parallel-execution-on-one-node-tp21052.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: RDD Moving Average

2015-01-08 Thread Tobias Pfeiffer
Hi,

On Wed, Jan 7, 2015 at 9:47 AM, Asim Jalis  wrote:

> One approach I was considering was to use mapPartitions. It is
> straightforward to compute the moving average over a partition, except for
> near the end point. Does anyone see how to fix that?
>

Well, I guess this is not a perfect use case for mapPartitions, in
particular since you would have to implement the behavior near the
beginning and end of a partition yourself. I would rather go with the
high-level RDD functions that are partition-independent.

By the way, I am now also trying to implement sliding windows based on
count and embedded timestamp... seems like I should have had a look at
rdd.sliding() before...

Tobias


Re: Failed to save RDD as text file to local file system

2015-01-08 Thread VISHNU SUBRAMANIAN
looks like it is trying to save the file in Hdfs.

Check if you have set any hadoop path in your system.

On Fri, Jan 9, 2015 at 12:14 PM, Raghavendra Pandey <
raghavendra.pan...@gmail.com> wrote:

> Can you check permissions etc as I am able to run
> r.saveAsTextFile("file:///home/cloudera/tmp/out1") successfully on my
> machine..
>
> On Fri, Jan 9, 2015 at 10:25 AM, NingjunWang 
> wrote:
>
>> I try to save RDD as text file to local file system (Linux) but it does
>> not
>> work
>>
>> Launch spark-shell and run the following
>>
>> val r = sc.parallelize(Array("a", "b", "c"))
>> r.saveAsTextFile("file:///home/cloudera/tmp/out1")
>>
>>
>> IOException: Mkdirs failed to create
>>
>> file:/home/cloudera/tmp/out1/_temporary/0/_temporary/attempt_201501082027_0003_m_00_47
>> (exists=false, cwd=file:/var/run/spark/work/app-20150108201046-0021/0)
>> at
>>
>> org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:442)
>> at
>>
>> org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:428)
>> at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
>> at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:801)
>> at
>>
>> org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:123)
>> at
>> org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:90)
>> at
>>
>> org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1056)
>> at
>>
>> org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1047)
>> at
>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>> at org.apache.spark.scheduler.Task.run(Task.scala:56)
>> at
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
>> at
>>
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> at
>>
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> at java.lang.Thread.run(Thread.java:745)
>>
>>
>> I also try with 4 slash but still get the same error
>> r.saveAsTextFile("file:home/cloudera/tmp/out1")
>>
>> Please advise
>>
>> Ningjun
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Failed-to-save-RDD-as-text-file-to-local-file-system-tp21050.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: skipping header from each file

2015-01-08 Thread Akhil Das
Did you try something like:

val file = sc.textFile("/home/akhld/sigmoid/input")

val skipped = file.filter(row => !row.contains("header"))

skipped.take(10).foreach(println)

Thanks
Best Regards

On Fri, Jan 9, 2015 at 11:48 AM, Hafiz Mujadid 
wrote:

> Suppose I give three files paths to spark context to read and each file has
> schema in first row. how can we skip schema lines from headers
>
>
> val rdd=sc.textFile("file1,file2,file3");
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/skipping-header-from-each-file-tp21051.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Failed to save RDD as text file to local file system

2015-01-08 Thread Raghavendra Pandey
Can you check permissions etc as I am able to run
r.saveAsTextFile("file:///home/cloudera/tmp/out1")
successfully on my machine..

On Fri, Jan 9, 2015 at 10:25 AM, NingjunWang 
wrote:

> I try to save RDD as text file to local file system (Linux) but it does not
> work
>
> Launch spark-shell and run the following
>
> val r = sc.parallelize(Array("a", "b", "c"))
> r.saveAsTextFile("file:///home/cloudera/tmp/out1")
>
>
> IOException: Mkdirs failed to create
>
> file:/home/cloudera/tmp/out1/_temporary/0/_temporary/attempt_201501082027_0003_m_00_47
> (exists=false, cwd=file:/var/run/spark/work/app-20150108201046-0021/0)
> at
> org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:442)
> at
> org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:428)
> at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
> at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:801)
> at
>
> org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:123)
> at
> org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:90)
> at
>
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1056)
> at
>
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1047)
> at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
> at org.apache.spark.scheduler.Task.run(Task.scala:56)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
> at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
>
>
> I also try with 4 slash but still get the same error
> r.saveAsTextFile("file:home/cloudera/tmp/out1")
>
> Please advise
>
> Ningjun
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Failed-to-save-RDD-as-text-file-to-local-file-system-tp21050.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Spark SQL: Storing AVRO Schema in Parquet

2015-01-08 Thread Raghavendra Pandey
I have the similar kind of requirement where I want to push avro data into
parquet. But it seems you have to do it on your own. There is parquet-mr
project that uses hadoop to do so. I am trying to write a spark job to do
similar kind of thing.

On Fri, Jan 9, 2015 at 3:20 AM, Jerry Lam  wrote:

> Hi spark users,
>
> I'm using spark SQL to create parquet files on HDFS. I would like to store
> the avro schema into the parquet meta so that non spark sql applications
> can marshall the data without avro schema using the avro parquet reader.
> Currently, schemaRDD.saveAsParquetFile does not allow to do that. Is there
> another API that allows me to do this?
>
> Best Regards,
>
> Jerry
>


skipping header from each file

2015-01-08 Thread Hafiz Mujadid
Suppose I give three files paths to spark context to read and each file has
schema in first row. how can we skip schema lines from headers


val rdd=sc.textFile("file1,file2,file3");



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/skipping-header-from-each-file-tp21051.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Did anyone tried overcommit of CPU cores?

2015-01-08 Thread Xuelin Cao
Hi,

  I'm wondering whether it is a good idea to overcommit CPU cores on
the spark cluster.

  For example, in our testing cluster, each worker machine has 24
physical CPU cores. However, we are allowed to set the CPU core number to
48 or more in the spark configuration file. As a result, we are allowed to
launch more tasks than the number of physical CPU cores.

  The motivation of overcommit CPU cores is, for many times, a task
cannot consume 100% resource of a single CPU core (due to I/O, shuffle,
etc.).

  So, overcommit the CPU cores allows more tasks running at the same
time, and makes the resource be used economically.

  But, is there any reason that we should not doing like this? Anyone
tried this?

  [image: Inline image 1]


Re: Getting Output From a Cluster

2015-01-08 Thread Su She
Yes, I am calling the saveAsHadoopFiles on the Dstream. However, when I
call print on the Dstream it works? If I had to do foreachRDD to
saveAsHadoopFile, then why is it working for print?

Also, if I am doing foreachRDD, do I need connections, or can I simply put
the saveAsHadoopFiles inside the foreachRDD function?

Thanks Yana for the help! I will play around with foreachRDD and convey my
results.

Suhas Shekar

University of California, Los Angeles
B.A. Economics, Specialization in Computing 2014

On Thu, Jan 8, 2015 at 6:06 PM, Yana Kadiyska 
wrote:

> are you calling the saveAsText files on the DStream --looks like it? Look
> at the section called "Design Patterns for using foreachRDD" in the link
> you sent -- you want to do  dstream.foreachRDD(rdd => rdd.saveAs)
>
> On Thu, Jan 8, 2015 at 5:20 PM, Su She  wrote:
>
>> Hello Everyone,
>>
>> Thanks in advance for the help!
>>
>> I successfully got my Kafka/Spark WordCount app to print locally.
>> However, I want to run it on a cluster, which means that I will have to
>> save it to HDFS if I want to be able to read the output.
>>
>> I am running Spark 1.1.0, which means according to this document:
>> https://spark.apache.org/docs/1.1.0/streaming-programming-guide.html
>>
>> I should be able to use commands such as saveAsText/HadoopFiles.
>>
>> 1) When I try saveAsTextFiles it says:
>> cannot find symbol
>> [ERROR] symbol  : method
>> saveAsTextFiles(java.lang.String,java.lang.String)
>> [ERROR] location: class
>> org.apache.spark.streaming.api.java.JavaPairDStream
>>
>> This makes some sense as saveAsTextFiles is not included here:
>>
>> http://people.apache.org/~tdas/spark-1.1.0-temp-docs/api/java/org/apache/spark/streaming/api/java/JavaPairDStream.html
>>
>> 2) When I try
>> saveAsHadoopFiles("hdfs://ipus-west-1.compute.internal:8020/user/testwordcount",
>> "txt") it builds, but when I try running it it throws this exception:
>>
>> Exception in thread "main" java.lang.RuntimeException:
>> java.lang.RuntimeException: class scala.runtime.Nothing$ not
>> org.apache.hadoop.mapred.OutputFormat
>> at
>> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2079)
>> at
>> org.apache.hadoop.mapred.JobConf.getOutputFormat(JobConf.java:712)
>> at
>> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1021)
>> at
>> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:940)
>> at
>> org.apache.spark.streaming.dstream.PairDStreamFunctions$$anonfun$8.apply(PairDStreamFunctions.scala:632)
>> at
>> org.apache.spark.streaming.dstream.PairDStreamFunctions$$anonfun$8.apply(PairDStreamFunctions.scala:630)
>> at
>> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:42)
>> at
>> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
>> at
>> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
>> at scala.util.Try$.apply(Try.scala:161)
>> at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32)
>> at
>> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:171)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> at java.lang.Thread.run(Thread.java:724)
>> Caused by: java.lang.RuntimeException: class scala.runtime.Nothing$ not
>> org.apache.hadoop.mapred.OutputFormat
>> at
>> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2073)
>> ... 14 more
>>
>>
>> Any help is really appreciated! Thanks.
>>
>>
>


Failed to save RDD as text file to local file system

2015-01-08 Thread NingjunWang
I try to save RDD as text file to local file system (Linux) but it does not
work

Launch spark-shell and run the following

val r = sc.parallelize(Array("a", "b", "c"))
r.saveAsTextFile("file:///home/cloudera/tmp/out1")


IOException: Mkdirs failed to create
file:/home/cloudera/tmp/out1/_temporary/0/_temporary/attempt_201501082027_0003_m_00_47
(exists=false, cwd=file:/var/run/spark/work/app-20150108201046-0021/0)
at
org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:442)
at
org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:428)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:801)
at
org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:123)
at
org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:90)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1056)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1047)
at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)


I also try with 4 slash but still get the same error
r.saveAsTextFile("file:home/cloudera/tmp/out1")

Please advise

Ningjun




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Failed-to-save-RDD-as-text-file-to-local-file-system-tp21050.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Cannot save RDD as text file to local file system

2015-01-08 Thread Wang, Ningjun (LNG-NPV)
I try to save RDD as text file to local file system (Linux) but it does not work

Launch spark-shell and run the following

val r = sc.parallelize(Array("a", "b", "c"))
r.saveAsTextFile("file:///home/cloudera/tmp/out1")


IOException: Mkdirs failed to create
file:/home/cloudera/tmp/out1/_temporary/0/_temporary/attempt_201501082027_0003_m_00_47
(exists=false, cwd=file:/var/run/spark/work/app-20150108201046-0021/0)
at
org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:442)
at
org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:428)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:801)
at
org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:123)
at 
org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:90)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1056)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1047)
at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)


I also try with 4 slash but still get the same error
r.saveAsTextFile("file:home/cloudera/tmp/out1")

Please advise

Ningjun



[ANNOUNCE] Apache Science and Healthcare Track @ApacheCon NA 2015

2015-01-08 Thread Lewis John Mcgibbney
Hi Folks,

Apologies for cross posting :(

As some of you may already know, @ApacheCon NA 2015 is happening in Austin,
TX April 13th-16th.

This email is specifically written to attract all folks interested in
Science and Healthcare... this is an official call to arms! I am aware that
there are many Science and Healthcare-type people also lingering in the
Apache Semantic Web communities so this one is for all of you folks as well.

Over a number of years the Science track has been emerging as an attractive
and exciting, at times mind blowing non-traditional track running alongside
the resident HTTP server, Big Data, etc tracks. The Semantic Web Track is
another such emerging track which has proved popular. This year we want to
really get the message out there about how much Apache technology is
actually being used in Science and Healthcare. This is not *only* aimed at
attracting members of the communities below

but also at potentially attracting a brand new breed of conference
participants to ApacheCon  and
the Foundation e.g. Scientists who love Apache. We are looking for
exciting, invigorating, obscure, half-baked, funky, academic, practical and
impractical stories, use cases, experiments and down right successes alike
from within the Science domain. The only thing they need to have in common
is that they consume, contribute towards, advocate, disseminate or even
commercialize Apache technology within the Scientific domain and would be
relevant to that audience. It is fully open to interest whether this track
be combined with the proposed *healthcare track*... if there is interest to
do this then we can rename this track to Science and Healthcare. In essence
one could argue that they are one and the same however I digress [image: :)]

What I would like those of you that are interested to do, is to merely
check out the scope and intent of the Apache in Science content curation
which is currently ongoing and to potentially register your interest.

https://wiki.apache.org/apachecon/ACNA2015ContentCommittee#Apache_in_Science

I would love to see the Science and Healthcare track be THE BIGGEST track
@ApacheCon, and although we have some way to go, I'm sure many previous
track participants will tell you this is not to missed.

We are looking for content from a wide variety of Scientific use cases all
related to Apache technology.
Thanks in advance and I look forward to seeing you in Austin.
Lewis

-- 
*Lewis*


Re: Getting Output From a Cluster

2015-01-08 Thread Yana Kadiyska
are you calling the saveAsText files on the DStream --looks like it? Look
at the section called "Design Patterns for using foreachRDD" in the link
you sent -- you want to do  dstream.foreachRDD(rdd => rdd.saveAs)

On Thu, Jan 8, 2015 at 5:20 PM, Su She  wrote:

> Hello Everyone,
>
> Thanks in advance for the help!
>
> I successfully got my Kafka/Spark WordCount app to print locally. However,
> I want to run it on a cluster, which means that I will have to save it to
> HDFS if I want to be able to read the output.
>
> I am running Spark 1.1.0, which means according to this document:
> https://spark.apache.org/docs/1.1.0/streaming-programming-guide.html
>
> I should be able to use commands such as saveAsText/HadoopFiles.
>
> 1) When I try saveAsTextFiles it says:
> cannot find symbol
> [ERROR] symbol  : method saveAsTextFiles(java.lang.String,java.lang.String)
> [ERROR] location: class
> org.apache.spark.streaming.api.java.JavaPairDStream
>
> This makes some sense as saveAsTextFiles is not included here:
>
> http://people.apache.org/~tdas/spark-1.1.0-temp-docs/api/java/org/apache/spark/streaming/api/java/JavaPairDStream.html
>
> 2) When I try
> saveAsHadoopFiles("hdfs://ipus-west-1.compute.internal:8020/user/testwordcount",
> "txt") it builds, but when I try running it it throws this exception:
>
> Exception in thread "main" java.lang.RuntimeException:
> java.lang.RuntimeException: class scala.runtime.Nothing$ not
> org.apache.hadoop.mapred.OutputFormat
> at
> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2079)
> at
> org.apache.hadoop.mapred.JobConf.getOutputFormat(JobConf.java:712)
> at
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1021)
> at
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:940)
> at
> org.apache.spark.streaming.dstream.PairDStreamFunctions$$anonfun$8.apply(PairDStreamFunctions.scala:632)
> at
> org.apache.spark.streaming.dstream.PairDStreamFunctions$$anonfun$8.apply(PairDStreamFunctions.scala:630)
> at
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:42)
> at
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
> at
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
> at scala.util.Try$.apply(Try.scala:161)
> at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32)
> at
> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:171)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:724)
> Caused by: java.lang.RuntimeException: class scala.runtime.Nothing$ not
> org.apache.hadoop.mapred.OutputFormat
> at
> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2073)
> ... 14 more
>
>
> Any help is really appreciated! Thanks.
>
>


Re: correct/best way to install custom spark1.2 on cdh5.3.0?

2015-01-08 Thread Marcelo Vanzin
I ran this with CDH 5.2 without a problem (sorry don't have 5.3
readily available at the moment):

$ HBASE='/opt/cloudera/parcels/CDH/lib/hbase/\*'
$ spark-submit --driver-class-path $HBASE --conf
"spark.executor.extraClassPath=$HBASE" --master yarn --class
org.apache.spark.examples.HBaseTest
/opt/cloudera/parcels/CDH/lib/spark/examples/lib/spark-examples-1.1.0-cdh5.2.2-SNAPSHOT-hadoop2.5.0-cdh5.2.2-SNAPSHOT.jar
t1

Seems to me like you still have some wrong Spark build somewhere in
your environment getting in the way.


On Thu, Jan 8, 2015 at 4:15 PM, freedafeng  wrote:
> I ran the release spark in cdh5.3.0 but got the same error. Anyone tried to
> run spark in cdh5.3.0 using its newAPIHadoopRDD?
>
> command:
> spark-submit --master spark://master:7077 --jars
> /opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/jars/spark-examples-1.2.0-cdh5.3.0-hadoop2.5.0-cdh5.3.0.jar
> ./sparkhbase.py
>
> Error.
>
> 2015-01-09 00:02:03,344 INFO  [Thread-2-SendThread(master:2181)]
> zookeeper.ClientCnxn (ClientCnxn.java:primeConnection(852)) - Socket
> connection established to master/10.191.41.253:2181, initiating session
> 2015-01-09 00:02:03,358 INFO  [Thread-2-SendThread(master:2181)]
> zookeeper.ClientCnxn (ClientCnxn.java:onConnected(1235)) - Session
> establishment complete on server master/10.191.41.253:2181, sessionid =
> 0x14acbdae7e60066, negotiated timeout = 6
> Traceback (most recent call last):
>   File "/root/workspace/test/./sparkhbase.py", line 23, in 
> conf=conf2)
>   File
> "/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/spark/python/pyspark/context.py",
> line 530, in newAPIHadoopRDD
> jconf, batchSize)
>   File
> "/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
> line 538, in __call__
>   File
> "/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
> line 300, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling
> z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
> : java.lang.IncompatibleClassChangeError: Found interface
> org.apache.hadoop.mapreduce.JobContext, but class was expected
> at
> org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:158)
> at 
> org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:98)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
> at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
> at org.apache.spark.rdd.RDD.take(RDD.scala:1060)
> at org.apache.spark.rdd.RDD.first(RDD.scala:1093)
> at
> org.apache.spark.api.python.SerDeUtil$.pairRDDToPython(SerDeUtil.scala:202)
> at
> org.apache.spark.api.python.PythonRDD$.newAPIHadoopRDD(PythonRDD.scala:500)
> at 
> org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD(PythonRDD.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
> at py4j.Gateway.invoke(Gateway.java:259)
> at 
> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
> at py4j.GatewayConnection.run(GatewayConnection.java:207)
> at java.lang.Thread.run(Thread.java:745)
>
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/correct-best-way-to-install-custom-spark1-2-on-cdh5-3-0-tp21045p21047.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>



-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Running spark 1.2 on Hadoop + Kerberos

2015-01-08 Thread Marcelo Vanzin
On Thu, Jan 8, 2015 at 4:09 PM, Manoj Samel  wrote:
> Some old communication (Oct 14) says Spark is not certified with Kerberos.
> Can someone comment on this aspect ?

Spark standalone doesn't support kerberos. Spark running on top of
Yarn works fine with kerberos.

-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: correct/best way to install custom spark1.2 on cdh5.3.0?

2015-01-08 Thread freedafeng
I ran the release spark in cdh5.3.0 but got the same error. Anyone tried to
run spark in cdh5.3.0 using its newAPIHadoopRDD? 

command: 
spark-submit --master spark://master:7077 --jars
/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/jars/spark-examples-1.2.0-cdh5.3.0-hadoop2.5.0-cdh5.3.0.jar
./sparkhbase.py

Error.

2015-01-09 00:02:03,344 INFO  [Thread-2-SendThread(master:2181)]
zookeeper.ClientCnxn (ClientCnxn.java:primeConnection(852)) - Socket
connection established to master/10.191.41.253:2181, initiating session
2015-01-09 00:02:03,358 INFO  [Thread-2-SendThread(master:2181)]
zookeeper.ClientCnxn (ClientCnxn.java:onConnected(1235)) - Session
establishment complete on server master/10.191.41.253:2181, sessionid =
0x14acbdae7e60066, negotiated timeout = 6
Traceback (most recent call last):
  File "/root/workspace/test/./sparkhbase.py", line 23, in 
conf=conf2)
  File
"/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/spark/python/pyspark/context.py",
line 530, in newAPIHadoopRDD
jconf, batchSize)
  File
"/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
line 538, in __call__
  File
"/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: java.lang.IncompatibleClassChangeError: Found interface
org.apache.hadoop.mapreduce.JobContext, but class was expected
at
org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:158)
at 
org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:98)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.rdd.RDD.take(RDD.scala:1060)
at org.apache.spark.rdd.RDD.first(RDD.scala:1093)
at
org.apache.spark.api.python.SerDeUtil$.pairRDDToPython(SerDeUtil.scala:202)
at
org.apache.spark.api.python.PythonRDD$.newAPIHadoopRDD(PythonRDD.scala:500)
at 
org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/correct-best-way-to-install-custom-spark1-2-on-cdh5-3-0-tp21045p21047.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Running spark 1.2 on Hadoop + Kerberos

2015-01-08 Thread Manoj Samel
Pl ignore the keytab question for now, the question wasn't fully described

Some old communication (Oct 14) says Spark is not certified with Kerberos.
Can someone comment on this aspect ?

On Thu, Jan 8, 2015 at 3:53 PM, Marcelo Vanzin  wrote:

> Hi Manoj,
>
> As long as you're logged in (i.e. you've run kinit), everything should
> just work. You can run "klist" to make sure you're logged in.
>
> On Thu, Jan 8, 2015 at 3:49 PM, Manoj Samel 
> wrote:
> > Hi,
> >
> > For running spark 1.2 on Hadoop cluster with Kerberos, what spark
> > configurations are required?
> >
> > Using existing keytab, can any examples be submitted to the secured
> cluster
> > ? How?
> >
> > Thanks,
>
>
>
> --
> Marcelo
>


Re: Running spark 1.2 on Hadoop + Kerberos

2015-01-08 Thread Marcelo Vanzin
Hi Manoj,

As long as you're logged in (i.e. you've run kinit), everything should
just work. You can run "klist" to make sure you're logged in.

On Thu, Jan 8, 2015 at 3:49 PM, Manoj Samel  wrote:
> Hi,
>
> For running spark 1.2 on Hadoop cluster with Kerberos, what spark
> configurations are required?
>
> Using existing keytab, can any examples be submitted to the secured cluster
> ? How?
>
> Thanks,



-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Running spark 1.2 on Hadoop + Kerberos

2015-01-08 Thread Manoj Samel
Hi,

For running spark 1.2 on Hadoop cluster with Kerberos, what spark
configurations are required?

Using existing keytab, can any examples be submitted to the secured cluster
? How?

Thanks,


Re: correct/best way to install custom spark1.2 on cdh5.3.0?

2015-01-08 Thread Marcelo Vanzin
On Thu, Jan 8, 2015 at 3:33 PM, freedafeng  wrote:
> I installed the custom as a standalone mode as normal. The master and slaves
> started successfully.
> However, I got error when I ran a job. It seems to me from the error message
> the some library was compiled against hadoop1, but my spark was compiled
> against hadoop2.

Is that using your build or the CDH build?

It seems you have the wrong HBase dependency. I'd be surprised if the
CDH build had that problem, since IIRC we don't even build HBase for
hadoop 1 anymore.

Take a look at examples/pom.xml in Spark for an example, which has
different profiles for dealing with the HBase builds for hadoop1 and
hadoop2.

-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: JavaRDD (Data Aggregation) based on key

2015-01-08 Thread Rishi Yadav
One approach is  to first transform this RDD into a PairRDD by taking the
field you are going to do aggregation on as key

On Tue, Dec 23, 2014 at 1:47 AM, sachin Singh 
wrote:

> Hi,
> I have a csv file having fields as a,b,c .
> I want to do aggregation(sum,average..) based on any field(a,b or c) as per
> user input,
> using Apache Spark Java API,Please Help Urgent!
>
> Thanks in advance,
>
> Regards
> Sachin
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/JavaRDD-Data-Aggregation-based-on-key-tp20828.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Profiling a spark application.

2015-01-08 Thread Rishi Yadav
as per my understanding RDDs do not get replicated, underlying Data does if
it's in HDFS.

On Thu, Dec 25, 2014 at 9:04 PM, rapelly kartheek 
wrote:

> Hi,
>
> I want to find the time taken for replicating an rdd in spark cluster
> along with the computation time on the replicated rdd.
>
> Can someone please suggest some ideas?
>
> Thank you
>


Re: Problem with StreamingContext - getting SPARK-2243

2015-01-08 Thread Rishi Yadav
you can also access SparkConf using sc.getConf in Spark shell though for
StreamingContext you can directly refer sc as Akhil suggested.

On Sun, Dec 28, 2014 at 12:13 AM, Akhil Das 
wrote:

> In the shell you could do:
>
> val ssc = StreamingContext(*sc*, Seconds(1))
>
> as *sc* is the SparkContext, which is already instantiated.
>
> Thanks
> Best Regards
>
> On Sun, Dec 28, 2014 at 6:55 AM, Thomas Frisk  wrote:
>
>> Yes you are right - thanks for that :)
>>
>> On 27 December 2014 at 23:18, Ilya Ganelin  wrote:
>>
>>> Are you trying to do this in the shell? Shell is instantiated with a
>>> spark context named sc.
>>>
>>> -Ilya Ganelin
>>>
>>> On Sat, Dec 27, 2014 at 5:24 PM, tfrisk  wrote:
>>>

 Hi,

 Doing:
val ssc = new StreamingContext(conf, Seconds(1))

 and getting:
Only one SparkContext may be running in this JVM (see SPARK-2243). To
 ignore this error, set spark.driver.allowMultipleContexts = true.


 But I dont think that I have another SparkContext running. Is there any
 way
 I can check this or force kill ?  I've tried restarting the server as
 I'm
 desperate but still I get the same issue.  I was not getting this
 earlier
 today.

 Any help much appreciated .

 Thanks,

 Thomas




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Problem-with-StreamingContext-getting-SPARK-2243-tp20869.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


>>>
>>
>


Re: correct/best way to install custom spark1.2 on cdh5.3.0?

2015-01-08 Thread freedafeng
I installed the custom as a standalone mode as normal. The master and slaves
started successfully. 
However, I got error when I ran a job. It seems to me from the error message
the some library was compiled against hadoop1, but my spark was compiled
against hadoop2. 

15/01/08 23:27:36 INFO ClientCnxn: Opening socket connection to server
master/10.191.41.253:2181. Will not attempt to authenticate using SASL
(unknown error)
15/01/08 23:27:36 INFO ClientCnxn: Socket connection established to
master/10.191.41.253:2181, initiating session
15/01/08 23:27:36 INFO ClientCnxn: Session establishment complete on server
master/10.191.41.253:2181, sessionid = 0x14acbdae7e60022, negotiated timeout
= 6
Traceback (most recent call last):
  File "/root/workspace/test/sparkhbase.py", line 23, in 
conf=conf2)
  File "/root/spark/python/pyspark/context.py", line 530, in newAPIHadoopRDD
jconf, batchSize)
  File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
line 538, in __call__
  File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line
300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: java.lang.IncompatibleClassChangeError: Found interface
org.apache.hadoop.mapreduce.JobContext, but class was expected
at
org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:157)
at 
org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:98)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.rdd.RDD.take(RDD.scala:1060)
at org.apache.spark.rdd.RDD.first(RDD.scala:1093)
at
org.apache.spark.api.python.SerDeUtil$.pairRDDToPython(SerDeUtil.scala:202)
at
org.apache.spark.api.python.PythonRDD$.newAPIHadoopRDD(PythonRDD.scala:500)
at 
org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)

If I understand correctly, the org.apache.hadoop.mapreduce.JobContext in
hadoop1 is a class, but is a interface in hadoop2. My question is which
library could cause this problem. 

Thanks.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/correct-best-way-to-install-custom-spark1-2-on-cdh5-3-0-tp21045p21046.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: correct/best way to install custom spark1.2 on cdh5.3.0?

2015-01-08 Thread Marcelo Vanzin
Disclaimer: CDH questions are better handled at cdh-us...@cloudera.org.

But the question I'd like to ask is: why do you need your own Spark
build? What's wrong with CDH's Spark that it doesn't work for you?

On Thu, Jan 8, 2015 at 3:01 PM, freedafeng  wrote:
> Could anyone come up with your experience on how to do this?
>
> I have created a cluster and installed cdh5.3.0 on it with basically core +
> Hbase. but cloudera installed and configured the spark in its parcels
> anyway. I'd like to install our custom spark on this cluster to use the
> hadoop and hbase service there. There could be potentially conflicts if this
> is not done correctly. Library conflicts are what I worry most.
>
> I understand this is a special case. but if you know how to do it, please
> let me know. Thanks.
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/correct-best-way-to-install-custom-spark1-2-on-cdh5-3-0-tp21045.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>



-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



correct/best way to install custom spark1.2 on cdh5.3.0?

2015-01-08 Thread freedafeng
Could anyone come up with your experience on how to do this? 

I have created a cluster and installed cdh5.3.0 on it with basically core +
Hbase. but cloudera installed and configured the spark in its parcels
anyway. I'd like to install our custom spark on this cluster to use the
hadoop and hbase service there. There could be potentially conflicts if this
is not done correctly. Library conflicts are what I worry most.

I understand this is a special case. but if you know how to do it, please
let me know. Thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/correct-best-way-to-install-custom-spark1-2-on-cdh5-3-0-tp21045.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Implement customized Join for SparkSQL

2015-01-08 Thread Rishi Yadav
Hi Kevin,

Say A has 10 ids, so you are pulling data from B's data source only for
these 10 ids?

What if you load A and B as separate schemaRDDs and then do join. Spark
will optimize the path anyway when action is fired .

On Mon, Jan 5, 2015 at 2:28 AM, Dai, Kevin  wrote:

>  Hi, All
>
>
>
> Suppose I want to join two tables A and B as follows:
>
>
>
> Select * from A join B on A.id = B.id
>
>
>
> A is a file while B is a database which indexed by id and I wrapped it by
> Data source API.
>
> The desired join flow is:
>
> 1.   Generate A’s RDD[Row]
>
> 2.   Generate B’s RDD[Row] from A by using A’s id and B’s data source
> api to get row from the database
>
> 3.   Merge these two RDDs to the final RDD[Row]
>
>
>
> However it seems existing join strategy doesn’t support it?
>
>
>
> Any way to achieve it?
>
>
>
> Best Regards,
>
> Kevin.
>


Re: SparkSQL

2015-01-08 Thread Marcelo Vanzin
Disclaimer: this seems more of a CDH question, I'd suggest sending
these to the CDH mailing list in the future.

CDH 5.2 actually has Spark 1.1. It comes with SparkSQL built-in, but
it does not include the thrift server because of incompatibilities
with the CDH version of Hive. To use Hive support, you'll need to
manually add Hive jars to your application's classpath, though.

CDH 5.3 (Spark 1.2) has the thrift server, if you want to use it, and
has the same limitation regarding having to mess with the classpath.

Also, SparkSQL is currently unsupported in CDH, so YMMV.

Now, if you're trying to use an Apache release of Spark against CDH,
you may run into other issues (like different Hive versions causing
problems). So be careful when doing that.


On Thu, Jan 8, 2015 at 2:24 PM, Abhi Basu <9000r...@gmail.com> wrote:
> I am working with CDH5.2 (Spark 1.0.0) and wondering which version of Spark
> comes with SparkSQL by default. Also, will SparkSQL come enabled to access
> the Hive Metastore? Is there an easier way to enable Hive support without
> have to build the code with various switches?
>
> Thanks,
>
> Abhi
>
> --
> Abhi Basu



-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



SparkSQL

2015-01-08 Thread Abhi Basu
I am working with CDH5.2 (Spark 1.0.0) and wondering which version of Spark
comes with SparkSQL by default. Also, will SparkSQL come enabled to access
the Hive Metastore? Is there an easier way to enable Hive support without
have to build the code with various switches?

Thanks,

Abhi

-- 
Abhi Basu


Getting Output From a Cluster

2015-01-08 Thread Su She
Hello Everyone,

Thanks in advance for the help!

I successfully got my Kafka/Spark WordCount app to print locally. However,
I want to run it on a cluster, which means that I will have to save it to
HDFS if I want to be able to read the output.

I am running Spark 1.1.0, which means according to this document:
https://spark.apache.org/docs/1.1.0/streaming-programming-guide.html

I should be able to use commands such as saveAsText/HadoopFiles.

1) When I try saveAsTextFiles it says:
cannot find symbol
[ERROR] symbol  : method saveAsTextFiles(java.lang.String,java.lang.String)
[ERROR] location: class
org.apache.spark.streaming.api.java.JavaPairDStream

This makes some sense as saveAsTextFiles is not included here:
http://people.apache.org/~tdas/spark-1.1.0-temp-docs/api/java/org/apache/spark/streaming/api/java/JavaPairDStream.html

2) When I try
saveAsHadoopFiles("hdfs://ipus-west-1.compute.internal:8020/user/testwordcount",
"txt") it builds, but when I try running it it throws this exception:

Exception in thread "main" java.lang.RuntimeException:
java.lang.RuntimeException: class scala.runtime.Nothing$ not
org.apache.hadoop.mapred.OutputFormat
at
org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2079)
at
org.apache.hadoop.mapred.JobConf.getOutputFormat(JobConf.java:712)
at
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1021)
at
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:940)
at
org.apache.spark.streaming.dstream.PairDStreamFunctions$$anonfun$8.apply(PairDStreamFunctions.scala:632)
at
org.apache.spark.streaming.dstream.PairDStreamFunctions$$anonfun$8.apply(PairDStreamFunctions.scala:630)
at
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:42)
at
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32)
at
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:171)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
Caused by: java.lang.RuntimeException: class scala.runtime.Nothing$ not
org.apache.hadoop.mapred.OutputFormat
at
org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2073)
... 14 more


Any help is really appreciated! Thanks.


Spark SQL: Storing AVRO Schema in Parquet

2015-01-08 Thread Jerry Lam
Hi spark users,

I'm using spark SQL to create parquet files on HDFS. I would like to store
the avro schema into the parquet meta so that non spark sql applications
can marshall the data without avro schema using the avro parquet reader.
Currently, schemaRDD.saveAsParquetFile does not allow to do that. Is there
another API that allows me to do this?

Best Regards,

Jerry


Is the Thrift server right for me?

2015-01-08 Thread sjbrunst
I'm building a system that collects data using Spark Streaming, does some
processing with it, then saves the data. I want the data to be queried by
multiple applications, and it sounds like the Thrift JDBC/ODBC server might
be the right tool to handle the queries. However,  the documentation for the
Thrift server

  
seems to be written for Hive users who are moving to Spark. I never used
Hive before I started using Spark, so it is not clear to me how best to use
this.

I've tried putting data into Hive, then serving it with the Thrift server.
But I have not been able to update the data in Hive without first shutting
down the server. This is a problem because new data is always being streamed
in, and so the data must continuously be updated.

The system I'm building is supposed to replace a system that stores the data
in MongoDB. The dataset has now grown so large that the database index does
not fit in memory, which causes major performance problems in MongoDB.

If the Thrift server is the right tool for me, how can I set it up for my
application? If it is not the right tool, what else can I use?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Is-the-Thrift-server-right-for-me-tp21044.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Standalone Cluster not correctly configured

2015-01-08 Thread Josh Rosen
Can you please file a JIRA issue for this?  This will make it easier to
triage this issue.

https://issues.apache.org/jira/browse/SPARK

Thanks,
Josh

On Thu, Jan 8, 2015 at 2:34 AM, frodo777 
wrote:

> Hello everyone.
>
> With respect to the configuration problem that I explained before 
>
> Do you have any idea what is wrong there?
>
> The problem in a nutshell:
> - When more than one master is started in the cluster, all of them are
> scheduling independently, thinking they are all leaders.
> - zookeeper configuration seems to be correct, only one leader is reported.
> The remaining master nodes are followers.
> - Default /spark directory is used for zookeeper.
>
> Thanks a lot.
> -Bob
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Standalone-Cluster-not-correctly-configured-tp20909p21029.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Discrepancy in PCA values

2015-01-08 Thread Xiangrui Meng
The Julia code is computing the SVD of the Gram matrix. PCA should be
applied to the covariance matrix. -Xiangrui

On Thu, Jan 8, 2015 at 8:27 AM, Upul Bandara  wrote:
> Hi All,
>
> I tried to do PCA for the Iris dataset
> [https://archive.ics.uci.edu/ml/datasets/Iris] using MLLib
> [http://spark.apache.org/docs/1.1.1/mllib-dimensionality-reduction.html].
> Also, PCA  was calculated in Julia using following method:
>
> Sigma = (1/numRow(X))*X'*X ;
> [U, S, V] = svd(Sigma);
> Ureduced = U(:, 1:k);
> Z = X*Ureduced;
>
> However, I'm seeing a little difference between values given by MLLib and
> the method shown above .
>
> Does anyone have any idea about this difference?
>
> Additionally, I have attached two visualizations, related to two approaches.
>
> Thanks,
> Upul
>
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: ERROR ConnectionManager: Corresponding SendingConnection to ConnectionManagerId

2015-01-08 Thread Aaron Davidson
Do note that this problem may be fixed in Spark 1.2, as we changed the
default transfer service to use a Netty-based one rather than the
ConnectionManager.

On Thu, Jan 8, 2015 at 7:05 AM, Spidy  wrote:

> Hi,
>
> Can you please explain which settings did you changed?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/ERROR-ConnectionManager-Corresponding-SendingConnection-to-ConnectionManagerId-tp17050p21035.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Initial State of updateStateByKey

2015-01-08 Thread Asim Jalis
In Spark Streaming, is there a way to initialize the state
of updateStateByKey before it starts processing RDDs? I noticed that there
is an overload of updateStateByKey that takes an initialRDD in the latest
sources (although not in the 1.2.0 release). Is there another way to do
this until this feature is released?


Re: Data locality running Spark on Mesos

2015-01-08 Thread Tim Chen
How did you run this benchmark, and is there a open version I can try it
with?

And what is your configurations, like spark.locality.wait, etc?

Tim

On Thu, Jan 8, 2015 at 11:44 AM, mvle  wrote:

> Hi,
>
> I've noticed running Spark apps on Mesos is significantly slower compared
> to
> stand-alone or Spark on YARN.
> I don't think it should be the case, so I am posting the problem here in
> case someone has some explanation
> or can point me to some configuration options i've missed.
>
> I'm running the LinearRegression benchmark with a dataset of 48.8GB.
> On a 10-node stand-alone Spark cluster (each node 4-core, 8GB of RAM),
> I can finish the workload in about 5min (I don't remember exactly).
> The data is loaded into HDFS spanning the same 10-node cluster.
> There are 6 worker instances per node.
>
> However, when running the same workload on the same cluster but now with
> Spark on Mesos (course-grained mode), the execution time is somewhere
> around
> 15min. Actually, I tried with find-grained mode and giving each Mesos node
> 6
> VCPUs (to hopefully get 6 executors like the stand-alone test), I still get
> roughly 15min.
>
> I've noticed that when Spark is running on Mesos, almost all tasks execute
> with locality NODE_LOCAL (even in Mesos in coarse-grained mode). On
> stand-alone, the locality is mostly PROCESS_LOCAL.
>
> I think this locality issue might be the reason for the slow down but I
> can't figure out why, especially for coarse-grained mode as the executors
> supposedly do not go away until job completion.
>
> Any ideas?
>
> Thanks,
> Mike
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Data-locality-running-Spark-on-Mesos-tp21041.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Spark on teradata?

2015-01-08 Thread Evan R. Sparks
Have you taken a look at the TeradataDBInputFormat? Spark is compatible
with arbitrary hadoop input formats - so this might work for you:
http://developer.teradata.com/extensibility/articles/hadoop-mapreduce-connector-to-teradata-edw

On Thu, Jan 8, 2015 at 10:53 AM, gen tang  wrote:

> Thanks a lot for your reply.
> In fact, I need to work on almost all the data in teradata (~100T). So, I
> don't think that jdbcRDD is a good choice.
>
> Cheers
> Gen
>
>
> On Thu, Jan 8, 2015 at 7:39 PM, Reynold Xin  wrote:
>
>> Depending on your use cases. If the use case is to extract small amount
>> of data out of teradata, then you can use the JdbcRDD and soon a jdbc input
>> source based on the new Spark SQL external data source API.
>>
>>
>>
>> On Wed, Jan 7, 2015 at 7:14 AM, gen tang  wrote:
>>
>>> Hi,
>>>
>>> I have a stupid question:
>>> Is it possible to use spark on Teradata data warehouse, please? I read
>>> some news on internet which say yes. However, I didn't find any example
>>> about this issue
>>>
>>> Thanks in advance.
>>>
>>> Cheers
>>> Gen
>>>
>>>
>>
>


Re: Spark Project Fails to run multicore in local mode.

2015-01-08 Thread Dean Wampler
Use local[*] instead of local to grab all available cores. Using local just
grabs one.

Dean

On Thursday, January 8, 2015, mixtou  wrote:

> I am new to Apache Spark, now i am trying my first project "Space Saving
> Counting Algorithm" and while it compiles in single core using
> .setMaster("local") it fails when using .setMaster("local[4]") or any
> number>1. My Code follows:
> = import
> org.apache.spark.{SparkConf, SparkContext} /** * Created by mixtou on
> 30/12/14. */ object SpaceSaving { var frequent_words_counters =
> scala.collection.immutable.Map[String, Array[Int]](); var guaranteed_words
> = scala.collection.immutable.Map[String, Array[Int]](); val top_k: Int =
> 100; var words_no: Int = 0; var tStart: Long = 0; var tStop: Long = 0; var
> fi: Double = 0.001; def main (args: Array[String]): Unit = { val sparkConf
> = new SparkConf().setAppName("Space Saving Project").setMaster("local");
> val ctx = new SparkContext(sparkConf); // val stopwords =
> ctx.textFile("/Users/mixtou/PhD/Courses/Databases_Advanced/Project/scala_spark_space_saving/src/main/resources/stopwords.txt");
> val lines =
> ctx.textFile("/Users/mixtou/PhD/Courses/Databases_Advanced/Project/scala_spark_space_saving/src/main/resources/README.txt",
> 2) .map(line => line.toLowerCase()); val nonEmptyLines = lines.filter(line
> => line.nonEmpty); val regex = "[,.:;'\"\\?\\-!\\(\\)\\+\\[\\]\\d+]".r; //
> val temp = nonEmptyLines.map(removeStopWords(_, stopwords)); val cleanLines
> = nonEmptyLines.map(line => regex.replaceAllIn(line, " ")); val dirtyWords
> = cleanLines.flatMap(line => line.split("\\s+")); val words =
> dirtyWords.filter(word => word.length > 3); * words.foreach(word =>
> space_saving_algorithm(word));* *ERROR HERE!*! if
> (frequent_words_counters.size > 0) { frequent_words_counters.foreach(line
> => println("Top Frequent Word: " + line._1 + " with count: " + line._2(0) +
> " end error: " + line._2(1))); } System.out.println("===
> Throughput:=> "+ 1000*(words_no/(tStop - tStart))+ " words per second. " );
> estimateGuaranteedFrequentWords(); words.collect(); ctx.stop(); } def
> space_saving_algorithm(word: String): Unit = { // System.out.println(word);
> if (frequent_words_counters.contains(word)) { * val count =
> frequent_words_counters.get(word).get(0);* *ERROR HERE* val error =
> frequent_words_counters.get(word).get(1); // System.out.println("Before: "
> + word + " <=> " + count); frequent_words_counters += word ->
> Array[Int](count+1, error); // System.out.println("After: " + word + " <=>
> " + counters.get(word).get(0)); } else { if (frequent_words_counters.size <
> top_k) { frequent_words_counters += word -> Array[Int](1, 0); //
> System.out.println("Adding Word to Counters: "+word); } else {
> replaceLeastEntry(word); } } if(words_no > 0 ){ tStop =
> java.lang.System.currentTimeMillis(); } else{ tStart =
> java.lang.System.currentTimeMillis(); } words_no += 1; } def
> replaceLeastEntry(word: String): Unit = { var temp_list =
> frequent_words_counters.toList.sortWith( (x,y) => x._2(0) > y._2(0) ); val
> word_count = temp_list.last._2(0); // System.out.println("Replacing word: "
> + temp_list.last._1 + ", having count and error " + temp_list.last._2(0)+"
> , " + temp_list.last._2(1) + " with word: "+word); //
> System.out.println(temp_list.length); temp_list =
> temp_list.take(temp_list.length - 1); frequent_words_counters =
> temp_list.toMap[String, Array[Int]]; frequent_words_counters += word ->
> Array[Int](word_count+1, word_count); } def
> estimateGuaranteedFrequentWords(): Unit = {
> frequent_words_counters.foreach{tuple => if (tuple._2(0) - tuple._2(1) <
> words_no*fi) { // counters -= tuple._1; guaranteed_words += tuple; //
> System.out.println("NEW SISZEZ: "+counters.size); } else {
> System.out.println("Guaranteed Word : "+tuple._1+" with count:
> "+tuple._2(0)+" and error: "+tuple._2(1)); } }; } } The Compiler Error is:
> == 15/01/08 16:44:51 ERROR Executor:
> Exception in task 1.0 in stage 0.0 (TID 1)
> java.util.NoSuchElementException: None.get at
> scala.None$.get(Option.scala:313) at scala.None$.get(Option.scala:311) at
> SpaceSaving$.space_saving_algorithm(SpaceSaving.scala:52) at
> SpaceSaving$$anonfun$main$1.apply(SpaceSaving.scala:36) at
> SpaceSaving$$anonfun$main$1.apply(SpaceSaving.scala:36) at
> scala.collection.Iterator$class.foreach(Iterator.scala:727) at
> scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at
> org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:765) at
> org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:765) at
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
> at
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at
> org.apache.spark.scheduler.Task.run(Task.scala:56) at
> org.apache.spark.executor.Ex

Data locality running Spark on Mesos

2015-01-08 Thread mvle
Hi,

I've noticed running Spark apps on Mesos is significantly slower compared to
stand-alone or Spark on YARN.
I don't think it should be the case, so I am posting the problem here in
case someone has some explanation
or can point me to some configuration options i've missed.

I'm running the LinearRegression benchmark with a dataset of 48.8GB.
On a 10-node stand-alone Spark cluster (each node 4-core, 8GB of RAM),
I can finish the workload in about 5min (I don't remember exactly).
The data is loaded into HDFS spanning the same 10-node cluster.
There are 6 worker instances per node.

However, when running the same workload on the same cluster but now with
Spark on Mesos (course-grained mode), the execution time is somewhere around
15min. Actually, I tried with find-grained mode and giving each Mesos node 6
VCPUs (to hopefully get 6 executors like the stand-alone test), I still get
roughly 15min.

I've noticed that when Spark is running on Mesos, almost all tasks execute
with locality NODE_LOCAL (even in Mesos in coarse-grained mode). On
stand-alone, the locality is mostly PROCESS_LOCAL.

I think this locality issue might be the reason for the slow down but I
can't figure out why, especially for coarse-grained mode as the executors
supposedly do not go away until job completion.

Any ideas?

Thanks,
Mike



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Data-locality-running-Spark-on-Mesos-tp21041.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SPARKonYARN failing on CDH 5.3.0 : container cannot be fetched because of NumberFormatException

2015-01-08 Thread Marcelo Vanzin
Just to add to Sandy's comment, check your client configuration
(generally in /etc/spark/conf). If you're using CM, you may need to
run the "Deploy Client Configuration" command on the cluster to update
the configs to match the new version of CDH.

On Thu, Jan 8, 2015 at 11:38 AM, Sandy Ryza  wrote:
> Hi Mukesh,
>
> Those line numbers in ConverterUtils in the stack trace don't appear to line
> up with CDH 5.3:
> https://github.com/cloudera/hadoop-common/blob/cdh5-2.5.0_5.3.0/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/ConverterUtils.java
>
> Is it possible you're still including the old jars on the classpath in some
> way?
>
> -Sandy
>
> On Thu, Jan 8, 2015 at 3:38 AM, Mukesh Jha  wrote:
>>
>> Hi Experts,
>>
>> I am running spark inside YARN job.
>>
>> The spark-streaming job is running fine in CDH-5.0.0 but after the upgrade
>> to 5.3.0 it cannot fetch containers with the below errors. Looks like the
>> container id is incorrect and a string is present in a pace where it's
>> expecting a number.
>>
>>
>>
>> java.lang.IllegalArgumentException: Invalid ContainerId:
>> container_e01_1420481081140_0006_01_01
>>
>> Caused by: java.lang.NumberFormatException: For input string: "e01"
>>
>>
>>
>> Is this a bug?? Did you face something similar and any ideas how to fix
>> this?
>>
>>
>>
>> 15/01/08 09:50:28 INFO yarn.ApplicationMaster: Registered signal handlers
>> for [TERM, HUP, INT]
>>
>> 15/01/08 09:50:29 ERROR yarn.ApplicationMaster: Uncaught exception:
>>
>> java.lang.IllegalArgumentException: Invalid ContainerId:
>> container_e01_1420481081140_0006_01_01
>>
>> at
>> org.apache.hadoop.yarn.util.ConverterUtils.toContainerId(ConverterUtils.java:182)
>>
>> at
>> org.apache.spark.deploy.yarn.YarnRMClientImpl.getAttemptId(YarnRMClientImpl.scala:79)
>>
>> at
>> org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:79)
>>
>> at
>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:515)
>>
>> at
>> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:60)
>>
>> at
>> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:59)
>>
>> at java.security.AccessController.doPrivileged(Native Method)
>>
>> at javax.security.auth.Subject.doAs(Subject.java:415)
>>
>> at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>
>> at
>> org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:59)
>>
>> at
>> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:513)
>>
>> at
>> org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
>>
>> Caused by: java.lang.NumberFormatException: For input string: "e01"
>>
>> at
>> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
>>
>> at java.lang.Long.parseLong(Long.java:441)
>>
>> at java.lang.Long.parseLong(Long.java:483)
>>
>> at
>> org.apache.hadoop.yarn.util.ConverterUtils.toApplicationAttemptId(ConverterUtils.java:137)
>>
>> at
>> org.apache.hadoop.yarn.util.ConverterUtils.toContainerId(ConverterUtils.java:177)
>>
>> ... 11 more
>>
>> 15/01/08 09:50:29 INFO yarn.ApplicationMaster: Final app status: FAILED,
>> exitCode: 10, (reason: Uncaught exception: Invalid ContainerId:
>> container_e01_1420481081140_0006_01_01)
>>
>>
>> --
>> Thanks & Regards,
>>
>> Mukesh Jha
>
>



-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SPARKonYARN failing on CDH 5.3.0 : container cannot be fetched because of NumberFormatException

2015-01-08 Thread Sandy Ryza
Hi Mukesh,

Those line numbers in ConverterUtils in the stack trace don't appear to
line up with CDH 5.3:
https://github.com/cloudera/hadoop-common/blob/cdh5-2.5.0_5.3.0/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/ConverterUtils.java

Is it possible you're still including the old jars on the classpath in some
way?

-Sandy

On Thu, Jan 8, 2015 at 3:38 AM, Mukesh Jha  wrote:

> Hi Experts,
>
> I am running spark inside YARN job.
>
> The spark-streaming job is running fine in CDH-5.0.0 but after the upgrade
> to 5.3.0 it cannot fetch containers with the below errors. Looks like the
> container id is incorrect and a string is present in a pace where it's
> expecting a number.
>
>
>
> java.lang.IllegalArgumentException: Invalid ContainerId:
> container_e01_1420481081140_0006_01_01
>
> Caused by: java.lang.NumberFormatException: For input string: "e01"
>
>
>
> Is this a bug?? Did you face something similar and any ideas how to fix
> this?
>
>
>
> 15/01/08 09:50:28 INFO yarn.ApplicationMaster: Registered signal handlers
> for [TERM, HUP, INT]
>
> 15/01/08 09:50:29 ERROR yarn.ApplicationMaster: Uncaught exception:
>
> java.lang.IllegalArgumentException: Invalid ContainerId:
> container_e01_1420481081140_0006_01_01
>
> at
> org.apache.hadoop.yarn.util.ConverterUtils.toContainerId(ConverterUtils.java:182)
>
> at
> org.apache.spark.deploy.yarn.YarnRMClientImpl.getAttemptId(YarnRMClientImpl.scala:79)
>
> at
> org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:79)
>
> at
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:515)
>
> at
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:60)
>
> at
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:59)
>
> at java.security.AccessController.doPrivileged(Native Method)
>
> at javax.security.auth.Subject.doAs(Subject.java:415)
>
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>
> at
> org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:59)
>
> at
> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:513)
>
> at
> org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
>
> Caused by: java.lang.NumberFormatException: For input string: "e01"
>
> at
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
>
> at java.lang.Long.parseLong(Long.java:441)
>
> at java.lang.Long.parseLong(Long.java:483)
>
> at
> org.apache.hadoop.yarn.util.ConverterUtils.toApplicationAttemptId(ConverterUtils.java:137)
>
> at
> org.apache.hadoop.yarn.util.ConverterUtils.toContainerId(ConverterUtils.java:177)
>
> ... 11 more
>
> 15/01/08 09:50:29 INFO yarn.ApplicationMaster: Final app status: FAILED,
> exitCode: 10, (reason: Uncaught exception: Invalid ContainerId:
> container_e01_1420481081140_0006_01_01)
>
> --
> Thanks & Regards,
>
> *Mukesh Jha *
>


Re: Spark History Server can't read event logs

2015-01-08 Thread Marcelo Vanzin
Sorry for the noise; but I just remembered you're actually using MapR
(and not HDFS), so maybe the "3777" trick could work...

On Thu, Jan 8, 2015 at 10:32 AM, Marcelo Vanzin  wrote:
> Nevermind my last e-mail. HDFS complains about not understanding "3777"...
>
> On Thu, Jan 8, 2015 at 9:46 AM, Marcelo Vanzin  wrote:
>> Hmm. Can you set the permissions of "/apps/spark/historyserver/logs"
>> to 3777? I'm not sure HDFS respects the group id bit, but it's worth a
>> try. (BTW that would only affect newly created log directories.)
>>
>> On Thu, Jan 8, 2015 at 1:22 AM,   wrote:
>>> Hi Vanzin,
>>>
>>> I am using the MapR distribution of Hadoop. The history server logs are 
>>> created by a job with the permissions:
>>>
>>> drwxrwx---   - 2 2015-01-08 09:14 
>>> /apps/spark/historyserver/logs/spark-1420708455212
>>>
>>> However, the permissions of the higher directories are mapr:mapr and the 
>>> user that runs Spark in our case is a unix ID called mapr (in the mapr 
>>> group). Therefore, this can't read my job event logs as shown above.
>>>
>>>
>>> Thanks,
>>> Michael
>>>
>>>
>>> -Original Message-
>>> From: Marcelo Vanzin [mailto:van...@cloudera.com]
>>> Sent: 07 January 2015 18:10
>>> To: England, Michael (IT/UK)
>>> Cc: user@spark.apache.org
>>> Subject: Re: Spark History Server can't read event logs
>>>
>>> The Spark code generates the log directory with "770" permissions. On top 
>>> of that you need to make sure of two things:
>>>
>>> - all directories up to /apps/spark/historyserver/logs/ are readable by the 
>>> user running the history server
>>> - the user running the history server belongs to the group that owns 
>>> /apps/spark/historyserver/logs/
>>>
>>> I think the code could be more explicitly about setting the group of the 
>>> generated log directories and files, but if you follow the two rules above 
>>> things should work. Also, I recommend setting 
>>> /apps/spark/historyserver/logs/ itself to "1777" so that any user can 
>>> generate logs, but only the owner (or a superuser) can delete them.
>>>
>>>
>>>
>>> On Wed, Jan 7, 2015 at 7:45 AM,   wrote:
 Hi,



 When I run jobs and save the event logs, they are saved with the
 permissions of the unix user and group that ran the spark job. The
 history server is run as a service account and therefore can’t read the 
 files:



 Extract from the History server logs:



 2015-01-07 15:37:24,3021 ERROR Client
 fs/client/fileclient/cc/client.cc:1009
 Thread: 1183 User does not have access to open file
 /apps/spark/historyserver/logs/spark-1420644521194

 15/01/07 15:37:24 ERROR ReplayListenerBus: Exception in parsing Spark
 event log
 /apps/spark/historyserver/logs/spark-1420644521194/EVENT_LOG_1

 org.apache.hadoop.security.AccessControlException: Open failed for file:
 /apps/spark/historyserver/logs/spark-1420644521194/EVENT_LOG_1, error:
 Permission denied (13)



 Is there a setting which I can change that allows the files to be
 world readable or at least by the account running the history server?
 Currently, the job appears in the History Sever UI but only states ‘>>> Started>’.



 Thanks,

 Michael


 This e-mail (including any attachments) is private and confidential,
 may contain proprietary or privileged information and is intended for
 the named
 recipient(s) only. Unintended recipients are strictly prohibited from
 taking action on the basis of information in this e-mail and must
 contact the sender immediately, delete this e-mail (and all
 attachments) and destroy any hard copies. Nomura will not accept
 responsibility or liability for the accuracy or completeness of, or
 the presence of any virus or disabling code in, this e-mail. If
 verification is sought please request a hard copy. Any reference to
 the terms of executed transactions should be treated as preliminary only 
 and subject to formal written confirmation by Nomura.
 Nomura reserves the right to retain, monitor and intercept e-mail
 communications through its networks (subject to and in accordance with
 applicable laws). No confidentiality or privilege is waived or lost by
 Nomura by any mistransmission of this e-mail. Any reference to
 "Nomura" is a reference to any entity in the Nomura Holdings, Inc.
 group. Please read our Electronic Communications Legal Notice which forms 
 part of this e-mail:
 http://www.Nomura.com/email_disclaimer.htm
>>>
>>>
>>>
>>> --
>>> Marcelo
>>>
>>>
>>> This e-mail (including any attachments) is private and confidential, may 
>>> contain proprietary or privileged information and is intended for the named 
>>> recipient(s) only. Unintended recipients are strictly prohibited from 
>>> taking action on the basis of information in this e-mail and must contact 
>>> the sender immediate

Find S3 file attributes by Spark

2015-01-08 Thread rajnish
Hi,

We have file in AWS S3 bucket, that is loaded frequently, When accessing
that file from spark, can we get file properties by some method in spark? 


Regards
Raj



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Find-S3-file-attributes-by-Spark-tp21039.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SparkSQL support for reading Avro files

2015-01-08 Thread yanenli2
thanks for the reply! Now I know that this package is moved here:
https://github.com/databricks/spark-avro



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-support-for-reading-Avro-files-tp20981p21040.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SPARKonYARN failing on CDH 5.3.0 : container cannot be fetched because of NumberFormatException

2015-01-08 Thread Mukesh Jha
On Thu, Jan 8, 2015 at 5:08 PM, Mukesh Jha  wrote:

> Hi Experts,
>
> I am running spark inside YARN job.
>
> The spark-streaming job is running fine in CDH-5.0.0 but after the upgrade
> to 5.3.0 it cannot fetch containers with the below errors. Looks like the
> container id is incorrect and a string is present in a pace where it's
> expecting a number.
>
>
>
> java.lang.IllegalArgumentException: Invalid ContainerId:
> container_e01_1420481081140_0006_01_01
>
> Caused by: java.lang.NumberFormatException: For input string: "e01"
>
>
>
> Is this a bug?? Did you face something similar and any ideas how to fix
> this?
>
>
>
> 15/01/08 09:50:28 INFO yarn.ApplicationMaster: Registered signal handlers
> for [TERM, HUP, INT]
>
> 15/01/08 09:50:29 ERROR yarn.ApplicationMaster: Uncaught exception:
>
> java.lang.IllegalArgumentException: Invalid ContainerId:
> container_e01_1420481081140_0006_01_01
>
> at
> org.apache.hadoop.yarn.util.ConverterUtils.toContainerId(ConverterUtils.java:182)
>
> at
> org.apache.spark.deploy.yarn.YarnRMClientImpl.getAttemptId(YarnRMClientImpl.scala:79)
>
> at
> org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:79)
>
> at
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:515)
>
> at
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:60)
>
> at
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:59)
>
> at java.security.AccessController.doPrivileged(Native Method)
>
> at javax.security.auth.Subject.doAs(Subject.java:415)
>
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>
> at
> org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:59)
>
> at
> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:513)
>
> at
> org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
>
> Caused by: java.lang.NumberFormatException: For input string: "e01"
>
> at
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
>
> at java.lang.Long.parseLong(Long.java:441)
>
> at java.lang.Long.parseLong(Long.java:483)
>
> at
> org.apache.hadoop.yarn.util.ConverterUtils.toApplicationAttemptId(ConverterUtils.java:137)
>
> at
> org.apache.hadoop.yarn.util.ConverterUtils.toContainerId(ConverterUtils.java:177)
>
> ... 11 more
>
> 15/01/08 09:50:29 INFO yarn.ApplicationMaster: Final app status: FAILED,
> exitCode: 10, (reason: Uncaught exception: Invalid ContainerId:
> container_e01_1420481081140_0006_01_01)
>
> --
> Thanks & Regards,
>
> *Mukesh Jha *
>



-- 


Thanks & Regards,

*Mukesh Jha *


Re: Spark on teradata?

2015-01-08 Thread gen tang
Thanks a lot for your reply.
In fact, I need to work on almost all the data in teradata (~100T). So, I
don't think that jdbcRDD is a good choice.

Cheers
Gen


On Thu, Jan 8, 2015 at 7:39 PM, Reynold Xin  wrote:

> Depending on your use cases. If the use case is to extract small amount of
> data out of teradata, then you can use the JdbcRDD and soon a jdbc input
> source based on the new Spark SQL external data source API.
>
>
>
> On Wed, Jan 7, 2015 at 7:14 AM, gen tang  wrote:
>
>> Hi,
>>
>> I have a stupid question:
>> Is it possible to use spark on Teradata data warehouse, please? I read
>> some news on internet which say yes. However, I didn't find any example
>> about this issue
>>
>> Thanks in advance.
>>
>> Cheers
>> Gen
>>
>>
>


Re: Spark on teradata?

2015-01-08 Thread Reynold Xin
Depending on your use cases. If the use case is to extract small amount of
data out of teradata, then you can use the JdbcRDD and soon a jdbc input
source based on the new Spark SQL external data source API.



On Wed, Jan 7, 2015 at 7:14 AM, gen tang  wrote:

> Hi,
>
> I have a stupid question:
> Is it possible to use spark on Teradata data warehouse, please? I read
> some news on internet which say yes. However, I didn't find any example
> about this issue
>
> Thanks in advance.
>
> Cheers
> Gen
>
>


Re: Spark History Server can't read event logs

2015-01-08 Thread Marcelo Vanzin
Nevermind my last e-mail. HDFS complains about not understanding "3777"...

On Thu, Jan 8, 2015 at 9:46 AM, Marcelo Vanzin  wrote:
> Hmm. Can you set the permissions of "/apps/spark/historyserver/logs"
> to 3777? I'm not sure HDFS respects the group id bit, but it's worth a
> try. (BTW that would only affect newly created log directories.)
>
> On Thu, Jan 8, 2015 at 1:22 AM,   wrote:
>> Hi Vanzin,
>>
>> I am using the MapR distribution of Hadoop. The history server logs are 
>> created by a job with the permissions:
>>
>> drwxrwx---   - 2 2015-01-08 09:14 
>> /apps/spark/historyserver/logs/spark-1420708455212
>>
>> However, the permissions of the higher directories are mapr:mapr and the 
>> user that runs Spark in our case is a unix ID called mapr (in the mapr 
>> group). Therefore, this can't read my job event logs as shown above.
>>
>>
>> Thanks,
>> Michael
>>
>>
>> -Original Message-
>> From: Marcelo Vanzin [mailto:van...@cloudera.com]
>> Sent: 07 January 2015 18:10
>> To: England, Michael (IT/UK)
>> Cc: user@spark.apache.org
>> Subject: Re: Spark History Server can't read event logs
>>
>> The Spark code generates the log directory with "770" permissions. On top of 
>> that you need to make sure of two things:
>>
>> - all directories up to /apps/spark/historyserver/logs/ are readable by the 
>> user running the history server
>> - the user running the history server belongs to the group that owns 
>> /apps/spark/historyserver/logs/
>>
>> I think the code could be more explicitly about setting the group of the 
>> generated log directories and files, but if you follow the two rules above 
>> things should work. Also, I recommend setting 
>> /apps/spark/historyserver/logs/ itself to "1777" so that any user can 
>> generate logs, but only the owner (or a superuser) can delete them.
>>
>>
>>
>> On Wed, Jan 7, 2015 at 7:45 AM,   wrote:
>>> Hi,
>>>
>>>
>>>
>>> When I run jobs and save the event logs, they are saved with the
>>> permissions of the unix user and group that ran the spark job. The
>>> history server is run as a service account and therefore can’t read the 
>>> files:
>>>
>>>
>>>
>>> Extract from the History server logs:
>>>
>>>
>>>
>>> 2015-01-07 15:37:24,3021 ERROR Client
>>> fs/client/fileclient/cc/client.cc:1009
>>> Thread: 1183 User does not have access to open file
>>> /apps/spark/historyserver/logs/spark-1420644521194
>>>
>>> 15/01/07 15:37:24 ERROR ReplayListenerBus: Exception in parsing Spark
>>> event log
>>> /apps/spark/historyserver/logs/spark-1420644521194/EVENT_LOG_1
>>>
>>> org.apache.hadoop.security.AccessControlException: Open failed for file:
>>> /apps/spark/historyserver/logs/spark-1420644521194/EVENT_LOG_1, error:
>>> Permission denied (13)
>>>
>>>
>>>
>>> Is there a setting which I can change that allows the files to be
>>> world readable or at least by the account running the history server?
>>> Currently, the job appears in the History Sever UI but only states ‘>> Started>’.
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Michael
>>>
>>>
>>> This e-mail (including any attachments) is private and confidential,
>>> may contain proprietary or privileged information and is intended for
>>> the named
>>> recipient(s) only. Unintended recipients are strictly prohibited from
>>> taking action on the basis of information in this e-mail and must
>>> contact the sender immediately, delete this e-mail (and all
>>> attachments) and destroy any hard copies. Nomura will not accept
>>> responsibility or liability for the accuracy or completeness of, or
>>> the presence of any virus or disabling code in, this e-mail. If
>>> verification is sought please request a hard copy. Any reference to
>>> the terms of executed transactions should be treated as preliminary only 
>>> and subject to formal written confirmation by Nomura.
>>> Nomura reserves the right to retain, monitor and intercept e-mail
>>> communications through its networks (subject to and in accordance with
>>> applicable laws). No confidentiality or privilege is waived or lost by
>>> Nomura by any mistransmission of this e-mail. Any reference to
>>> "Nomura" is a reference to any entity in the Nomura Holdings, Inc.
>>> group. Please read our Electronic Communications Legal Notice which forms 
>>> part of this e-mail:
>>> http://www.Nomura.com/email_disclaimer.htm
>>
>>
>>
>> --
>> Marcelo
>>
>>
>> This e-mail (including any attachments) is private and confidential, may 
>> contain proprietary or privileged information and is intended for the named 
>> recipient(s) only. Unintended recipients are strictly prohibited from taking 
>> action on the basis of information in this e-mail and must contact the 
>> sender immediately, delete this e-mail (and all attachments) and destroy any 
>> hard copies. Nomura will not accept responsibility or liability for the 
>> accuracy or completeness of, or the presence of any virus or disabling code 
>> in, this e-mail. If verification is sought please request a hard copy. Any 
>> referenc

Re: Spark History Server can't read event logs

2015-01-08 Thread Marcelo Vanzin
Hmm. Can you set the permissions of "/apps/spark/historyserver/logs"
to 3777? I'm not sure HDFS respects the group id bit, but it's worth a
try. (BTW that would only affect newly created log directories.)

On Thu, Jan 8, 2015 at 1:22 AM,   wrote:
> Hi Vanzin,
>
> I am using the MapR distribution of Hadoop. The history server logs are 
> created by a job with the permissions:
>
> drwxrwx---   - 2 2015-01-08 09:14 
> /apps/spark/historyserver/logs/spark-1420708455212
>
> However, the permissions of the higher directories are mapr:mapr and the user 
> that runs Spark in our case is a unix ID called mapr (in the mapr group). 
> Therefore, this can't read my job event logs as shown above.
>
>
> Thanks,
> Michael
>
>
> -Original Message-
> From: Marcelo Vanzin [mailto:van...@cloudera.com]
> Sent: 07 January 2015 18:10
> To: England, Michael (IT/UK)
> Cc: user@spark.apache.org
> Subject: Re: Spark History Server can't read event logs
>
> The Spark code generates the log directory with "770" permissions. On top of 
> that you need to make sure of two things:
>
> - all directories up to /apps/spark/historyserver/logs/ are readable by the 
> user running the history server
> - the user running the history server belongs to the group that owns 
> /apps/spark/historyserver/logs/
>
> I think the code could be more explicitly about setting the group of the 
> generated log directories and files, but if you follow the two rules above 
> things should work. Also, I recommend setting /apps/spark/historyserver/logs/ 
> itself to "1777" so that any user can generate logs, but only the owner (or a 
> superuser) can delete them.
>
>
>
> On Wed, Jan 7, 2015 at 7:45 AM,   wrote:
>> Hi,
>>
>>
>>
>> When I run jobs and save the event logs, they are saved with the
>> permissions of the unix user and group that ran the spark job. The
>> history server is run as a service account and therefore can’t read the 
>> files:
>>
>>
>>
>> Extract from the History server logs:
>>
>>
>>
>> 2015-01-07 15:37:24,3021 ERROR Client
>> fs/client/fileclient/cc/client.cc:1009
>> Thread: 1183 User does not have access to open file
>> /apps/spark/historyserver/logs/spark-1420644521194
>>
>> 15/01/07 15:37:24 ERROR ReplayListenerBus: Exception in parsing Spark
>> event log
>> /apps/spark/historyserver/logs/spark-1420644521194/EVENT_LOG_1
>>
>> org.apache.hadoop.security.AccessControlException: Open failed for file:
>> /apps/spark/historyserver/logs/spark-1420644521194/EVENT_LOG_1, error:
>> Permission denied (13)
>>
>>
>>
>> Is there a setting which I can change that allows the files to be
>> world readable or at least by the account running the history server?
>> Currently, the job appears in the History Sever UI but only states ‘> Started>’.
>>
>>
>>
>> Thanks,
>>
>> Michael
>>
>>
>> This e-mail (including any attachments) is private and confidential,
>> may contain proprietary or privileged information and is intended for
>> the named
>> recipient(s) only. Unintended recipients are strictly prohibited from
>> taking action on the basis of information in this e-mail and must
>> contact the sender immediately, delete this e-mail (and all
>> attachments) and destroy any hard copies. Nomura will not accept
>> responsibility or liability for the accuracy or completeness of, or
>> the presence of any virus or disabling code in, this e-mail. If
>> verification is sought please request a hard copy. Any reference to
>> the terms of executed transactions should be treated as preliminary only and 
>> subject to formal written confirmation by Nomura.
>> Nomura reserves the right to retain, monitor and intercept e-mail
>> communications through its networks (subject to and in accordance with
>> applicable laws). No confidentiality or privilege is waived or lost by
>> Nomura by any mistransmission of this e-mail. Any reference to
>> "Nomura" is a reference to any entity in the Nomura Holdings, Inc.
>> group. Please read our Electronic Communications Legal Notice which forms 
>> part of this e-mail:
>> http://www.Nomura.com/email_disclaimer.htm
>
>
>
> --
> Marcelo
>
>
> This e-mail (including any attachments) is private and confidential, may 
> contain proprietary or privileged information and is intended for the named 
> recipient(s) only. Unintended recipients are strictly prohibited from taking 
> action on the basis of information in this e-mail and must contact the sender 
> immediately, delete this e-mail (and all attachments) and destroy any hard 
> copies. Nomura will not accept responsibility or liability for the accuracy 
> or completeness of, or the presence of any virus or disabling code in, this 
> e-mail. If verification is sought please request a hard copy. Any reference 
> to the terms of executed transactions should be treated as preliminary only 
> and subject to formal written confirmation by Nomura. Nomura reserves the 
> right to retain, monitor and intercept e-mail communications through its 
> networks (subject to and 

Spark Streaming Checkpointing

2015-01-08 Thread Asim Jalis
Since checkpointing in streaming apps happens every checkpoint duration, in
the event of failure, how is the system able to recover the state changes
that happened after the last checkpoint?


Re: Join RDDs with DStreams

2015-01-08 Thread Gerard Maas
You are looking for dstream.transform(rdd => rdd.(otherRdd))

The docs contain an example on how to use transform.

https://spark.apache.org/docs/latest/streaming-programming-guide.html#transformations-on-dstreams

-kr, Gerard.

On Thu, Jan 8, 2015 at 5:50 PM, Asim Jalis  wrote:

> Is there a way to join non-DStream RDDs with DStream RDDs?
>
> Here is the use case. I have a lookup table stored in HDFS that I want to
> read as an RDD. Then I want to join it with the RDDs that are coming in
> through the DStream. How can I do this?
>
> Thanks.
>
> Asim
>


Join RDDs with DStreams

2015-01-08 Thread Asim Jalis
Is there a way to join non-DStream RDDs with DStream RDDs?

Here is the use case. I have a lookup table stored in HDFS that I want to
read as an RDD. Then I want to join it with the RDDs that are coming in
through the DStream. How can I do this?

Thanks.

Asim


Re: Registering custom metrics

2015-01-08 Thread Gerard Maas
Very interesting approach. Thanks for sharing it!

On Thu, Jan 8, 2015 at 5:30 PM, Enno Shioji  wrote:

> FYI I found this approach by Ooyala.
>
> /** Instrumentation for Spark based on accumulators.
>   *
>   * Usage:
>   * val instrumentation = new SparkInstrumentation("example.metrics")
>   * val numReqs = sc.accumulator(0L)
>   * instrumentation.source.registerDailyAccumulator(numReqs, "numReqs")
>   * instrumentation.register()
>   *
>   * Will create and report the following metrics:
>   * - Gauge with total number of requests (daily)
>   * - Meter with rate of requests
>   *
>   * @param prefix prefix for all metrics that will be reported by this 
> Instrumentation
>   */
>
> https://gist.github.com/ibuenros/9b94736c2bad2f4b8e23
> ᐧ
>
> On Mon, Jan 5, 2015 at 2:56 PM, Enno Shioji  wrote:
>
>> Hi Gerard,
>>
>> Thanks for the answer! I had a good look at it, but I couldn't figure out
>> whether one can use that to emit metrics from your application code.
>>
>> Suppose I wanted to monitor the rate of bytes I produce, like so:
>>
>> stream
>> .map { input =>
>>   val bytes = produce(input)
>>   // metricRegistry.meter("some.metrics").mark(bytes.length)
>>   bytes
>> }
>> .saveAsTextFile("text")
>>
>> Is there a way to achieve this with the MetricSystem?
>>
>>
>> ᐧ
>>
>> On Mon, Jan 5, 2015 at 10:24 AM, Gerard Maas 
>> wrote:
>>
>>> Hi,
>>>
>>> Yes, I managed to create a register custom metrics by creating an
>>>  implementation  of org.apache.spark.metrics.source.Source and
>>> registering it to the metrics subsystem.
>>> Source is [Spark] private, so you need to create it under a org.apache.spark
>>> package. In my case, I'm dealing with Spark Streaming metrics, and I
>>> created my CustomStreamingSource under org.apache.spark.streaming as I
>>> also needed access to some [Streaming] private components.
>>>
>>> Then, you register your new metric Source on the Spark's metric system,
>>> like so:
>>>
>>> SparkEnv.get.metricsSystem.registerSource(customStreamingSource)
>>>
>>> And it will get reported to the metrics Sync active on your system. By
>>> default, you can access them through the metric endpoint:
>>> http://:/metrics/json
>>>
>>> I hope this helps.
>>>
>>> -kr, Gerard.
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Dec 30, 2014 at 3:32 PM, eshioji  wrote:
>>>
 Hi,

 Did you find a way to do this / working on this?
 Am trying to find a way to do this as well, but haven't been able to
 find a
 way.



 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/Registering-custom-metrics-tp9030p9968.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org


>>>
>>
>


Re: Registering custom metrics

2015-01-08 Thread Enno Shioji
FYI I found this approach by Ooyala.

/** Instrumentation for Spark based on accumulators.
  *
  * Usage:
  * val instrumentation = new SparkInstrumentation("example.metrics")
  * val numReqs = sc.accumulator(0L)
  * instrumentation.source.registerDailyAccumulator(numReqs, "numReqs")
  * instrumentation.register()
  *
  * Will create and report the following metrics:
  * - Gauge with total number of requests (daily)
  * - Meter with rate of requests
  *
  * @param prefix prefix for all metrics that will be reported by this
Instrumentation
  */

https://gist.github.com/ibuenros/9b94736c2bad2f4b8e23
ᐧ

On Mon, Jan 5, 2015 at 2:56 PM, Enno Shioji  wrote:

> Hi Gerard,
>
> Thanks for the answer! I had a good look at it, but I couldn't figure out
> whether one can use that to emit metrics from your application code.
>
> Suppose I wanted to monitor the rate of bytes I produce, like so:
>
> stream
> .map { input =>
>   val bytes = produce(input)
>   // metricRegistry.meter("some.metrics").mark(bytes.length)
>   bytes
> }
> .saveAsTextFile("text")
>
> Is there a way to achieve this with the MetricSystem?
>
>
> ᐧ
>
> On Mon, Jan 5, 2015 at 10:24 AM, Gerard Maas 
> wrote:
>
>> Hi,
>>
>> Yes, I managed to create a register custom metrics by creating an
>>  implementation  of org.apache.spark.metrics.source.Source and
>> registering it to the metrics subsystem.
>> Source is [Spark] private, so you need to create it under a org.apache.spark
>> package. In my case, I'm dealing with Spark Streaming metrics, and I
>> created my CustomStreamingSource under org.apache.spark.streaming as I
>> also needed access to some [Streaming] private components.
>>
>> Then, you register your new metric Source on the Spark's metric system,
>> like so:
>>
>> SparkEnv.get.metricsSystem.registerSource(customStreamingSource)
>>
>> And it will get reported to the metrics Sync active on your system. By
>> default, you can access them through the metric endpoint:
>> http://:/metrics/json
>>
>> I hope this helps.
>>
>> -kr, Gerard.
>>
>>
>>
>>
>>
>>
>> On Tue, Dec 30, 2014 at 3:32 PM, eshioji  wrote:
>>
>>> Hi,
>>>
>>> Did you find a way to do this / working on this?
>>> Am trying to find a way to do this as well, but haven't been able to
>>> find a
>>> way.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/Registering-custom-metrics-tp9030p9968.html
>>> Sent from the Apache Spark Developers List mailing list archive at
>>> Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>
>>>
>>
>


Re: Saving partial (top 10) DStream windows to hdfs

2015-01-08 Thread Yana Kadiyska
I'm glad you solved this issue but have a followup question for you.
Wouldn't Akhil's solution be better for you after all? I run similar
computation where a large set of data gets reduced to a much smaller
aggregate in an interval. If you do saveAsText without coalescing, I
believe you'd get the same number of files as you have partitions. So in a
worse case scenario, you'd end up with 10 partitions(if each item in your
top 10 was in a different partition) and thus, 10 output files. Seems to me
this would be horrible for subsequent processing of these (as in, this is
small-files in HDFS at its worst). But even with coalesce 1 you'd get 1
pretty small file on every interval

I'm curious what you think because everytime I come to a situation like
this I end up using a store that supports appends (some sort of DB) but
it's more code/work. So I'm curious on your experience of saving to files
(unless you have a separate process to merge these chunks, of course)

On Wed, Jan 7, 2015 at 11:56 AM, Laeeq Ahmed  wrote:

> Hi,
>
> It worked out as this.
>
> val topCounts = sortedCounts.transform(rdd =>
> rdd.zipWithIndex().filter(x=>x._2 <=10))
>
> Regards,
> Laeeq
>
>
>   On Wednesday, January 7, 2015 5:25 PM, Laeeq Ahmed
>  wrote:
>
>
> Hi Yana,
>
> I also think that
> *val top10 = your_stream.mapPartitions(rdd => rdd.take(10))*
>
>
> will give top 10 from each partition. I will try your code.
>
> Regards,
> Laeeq
>
>
>   On Wednesday, January 7, 2015 5:19 PM, Yana Kadiyska <
> yana.kadiy...@gmail.com> wrote:
>
>
> My understanding is that
>
> *val top10 = your_stream.mapPartitions(rdd => rdd.take(10))*
>
> would result in an RDD containing the top 10 entries per partition -- am I
> wrong?
>
> I am not sure if there is a more efficient way but I think this would work:
>
> *sortedCounts.*zipWithIndex().filter(x=>x._2 <=10).saveAsText
>
> On Wed, Jan 7, 2015 at 10:38 AM, Laeeq Ahmed  > wrote:
>
> Hi,
>
> I applied it as fallows:
>
>eegStreams(a) = KafkaUtils.createStream(ssc, zkQuorum, group,
> Map(args(a) -> 1),StorageLevel.MEMORY_AND_DISK_SER).map(_._2)
> val counts = eegStreams(a).map(x =>
> math.round(x.toDouble)).countByValueAndWindow(Seconds(4), Seconds(4))
> val sortedCounts = counts.map(_.swap).transform(rdd =>
> rdd.sortByKey(false)).map(_.swap)
> val topCounts = sortedCounts.mapPartitions(rdd=>rdd.take(10))
> *//val topCounts = sortedCounts.transform(rdd =>
> ssc.sparkContext.makeRDD(rdd.take(10)))*
> topCounts.map(tuple => "%s,%s".format(tuple._1,
> tuple._2)).saveAsTextFiles("hdfs://
> ec2-23-21-113-136.compute-1.amazonaws.com:9000/user/hduser/output/" +
> (a+1))
> topCounts.print()
>
> It gives the output with 10 extra values. I think it works on partition of
> each rdd rather than just rdd. I also tried the commented code. It gives
> correct result but in the start it gives serialisation error
>
> ERROR actor.OneForOneStrategy: org.apache.spark.streaming.StreamingContext
> java.io.NotSerializableException:
> org.apache.spark.streaming.StreamingContext
>
> Output for code in red: The values in green looks extra to me.
>
> 0,578
> -3,576
> 4,559
> -1,556
> 3,553
> -6,540
> 6,538
> -4,535
> 1,526
> 10,483
> *94,8*
> *-113,8*
> *-137,8*
> *-85,8*
> *-91,8*
> *-121,8*
> *114,8*
> *108,8*
> *93,8*
> *101,8*
> 1,128
> -8,118
> 3,112
> -4,110
> -13,108
> 4,108
> -3,107
> -10,107
> -6,106
> 8,105
> *76,6*
> *74,6*
> *60,6*
> *52,6*
> *70,6*
> *71,6*
> *-60,6*
> *55,6*
> *78,5*
> *64,5*
>
> and so on.
>
> Regards,
> Laeeq
>
>
>
>   On Tuesday, January 6, 2015 9:06 AM, Akhil Das <
> ak...@sigmoidanalytics.com> wrote:
>
>
> You can try something like:
>
> *val top10 = your_stream.mapPartitions(rdd => rdd.take(10))*
>
>
> Thanks
> Best Regards
>
> On Mon, Jan 5, 2015 at 11:08 PM, Laeeq Ahmed  > wrote:
>
> Hi,
>
> I am counting values in each window and find the top values and want to
> save only the top 10 frequent values of each window to hdfs rather than all
> the values.
>
> *eegStreams(a) = KafkaUtils.createStream(ssc, zkQuorum, group, Map(args(a)
> -> 1),StorageLevel.MEMORY_AND_DISK_SER).map(_._2)*
> *val counts = eegStreams(a).map(x => (math.round(x.toDouble),
> 1)).reduceByKeyAndWindow(_ + _, _ - _, Seconds(4), Seconds(4))*
> *val sortedCounts = counts.map(_.swap).transform(rdd =>
> rdd.sortByKey(false)).map(_.swap)*
> *//sortedCounts.foreachRDD(rdd =>println("\nTop 10 amplitudes:\n" +
> rdd.take(10).mkString("\n")))*
> *sortedCounts.map(tuple => "%s,%s".format(tuple._1,
> tuple._2)).saveAsTextFiles("hdfs://ec2-23-21-113-136.compute-1.amazonaws.com:9000/user/hduser/output/
> "
> + (a+1))*
>
> I can print top 10 as above in red.
>
> I have also tried
>
> *sortedCounts.foreachRDD{ rdd =>
> ssc.sparkContext.parallelize(rdd.take(10)).saveAsTextFile("hdfs://ec2-23-21-113-136.compute-1.amazonaws.com:9000/user/hduser/output/
> "
> + (a+1))} *
>
> but I

Re: ERROR ConnectionManager: Corresponding SendingConnection to ConnectionManagerId

2015-01-08 Thread Spidy
Hi,

Can you please explain which settings did you changed?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/ERROR-ConnectionManager-Corresponding-SendingConnection-to-ConnectionManagerId-tp17050p21035.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark Project Fails to run multicore in local mode.

2015-01-08 Thread mixtou
I am new to Apache Spark, now i am trying my first project "Space Saving
Counting Algorithm" and while it compiles in single core using
.setMaster("local") it fails when using .setMaster("local[4]") or any
number>1.My Code
follows:=import
org.apache.spark.{SparkConf, SparkContext}/** * Created by mixtou on
30/12/14. */object SpaceSaving {  var frequent_words_counters =
scala.collection.immutable.Map[String, Array[Int]]();  var guaranteed_words
= scala.collection.immutable.Map[String, Array[Int]]();  val top_k: Int =
100;  var words_no: Int = 0;  var tStart: Long = 0;  var tStop: Long = 0; 
var fi: Double = 0.001;  def main (args: Array[String]): Unit = {val
sparkConf = new SparkConf().setAppName("Space Saving
Project").setMaster("local");val ctx = new SparkContext(sparkConf);   
//val stopwords =
ctx.textFile("/Users/mixtou/PhD/Courses/Databases_Advanced/Project/scala_spark_space_saving/src/main/resources/stopwords.txt");
   
val lines =
ctx.textFile("/Users/mixtou/PhD/Courses/Databases_Advanced/Project/scala_spark_space_saving/src/main/resources/README.txt",
2)  .map(line => line.toLowerCase());val nonEmptyLines =
lines.filter(line => line.nonEmpty);val regex =
"[,.:;'\"\\?\\-!\\(\\)\\+\\[\\]\\d+]".r;//val temp =
nonEmptyLines.map(removeStopWords(_, stopwords));val cleanLines =
nonEmptyLines.map(line => regex.replaceAllIn(line, " "));val dirtyWords
= cleanLines.flatMap(line => line.split("\\s+"));val words =
dirtyWords.filter(word => word.length > 3);   / words.foreach(word =>
space_saving_algorithm(word));/ *ERROR HERE!*!if
(frequent_words_counters.size > 0) { 
frequent_words_counters.foreach(line => println("Top Frequent Word: " +
line._1 + " with count: " + line._2(0) + " end error: " + line._2(1)));}   
System.out.println("=== Throughput:=> "+ 1000*(words_no/(tStop -
tStart))+ " words per second. " );estimateGuaranteedFrequentWords();   
words.collect();ctx.stop();  }  def space_saving_algorithm(word:
String): Unit = {//System.out.println(word);if
(frequent_words_counters.contains(word)) { / val count =
frequent_words_counters.get(word).get(0);/ *ERROR HERE*  val error =
frequent_words_counters.get(word).get(1);//  System.out.println("Before:
" + word + " <=> " + count);  frequent_words_counters += word ->
Array[Int](count+1, error);//  System.out.println("After: " + word + "
<=> " + counters.get(word).get(0));}else {  if
(frequent_words_counters.size < top_k) {frequent_words_counters +=
word -> Array[Int](1, 0);//System.out.println("Adding Word to
Counters: "+word);  }  else {replaceLeastEntry(word);  }   
}if(words_no > 0 ){  tStop = java.lang.System.currentTimeMillis();   
}else{  tStart = java.lang.System.currentTimeMillis();}   
words_no += 1;  }  def replaceLeastEntry(word: String): Unit = {var
temp_list = frequent_words_counters.toList.sortWith( (x,y) => x._2(0) >
y._2(0) );val word_count = temp_list.last._2(0);//   
System.out.println("Replacing word: " + temp_list.last._1 + ", having count
and error " + temp_list.last._2(0)+" , " + temp_list.last._2(1) + " with
word: "+word);//System.out.println(temp_list.length);temp_list =
temp_list.take(temp_list.length - 1);frequent_words_counters =
temp_list.toMap[String, Array[Int]];frequent_words_counters += word ->
Array[Int](word_count+1, word_count);  }  def
estimateGuaranteedFrequentWords(): Unit = {   
frequent_words_counters.foreach{tuple =>  if (tuple._2(0) - tuple._2(1)
< words_no*fi) {//counters -= tuple._1;guaranteed_words +=
tuple;//System.out.println("NEW SISZEZ: "+counters.size);  }
 
else {System.out.println("Guaranteed Word : "+tuple._1+" with count:
"+tuple._2(0)+" and error: "+tuple._2(1));  }};  }}The Compiler
Error is:==15/01/08 16:44:51 ERROR
Executor: Exception in task 1.0 in stage 0.0 (TID
1)java.util.NoSuchElementException: None.getat
scala.None$.get(Option.scala:313)   at scala.None$.get(Option.scala:311)
at
SpaceSaving$.space_saving_algorithm(SpaceSaving.scala:52)   at
SpaceSaving$$anonfun$main$1.apply(SpaceSaving.scala:36) at
SpaceSaving$$anonfun$main$1.apply(SpaceSaving.scala:36) at
scala.collection.Iterator$class.foreach(Iterator.scala:727) at
scala.collection.AbstractIterator.foreach(Iterator.scala:1157)  at
org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:765)at
org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:765)at
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
at
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)   at
org.apache.spark.scheduler.Task.run(Task.scala:56)  at
org.apache.spark.execu

Re: Spark SQL: The cached columnar table is not columnar?

2015-01-08 Thread Cheng Lian

Ah, my bad... You're absolute right!

Just checked how this number is computed. It turned out that once an RDD 
block is retrieved from the block manager, the size of the block is 
added to the input bytes. Spark SQL's in-memory columnar format stores 
all columns within a single partition into a single RDD block, that's 
why the input bytes size always equals to the size of the whole table. 
However, when decompressing, reading values from columnar byte buffers, 
only required byte buffer(s) are actually touched.


Cheng

On 1/8/15 10:13 PM, Xuelin Cao wrote:


Hi, Cheng

  In your code:

cacheTable("tbl")
sql("select * from tbl").collect() sql("select name from tbl").collect()

 Running the first sql, the whole table is not cached yet. So the 
*input data will be the original json file. *
 After it is cached, the json format data is removed, so the total 
amount of data also drops.


 If you try like this:

cacheTable("tbl")
sql("select * from tbl").collect() sql("select name from tbl").collect()
sql("select * from tbl").collect()

 Is the input data of the 3rd SQL bigger than 49.1KB?




On Thu, Jan 8, 2015 at 9:36 PM, Cheng Lian > wrote:


Weird, which version did you use? Just tried a small snippet in
Spark 1.2.0 shell as follows, the result showed in the web UI
meets the expectation quite well:

|import  org.apache.spark.sql.SQLContext
import  sc._

val  sqlContext  =  new  SQLContext(sc)
import  sqlContext._

jsonFile("file:///tmp/p.json").registerTempTable("tbl")
cacheTable("tbl")
sql("select * from tbl").collect()
sql("select name from tbl").collect()
|

The input data of the first statement is 292KB, the second is 49.1KB.

The JSON file I used is |examples/src/main/resources/people.json|,
I copied its contents multiple times to generate a larger file.

Cheng

On 1/8/15 7:43 PM, Xuelin Cao wrote:




Hi, Cheng

 I checked the Input data for each stage. For example, in my
attached screen snapshot, the input data is 1212.5MB, which is
the total amount of the whole table

Inline image 1

 And, I also check the input data for each task (in the stage
detail page). And the sum of the input data for each task is also
1212.5MB




On Thu, Jan 8, 2015 at 6:40 PM, Cheng Lian mailto:lian.cs@gmail.com>> wrote:

Hey Xuelin, which data item in the Web UI did you check?


On 1/7/15 5:37 PM, Xuelin Cao wrote:


Hi,

  Curious and curious. I'm puzzled by the Spark SQL
cached table.

Theoretically, the cached table should be columnar table,
and only scan the column that included in my SQL.

  However, in my test, I always see the whole table is
scanned even though I only "select" one column in my SQL.

  Here is my code:

/val sqlContext = new org.apache.spark.sql.SQLContext(sc)
/
/import sqlContext._
/
/sqlContext.jsonFile("/data/ad.json").registerTempTable("adTable")
/
/sqlContext.cacheTable("adTable")  //The table has > 10 columns/
/
/
///First run, cache the table into memory//
/
/sqlContext.sql("select * from adTable").collect/
/
/
///Second run, only one column is used. It should only scan
a small fraction of data//
/
/sqlContext.sql("select adId from adTable").collect /
/sqlContext.sql("select adId from adTable").collect
/
/sqlContext.sql("select adId from adTable").collect/

What I found is, every time I run the SQL, in WEB
UI, it shows the total amount of input data is always the
same --- the total amount of the table.

Is anything wrong? My expectation is:
1. The cached table is stored as columnar table
2. Since I only need one column in my SQL, the total
amount of input data showed in WEB UI should be very small

But what I found is totally not the case. Why?

Thanks





​






Parquet compression codecs not applied

2015-01-08 Thread Ayoub
Hello,

I tried to save a table created via the hive context as a parquet file but
whatever compression codec (uncompressed, snappy, gzip or lzo) I set via
setConf like:

setConf("spark.sql.parquet.compression.codec", "gzip")

the size of the generated files is the always the same, so it seems like
spark context ignores the compression codec that I set.

Here is a code sample applied via the spark shell:

import org.apache.spark.sql.hive.HiveContext
val hiveContext = new HiveContext(sc)

hiveContext.sql("SET hive.exec.dynamic.partition = true")
hiveContext.sql("SET hive.exec.dynamic.partition.mode = nonstrict")
hiveContext.setConf("spark.sql.parquet.binaryAsString", "true") // required
to make data compatible with impala
hiveContext.setConf("spark.sql.parquet.compression.codec", "gzip")

hiveContext.sql("create external table if not exists foo (bar STRING, ts
INT) Partitioned by (year INT, month INT, day INT) STORED AS PARQUET
Location 'hdfs://path/data/foo'")

hiveContext.sql("insert into table foo partition(year, month,day) select *,
year(from_unixtime(ts)) as year, month(from_unixtime(ts)) as month, 
day(from_unixtime(ts)) as day from raw_foo")

I tried that with spark 1.2 and 1.3 snapshot against hive 0.13
and I also tried that with Impala on the same cluster which applied
correctly the compression codecs.

Does anyone know what could be the problem ?

Thanks,
Ayoub.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Parquet-compression-codecs-not-applied-tp21033.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Build spark source code with Maven in Intellij Idea

2015-01-08 Thread Sean Owen
Popular topic in the last 48 hours! Just about 20 minutes ago I
collected some recent information on just this topic into a pull
request.

https://github.com/apache/spark/pull/3952

On Thu, Jan 8, 2015 at 2:24 PM, Todd  wrote:
> Hi,
> I have imported the Spark source code in Intellij Idea as a SBT project. I
> try to do maven install in Intellij Idea  by clicking Install in the Spark
> Project  Parent POM(root),but failed.
> I would ask which profiles should be checked. What I want to achieve is
> staring Spark in IDE and Hadoop 2.4 in local machine. At this point, I only
> care Hadoop2.4,not care hbase,hive...

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Build spark source code with Maven in Intellij Idea

2015-01-08 Thread Todd
Hi,
I have imported the Spark source code in Intellij Idea as a SBT project. I try 
to do maven install in Intellij Idea  by clicking Install in the Spark Project  
Parent POM(root),but failed.
I would ask which profiles should be checked. What I want to achieve is staring 
Spark in IDE and Hadoop 2.4 in local machine. At this point, I only care 
Hadoop2.4,not care hbase,hive...


Re: Executing Spark, Error creating path from empty String.

2015-01-08 Thread Guillermo Ortiz
I was adding some bad jars I guess. I deleted all the jars and copied
them again and it works.

2015-01-08 14:15 GMT+01:00 Guillermo Ortiz :
> When I try to execute my task with Spark it starts to copy the jars it
> needs to HDFS and it finally fails, I don't know exactly why. I have
> checked HDFS and it copies the files, so, it seems to work that part.
> I changed the log level to debug but there's nothing else to help.
> What else does Spark need to copy that it could be an empty string?
>
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> 15/01/08 14:06:32 WARN NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where
> applicable
> 15/01/08 14:06:32 INFO RMProxy: Connecting to ResourceManager at
> vmlbyarnl01.lvtc.gsnet.corp/180.133.240.174:8050
> 15/01/08 14:06:33 INFO Client: Got cluster metric info from
> ResourceManager, number of NodeManagers: 3
> 15/01/08 14:06:33 INFO Client: Max mem capabililty of a single
> resource in this cluster 97280
> 15/01/08 14:06:33 INFO Client: Preparing Local resources
> 15/01/08 14:06:34 WARN BlockReaderLocal: The short-circuit local reads
> feature cannot be used because libhadoop cannot be loaded.
> 15/01/08 14:06:34 INFO Client: Uploading
> file:/home/spark-1.1.1-bin-hadoop2.4/lib/spark-assembly-1.1.1-hadoop2.4.0.jar
> to 
> hdfs://vmlbnanodl01.lvtc.gsnet.corp:8020/user/hdfs/.sparkStaging/application_1417607109980_0017/spark-assembly-1.1.1-hadoop2.4.0.jar
> 15/01/08 14:06:42 INFO Client: Uploading
> file:/user/local/etc/lib/my-spark-streaming-scala.jar to
> hdfs://vmlbnanodl01.lvtc.gsnet.corp:8020/user/hdfs/.sparkStaging/application_1417607109980_0017/my-spark-streaming-scala.jar
> Exception in thread "main" java.lang.IllegalArgumentException: Can not
> create a Path from an empty string
> at org.apache.hadoop.fs.Path.checkPathArg(Path.java:127)
> at org.apache.hadoop.fs.Path.(Path.java:135)
> at org.apache.hadoop.fs.Path.(Path.java:94)
> at 
> org.apache.spark.deploy.yarn.ClientBase$class.copyRemoteFile(ClientBase.scala:159)
> at org.apache.spark.deploy.yarn.Client.copyRemoteFile(Client.scala:37)
> at 
> org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$5$$anonfun$apply$2.apply(ClientBase.scala:236)
> at 
> org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$5$$anonfun$apply$2.apply(ClientBase.scala:231)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
> at 
> org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$5.apply(ClientBase.scala:231)
> at 
> org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$5.apply(ClientBase.scala:229)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.deploy.yarn.ClientBase$class.prepareLocalResources(ClientBase.scala:229)
> at 
> org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:37)
> at org.apache.spark.deploy.yarn.Client.runApp(Client.scala:74)
> at org.apache.spark.deploy.yarn.Client.run(Client.scala:96)
> at org.apache.spark.deploy.yarn.Client$.main(Client.scala:176)
> at org.apache.spark.deploy.yarn.Client.main(Client.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:329)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark SQL: The cached columnar table is not columnar?

2015-01-08 Thread Xuelin Cao
Hi, Cheng

  In your code:

cacheTable("tbl")
sql("select * from tbl").collect() sql("select name from tbl").collect()

 Running the first sql, the whole table is not cached yet. So the *input
data will be the original json file. *
 After it is cached, the json format data is removed, so the total
amount of data also drops.

 If you try like this:

cacheTable("tbl")
sql("select * from tbl").collect() sql("select name from tbl").collect()
sql("select * from tbl").collect()

 Is the input data of the 3rd SQL bigger than 49.1KB?




On Thu, Jan 8, 2015 at 9:36 PM, Cheng Lian  wrote:

>  Weird, which version did you use? Just tried a small snippet in Spark
> 1.2.0 shell as follows, the result showed in the web UI meets the
> expectation quite well:
>
> import org.apache.spark.sql.SQLContextimport sc._
> val sqlContext = new SQLContext(sc)import sqlContext._
>
> jsonFile("file:///tmp/p.json").registerTempTable("tbl")
> cacheTable("tbl")
> sql("select * from tbl").collect()
> sql("select name from tbl").collect()
>
> The input data of the first statement is 292KB, the second is 49.1KB.
>
> The JSON file I used is examples/src/main/resources/people.json, I copied
> its contents multiple times to generate a larger file.
>
> Cheng
>
> On 1/8/15 7:43 PM, Xuelin Cao wrote:
>
>
>
>  Hi, Cheng
>
>   I checked the Input data for each stage. For example, in my
> attached screen snapshot, the input data is 1212.5MB, which is the total
> amount of the whole table
>
>  [image: Inline image 1]
>
>   And, I also check the input data for each task (in the stage detail
> page). And the sum of the input data for each task is also 1212.5MB
>
>
>
>
> On Thu, Jan 8, 2015 at 6:40 PM, Cheng Lian  wrote:
>
>>  Hey Xuelin, which data item in the Web UI did you check?
>>
>>
>> On 1/7/15 5:37 PM, Xuelin Cao wrote:
>>
>>
>>  Hi,
>>
>>Curious and curious. I'm puzzled by the Spark SQL cached table.
>>
>>Theoretically, the cached table should be columnar table, and
>> only scan the column that included in my SQL.
>>
>>However, in my test, I always see the whole table is scanned even
>> though I only "select" one column in my SQL.
>>
>>Here is my code:
>>
>>
>> *val sqlContext = new org.apache.spark.sql.SQLContext(sc) *
>>
>> *import sqlContext._ *
>>
>> *sqlContext.jsonFile("/data/ad.json").registerTempTable("adTable") *
>> *sqlContext.cacheTable("adTable")  //The table has > 10 columns*
>>
>>  *//First run, cache the table into memory*
>>  *sqlContext.sql("select * from adTable").collect*
>>
>>  *//Second run, only one column is used. It should only scan a small
>> fraction of data*
>>  *sqlContext.sql("select adId from adTable").collect *
>>
>> *sqlContext.sql("select adId from adTable").collect *
>> *sqlContext.sql("select adId from adTable").collect*
>>
>>  What I found is, every time I run the SQL, in WEB UI, it shows
>> the total amount of input data is always the same --- the total amount of
>> the table.
>>
>>  Is anything wrong? My expectation is:
>> 1. The cached table is stored as columnar table
>> 2. Since I only need one column in my SQL, the total amount of
>> input data showed in WEB UI should be very small
>>
>>  But what I found is totally not the case. Why?
>>
>>  Thanks
>>
>>
>>
>​
>


Parquet compression codecs not applied

2015-01-08 Thread Ayoub Benali
Hello,

I tried to save a table created via the hive context as a parquet file but
whatever compression codec (uncompressed, snappy, gzip or lzo) I set via
setConf like:

setConf("spark.sql.parquet.compression.codec", "gzip")

the size of the generated files is the always the same, so it seems like
spark context ignores the compression codec that I set.

Here is a code sample applied via the spark shell:

import org.apache.spark.sql.hive.HiveContext
val hiveContext = new HiveContext(sc)

hiveContext.sql("SET hive.exec.dynamic.partition = true")
hiveContext.sql("SET hive.exec.dynamic.partition.mode = nonstrict")
hiveContext.setConf("spark.sql.parquet.binaryAsString", "true") // required
to make data compatible with impala
hiveContext.setConf("spark.sql.parquet.compression.codec", "gzip")

hiveContext.sql("create external table if not exists foo (bar STRING, ts
INT) Partitioned by (year INT, month INT, day INT) STORED AS PARQUET
Location 'hdfs://path/data/foo'")

hiveContext.sql("insert into table foo partition(year, month,day) select *,
year(from_unixtime(ts)) as year, month(from_unixtime(ts)) as month,
day(from_unixtime(ts)) as day from raw_foo")

I tried that with spark 1.2 and 1.3 snapshot against hive 0.13
and I also tried that with Impala on the same cluster which applied
correctly the compression codecs.

Does anyone know what could be the problem ?

Thanks,
Ayoub.


Re: Does SparkSQL not support nested IF(1=1, 1, IF(2=2, 2, 3)) statements?

2015-01-08 Thread Cheng Lian
The |+| operator only handles numeric data types, you may register you 
own concat function like this:


|sqlContext.registerFunction("concat", (s: String, t: String) => s + t)
sqlContext.sql("select concat('$', col1) from tbl")
|

Cheng

On 1/5/15 1:13 PM, RK wrote:

The issue is happening when I try to concatenate column values in the 
query like "col1+'


+col3". For some reason, this issue is not manifesting itself when I 
do a single IF query.


Is there a concat function in SparkSQL? I can't find anything in the 
documentation.


Thanks,
RK



On Sunday, January 4, 2015 7:42 PM, RK  wrote:


BTW, I am seeing this issue in Spark 1.1.1.


On Sunday, January 4, 2015 7:29 PM, RK  wrote:


When I use a single IF statement like "select IF(col1 != "", col1+'

+col3, col2+'

+col3) from my_table", it works fine.

However, when I use a nested IF like "select IF(col1 != "", col1+'

+col3, IF(col2 != "", col2+'

+col3, '

)) from my_table", I am getting the following exception.

Exception in thread "main" 
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: 
Unresolved attributes: if (NOT (col1#1 = )) (CAST((CAST(col1#1, 
DoubleType) + CAST($, DoubleType)), DoubleType) + CAST(col3#3, 
DoubleType)) else if (NOT (col2#2 = )) (CAST((CAST(col2#2, DoubleType) 
+ CAST($, DoubleType)), DoubleType) + CAST(col3#3, DoubleType)) else $ 
AS c0#4, tree:
Project [if (NOT (col1#1 = )) (CAST((CAST(col1#1, DoubleType) + 
CAST($, DoubleType)), DoubleType) + CAST(col3#3, DoubleType)) else if 
(NOT (col2#2 = )) (CAST((CAST(col2#2, DoubleType) + CAST($, 
DoubleType)), DoubleType) + CAST(col3#3, DoubleType)) else $ AS c0#4]

 Subquery my_table
  SparkLogicalPlan (ExistingRdd [DB#0,col1#1,col2#2,col3#3], 
MappedRDD[97] at getCallSite at DStream.scala:294)


Does Spark SQL not support nested IF queries or is my query incorrect?

Thanks,
RK





​


Re: SparkSQL support for reading Avro files

2015-01-08 Thread Cheng Lian

This package is moved here: https://github.com/databricks/spark-avro

On 1/6/15 5:12 AM, yanenli2 wrote:

Hi All,

I want to use the SparkSQL to manipulate the data with Avro format. I
found a solution at https://github.com/marmbrus/sql-avro . However it
doesn't compile successfully anymore with the latent code of Spark
version 1.2.0 or 1.2.1.


I then try to pull a copy from github stated at


http://mail-archives.apache.org/mod_mbox/spark-reviews/201409.mbox/%3cgit-pr-2475-sp...@git.apache.org%3E


by command:


git checkout 47d542cc0238fba04b6c4e4456393d812d559c4e


But it failed to pull this commit.

What can I do to make it work? Any plan for adding the Avro support in
SparkSQL?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-support-for-reading-Avro-files-tp20981.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org





-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: example insert statement in Spark SQL

2015-01-08 Thread Cheng Lian
Spark SQL supports Hive insertion statement (Hive 0.14.0 style insertion 
is not supported though) 
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-InsertingdataintoHiveTablesfromqueries


The small SQL dialect provided in Spark SQL doesn't support insertion yet.

On 1/7/15 5:13 PM, Niranda Perera wrote:

Hi,

Are insert statements supported in Spark? if so, can you please give 
me an example?


Rgds

--
Niranda



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark SQL: The cached columnar table is not columnar?

2015-01-08 Thread Cheng Lian
Weird, which version did you use? Just tried a small snippet in Spark 
1.2.0 shell as follows, the result showed in the web UI meets the 
expectation quite well:


|import  org.apache.spark.sql.SQLContext
import  sc._

val  sqlContext  =  new  SQLContext(sc)
import  sqlContext._

jsonFile("file:///tmp/p.json").registerTempTable("tbl")
cacheTable("tbl")
sql("select * from tbl").collect()
sql("select name from tbl").collect()
|

The input data of the first statement is 292KB, the second is 49.1KB.

The JSON file I used is |examples/src/main/resources/people.json|, I 
copied its contents multiple times to generate a larger file.


Cheng

On 1/8/15 7:43 PM, Xuelin Cao wrote:




Hi, Cheng

 I checked the Input data for each stage. For example, in my 
attached screen snapshot, the input data is 1212.5MB, which is the 
total amount of the whole table


Inline image 1

 And, I also check the input data for each task (in the stage 
detail page). And the sum of the input data for each task is also 1212.5MB





On Thu, Jan 8, 2015 at 6:40 PM, Cheng Lian > wrote:


Hey Xuelin, which data item in the Web UI did you check?


On 1/7/15 5:37 PM, Xuelin Cao wrote:


Hi,

  Curious and curious. I'm puzzled by the Spark SQL cached table.

  Theoretically, the cached table should be columnar table,
and only scan the column that included in my SQL.

  However, in my test, I always see the whole table is
scanned even though I only "select" one column in my SQL.

  Here is my code:

/val sqlContext = new org.apache.spark.sql.SQLContext(sc)
/
/import sqlContext._
/
/sqlContext.jsonFile("/data/ad.json").registerTempTable("adTable")
/
/sqlContext.cacheTable("adTable")  //The table has > 10 columns/
/
/
///First run, cache the table into memory//
/
/sqlContext.sql("select * from adTable").collect/
/
/
///Second run, only one column is used. It should only scan a
small fraction of data//
/
/sqlContext.sql("select adId from adTable").collect /
/sqlContext.sql("select adId from adTable").collect
/
/sqlContext.sql("select adId from adTable").collect/

What I found is, every time I run the SQL, in WEB UI, it
shows the total amount of input data is always the same --- the
total amount of the table.

Is anything wrong? My expectation is:
1. The cached table is stored as columnar table
2. Since I only need one column in my SQL, the total
amount of input data showed in WEB UI should be very small

But what I found is totally not the case. Why?

Thanks





​


Eclipse flags error on KafkaUtils.createStream()

2015-01-08 Thread kc66
Hi,
I am using Eclipse writing Java code.

I am trying to create a Kafka receiver by:
JavaPairReceiverInputDStream a  =
KafkaUtils.createStream(jssc, String.class, Message.class,
StringDecoder.class, DefaultDecoder.class,
kafkaParams,
topics, StorageLevel.MEMORY_AND_DISK());

where jssc, kafkaParams, and topics are all properly defined.

I am getting flagged by Eclipse with the following messages:
"The type scala.reflect.ClassTag cannot be resolved. It is indirectly
referenced from required .class files"

I don't know Scala and it seems that scala.reflect.ClassTag is an unusual
class which can not be imported simply by using an "import" statement.

I have
import scala.reflect.*;
but it doesn't help.

In my pom.xml I have:   

  org.scala-lang
  scala-reflect
  2.11.4

That doesn't help either.

Is Eclipse flagging a real problem?

A solution suggested by Eclipse is to edit the Java Build Path using the UI
below. However, I have no idea what to do.


 

I would rather use the API below that doesn't require passing in of
StringDecoder and DefaultDecoder below. However, the contents of my Kafka
messages are not Strings. Is there any way to use this API
with non-String Kafka messages?

public static JavaPairReceiverInputDStream
createStream(JavaStreamingContext jssc,
   String zkQuorum,
   String groupId,
  
java.util.Map topics,
   StorageLevel
storageLevel)


Thanks!!
KC





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Eclipse-flags-error-on-KafkaUtils-createStream-tp21032.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Executing Spark, Error creating path from empty String.

2015-01-08 Thread Guillermo Ortiz
When I try to execute my task with Spark it starts to copy the jars it
needs to HDFS and it finally fails, I don't know exactly why. I have
checked HDFS and it copies the files, so, it seems to work that part.
I changed the log level to debug but there's nothing else to help.
What else does Spark need to copy that it could be an empty string?

Spark assembly has been built with Hive, including Datanucleus jars on classpath
15/01/08 14:06:32 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where
applicable
15/01/08 14:06:32 INFO RMProxy: Connecting to ResourceManager at
vmlbyarnl01.lvtc.gsnet.corp/180.133.240.174:8050
15/01/08 14:06:33 INFO Client: Got cluster metric info from
ResourceManager, number of NodeManagers: 3
15/01/08 14:06:33 INFO Client: Max mem capabililty of a single
resource in this cluster 97280
15/01/08 14:06:33 INFO Client: Preparing Local resources
15/01/08 14:06:34 WARN BlockReaderLocal: The short-circuit local reads
feature cannot be used because libhadoop cannot be loaded.
15/01/08 14:06:34 INFO Client: Uploading
file:/home/spark-1.1.1-bin-hadoop2.4/lib/spark-assembly-1.1.1-hadoop2.4.0.jar
to 
hdfs://vmlbnanodl01.lvtc.gsnet.corp:8020/user/hdfs/.sparkStaging/application_1417607109980_0017/spark-assembly-1.1.1-hadoop2.4.0.jar
15/01/08 14:06:42 INFO Client: Uploading
file:/user/local/etc/lib/my-spark-streaming-scala.jar to
hdfs://vmlbnanodl01.lvtc.gsnet.corp:8020/user/hdfs/.sparkStaging/application_1417607109980_0017/my-spark-streaming-scala.jar
Exception in thread "main" java.lang.IllegalArgumentException: Can not
create a Path from an empty string
at org.apache.hadoop.fs.Path.checkPathArg(Path.java:127)
at org.apache.hadoop.fs.Path.(Path.java:135)
at org.apache.hadoop.fs.Path.(Path.java:94)
at 
org.apache.spark.deploy.yarn.ClientBase$class.copyRemoteFile(ClientBase.scala:159)
at org.apache.spark.deploy.yarn.Client.copyRemoteFile(Client.scala:37)
at 
org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$5$$anonfun$apply$2.apply(ClientBase.scala:236)
at 
org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$5$$anonfun$apply$2.apply(ClientBase.scala:231)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at 
org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$5.apply(ClientBase.scala:231)
at 
org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$5.apply(ClientBase.scala:229)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
org.apache.spark.deploy.yarn.ClientBase$class.prepareLocalResources(ClientBase.scala:229)
at 
org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:37)
at org.apache.spark.deploy.yarn.Client.runApp(Client.scala:74)
at org.apache.spark.deploy.yarn.Client.run(Client.scala:96)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:176)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:329)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Several applications share the same Spark executors (or their cache)

2015-01-08 Thread Silvio Fiorito
Rather than having duplicate Spark apps and the web app having a direct 
reference  to the SparkContext, why not use a queue or message bus to submit 
your requests. This way you're not wasting resources caching the same data in 
Spark and you can scale your web tier independently of the Spark tier.

From: preeze
Sent: ‎1/‎8/‎2015 5:59 AM
To: user@spark.apache.org
Subject: Several applications share the same Spark executors (or their cache)

Hi all,

We have a web application that connects to a Spark cluster to trigger some
calculation there. It also caches big amount of data in the Spark executors'
cache.

To meet high availability requirements we need to run 2 instances of our web
application on different hosts. Doing this straightforward will mean that
the second application fires another set of executors that will initialize
their own huge cache totally identical to that for the first application.

Ideally we would like to reuse the cache in Spark for the needs of all
instances of our applications.

I am aware of the possibility to use Tachyon to externalize executors'
cache. Currently exploring other options.

Is there any way to allow several instances of the same application to
connect to the same set of Spark executors?





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Several-applications-share-the-same-Spark-executors-or-their-cache-tp21031.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Trying to execute Spark in Yarn

2015-01-08 Thread Guillermo Ortiz
thanks!

2015-01-08 12:59 GMT+01:00 Shixiong Zhu :
> `--jars` accepts a comma-separated list of jars. See the usage about
> `--jars`
>
> --jars JARS Comma-separated list of local jars to include on the driver and
> executor classpaths.
>
>
>
> Best Regards,
>
> Shixiong Zhu
>
> 2015-01-08 19:23 GMT+08:00 Guillermo Ortiz :
>>
>> I'm trying to execute Spark from a Hadoop Cluster, I have created this
>> script to try it:
>>
>> #!/bin/bash
>>
>> export HADOOP_CONF_DIR=/etc/hadoop/conf
>> SPARK_CLASSPATH=""
>> for lib in `ls /user/local/etc/lib/*.jar`
>> do
>> SPARK_CLASSPATH=$SPARK_CLASSPATH:$lib
>> done
>> /home/spark-1.1.1-bin-hadoop2.4/bin/spark-submit --name "Streaming"
>> --master yarn-cluster --class com.sparkstreaming.Executor --jars
>> $SPARK_CLASSPATH --executor-memory 10g
>> /user/local/etc/lib/my-spark-streaming-scala.jar
>>
>> When I execute the script I get this error:
>>
>> Spark assembly has been built with Hive, including Datanucleus jars on
>> classpath
>> Exception in thread "main" java.net.URISyntaxException: Expected
>> scheme name at index 0:
>>
>> :/user/local/etc/lib/akka-actor_2.10-2.2.3-shaded-protobuf.jar:/user/local/etc/lib/akka-remote_2.10-..
>> 
>> 
>>
>> -maths-1.2.2a.jar:/user/local/etc/lib/xmlenc-0.52.jar:/user/local/etc/lib/zkclient-0.3.jar:/user/local/etc/lib/zookeeper-3.4.5.jar
>> at java.net.URI$Parser.fail(URI.java:2829)
>> at java.net.URI$Parser.failExpecting(URI.java:2835)
>> at java.net.URI$Parser.parse(URI.java:3027)
>> at java.net.URI.(URI.java:595)
>> at org.apache.spark.util.Utils$.resolveURI(Utils.scala:1396)
>> at
>> org.apache.spark.util.Utils$$anonfun$resolveURIs$1.apply(Utils.scala:1419)
>> at
>> org.apache.spark.util.Utils$$anonfun$resolveURIs$1.apply(Utils.scala:1419)
>> at
>> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>> at
>> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>> at
>> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>> at
>> scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>> at
>> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>> at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
>> at org.apache.spark.util.Utils$.resolveURIs(Utils.scala:1419)
>> at
>> org.apache.spark.deploy.SparkSubmitArguments.parse$1(SparkSubmitArguments.scala:308)
>> at
>> org.apache.spark.deploy.SparkSubmitArguments.parseOpts(SparkSubmitArguments.scala:221)
>> at
>> org.apache.spark.deploy.SparkSubmitArguments.(SparkSubmitArguments.scala:65)
>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:70)
>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>>
>>
>>
>> Why do I get this error? I have no idea. Any clue?
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Trying to execute Spark in Yarn

2015-01-08 Thread Shixiong Zhu
`--jars` accepts a comma-separated list of jars. See the usage about
`--jars`

--jars JARS Comma-separated list of local jars to include on the driver and
executor classpaths.



Best Regards,
Shixiong Zhu

2015-01-08 19:23 GMT+08:00 Guillermo Ortiz :

> I'm trying to execute Spark from a Hadoop Cluster, I have created this
> script to try it:
>
> #!/bin/bash
>
> export HADOOP_CONF_DIR=/etc/hadoop/conf
> SPARK_CLASSPATH=""
> for lib in `ls /user/local/etc/lib/*.jar`
> do
> SPARK_CLASSPATH=$SPARK_CLASSPATH:$lib
> done
> /home/spark-1.1.1-bin-hadoop2.4/bin/spark-submit --name "Streaming"
> --master yarn-cluster --class com.sparkstreaming.Executor --jars
> $SPARK_CLASSPATH --executor-memory 10g
> /user/local/etc/lib/my-spark-streaming-scala.jar
>
> When I execute the script I get this error:
>
> Spark assembly has been built with Hive, including Datanucleus jars on
> classpath
> Exception in thread "main" java.net.URISyntaxException: Expected
> scheme name at index 0:
>
> :/user/local/etc/lib/akka-actor_2.10-2.2.3-shaded-protobuf.jar:/user/local/etc/lib/akka-remote_2.10-..
> 
> 
>
> -maths-1.2.2a.jar:/user/local/etc/lib/xmlenc-0.52.jar:/user/local/etc/lib/zkclient-0.3.jar:/user/local/etc/lib/zookeeper-3.4.5.jar
> at java.net.URI$Parser.fail(URI.java:2829)
> at java.net.URI$Parser.failExpecting(URI.java:2835)
> at java.net.URI$Parser.parse(URI.java:3027)
> at java.net.URI.(URI.java:595)
> at org.apache.spark.util.Utils$.resolveURI(Utils.scala:1396)
> at
> org.apache.spark.util.Utils$$anonfun$resolveURIs$1.apply(Utils.scala:1419)
> at
> org.apache.spark.util.Utils$$anonfun$resolveURIs$1.apply(Utils.scala:1419)
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at
> scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
> at
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
> at org.apache.spark.util.Utils$.resolveURIs(Utils.scala:1419)
> at
> org.apache.spark.deploy.SparkSubmitArguments.parse$1(SparkSubmitArguments.scala:308)
> at
> org.apache.spark.deploy.SparkSubmitArguments.parseOpts(SparkSubmitArguments.scala:221)
> at
> org.apache.spark.deploy.SparkSubmitArguments.(SparkSubmitArguments.scala:65)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:70)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>
>
>
> Why do I get this error? I have no idea. Any clue?
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Spark SQL: The cached columnar table is not columnar?

2015-01-08 Thread Xuelin Cao
Hi, Cheng

 I checked the Input data for each stage. For example, in my attached
screen snapshot, the input data is 1212.5MB, which is the total amount of
the whole table

[image: Inline image 1]

 And, I also check the input data for each task (in the stage detail
page). And the sum of the input data for each task is also 1212.5MB




On Thu, Jan 8, 2015 at 6:40 PM, Cheng Lian  wrote:

>  Hey Xuelin, which data item in the Web UI did you check?
>
>
> On 1/7/15 5:37 PM, Xuelin Cao wrote:
>
>
>  Hi,
>
>Curious and curious. I'm puzzled by the Spark SQL cached table.
>
>Theoretically, the cached table should be columnar table, and only
> scan the column that included in my SQL.
>
>However, in my test, I always see the whole table is scanned even
> though I only "select" one column in my SQL.
>
>Here is my code:
>
>
> *val sqlContext = new org.apache.spark.sql.SQLContext(sc) *
>
> *import sqlContext._ *
>
> *sqlContext.jsonFile("/data/ad.json").registerTempTable("adTable") *
> *sqlContext.cacheTable("adTable")  //The table has > 10 columns*
>
>  *//First run, cache the table into memory*
>  *sqlContext.sql("select * from adTable").collect*
>
>  *//Second run, only one column is used. It should only scan a small
> fraction of data*
>  *sqlContext.sql("select adId from adTable").collect *
>
> *sqlContext.sql("select adId from adTable").collect *
> *sqlContext.sql("select adId from adTable").collect*
>
>  What I found is, every time I run the SQL, in WEB UI, it shows
> the total amount of input data is always the same --- the total amount of
> the table.
>
>  Is anything wrong? My expectation is:
> 1. The cached table is stored as columnar table
> 2. Since I only need one column in my SQL, the total amount of
> input data showed in WEB UI should be very small
>
>  But what I found is totally not the case. Why?
>
>  Thanks
>
>
>


SPARKonYARN failing on CDH 5.3.0 : container cannot be fetched because of NumberFormatException

2015-01-08 Thread Mukesh Jha
Hi Experts,

I am running spark inside YARN job.

The spark-streaming job is running fine in CDH-5.0.0 but after the upgrade
to 5.3.0 it cannot fetch containers with the below errors. Looks like the
container id is incorrect and a string is present in a pace where it's
expecting a number.



java.lang.IllegalArgumentException: Invalid ContainerId:
container_e01_1420481081140_0006_01_01

Caused by: java.lang.NumberFormatException: For input string: "e01"



Is this a bug?? Did you face something similar and any ideas how to fix
this?



15/01/08 09:50:28 INFO yarn.ApplicationMaster: Registered signal handlers
for [TERM, HUP, INT]

15/01/08 09:50:29 ERROR yarn.ApplicationMaster: Uncaught exception:

java.lang.IllegalArgumentException: Invalid ContainerId:
container_e01_1420481081140_0006_01_01

at
org.apache.hadoop.yarn.util.ConverterUtils.toContainerId(ConverterUtils.java:182)

at
org.apache.spark.deploy.yarn.YarnRMClientImpl.getAttemptId(YarnRMClientImpl.scala:79)

at
org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:79)

at
org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:515)

at
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:60)

at
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:59)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:415)

at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)

at
org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:59)

at
org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:513)

at
org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)

Caused by: java.lang.NumberFormatException: For input string: "e01"

at
java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)

at java.lang.Long.parseLong(Long.java:441)

at java.lang.Long.parseLong(Long.java:483)

at
org.apache.hadoop.yarn.util.ConverterUtils.toApplicationAttemptId(ConverterUtils.java:137)

at
org.apache.hadoop.yarn.util.ConverterUtils.toContainerId(ConverterUtils.java:177)

... 11 more

15/01/08 09:50:29 INFO yarn.ApplicationMaster: Final app status: FAILED,
exitCode: 10, (reason: Uncaught exception: Invalid ContainerId:
container_e01_1420481081140_0006_01_01)

-- 
Thanks & Regards,

*Mukesh Jha *


Trying to execute Spark in Yarn

2015-01-08 Thread Guillermo Ortiz
I'm trying to execute Spark from a Hadoop Cluster, I have created this
script to try it:

#!/bin/bash

export HADOOP_CONF_DIR=/etc/hadoop/conf
SPARK_CLASSPATH=""
for lib in `ls /user/local/etc/lib/*.jar`
do
SPARK_CLASSPATH=$SPARK_CLASSPATH:$lib
done
/home/spark-1.1.1-bin-hadoop2.4/bin/spark-submit --name "Streaming"
--master yarn-cluster --class com.sparkstreaming.Executor --jars
$SPARK_CLASSPATH --executor-memory 10g
/user/local/etc/lib/my-spark-streaming-scala.jar

When I execute the script I get this error:

Spark assembly has been built with Hive, including Datanucleus jars on classpath
Exception in thread "main" java.net.URISyntaxException: Expected
scheme name at index 0:
:/user/local/etc/lib/akka-actor_2.10-2.2.3-shaded-protobuf.jar:/user/local/etc/lib/akka-remote_2.10-..


-maths-1.2.2a.jar:/user/local/etc/lib/xmlenc-0.52.jar:/user/local/etc/lib/zkclient-0.3.jar:/user/local/etc/lib/zookeeper-3.4.5.jar
at java.net.URI$Parser.fail(URI.java:2829)
at java.net.URI$Parser.failExpecting(URI.java:2835)
at java.net.URI$Parser.parse(URI.java:3027)
at java.net.URI.(URI.java:595)
at org.apache.spark.util.Utils$.resolveURI(Utils.scala:1396)
at 
org.apache.spark.util.Utils$$anonfun$resolveURIs$1.apply(Utils.scala:1419)
at 
org.apache.spark.util.Utils$$anonfun$resolveURIs$1.apply(Utils.scala:1419)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at org.apache.spark.util.Utils$.resolveURIs(Utils.scala:1419)
at 
org.apache.spark.deploy.SparkSubmitArguments.parse$1(SparkSubmitArguments.scala:308)
at 
org.apache.spark.deploy.SparkSubmitArguments.parseOpts(SparkSubmitArguments.scala:221)
at 
org.apache.spark.deploy.SparkSubmitArguments.(SparkSubmitArguments.scala:65)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:70)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



Why do I get this error? I have no idea. Any clue?

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Several applications share the same Spark executors (or their cache)

2015-01-08 Thread preeze
Hi all,

We have a web application that connects to a Spark cluster to trigger some
calculation there. It also caches big amount of data in the Spark executors'
cache.

To meet high availability requirements we need to run 2 instances of our web
application on different hosts. Doing this straightforward will mean that
the second application fires another set of executors that will initialize
their own huge cache totally identical to that for the first application.

Ideally we would like to reuse the cache in Spark for the needs of all
instances of our applications.

I am aware of the possibility to use Tachyon to externalize executors'
cache. Currently exploring other options.

Is there any way to allow several instances of the same application to
connect to the same set of Spark executors?





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Several-applications-share-the-same-Spark-executors-or-their-cache-tp21031.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark 1.2.0 ec2 launch script hadoop native libraries not found warning

2015-01-08 Thread critikaled
Hi,
Im facing this error on spark ec2 cluster when a job is submitted its says
that native hadoop libraries are not found I have checked spark-env.sh and
all the folders in the path but unable to find the problem even though the
folder are containing. are there any performance drawbacks if we use inbuilt
jars is there any body else fcing this problem. btw I'm using spark-1.2.0,
hadoop major version = 2, scala version = 2.10.4.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-2-0-ec2-launch-script-hadoop-native-libraries-not-found-warning-tp21030.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark SQL: The cached columnar table is not columnar?

2015-01-08 Thread Cheng Lian

Hey Xuelin, which data item in the Web UI did you check?

On 1/7/15 5:37 PM, Xuelin Cao wrote:


Hi,

Curious and curious. I'm puzzled by the Spark SQL cached table.

Theoretically, the cached table should be columnar table, and only 
scan the column that included in my SQL.


However, in my test, I always see the whole table is scanned even 
though I only "select" one column in my SQL.


  Here is my code:

/val sqlContext = new org.apache.spark.sql.SQLContext(sc)
/
/import sqlContext._
/
/sqlContext.jsonFile("/data/ad.json").registerTempTable("adTable")
/
/sqlContext.cacheTable("adTable")  //The table has > 10 columns/
/
/
///First run, cache the table into memory//
/
/sqlContext.sql("select * from adTable").collect/
/
/
///Second run, only one column is used. It should only scan a small 
fraction of data//

/
/sqlContext.sql("select adId from adTable").collect /
/sqlContext.sql("select adId from adTable").collect
/
/sqlContext.sql("select adId from adTable").collect/

What I found is, every time I run the SQL, in WEB UI, it shows 
the total amount of input data is always the same --- the total amount 
of the table.


Is anything wrong? My expectation is:
1. The cached table is stored as columnar table
2. Since I only need one column in my SQL, the total amount of 
input data showed in WEB UI should be very small


But what I found is totally not the case. Why?

Thanks





Re: Spark Standalone Cluster not correctly configured

2015-01-08 Thread frodo777
Hello everyone.

With respect to the configuration problem that I explained before 

Do you have any idea what is wrong there?

The problem in a nutshell:
- When more than one master is started in the cluster, all of them are
scheduling independently, thinking they are all leaders.
- zookeeper configuration seems to be correct, only one leader is reported.
The remaining master nodes are followers.
- Default /spark directory is used for zookeeper.

Thanks a lot.
-Bob



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Standalone-Cluster-not-correctly-configured-tp20909p21029.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark-network-yarn 2.11 depends on spark-network-shuffle 2.10

2015-01-08 Thread Aniket Bhatnagar
Actually it does causes builds with SBT 0.13.7 to fail with the error
"Conflicting cross-version suffixes". I have raised a defect SPARK-5143 for
this.

On Wed Jan 07 2015 at 23:44:21 Marcelo Vanzin  wrote:

> This particular case shouldn't cause problems since both of those
> libraries are java-only (the scala version appended there is just for
> helping the build scripts).
>
> But it does look weird, so it would be nice to fix it.
>
> On Wed, Jan 7, 2015 at 12:25 AM, Aniket Bhatnagar
>  wrote:
> > It seems that spark-network-yarn compiled for scala 2.11 depends on
> > spark-network-shuffle compiled for scala 2.10. This causes cross version
> > dependencies conflicts in sbt. Seems like a publishing error?
> >
> > http://www.uploady.com/#!/download/6Yn95UZA0DR/3taAJFjCJjrsSXOR
>
>
>
> --
> Marcelo
>


RE: Spark History Server can't read event logs

2015-01-08 Thread michael.england
Hi Vanzin,

I am using the MapR distribution of Hadoop. The history server logs are created 
by a job with the permissions:

drwxrwx---   - 2 2015-01-08 09:14 
/apps/spark/historyserver/logs/spark-1420708455212

However, the permissions of the higher directories are mapr:mapr and the user 
that runs Spark in our case is a unix ID called mapr (in the mapr group). 
Therefore, this can't read my job event logs as shown above.


Thanks,
Michael


-Original Message-
From: Marcelo Vanzin [mailto:van...@cloudera.com] 
Sent: 07 January 2015 18:10
To: England, Michael (IT/UK)
Cc: user@spark.apache.org
Subject: Re: Spark History Server can't read event logs

The Spark code generates the log directory with "770" permissions. On top of 
that you need to make sure of two things:

- all directories up to /apps/spark/historyserver/logs/ are readable by the 
user running the history server
- the user running the history server belongs to the group that owns 
/apps/spark/historyserver/logs/

I think the code could be more explicitly about setting the group of the 
generated log directories and files, but if you follow the two rules above 
things should work. Also, I recommend setting /apps/spark/historyserver/logs/ 
itself to "1777" so that any user can generate logs, but only the owner (or a 
superuser) can delete them.



On Wed, Jan 7, 2015 at 7:45 AM,   wrote:
> Hi,
>
>
>
> When I run jobs and save the event logs, they are saved with the 
> permissions of the unix user and group that ran the spark job. The 
> history server is run as a service account and therefore can’t read the files:
>
>
>
> Extract from the History server logs:
>
>
>
> 2015-01-07 15:37:24,3021 ERROR Client 
> fs/client/fileclient/cc/client.cc:1009
> Thread: 1183 User does not have access to open file
> /apps/spark/historyserver/logs/spark-1420644521194
>
> 15/01/07 15:37:24 ERROR ReplayListenerBus: Exception in parsing Spark 
> event log 
> /apps/spark/historyserver/logs/spark-1420644521194/EVENT_LOG_1
>
> org.apache.hadoop.security.AccessControlException: Open failed for file:
> /apps/spark/historyserver/logs/spark-1420644521194/EVENT_LOG_1, error:
> Permission denied (13)
>
>
>
> Is there a setting which I can change that allows the files to be 
> world readable or at least by the account running the history server? 
> Currently, the job appears in the History Sever UI but only states ‘ Started>’.
>
>
>
> Thanks,
>
> Michael
>
>
> This e-mail (including any attachments) is private and confidential, 
> may contain proprietary or privileged information and is intended for 
> the named
> recipient(s) only. Unintended recipients are strictly prohibited from 
> taking action on the basis of information in this e-mail and must 
> contact the sender immediately, delete this e-mail (and all 
> attachments) and destroy any hard copies. Nomura will not accept 
> responsibility or liability for the accuracy or completeness of, or 
> the presence of any virus or disabling code in, this e-mail. If 
> verification is sought please request a hard copy. Any reference to 
> the terms of executed transactions should be treated as preliminary only and 
> subject to formal written confirmation by Nomura.
> Nomura reserves the right to retain, monitor and intercept e-mail 
> communications through its networks (subject to and in accordance with 
> applicable laws). No confidentiality or privilege is waived or lost by 
> Nomura by any mistransmission of this e-mail. Any reference to 
> "Nomura" is a reference to any entity in the Nomura Holdings, Inc. 
> group. Please read our Electronic Communications Legal Notice which forms 
> part of this e-mail:
> http://www.Nomura.com/email_disclaimer.htm



--
Marcelo


This e-mail (including any attachments) is private and confidential, may 
contain proprietary or privileged information and is intended for the named 
recipient(s) only. Unintended recipients are strictly prohibited from taking 
action on the basis of information in this e-mail and must contact the sender 
immediately, delete this e-mail (and all attachments) and destroy any hard 
copies. Nomura will not accept responsibility or liability for the accuracy or 
completeness of, or the presence of any virus or disabling code in, this 
e-mail. If verification is sought please request a hard copy. Any reference to 
the terms of executed transactions should be treated as preliminary only and 
subject to formal written confirmation by Nomura. Nomura reserves the right to 
retain, monitor and intercept e-mail communications through its networks 
(subject to and in accordance with applicable laws). No confidentiality or 
privilege is waived or lost by Nomura by any mistransmission of this e-mail. 
Any reference to "Nomura" is a reference to any entity in the Nomura Holdings, 
Inc. group. Please read our Electronic Communications Legal Notice which forms 
part of this e-mail: http://www.Nomura.com/email_disclaimer.htm



Re: Join stucks in the last stage step

2015-01-08 Thread paja
Just to demonstrate BIG difference between ordinary task (id 450) and last
remaining task (id 0)

Index   ID  Attempt Status ▾Locality LevelLaunch Time   
DurationGC Time
Shuffle ReadShuffle Spill (Memory)  Shuffle Spill (Disk)Errors
0   24130   RUNNING 2015/01/08 08:03:19 *1.5 h  10.9 GB 
176.9 GB*   4.4 GB  
450 28630   SUCCESS 2015/01/08 08:06:55 *1.1 min42.7 MB 
968.7 MB* 32.9 MB   



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Join-stucks-in-the-last-stage-step-tp21018p21028.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org