Re: What's the advantage features of Spark SQL(JDBC)

2015-05-15 Thread Yi Zhang
@Hao,As you said, there is no advantage feature for JDBC, it just provides unified api to support different data sources. Is it right? On Friday, May 15, 2015 2:46 PM, Cheng, Hao hao.ch...@intel.com wrote: #yiv2822675239 #yiv2822675239 -- _filtered #yiv2822675239

Re: Spark on Mesos vs Yarn

2015-05-15 Thread Ankur Chauhan
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi Tim, Thanks for such a detailed email. I am excited to hear about the new features, I had a pull request going for adding attribute based filtering in the mesos scheduler but it hasn't received much love - https://github.com/apache/spark/pull/5563

Re: Spark performance in cluster mode using yarn

2015-05-15 Thread Sachin Singh
Hi Ayan, I am asking general scenarios as per given info/configuration, from experts, not specific, java code is nothing get hive context and select query, there is no serialization or any other complex things I kept,straight forward, 10 lines of code, Group Please suggest if any Idea, Regards

RE: What's the advantage features of Spark SQL(JDBC)

2015-05-15 Thread Cheng, Hao
Spark SQL just take the JDBC as a new data source, the same as we need to support loading data from a .csv or .json. From: Yi Zhang [mailto:zhangy...@yahoo.com.INVALID] Sent: Friday, May 15, 2015 2:30 PM To: User Subject: What's the advantage features of Spark SQL(JDBC) Hi All, Comparing

回复:Re: how to delete data from table in sparksql

2015-05-15 Thread luohui20001
got it,thank you. Thanksamp;Best regards! San.Luo - 原始邮件 - 发件人:Michael Armbrust mich...@databricks.com 收件人:Denny Lee denny.g@gmail.com 抄送人:罗辉 luohui20...@sina.com, user user@spark.apache.org 主题:Re: how to delete data from table in sparksql

Re: Spark on Mesos vs Yarn

2015-05-15 Thread Tim Chen
Hi Ankur, This is a great question as I've heard similar concerns about Spark on Mesos. At the time when I started to contribute to Spark on Mesos approx half year ago, the Mesos scheduler and related code hasn't really got much attention from anyone and it was pretty much in maintenance mode.

Re: What's the advantage features of Spark SQL(JDBC)

2015-05-15 Thread Yi Zhang
OK. Thanks. On Friday, May 15, 2015 3:35 PM, Cheng, Hao hao.ch...@intel.com wrote: #yiv2190097982 #yiv2190097982 -- _filtered #yiv2190097982 {font-family:Helvetica;panose-1:2 11 6 4 2 2 2 2 2 4;} _filtered #yiv2190097982 {font-family:宋体;panose-1:2 1 6 0 3 1 1 1 1 1;} _filtered

Why association with remote system has failed when set master in Spark programmatically

2015-05-15 Thread Yi Zhang
Hi all, I run start-master.sh to start standalone Spark with spark://192.168.1.164:7077. Then, I use this command as below, and it's OK:./bin/spark-shell --master spark://192.168.1.164:7077 The console print correct message, and Spark context had been initialised correctly.  However, when I run

RE: question about sparksql caching

2015-05-15 Thread Cheng, Hao
You probably can try something like: val df = sqlContext.sql(select c1, sum(c2) from T1, T2 where T1.key=T2.key group by c1) df.cache() // Cache the result, but it's a lazy execution. df.registerAsTempTable(my_result) sqlContext.sql(select * from my_result where c1=1).collect // the cache

RE: What's the advantage features of Spark SQL(JDBC)

2015-05-15 Thread Cheng, Hao
Yes. From: Yi Zhang [mailto:zhangy...@yahoo.com] Sent: Friday, May 15, 2015 2:51 PM To: Cheng, Hao; User Subject: Re: What's the advantage features of Spark SQL(JDBC) @Hao, As you said, there is no advantage feature for JDBC, it just provides unified api to support different data sources. Is it

What's the advantage features of Spark SQL(JDBC)

2015-05-15 Thread Yi Zhang
Hi All, Comparing direct access via JDBC, what's the advantage features of Spark SQL(JDBC) to access external data source? Any tips are welcome! Thanks. Regards,Yi

Spark on Mesos vs Yarn

2015-05-15 Thread Ankur Chauhan
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi, This is both a survey type as well as a roadmap query question. It seems like of the cluster options to run spark (i.e. via YARN and Mesos), YARN seems to be getting a lot more attention and patches when compared to Mesos. Would it be correct to

Re: Spark on Mesos vs Yarn

2015-05-15 Thread Iulian Dragoș
Hi Ankur, Just to add a thought to Tim's excellent answer, Spark on Mesos is very important to us and is the recommended deployment for our customers as Typesafe. Thanks for pointing to your PR, I see Tim already went through a round of reviews. It seems very useful, I'll give it a try as well.

Re: Worker Spark Port

2015-05-15 Thread James King
So I'm using code like this to use specific ports: val conf = new SparkConf() .setMaster(master) .setAppName(namexxx) .set(spark.driver.port, 51810) .set(spark.fileserver.port, 51811) .set(spark.broadcast.port, 51812) .set(spark.replClassServer.port, 51813)

Grouping and storing unordered time series data stream to HDFS

2015-05-15 Thread Nisrina Luthfiyati
Hi all, I have a stream of data from Kafka that I want to process and store in hdfs using Spark Streaming. Each data has a date/time dimension and I want to write data within the same time dimension to the same hdfs directory. The data stream might be unordered (by time dimension). I'm wondering

Forbidded : Error Code: 403

2015-05-15 Thread Mohammad Tariq
Hello list, *Scenario : *I am trying to read an Avro file stored in S3 and create a DataFrame out of it using *Spark-Avro* https://github.com/databricks/spark-avro library, but unable to do so. This is the code which I am using : public class S3DataFrame { public static void main(String[] args)

Re: kafka + Spark Streaming with checkPointing fails to start with

2015-05-15 Thread Alexander Krasheninnikov
I had same problem. The solution, I've found was to use: JavaStreamingContext streamingContext = JavaStreamingContext.getOrCreate('checkpoint_dir', contextFactory); ALL configuration should be performed inside contextFactory. If you try to configure streamContext after ::getOrCreate, you

Re: store hive metastore on persistent store

2015-05-15 Thread Yana Kadiyska
This should work. Which version of Spark are you using? Here is what I do -- make sure hive-site.xml is in the conf directory of the machine you're using the driver from. Now let's run spark-shell from that machine: scala val hc= new org.apache.spark.sql.hive.HiveContext(sc) hc:

Re: Custom Aggregate Function for DataFrame

2015-05-15 Thread ayan guha
can you kindly elaborate on this? it should be possible to write udafs in similar lines of sum/min etc. On Fri, May 15, 2015 at 5:49 AM, Justin Yip yipjus...@prediction.io wrote: Hello, May I know if these is way to implement aggregate function for grouped data in DataFrame? I dug into the

Re: Worker Spark Port

2015-05-15 Thread ayan guha
Hi I think you are mixing things a bit. Worker is part of the cluster. So it is governed by cluster manager. If you are running standalone cluster, then you can modify spark-env and configure SPARK_WORKER_PORT. executors, on the other hand, are bound with an application, ie spark context. Thus

Re: Grouping and storing unordered time series data stream to HDFS

2015-05-15 Thread ayan guha
Hi Do you have a cut off time, like how late an event can be? Else, you may consider a different persistent storage like Cassandra/Hbase and delegate update: part to them. On Fri, May 15, 2015 at 8:10 PM, Nisrina Luthfiyati nisrina.luthfiy...@gmail.com wrote: Hi all, I have a stream of data

Re: Why association with remote system has failed when set master in Spark programmatically

2015-05-15 Thread Yi Zhang
I debugged it, and the remote actor can be fetched in  the tryRegisterAllMasters() method in AppClient:    def tryRegisterAllMasters() {      for (masterAkkaUrl - masterAkkaUrls) {        logInfo(Connecting to master + masterAkkaUrl + ...)        val actor =

Re: Forbidded : Error Code: 403

2015-05-15 Thread Ted Yu
Have you verified that you can download the file from bucket-name without using Spark ? Seems like permission issue. Cheers On May 15, 2015, at 5:09 AM, Mohammad Tariq donta...@gmail.com wrote: Hello list, Scenario : I am trying to read an Avro file stored in S3 and create a

Re: Worker Spark Port

2015-05-15 Thread James King
I think this answers my question executors, on the other hand, are bound with an application, ie spark context. Thus you modify executor properties through a context. Many Thanks. jk On Fri, May 15, 2015 at 3:23 PM, ayan guha guha.a...@gmail.com wrote: Hi I think you are mixing things a

Re: SPARK-4412 regressed?

2015-05-15 Thread Sean Owen
(I made you a Contributor in JIRA -- your yahoo-related account of the two -- so maybe that will let you do so.) On Fri, May 15, 2015 at 4:19 PM, Yana Kadiyska yana.kadiy...@gmail.com wrote: Hi, two questions 1. Can regular JIRA users reopen bugs -- I can open a new issue but it does not

SPARK-4412 regressed?

2015-05-15 Thread Yana Kadiyska
Hi, two questions 1. Can regular JIRA users reopen bugs -- I can open a new issue but it does not appear that I can reopen issues. What is the proper protocol to follow if we discover regressions? 2. I believe SPARK-4412 regressed in Spark 1.3.1, according to this SO thread possibly even in

Re: Error communicating with MapOutputTracker

2015-05-15 Thread Thomas Gerber
Hi Imran, Thanks for the advice, tweaking with some akka parameters helped. See below. Now, we noticed that we get java heap OOM exceptions on the output tracker when we have too many tasks. I wonder: 1. where does the map output tracker live? The driver? The master (when those are not the

Best practice to avoid ambiguous columns in DataFrame.join

2015-05-15 Thread Justin Yip
Hello, I would like ask know if there are recommended ways of preventing ambiguous columns when joining dataframes. When we join dataframes, it usually happen we join the column with identical name. I could have rename the columns on the right data frame, as described in the following code. Is

Re: Broadcast variables can be rebroadcast?

2015-05-15 Thread N B
Thanks Ilya. Does one have to call broadcast again once the underlying data is updated in order to get the changes visible on all nodes? Thanks NB On Fri, May 15, 2015 at 5:29 PM, Ilya Ganelin ilgan...@gmail.com wrote: The broadcast variable is like a pointer. If the underlying data changes

Re: Spark's Guava pieces cause exceptions in non-trivial deployments

2015-05-15 Thread Marcelo Vanzin
On Fri, May 15, 2015 at 2:35 PM, Thomas Dudziak tom...@gmail.com wrote: I've just been through this exact case with shaded guava in our Mesos setup and that is how it behaves there (with Spark 1.3.1). If that's the case, it's a bug in the Mesos backend, since the spark.* options should behave

Re: Using groupByKey with Spark SQL

2015-05-15 Thread Michael Armbrust
Perhaps you are looking for GROUP BY and collect_set, which would allow you to stay in SQL. I'll add that in Spark 1.4 you can get access to items of a row by name. On Fri, May 15, 2015 at 10:48 AM, Edward Sargisson ejsa...@gmail.com wrote: Hi all, This might be a question to be answered or

Re: Best practice to avoid ambiguous columns in DataFrame.join

2015-05-15 Thread Michael Armbrust
There are several ways to solve this ambiguity: *1. use the DataFrames to get the attribute so its already resolved and not just a string we need to map to a DataFrame.* df.join(df2, df(_1) === df2(_1)) *2. Use aliases* df.as('a).join(df2.as('b), $a._1 === $b._1) *3. rename the columns as you

Broadcast variables can be rebroadcast?

2015-05-15 Thread NB
Hello, Once a broadcast variable is created using sparkContext.broadcast(), can it ever be updated again? The use case is for something like the underlying lookup data changing over time. Thanks NB -- View this message in context:

Re: Broadcast variables can be rebroadcast?

2015-05-15 Thread Ilya Ganelin
The broadcast variable is like a pointer. If the underlying data changes then the changes will be visible throughout the cluster. On Fri, May 15, 2015 at 5:18 PM NB nb.nos...@gmail.com wrote: Hello, Once a broadcast variable is created using sparkContext.broadcast(), can it ever be updated

Re: Broadcast variables can be rebroadcast?

2015-05-15 Thread Ilya Ganelin
Nope. It will just work when you all x.value. On Fri, May 15, 2015 at 5:39 PM N B nb.nos...@gmail.com wrote: Thanks Ilya. Does one have to call broadcast again once the underlying data is updated in order to get the changes visible on all nodes? Thanks NB On Fri, May 15, 2015 at 5:29 PM,

Re: Custom Aggregate Function for DataFrame

2015-05-15 Thread Justin Yip
Hi Ayan, I have a DF constructed from the following case class Event: case class State { attr1: String, } case class Event { userId: String, time: Long, state: State } I would like to generate a DF which contains the latest state of each userId. I could have first compute the latest

Re: Best practice to avoid ambiguous columns in DataFrame.join

2015-05-15 Thread Justin Yip
Thanks Michael, This is very helpful. I have a follow up question related to NaFunctions. Usually after a left outer join, we get lots of null value and we need to handle them before further processing. I have the following piece of code, the _1 column is duplicated and crashes the .na.fill

Hive Skew flag?

2015-05-15 Thread Denny Lee
Just wondering if we have any timeline on when the hive skew flag will be included within SparkSQL? Thanks! Denny

FetchFailedException and MetadataFetchFailedException

2015-05-15 Thread rok
I am trying to sort a collection of key,value pairs (between several hundred million to a few billion) and have recently been getting lots of FetchFailedException errors that seem to originate when one of the executors doesn't seem to find a temporary shuffle file on disk. E.g.:

How to reshape RDD/Spark DataFrame

2015-05-15 Thread macwanjason
Hi all, I am a student trying to learn Spark and I had a question regarding converting rows to columns (data pivot/reshape). I have some data in the following format (either RDD or Spark DataFrame): from pyspark.sql import SQLContext sqlContext = SQLContext(sc) rdd =

RE: Spark's Guava pieces cause exceptions in non-trivial deployments

2015-05-15 Thread Anton Brazhnyk
For me it wouldn’t help I guess, because those newer classes would still be loaded by different classloader. What did work for me with 1.3.1 – removing of those classes from Spark’s jar completely, so they get loaded from external Guava (the version I prefer) and by the classloader I expect.

[spark sql] $ and === can't be recognised in IntelliJ

2015-05-15 Thread Yi.Zhang
Hi all, I wanted to join the data frame based on spark sql in IntelliJ, and wrote these code lines as below: df1.as('first).join(df2.as('second), $first._1 === $second._1) IntelliJ reported the error for $ and === in red colour. I found $ and === are defined as implicit conversion in

Why association with remote system has failed when set master in Spark programmatically

2015-05-15 Thread Yi.Zhang
Hi all, I run start-master.sh to start standalone Spark with spark://192.168.1.164:7077. Then, I use this command as below, and it's OK: ./bin/spark-shell --master spark://192.168.1.164:7077 The console print correct message, and Spark context had been initialised correctly. However, when I

Spark sql and csv data processing question

2015-05-15 Thread Mike Frampton
Hi Im getting the following error when trying to process a csv based data file. Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 10.0 failed 4 times, most recent failure: Lost task 1.3 in stage 10.0 (TID 262,

Re: Broadcast variables can be rebroadcast?

2015-05-15 Thread ayan guha
Hi broadcast variables are shipped for the first time it is accessed in a transformation to the executors used by the transformation. It will NOT updated subsequently, even if the value has changed. However, a new value will be shipped to any new executor comes into play after the value has

[spark sql] $ and === can't be recognised in IntelliJ

2015-05-15 Thread Yi Zhang
Hi all, I wanted to join the data frame based on spark sql in IntelliJ, and wrote these code lines as below:df1.as('first).join(df2.as('second), $first._1 === $second._1) IntelliJ reported the error for $ and === in red colour. I found $ and === are defined as implicit conversion in 

RE: Spark Fair Scheduler for Spark Streaming - 1.2 and beyond

2015-05-15 Thread Evo Eftimov
No pools for the moment – for each of the apps using the straightforward way with the spark conf param for scheduling = FAIR Spark is running in a Standalone Mode Are you saying that Configuring Pools is mandatory to get the FAIR scheduling working – from the docs it seemed optional to

RE: Spark Fair Scheduler for Spark Streaming - 1.2 and beyond

2015-05-15 Thread Evo Eftimov
Ok thanks a lot for clarifying that – btw was your application a Spark Streaming App – I am also looking for confirmation that FAIR scheduling is supported for Spark Streaming Apps From: Richard Marscher [mailto:rmarsc...@localytics.com] Sent: Friday, May 15, 2015 7:20 PM To: Evo Eftimov

Re: Spark Fair Scheduler for Spark Streaming - 1.2 and beyond

2015-05-15 Thread Richard Marscher
The doc is a bit confusing IMO, but at least for my application I had to use a fair pool configuration to get my stages to be scheduled with FAIR. On Fri, May 15, 2015 at 2:13 PM, Evo Eftimov evo.efti...@isecc.com wrote: No pools for the moment – for each of the apps using the straightforward

Re: store hive metastore on persistent store

2015-05-15 Thread Yana Kadiyska
My point was more to how to verify that properties are picked up from the hive-site.xml file. You don't really need hive.metastore.uris if you're not running against an external metastore. I just did an experiment with warehouse.dir. My hive-site.xml looks like this: configuration property

Re: how to use rdd.countApprox

2015-05-15 Thread Du Li
Hi TD, Just let you know the job group and cancelation worked after I switched to spark 1.3.1. I set a group id for rdd.countApprox() and cancel it, then set another group id for the remaining job of the foreachRDD but let it complete. As a by-product, I use group id to indicate what the job

Re: Spark Fair Scheduler for Spark Streaming - 1.2 and beyond

2015-05-15 Thread Richard Marscher
It's not a Spark Streaming app, so sorry I'm not sure of the answer to that. I would assume it should work. On Fri, May 15, 2015 at 2:22 PM, Evo Eftimov evo.efti...@isecc.com wrote: Ok thanks a lot for clarifying that – btw was your application a Spark Streaming App – I am also looking for

Re: Spark Fair Scheduler for Spark Streaming - 1.2 and beyond

2015-05-15 Thread Mark Hamstra
If you don't send jobs to different pools, then they will all end up in the default pool. If you leave the intra-pool scheduling policy as the default FIFO, then this will effectively be the same thing as using the default FIFO scheduling. Depending on what you are trying to accomplish, you need

Re: store hive metastore on persistent store

2015-05-15 Thread Tamas Jambor
thanks for the reply. I am trying to use it without hive setup (spark-standalone), so it prints something like this: hive_ctx.sql(show tables).collect() 15/05/15 17:59:03 INFO HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore 15/05/15

Re: spark log field clarification

2015-05-15 Thread yanwei
anybody shed some light for me? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-log-field-clarification-tp22892p22904.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Forbidded : Error Code: 403

2015-05-15 Thread Steve Loughran
On 15 May 2015, at 21:20, Mohammad Tariq donta...@gmail.com wrote: Thank you Ayan and Ted for the prompt response. It isn't working with s3n either. And I am able to download the file. In fact I am able to read the same file using s3 API without any issue. sounds like an S3n

Re: Forbidded : Error Code: 403

2015-05-15 Thread Mohammad Tariq
Thanks for the suggestion Steve. I'll try that out. Read the long story last night while struggling with this :). I made sure that I don't have any '/' in my key. On Saturday, May 16, 2015, Steve Loughran ste...@hortonworks.com wrote: On 15 May 2015, at 21:20, Mohammad Tariq

Re: Spark's Guava pieces cause exceptions in non-trivial deployments

2015-05-15 Thread Thomas Dudziak
This is still a problem in 1.3. Optional is both used in several shaded classes within Guava (e.g. the Immutable* classes) and itself uses shaded classes (e.g. AbstractIterator). This causes problems in application code. The only reliable way we've found around this is to shade Guava ourselves for

Re: Problem with current spark

2015-05-15 Thread Shixiong Zhu
Could your provide the full driver log? Looks like a bug. Thank you! Best Regards, Shixiong Zhu 2015-05-13 14:02 GMT-07:00 Giovanni Paolo Gibilisco gibb...@gmail.com: Hi, I'm trying to run an application that uses a Hive context to perform some queries over JSON files. The code of the

Re: Spark's Guava pieces cause exceptions in non-trivial deployments

2015-05-15 Thread Marcelo Vanzin
On Fri, May 15, 2015 at 11:56 AM, Thomas Dudziak tom...@gmail.com wrote: Actually the extraClassPath settings put the extra jars at the end of the classpath so they won't help. Only the deprecated SPARK_CLASSPATH puts them at the front. That's definitely not the case for YARN:

Re: Spark's Guava pieces cause exceptions in non-trivial deployments

2015-05-15 Thread Thomas Dudziak
Actually the extraClassPath settings put the extra jars at the end of the classpath so they won't help. Only the deprecated SPARK_CLASSPATH puts them at the front. cheers, Tom On Fri, May 15, 2015 at 11:54 AM, Marcelo Vanzin van...@cloudera.com wrote: Ah, I see. yeah, it sucks that Spark has

Re: SaveAsTextFile brings down data nodes with IO Exceptions

2015-05-15 Thread Puneet Kapoor
Hey, Did you find any solution for this issue, we are seeing similar logs in our Data node logs. Appreciate any help. 2015-05-15 10:51:43,615 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: NttUpgradeDN1:50010:DataXceiver error processing WRITE_BLOCK operation src:

Re: Spark Job execution time

2015-05-15 Thread SamyaMaiti
It does depend on the network IO within your cluster CPU usage. Said that the difference in time to run should not be huge (assumption, you are not running any other job in the cluster in parallel). -- View this message in context:

Re: SaveAsTextFile brings down data nodes with IO Exceptions

2015-05-15 Thread Puneet Kapoor
I am seeing this on hadoop 2.4.0 version. Thanks for your suggestions, i will try those and let you know if they help ! On Sat, May 16, 2015 at 1:57 AM, Steve Loughran ste...@hortonworks.com wrote: What version of Hadoop are you seeing this on? On 15 May 2015, at 20:03, Puneet Kapoor

Re: Spark's Guava pieces cause exceptions in non-trivial deployments

2015-05-15 Thread Thomas Dudziak
I've just been through this exact case with shaded guava in our Mesos setup and that is how it behaves there (with Spark 1.3.1). cheers, Tom On Fri, May 15, 2015 at 12:04 PM, Marcelo Vanzin van...@cloudera.com wrote: On Fri, May 15, 2015 at 11:56 AM, Thomas Dudziak tom...@gmail.com wrote: