@Hao,As you said, there is no advantage feature for JDBC, it just provides
unified api to support different data sources. Is it right?
On Friday, May 15, 2015 2:46 PM, Cheng, Hao hao.ch...@intel.com wrote:
#yiv2822675239 #yiv2822675239 -- _filtered #yiv2822675239
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
Hi Tim,
Thanks for such a detailed email. I am excited to hear about the new
features, I had a pull request going for adding attribute based
filtering in the mesos scheduler but it hasn't received much love -
https://github.com/apache/spark/pull/5563
Hi Ayan,
I am asking general scenarios as per given info/configuration, from
experts, not specific,
java code is nothing get hive context and select query,
there is no serialization or any other complex things I kept,straight
forward, 10 lines of code,
Group Please suggest if any Idea,
Regards
Spark SQL just take the JDBC as a new data source, the same as we need to
support loading data from a .csv or .json.
From: Yi Zhang [mailto:zhangy...@yahoo.com.INVALID]
Sent: Friday, May 15, 2015 2:30 PM
To: User
Subject: What's the advantage features of Spark SQL(JDBC)
Hi All,
Comparing
got it,thank you.
Thanksamp;Best regards!
San.Luo
- 原始邮件 -
发件人:Michael Armbrust mich...@databricks.com
收件人:Denny Lee denny.g@gmail.com
抄送人:罗辉 luohui20...@sina.com, user user@spark.apache.org
主题:Re: how to delete data from table in sparksql
Hi Ankur,
This is a great question as I've heard similar concerns about Spark on
Mesos.
At the time when I started to contribute to Spark on Mesos approx half year
ago, the Mesos scheduler and related code hasn't really got much attention
from anyone and it was pretty much in maintenance mode.
OK. Thanks.
On Friday, May 15, 2015 3:35 PM, Cheng, Hao hao.ch...@intel.com wrote:
#yiv2190097982 #yiv2190097982 -- _filtered #yiv2190097982
{font-family:Helvetica;panose-1:2 11 6 4 2 2 2 2 2 4;} _filtered #yiv2190097982
{font-family:宋体;panose-1:2 1 6 0 3 1 1 1 1 1;} _filtered
Hi all,
I run start-master.sh to start standalone Spark with
spark://192.168.1.164:7077. Then, I use this command as below, and it's
OK:./bin/spark-shell --master spark://192.168.1.164:7077
The console print correct message, and Spark context had been initialised
correctly.
However, when I run
You probably can try something like:
val df = sqlContext.sql(select c1, sum(c2) from T1, T2 where T1.key=T2.key
group by c1)
df.cache() // Cache the result, but it's a lazy execution.
df.registerAsTempTable(my_result)
sqlContext.sql(select * from my_result where c1=1).collect // the cache
Yes.
From: Yi Zhang [mailto:zhangy...@yahoo.com]
Sent: Friday, May 15, 2015 2:51 PM
To: Cheng, Hao; User
Subject: Re: What's the advantage features of Spark SQL(JDBC)
@Hao,
As you said, there is no advantage feature for JDBC, it just provides unified
api to support different data sources. Is it
Hi All,
Comparing direct access via JDBC, what's the advantage features of Spark
SQL(JDBC) to access external data source?
Any tips are welcome! Thanks.
Regards,Yi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
Hi,
This is both a survey type as well as a roadmap query question. It
seems like of the cluster options to run spark (i.e. via YARN and
Mesos), YARN seems to be getting a lot more attention and patches when
compared to Mesos.
Would it be correct to
Hi Ankur,
Just to add a thought to Tim's excellent answer, Spark on Mesos is very
important to us and is the recommended deployment for our customers as
Typesafe.
Thanks for pointing to your PR, I see Tim already went through a round of
reviews. It seems very useful, I'll give it a try as well.
So I'm using code like this to use specific ports:
val conf = new SparkConf()
.setMaster(master)
.setAppName(namexxx)
.set(spark.driver.port, 51810)
.set(spark.fileserver.port, 51811)
.set(spark.broadcast.port, 51812)
.set(spark.replClassServer.port, 51813)
Hi all,
I have a stream of data from Kafka that I want to process and store in hdfs
using Spark Streaming.
Each data has a date/time dimension and I want to write data within the
same time dimension to the same hdfs directory. The data stream might be
unordered (by time dimension).
I'm wondering
Hello list,
*Scenario : *I am trying to read an Avro file stored in S3 and create a
DataFrame out of it using *Spark-Avro*
https://github.com/databricks/spark-avro library, but unable to do so.
This is the code which I am using :
public class S3DataFrame {
public static void main(String[] args)
I had same problem.
The solution, I've found was to use:
JavaStreamingContext streamingContext =
JavaStreamingContext.getOrCreate('checkpoint_dir', contextFactory);
ALL configuration should be performed inside contextFactory. If you try
to configure streamContext after ::getOrCreate, you
This should work. Which version of Spark are you using? Here is what I do
-- make sure hive-site.xml is in the conf directory of the machine you're
using the driver from. Now let's run spark-shell from that machine:
scala val hc= new org.apache.spark.sql.hive.HiveContext(sc)
hc:
can you kindly elaborate on this? it should be possible to write udafs in
similar lines of sum/min etc.
On Fri, May 15, 2015 at 5:49 AM, Justin Yip yipjus...@prediction.io wrote:
Hello,
May I know if these is way to implement aggregate function for grouped
data in DataFrame? I dug into the
Hi
I think you are mixing things a bit.
Worker is part of the cluster. So it is governed by cluster manager. If you
are running standalone cluster, then you can modify spark-env and
configure SPARK_WORKER_PORT.
executors, on the other hand, are bound with an application, ie spark
context. Thus
Hi
Do you have a cut off time, like how late an event can be? Else, you may
consider a different persistent storage like Cassandra/Hbase and delegate
update: part to them.
On Fri, May 15, 2015 at 8:10 PM, Nisrina Luthfiyati
nisrina.luthfiy...@gmail.com wrote:
Hi all,
I have a stream of data
I debugged it, and the remote actor can be fetched in the
tryRegisterAllMasters() method in AppClient: def tryRegisterAllMasters() {
for (masterAkkaUrl - masterAkkaUrls) { logInfo(Connecting to master
+ masterAkkaUrl + ...) val actor =
Have you verified that you can download the file from bucket-name without using
Spark ?
Seems like permission issue.
Cheers
On May 15, 2015, at 5:09 AM, Mohammad Tariq donta...@gmail.com wrote:
Hello list,
Scenario : I am trying to read an Avro file stored in S3 and create a
I think this answers my question
executors, on the other hand, are bound with an application, ie spark
context. Thus you modify executor properties through a context.
Many Thanks.
jk
On Fri, May 15, 2015 at 3:23 PM, ayan guha guha.a...@gmail.com wrote:
Hi
I think you are mixing things a
(I made you a Contributor in JIRA -- your yahoo-related account of the
two -- so maybe that will let you do so.)
On Fri, May 15, 2015 at 4:19 PM, Yana Kadiyska yana.kadiy...@gmail.com wrote:
Hi, two questions
1. Can regular JIRA users reopen bugs -- I can open a new issue but it does
not
Hi, two questions
1. Can regular JIRA users reopen bugs -- I can open a new issue but it does
not appear that I can reopen issues. What is the proper protocol to follow
if we discover regressions?
2. I believe SPARK-4412 regressed in Spark 1.3.1, according to this SO
thread possibly even in
Hi Imran,
Thanks for the advice, tweaking with some akka parameters helped. See below.
Now, we noticed that we get java heap OOM exceptions on the output tracker
when we have too many tasks. I wonder:
1. where does the map output tracker live? The driver? The master (when
those are not the
Hello,
I would like ask know if there are recommended ways of preventing ambiguous
columns when joining dataframes. When we join dataframes, it usually happen
we join the column with identical name. I could have rename the columns on
the right data frame, as described in the following code. Is
Thanks Ilya. Does one have to call broadcast again once the underlying data
is updated in order to get the changes visible on all nodes?
Thanks
NB
On Fri, May 15, 2015 at 5:29 PM, Ilya Ganelin ilgan...@gmail.com wrote:
The broadcast variable is like a pointer. If the underlying data changes
On Fri, May 15, 2015 at 2:35 PM, Thomas Dudziak tom...@gmail.com wrote:
I've just been through this exact case with shaded guava in our Mesos
setup and that is how it behaves there (with Spark 1.3.1).
If that's the case, it's a bug in the Mesos backend, since the spark.*
options should behave
Perhaps you are looking for GROUP BY and collect_set, which would allow you
to stay in SQL. I'll add that in Spark 1.4 you can get access to items of
a row by name.
On Fri, May 15, 2015 at 10:48 AM, Edward Sargisson ejsa...@gmail.com
wrote:
Hi all,
This might be a question to be answered or
There are several ways to solve this ambiguity:
*1. use the DataFrames to get the attribute so its already resolved and
not just a string we need to map to a DataFrame.*
df.join(df2, df(_1) === df2(_1))
*2. Use aliases*
df.as('a).join(df2.as('b), $a._1 === $b._1)
*3. rename the columns as you
Hello,
Once a broadcast variable is created using sparkContext.broadcast(), can it
ever be updated again? The use case is for something like the underlying
lookup data changing over time.
Thanks
NB
--
View this message in context:
The broadcast variable is like a pointer. If the underlying data changes
then the changes will be visible throughout the cluster.
On Fri, May 15, 2015 at 5:18 PM NB nb.nos...@gmail.com wrote:
Hello,
Once a broadcast variable is created using sparkContext.broadcast(), can it
ever be updated
Nope. It will just work when you all x.value.
On Fri, May 15, 2015 at 5:39 PM N B nb.nos...@gmail.com wrote:
Thanks Ilya. Does one have to call broadcast again once the underlying
data is updated in order to get the changes visible on all nodes?
Thanks
NB
On Fri, May 15, 2015 at 5:29 PM,
Hi Ayan,
I have a DF constructed from the following case class Event:
case class State { attr1: String, }
case class Event {
userId: String,
time: Long,
state: State
}
I would like to generate a DF which contains the latest state of each
userId. I could have first compute the latest
Thanks Michael,
This is very helpful. I have a follow up question related to NaFunctions.
Usually after a left outer join, we get lots of null value and we need to
handle them before further processing. I have the following piece of code,
the _1 column is duplicated and crashes the .na.fill
Just wondering if we have any timeline on when the hive skew flag will be
included within SparkSQL?
Thanks!
Denny
I am trying to sort a collection of key,value pairs (between several hundred
million to a few billion) and have recently been getting lots of
FetchFailedException errors that seem to originate when one of the
executors doesn't seem to find a temporary shuffle file on disk. E.g.:
Hi all,
I am a student trying to learn Spark and I had a question regarding
converting rows to columns (data pivot/reshape). I have some data in the
following format (either RDD or Spark DataFrame):
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
rdd =
For me it wouldn’t help I guess, because those newer classes would still be
loaded by different classloader.
What did work for me with 1.3.1 – removing of those classes from Spark’s jar
completely, so they get loaded from external Guava (the version I prefer) and
by the classloader I expect.
Hi all,
I wanted to join the data frame based on spark sql in IntelliJ, and wrote
these code lines as below:
df1.as('first).join(df2.as('second), $first._1 === $second._1)
IntelliJ reported the error for $ and === in red colour.
I found $ and === are defined as implicit conversion in
Hi all,
I run start-master.sh to start standalone Spark with
spark://192.168.1.164:7077. Then, I use this command as below, and it's OK:
./bin/spark-shell --master spark://192.168.1.164:7077
The console print correct message, and Spark context had been initialised
correctly.
However, when I
Hi
Im getting the following error when trying to process a csv based data file.
Exception in thread main org.apache.spark.SparkException: Job aborted due to
stage failure: Task 1 in stage 10.0 failed 4 times, most recent failure: Lost
task 1.3 in stage 10.0 (TID 262,
Hi
broadcast variables are shipped for the first time it is accessed in a
transformation to the executors used by the transformation. It will NOT
updated subsequently, even if the value has changed. However, a new value
will be shipped to any new executor comes into play after the value has
Hi all,
I wanted to join the data frame based on spark sql in IntelliJ, and wrote these
code lines as below:df1.as('first).join(df2.as('second), $first._1 ===
$second._1)
IntelliJ reported the error for $ and === in red colour.
I found $ and === are defined as implicit conversion in
No pools for the moment – for each of the apps using the straightforward way
with the spark conf param for scheduling = FAIR
Spark is running in a Standalone Mode
Are you saying that Configuring Pools is mandatory to get the FAIR scheduling
working – from the docs it seemed optional to
Ok thanks a lot for clarifying that – btw was your application a Spark
Streaming App – I am also looking for confirmation that FAIR scheduling is
supported for Spark Streaming Apps
From: Richard Marscher [mailto:rmarsc...@localytics.com]
Sent: Friday, May 15, 2015 7:20 PM
To: Evo Eftimov
The doc is a bit confusing IMO, but at least for my application I had to
use a fair pool configuration to get my stages to be scheduled with FAIR.
On Fri, May 15, 2015 at 2:13 PM, Evo Eftimov evo.efti...@isecc.com wrote:
No pools for the moment – for each of the apps using the straightforward
My point was more to how to verify that properties are picked up from
the hive-site.xml file. You don't really need hive.metastore.uris if you're
not running against an external metastore. I just did an experiment with
warehouse.dir.
My hive-site.xml looks like this:
configuration
property
Hi TD,
Just let you know the job group and cancelation worked after I switched to
spark 1.3.1. I set a group id for rdd.countApprox() and cancel it, then set
another group id for the remaining job of the foreachRDD but let it complete.
As a by-product, I use group id to indicate what the job
It's not a Spark Streaming app, so sorry I'm not sure of the answer to
that. I would assume it should work.
On Fri, May 15, 2015 at 2:22 PM, Evo Eftimov evo.efti...@isecc.com wrote:
Ok thanks a lot for clarifying that – btw was your application a Spark
Streaming App – I am also looking for
If you don't send jobs to different pools, then they will all end up in the
default pool. If you leave the intra-pool scheduling policy as the default
FIFO, then this will effectively be the same thing as using the default
FIFO scheduling.
Depending on what you are trying to accomplish, you need
thanks for the reply. I am trying to use it without hive setup
(spark-standalone), so it prints something like this:
hive_ctx.sql(show tables).collect()
15/05/15 17:59:03 INFO HiveMetaStore: 0: Opening raw store with
implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
15/05/15
anybody shed some light for me?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/spark-log-field-clarification-tp22892p22904.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
On 15 May 2015, at 21:20, Mohammad Tariq donta...@gmail.com wrote:
Thank you Ayan and Ted for the prompt response. It isn't working with s3n
either.
And I am able to download the file. In fact I am able to read the same file
using s3 API without any issue.
sounds like an S3n
Thanks for the suggestion Steve. I'll try that out.
Read the long story last night while struggling with this :). I made sure
that I don't have any '/' in my key.
On Saturday, May 16, 2015, Steve Loughran ste...@hortonworks.com wrote:
On 15 May 2015, at 21:20, Mohammad Tariq
This is still a problem in 1.3. Optional is both used in several shaded
classes within Guava (e.g. the Immutable* classes) and itself uses shaded
classes (e.g. AbstractIterator). This causes problems in application code.
The only reliable way we've found around this is to shade Guava ourselves
for
Could your provide the full driver log? Looks like a bug. Thank you!
Best Regards,
Shixiong Zhu
2015-05-13 14:02 GMT-07:00 Giovanni Paolo Gibilisco gibb...@gmail.com:
Hi,
I'm trying to run an application that uses a Hive context to perform some
queries over JSON files.
The code of the
On Fri, May 15, 2015 at 11:56 AM, Thomas Dudziak tom...@gmail.com wrote:
Actually the extraClassPath settings put the extra jars at the end of the
classpath so they won't help. Only the deprecated SPARK_CLASSPATH puts them
at the front.
That's definitely not the case for YARN:
Actually the extraClassPath settings put the extra jars at the end of the
classpath so they won't help. Only the deprecated SPARK_CLASSPATH puts them
at the front.
cheers,
Tom
On Fri, May 15, 2015 at 11:54 AM, Marcelo Vanzin van...@cloudera.com
wrote:
Ah, I see. yeah, it sucks that Spark has
Hey,
Did you find any solution for this issue, we are seeing similar logs in our
Data node logs. Appreciate any help.
2015-05-15 10:51:43,615 ERROR
org.apache.hadoop.hdfs.server.datanode.DataNode:
NttUpgradeDN1:50010:DataXceiver error processing WRITE_BLOCK operation
src:
It does depend on the network IO within your cluster CPU usage. Said that
the difference in time to run should not be huge (assumption, you are not
running any other job in the cluster in parallel).
--
View this message in context:
I am seeing this on hadoop 2.4.0 version.
Thanks for your suggestions, i will try those and let you know if they help
!
On Sat, May 16, 2015 at 1:57 AM, Steve Loughran ste...@hortonworks.com
wrote:
What version of Hadoop are you seeing this on?
On 15 May 2015, at 20:03, Puneet Kapoor
I've just been through this exact case with shaded guava in our Mesos setup
and that is how it behaves there (with Spark 1.3.1).
cheers,
Tom
On Fri, May 15, 2015 at 12:04 PM, Marcelo Vanzin van...@cloudera.com
wrote:
On Fri, May 15, 2015 at 11:56 AM, Thomas Dudziak tom...@gmail.com wrote:
65 matches
Mail list logo