What do you mean by starts delay scheduling? Are you saying it is no longer
doing local reads?
If that's the case you can increase the spark.locality.read timeout.
On Wednesday, November 18, 2015, Renu Yadav wrote:
> Hi ,
> I am using spark 1.4.1 and saving orc file using
>
You could write your own UDF isdate().
--
Ruslan Dautkhanov
On Tue, Nov 17, 2015 at 11:25 PM, Ravisankar Mani wrote:
> Hi Ted Yu,
>
> Thanks for your response. Is any other way to achieve in Spark Query?
>
>
> Regards,
> Ravi
>
> On Tue, Nov 17, 2015 at 10:26 AM, Ted Yu
Hi
Thanks for the help.
In my Case ...
I want to perform operation on 30 record per second using spark streaming.
and difference between key of records is around 33-34 ms and my RDD that
have 30 records already have 4 partition.
and right now my algo take around 400 ms to perform operation on 1
Interesting.
I will watching your PR.
On Wed, Nov 18, 2015 at 7:51 AM, 임정택 wrote:
> Ted,
>
> I suspect I hit the issue
> https://issues.apache.org/jira/browse/SPARK-11818
> Could you refer the issue and verify that it makes sense?
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
>
TD thank you for your reply.
I agree on data store requirement. I am using HBase as an underlying store.
So for every batch interval of say 10 seconds
- Calculate the time dimension ( minutes, hours, day, week, month and quarter )
along with other dimensions and metrics
- Update relevant base
Ted,
I suspect I hit the issue https://issues.apache.org/jira/browse/SPARK-11818
Could you refer the issue and verify that it makes sense?
Thanks,
Jungtaek Lim (HeartSaVioR)
2015-11-18 20:32 GMT+09:00 Ted Yu :
> Here is related code:
>
> private static void
Hi
I want to make sure we use short-circuit local reads for performance. I
have set the LD_LIBRARY_PATH correctly, confirmed that the native libraries
match our platform (i.e. are 64 bit and are loaded successfully). When I
start spark, i get the following message after increasing the logging
Hi,
Doesn't *SharedSparkContext* come with spark-core? Do I need to include any
special package in the library dependancies for using SharedSparkContext?
I am trying to write a testSuite similar to the *LogisticRegressionSuite*
test in the Spak-ml. Unfortunately, I am unable to import any of
Hi
I am working on a spark POC. I created a ec2 cluster on AWS using
spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2
Bellow is a simple python program. I am running using IPython notebook. The
notebook server is running on my spark master. If I run my program more than
1 once using my large data set, I
Nikhil,
Please take a look at: https://github.com/holdenk/spark-testing-base
On Wed, Nov 18, 2015 at 2:12 PM, Marcelo Vanzin wrote:
> On Wed, Nov 18, 2015 at 11:08 AM, njoshi wrote:
> > Doesn't *SharedSparkContext* come with spark-core? Do I need
I understand that the following are equivalent
df.filter('account === "acct1")
sql("select * from tempTableName where account = 'acct1'")
But is Spark SQL "smart" to also push filter predicates down for the
initial load?
e.g.
sqlContext.read.jdbc(…).filter('account=== "acct1")
Plus this article:
http://blog.cloudera.com/blog/2015/09/making-apache-spark-testing-easy-with-spark-testing-base/
On Wed, Nov 18, 2015 at 2:25 PM, Sourigna Phetsarath <
gna.phetsar...@teamaol.com> wrote:
> Nikhil,
>
> Please take a look at: https://github.com/holdenk/spark-testing-base
>
> On
Hi all,
I have a problem running some algorithms on GraphX. Occasionally, it
stopped running without any errors. The task state is FINISHED, but the
executers state is KILLED. However, I can see that one job is not finished
yet. It took too much time (minutes) while every job/iteration were
Hi
I start playing with both Apache projects and quickly got that exception.
Anyone being able to give some hint on the problem so that I can dig
further.
It seems to be a problem for Spark to load some of the groovy classes ...
Any idea?
Thanks
Guillaume
tog GroovySpark $
Thanks Marcelo and Sourigna. I see the spark-testing-base being part of
Spark, but has been included under test package of Spark-core. That caused
the trouble :(.
On Wed, Nov 18, 2015 at 11:26 AM, Sourigna Phetsarath <
gna.phetsar...@teamaol.com> wrote:
> Plus this article:
>
Refer to core/src/test/scala/org/apache/spark/ui/UISuite.scala , around
line 41:
val conf = new SparkConf()
.setMaster("local")
.setAppName("test")
.set("spark.ui.enabled", "true")
Cheers
On Wed, Nov 18, 2015 at 3:05 AM, Ted Yu wrote:
> You can set
Hi,
Thanks for the suggestion -- but those classpaths config options only affect
the driver and executor processes -- not the standalone mode daemons (master
and slave). Incidentally we have the extra jars we need set there.
I went through the docs but couldn't find a place to set extra
You can set spark.ui.enabled config parameter to false.
Cheers
> On Nov 18, 2015, at 1:29 AM, Alex Luya wrote:
>
> I noticed that blow bug has been fixed:
>
> https://issues.apache.org/jira/browse/SPARK-2100
> but how to do it(I mean disabling SparkUI)
Hey,
I’m wondering if anyone has run into issues with Spark 1.5 and a FileNotFound
exception with shuffle.index files? It’s been cropping up with very large joins
and aggregations, and causing all of our jobs to fail towards the end. The
memory limit for the executors (we’re running on mesos)
take executor memory times spark.shuffle.memoryFraction
and divide the data so that each partition is less than the above
*Romi Kuntsman*, *Big Data Engineer*
http://www.totango.com
On Wed, Nov 18, 2015 at 2:09 PM, Tom Arnfeld wrote:
> Hi Romi,
>
> Thanks! Could you give me an
Here is related code:
private static void checkDefaultsVersion(Configuration conf) {
if (conf.getBoolean("hbase.defaults.for.version.skip", Boolean.FALSE))
return;
String defaultsVersion = conf.get("hbase.defaults.for.version");
String thisVersion = VersionInfo.getVersion();
One such "lightweight PMML in JSON" is here -
https://github.com/bigmlcom/json-pml. At least for the schema definitions.
But nothing available in terms of evaluation/scoring. Perhaps this is
something that can form a basis for such a new undertaking.
I agree that distributed models are only
Hi Romi,
Thanks! Could you give me an indication of how much increase the partitions by?
We’ll take a stab in the dark, the input data is around 5M records (though each
record is fairly small). We’ve had trouble both with DataFrames and RDDs.
Tom.
> On 18 Nov 2015, at 12:04, Romi Kuntsman
I noticed that blow bug has been fixed:
https://issues.apache.org/jira/browse/SPARK-2100
but how to do it(I mean disabling SparkUI) programmatically?
is it by sparkContext.setLocalProperty(?,?)?
and I checked blow link,can't figured out which property to set
To subscribe to the list, you need to send a mail to
user-subscr...@spark.apache.org
(see http://spark.apache.org/community.html for details and a subscribe
link).
On Wed, Nov 18, 2015 at 11:23 AM, Alex Luya
wrote:
>
>
subscribe
The simplest way is remove all “provided” in your pom.
then ‘sbt assembly” to build your final package. then get rid of ‘—jars’
because assembly already includes all dependencies.
> On Nov 18, 2015, at 2:15 PM, Jack Yang wrote:
>
> So weird. Is there anything wrong with
Have you tried to change to scope to `compile` ?
2015-11-18 14:15 GMT+08:00 Jack Yang :
> So weird. Is there anything wrong with the way I made the pom file (I
> labelled them as *provided*)?
>
>
>
> Is there missing jar I forget to add in “--jar”?
>
>
>
> See the trace below:
>
Looks like I can use mapPartitions but can it be done using
forEachPartition?
On Tue, Nov 17, 2015 at 11:51 PM, swetha wrote:
> Hi,
>
> How to return an RDD of key/value pairs from an RDD that has
> foreachPartition applied. I have my code something like the
It works fine after some changes.
-Thanks,
Swetha
On Tue, Nov 17, 2015 at 10:22 PM, Tathagata Das wrote:
> Can you verify that the cluster is running the correct version of Spark.
> 1.5.2.
>
> On Tue, Nov 17, 2015 at 7:23 PM, swetha kasireddy <
> swethakasire...@gmail.com>
Hi ,
I am using spark 1.4.1 and saving orc file using
df.write.format("orc").save("outputlocation")
outputloation size 440GB
and while reading df.read.format("orc").load("outputlocation").count
it has 2618 partitions .
the count operation runs fine uptil 2500 but starts delay scheduling after
When you have following query, 'account=== “acct1” will be pushdown to generate
new query with “where account = acct1”
Thanks.
Zhan Zhang
On Nov 18, 2015, at 11:36 AM, Eran Medan
> wrote:
I understand that the following are equivalent
Hi,
I'm launching a Spark cluster with the spark-ec2 script and playing around
in spark-shell. I'm running the same line of code over and over again, and
getting different results, and sometimes exceptions. Towards the end,
after I cache the first RDD, it gives me the correct result multiple
Hi Feng,
Does airflow allow remote submissions of spark jobs via spark-submit?
On Wed, Nov 18, 2015 at 6:01 PM, Fengdong Yu
wrote:
> Hi,
>
> we use ‘Airflow' as our job workflow scheduler.
>
>
>
>
> On Nov 19, 2015, at 9:47 AM, Vikram Kone
Thank you TD for your time and help.
SM
> On 19-Nov-2015, at 6:58 AM, Tathagata Das wrote:
>
> There are different ways to do the rollups. Either update rollups from the
> streaming application, or you can generate roll ups in a later process - say
> periodic Spark job
Hi,
we use ‘Airflow' as our job workflow scheduler.
> On Nov 19, 2015, at 9:47 AM, Vikram Kone wrote:
>
> Hi Nick,
> Quick question about spark-submit command executed from azkaban with command
> job type.
> I see that when I press kill in azkaban portal on a
The following works against a hive table from spark sql
hc.sql("select id,r from (select id, name, rank() over (order by name) as
r from tt2) v where v.r >= 1 and v.r <= 12")
But when using a standard sql context against a temporary table the
following occurs:
Exception in thread "main"
Hi LCassa,
Try:
Map to pair, then reduce by key.
The spark documentation is a pretty good reference for this & there are plenty
of word count examples on the internet.
Warm regards,
TimB
From: Cassa L [mailto:lcas...@gmail.com]
Sent: Thursday, 19 November 2015 11:27 AM
To: user
Subject: how
njoshi wrote
> I am testing the LogisticRegression performance on a synthetically
> generated data.
Hmm, seems like a good idea. Can you give the code for generating the
training data?
best,
Robert Dodier
--
View this message in context:
If possible, could you give us the root cause and solution for future
readers of this thread.
On Wed, Nov 18, 2015 at 6:37 AM, swetha kasireddy wrote:
> It works fine after some changes.
>
> -Thanks,
> Swetha
>
> On Tue, Nov 17, 2015 at 10:22 PM, Tathagata Das
Yes, you can submit job remotely.
> On Nov 19, 2015, at 10:10 AM, Vikram Kone wrote:
>
> Hi Feng,
> Does airflow allow remote submissions of spark jobs via spark-submit?
>
> On Wed, Nov 18, 2015 at 6:01 PM, Fengdong Yu
For this simple example, we are importing 4 lines of 3 columns of a CSV file:
Administrator,FiveHundredAddresses1,92121
Ann,FiveHundredAddresses2,92109
Bobby,FiveHundredAddresses3,92101
Charles,FiveHundredAddresses4,92111
We are running spark-1.5.1-bin-hadoop2.6 with master and one slave, and
There are different ways to do the rollups. Either update rollups from the
streaming application, or you can generate roll ups in a later process -
say periodic Spark job every hour. Or you could just generate rollups on
demand, when it is queried.
The whole thing depends on your downstream
Hi Nick,
Quick question about spark-submit command executed from azkaban with
command job type.
I see that when I press kill in azkaban portal on a spark-submit job, it
doesn't actually kill the application on spark master and it continues to
run even though azkaban thinks that it's killed.
How do
I think you can use mapPartitions that returns PairRDDs followed by
forEachPartition for saving it
On Wed, Nov 18, 2015 at 9:31 AM swetha kasireddy
wrote:
> Looks like I can use mapPartitions but can it be done using
> forEachPartition?
>
> On Tue, Nov 17, 2015 at
--
*VJ Anand*
*Founder *
*Sankia*
vjan...@sankia.com
925-640-1340
www.sankia.com
*Confidentiality Notice*: This e-mail message, including any attachments,
is for the sole use of the intended recipient(s) and may contain
confidential and privileged information. Any unauthorized review, use,
Looks like groovy scripts dont' serialize over the wire properly.
Back in 2011 I hooked up groovy to mapreduce, so that you could do mappers and
reducers there; "grumpy"
https://github.com/steveloughran/grumpy
slides: http://www.slideshare.net/steve_l/hadoop-gets-groovy
What I ended up doing
We solved this by adding to spark-class script. At the bottom before the exec
statement we intercepted the command that was constructed and injected our
additional class path :
for ((i=0; i<${#CMD[@]}; i++));
do
if [[ ${CMD[$i]} == *"$SPARK_ASSEMBLY_JAR"* ]]
then
If I tried to change “provided” to “compile”.. then the error changed to :
Exception in thread "main" java.lang.IncompatibleClassChangeError: Implementing
class
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
Hi,
I have a data stream (JavaDStream) in following format-
timestamp=second1, map(key1=value1, key2=value2)
timestamp=second2,map(key1=value3, key2=value4)
timestamp=second2, map(key1=value1, key2=value5)
I want to group data by 'timestamp' first and then filter each RDD for
Key1=value1 or
Hi,
We have a lot of temp files that gets created due to shuffles caused by
group by. How to clear the files that gets created due to intermediate
operations in group by?
Thanks,
Swetha
--
View this message in context:
Checked out 1.6.0-SNAPSHOT 60 minutes ago
2015-11-18 19:19 GMT-08:00 Jack Yang :
> Which version of spark are you using?
>
>
>
> *From:* Stephen Boesch [mailto:java...@gmail.com]
> *Sent:* Thursday, 19 November 2015 2:12 PM
> *To:* user
> *Subject:* Do windowing functions
Hey Jeff, in addition to what Sandy said, there are two more reasons that
this might not be as bad as it seems; I may be incorrect in my
understanding though.
First, the "additional step" you're referring to is not likely to be adding
any overhead; the "extra map" is really just materializing the
Dear Ted,I just looked at the link you provided, it is great!
For my understanding, I could also directly use other Breeze part (except spark
mllib package linalg ) in spark (scala or java ) program after importing Breeze
package,it is right?
Thanks a lot in advance again!Zhiliang
On
But to focus the attention properly: I had already tried out 1.5.2.
2015-11-18 19:46 GMT-08:00 Stephen Boesch :
> Checked out 1.6.0-SNAPSHOT 60 minutes ago
>
> 2015-11-18 19:19 GMT-08:00 Jack Yang :
>
>> Which version of spark are you using?
>>
>>
>>
>>
Yes they do.
On Wed, Nov 18, 2015 at 7:49 PM, Stephen Boesch wrote:
> But to focus the attention properly: I had already tried out 1.5.2.
>
> 2015-11-18 19:46 GMT-08:00 Stephen Boesch :
>
>> Checked out 1.6.0-SNAPSHOT 60 minutes ago
>>
>> 2015-11-18 19:19
SQLContext only implements a subset of the SQL function, not included the
window function.
In HiveContext it is fine though.
From: Stephen Boesch [mailto:java...@gmail.com]
Sent: Thursday, 19 November 2015 3:01 PM
To: Michael Armbrust
Cc: Jack Yang; user
Subject: Re: Do windowing functions
Given there is no existing Groovy integration out there, I'd tend to agree to
use Scala if possible - the basics of functional-style Groovy is fairly similar
to Scala.
—
Sent from Mailbox
On Wed, Nov 18, 2015 at 11:52 PM, Steve Loughran
wrote:
> Looks like groovy
Dear Jack,
As is known, Breeze is numerical calculation package wrote by scala , spark
mllib also use it as underlying package for algebra usage.Here I am also
preparing to use Breeze for nonlinear equation optimization, however, it seemed
that I could not find the exact doc or API for Breeze
Hi Jeff,
Many access patterns simply take the result of hadoopFile and use it to
create some other object, and thus have no need for each input record to
refer to a different object. In those cases, the current API is more
performant than an alternative that would create an object for each
Have you looked at
https://github.com/scalanlp/breeze/wiki
Cheers
> On Nov 18, 2015, at 9:34 PM, Zhiliang Zhu wrote:
>
> Dear Jack,
>
> As is known, Breeze is numerical calculation package wrote by scala , spark
> mllib also use it as underlying package for algebra
Which version of spark are you using?
From: Stephen Boesch [mailto:java...@gmail.com]
Sent: Thursday, 19 November 2015 2:12 PM
To: user
Subject: Do windowing functions require hive support?
The following works against a hive table from spark sql
hc.sql("select id,r from (select id, name,
Why is the same query (and actually i tried several variations) working
against a hivecontext and not against the sql context?
2015-11-18 19:57 GMT-08:00 Michael Armbrust :
> Yes they do.
>
> On Wed, Nov 18, 2015 at 7:49 PM, Stephen Boesch wrote:
>
>>
Back to my question. If I use “provided”, the jar file will
expect some libraries are provided by the system.
However, the “ compiled ” is the default setting, which means
the third-party library will be included inside jar file after compiling.
So when I use “provided”, the error is they
Hi all,
I want to monitor Spark to get the following:
1. All the GC stats for Spark JVMs
2. Records successfully processed in a batch
3. Records failed in a batch
4. Getting historical data for batches,jobs,stages,tasks,etc,
Please let me know how can I get these information in Spark.
Regards,
Hi Steve
Since you are familiar with groovy it will go a bit deeper in details. My
(simple) groovy scripts are working fine with Apache Spark - a closure
(when dehydrated) will nicely serialize.
My issue comes when I want to use GroovyShell to run my scripts (my
ultimate goal is to integrate with
Dear Friends,
I am struggling with spark twitter streaming. I am not getting any data.
Please correct below code if you found any mistakes.
import org.apache.spark.*;
import org.apache.spark.api.java.
function.*;
import org.apache.spark.streaming.*;
import org.apache.spark.streaming.api.java.*;
Hello Spark experts!
I am new to Spark and i have the following query...
What I am trying to do: Run a spark 1.5.1 job local[*] on a 4 core CPU.
This will ping oracle data base and fetch 5000 records each in jdbcRDD, I
increase the number of partitions by 1 for every 5000 records i fetch.
I
Have you seen SPARK-5836 ?
Note TD's comment at the end.
Cheers
On Wed, Nov 18, 2015 at 7:28 PM, swetha wrote:
> Hi,
>
> We have a lot of temp files that gets created due to shuffles caused by
> group by. How to clear the files that gets created due to intermediate
>
Methods like first() and take(n) can't guarantee to return the same result
in a distributed context, because Spark uses an algorithm to grab data from
one or more partitions that involves running a distributed job over the
cluster, with tasks on the nodes where the chosen partitions are located.
Deep copying the data solved the issue:
data.map(r => {val t = SpecificData.get().deepCopy(r.getSchema, r); (t.id,
List(t)) }).reduceByKey(_ ++ _)
(noted here:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L1003)
Thanks Igor Berman, for
72 matches
Mail list logo