Hi Ted and Grace, Retried with Spark 1.4.0,still failed with same
phenomenon.here is a log.FYI. What else details may help?BTW, is
it a necessary step to run Hibench test for my spark cluster? I also tried to
skip building Hibench to execute bin/run-all.sh, but also got
Hi everyone,
I can't get 'day of year' when using spark sql query. Can you help any way
to achieve day of year?
Regards,
Ravi
This is most likely due to the internal implementation of ALS in MLib. Probably
for each parallel unit of execution (partition in Spark terms) the
implementation allocates and uses a RAM buffer where it keeps interim results
during the ALS iterations
If we assume that the size of that
HI Vinod,
Yes If you want to use a scala or python function you need the block of
code.
Only Hive UDF's are available permanently.
Thanks,
Vishnu
On Wed, Jul 8, 2015 at 5:17 PM, vinod kumar vinodsachin...@gmail.com
wrote:
Thanks Vishnu,
When restart the service the UDF was not accessible
Hi,
Just updating back that setting spark.driver.extraClassPath worked.
Thanks,
Daniel
On Fri, Jul 3, 2015 at 5:35 PM, Ted Yu yuzhih...@gmail.com wrote:
Alternatively, setting spark.driver.extraClassPath should work.
Cheers
On Fri, Jul 3, 2015 at 2:59 AM, Steve Loughran
Hi,
I am new to Spark. I have done following tests and I am confused in
conclusions. I have 2 queries.
Following is the detail of test
Test 1) Used 11 Node Cluster where each machine has 64 GB RAM and 4
physical cores. I ran a ALS algorithm using MilLib on 1.6 GB data set. I
ran 10 executors
Thanks Vishnu,
When restart the service the UDF was not accessible by my query.I need to
run the mentioned block again to use the UDF.
Is there is any way to maintain UDF in sqlContext permanently?
Thanks,
Vinod
On Wed, Jul 8, 2015 at 7:16 AM, VISHNU SUBRAMANIAN
johnfedrickena...@gmail.com
Thank you for quick response Vishnu,
I have following doubts too.
1.Is there is anyway to upload files to HDFS programattically using c#
language?.
2.Is there is any way to automatically load scala block of code (for UDF)
when i start the spark service?
-Vinod
On Wed, Jul 8, 2015 at 7:57 AM,
Are you sure you have actually increased the RAM (how exactly did you do that
and does it show in Spark UI)
Also use the SPARK UI and the driver console to check the RAM allocated for
each RDD and RDD partion in each of the scenarios
Re b) the general rule is num of partitions = 2 x
Hi All,
I'd like to use the hive context in spark shell, i need to recreate the
hive meta database in the same location, so i want to close the derby
connection previous created in the spark shell, is there any way to do this?
I try this, but it does not work:
Hi there,
My name is Oleh Rozvadovskyy. I represent CyberVision Inc., the IoT company
and the developer of Kaa IoT platform, which is open-source middleware for
smart devices and servers. In a 2 weeks period we're going to run a
webinar *IoT data ingestion in Spark Streaming using Kaa on Thu,
Also try to increase the number of partions gradually – not in one big jump
from 20 to 100 but adding e.g. 10 at a time and see whether there is a
correlation with adding more RAM to the executors
From: Evo Eftimov [mailto:evo.efti...@isecc.com]
Sent: Wednesday, July 8, 2015 1:26 PM
To:
Hi,
sqlContext.udf.register(udfname, functionname _)
example:
def square(x:Int):Int = { x * x}
register udf as below
sqlContext.udf.register(square,square _)
Thanks,
Vishnu
On Wed, Jul 8, 2015 at 2:23 PM, vinod kumar vinodsachin...@gmail.com
wrote:
Hi Everyone,
I am new to spark.may I
Hi
I am new to Spark. Following is the problem that I am facing
Test 1) I ran a VM on CDH distribution with only 1 core allocated to it and
I ran simple Streaming example in spark-shell with sending data on
port and trying to read it. With 1 core allocated to this nothing happens
in my
should I add dependencies for
spark-core_2.10,spark-yarn_2.10,spark-streaming_2.10,
org.apache.spark:spark-mllib_2.10,:spark-hive_2.10,:spark-graphx_2.10 in
pom.xml?if yes, there are 7 pom.xml in HiBench listing below, which one to
modify?
[root@spark-study HiBench-master]# find ./ -name
My apologies for double posting but I missed the web links that i followed
which are 1
http://ramhiser.com/2015/02/01/configuring-ipython-notebook-support-for-pyspark/,
2
http://blog.cloudera.com/blog/2014/08/how-to-use-ipython-notebook-with-apache-spark/,
3
Thanks you Akhil for the link
Sincerely,
Ashish Dutt
PhD Candidate
Department of Information Systems
University of Malaya, Lembah Pantai,
50603 Kuala Lumpur, Malaysia
On Wed, Jul 8, 2015 at 3:43 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Have a look
Hi 'I am looking how to load data in redshift .Thanks
On Wednesday, July 8, 2015 12:47 AM, shahab shahab.mok...@gmail.com
wrote:
Hi,
I did some experiment with loading data from s3 into spark. I loaded data from
s3 using sc.textFile(). Have a look at the following code
Thanks. Actually I've find the way. I'm using spark-submit to submit the
job the a YARN cluster with --mater yarn-cluster (which spark-submit
process is not the driver). So I can config
spark.yarn.submit.waitAppComplettion to false so that the process will
exit after the job is submitted.
ayan
Hi,
We are using spark thrift server as a hive replacement.
One of the things we have with hive, is that different users can connect with
their own usernames/passwords and get appropriate permissions.
So on the same server, one user may have a query that will have permissions to
run, while the
hi,
I'm running spark standalone cluster (1.4.0). I have some applications
running with scheduler every hour. I found that on one of the executions,
the job got to be FINISHED after very few seconds (instead of ~5 minutes),
and in the logs on the master, I can see the following exception:
seems you're correct:
2015-07-07 17:21:27,245 WARN
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
Container [pid=38506,containerID=container_1436262805092_0022_01_03]
is running be
yond virtual memory limits. Current usage: 4.3 GB of 4.5 GB
Hello Sooraj,
I see you are using ipython notebook.
Can you tell me are you on Windows OS or Linux based OS? I am using Windows
7 and I am new to Spark.
I am trying to connect ipython with my local cluster based on CDH5.4. I
followed these tutorials here but they are written on linux environment
Have a look
http://alvinalexander.com/scala/how-to-create-java-thread-runnable-in-scala,
create two threads and call thread1.start(), thread2.start()
Thanks
Best Regards
On Wed, Jul 8, 2015 at 1:06 PM, Ashish Dutt ashish.du...@gmail.com wrote:
Thanks for your reply Akhil.
How do you
Hi,
I've been experimenting with the Spark Word2Vec implementation in the
MLLib package.
It seems to me that only the preparatory steps are actually performed in
a distributed way, i.e. stages 0-2 that prepare the data. In stage 3
(mapPartitionsWithIndex at Word2Vec.scala:312), only one node seems
Hi,
I need to upgrade spark version 1.3 to version 1.4 on CDH 5.4.
I checked the documentation here
That turned out to be a silly data type mistake. At one point in the
iterative call, I was passing an integer value for the parameter 'alpha' of
the ALS train API, which was expecting a Double. So, py4j in fact
complained that it cannot take a method that takes an integer value for
that parameter.
Hi
I am beginner to scala and spark. I am trying to set up eclipse environment to
develop spark program in scala, then take it's jar for spark-submit.
How shall I start? To start my task includes, setting up eclipse for scala and
spark, getting dependencies resolved, building project using
Hello Prateek,
I started with getting the pre built binaries so as to skip the hassle of
building them from scratch.
I am not familiar with scala so can't comment on it.
I have documented my experiences on my blog www.edumine.wordpress.com
Perhaps it might be useful to you.
On 08-Jul-2015 9:39
Hi, Evgeny,
I reported a JIRA issue for your problem:
https://issues.apache.org/jira/browse/SPARK-8897. You can track it to see how
it will be solved.
Ray
-Original Message-
From: Evgeny Sinelnikov [mailto:esinelni...@griddynamics.com]
Sent: Monday, July 6, 2015 7:27 PM
To:
Hi,
We have a cluster with 4 nodes. The cluster uses CDH 5.4 for the past two
days I have been trying to connect my laptop to the server using spark
master ip:port but its been unsucessful.
The server contains data that needs to be cleaned and analysed.
The cluster and the nodes are on linux
That was a) fuzzy b) insufficient – one can certainly use forach (only) on
DStream RDDs – it works as empirical observation
As another empirical observation:
For each partition results in having one instance of the lambda/closure per
partition when e.g. publishing to output systems
Richard,
That's exactly the strategy I've been trying, which is a wrapper singleton
class. But I was seeing the inner object being created multiple times.
I wonder if the problem has to do with the way I'm processing the RDD's.
I'm using JavaDStream to stream data (from Kafka). Then I'm
My singletons do in fact stick around. They're one per worker, looks like.
So with 4 workers running on the box, we're creating one singleton per
worker process/jvm, which seems OK.
Still curious about foreachPartition vs. foreachRDD though...
On Tue, Jul 7, 2015 at 11:27 AM, Richard Marscher
These are quite different operations. One operates on RDDs in DStream and
one operates on partitions of an RDD. They are not alternatives.
On Wed, Jul 8, 2015, 2:43 PM dgoldenberg dgoldenberg...@gmail.com wrote:
Is there a set of best practices for when to use foreachPartition vs.
foreachRDD?
These are quite different operations. One operates on RDDs in DStream and
one operates on partitions of an RDD. They are not alternatives.
Sean, different operations as they are, they can certainly be used on the
same data set. In that sense, they are alternatives. Code can be written
using one
To set up Eclipse for Spark you should install the Scala IDE plugins:
http://scala-ide.org/download/current.html
Define your project in Maven with Scala plugins configured (you should be
able to find documentation online) and import as an existing Maven project.
The source code should be in
Your tableLoad() APIs are not actions. File will be read fully only when an
action is performed.
If the action is something like table1.join(table2), then I think both
files will be read in parallel.
Can you try that and look at the execution plan or in 1.4 this is shown in
Spark UI.
Srikanth
On
Hey,
I have quite a few jobs appearing in the web-ui with the description run at
ThreadPoolExecutor.java:1142.
Are these generated by SparkSQL internally?
There are so many that they cause a RejectedExecutionException when the
thread-pool runs out of space for them.
RejectedExecutionException
Hello.
I have an issue with CustomKryoRegistrator, which causes ClassNotFound on
Worker.
The issue is resolved if call SparkConf.setJar with path to the same jar I run.
It is a workaround, but it requires to specify the same jar file twice. The
first time I use it to actually run the job, and
I am using spark 1.4.1rc1 with default hive settings
Thanks
- Terry
Hi All,
I'd like to use the hive context in spark shell, i need to recreate the
hive meta database in the same location, so i want to close the derby
connection previous created in the spark shell, is there any way to do this?
Thank you, Ray,
but it is already created and almost fixed:
https://issues.apache.org/jira/browse/SPARK-8840
On Wed, Jul 8, 2015 at 4:04 PM, Sun, Rui rui@intel.com wrote:
Hi, Evgeny,
I reported a JIRA issue for your problem:
https://issues.apache.org/jira/browse/SPARK-8897. You can
Thanks, Sean.
are you asking about foreach vs foreachPartition? that's quite
different. foreachPartition does not give more parallelism but lets
you operate on a whole batch of data at once, which is nice if you
need to allocate some expensive resource to do the processing
This is basically what
The point of running them in parallel would be faster creation of the
tables. Has anybody been able to efficiently parallelize something like
this in Spark?
On Jul 8, 2015 12:29 AM, Akhil Das ak...@sigmoidanalytics.com wrote:
Whats the point of creating them in parallel? You can multi-thread it
@Evo There is no foreachRDD operation on RDDs; it is a method of
DStream. It gives each RDD in the stream. RDD has a foreach, and
foreachPartition. These give elements of an RDD. What do you mean it
'works' to call foreachRDD on an RDD?
@Dmitry are you asking about foreach vs foreachPartition?
Thanks, Cody. The good boy comment wasn't from me :) I was the one
asking for help.
On Wed, Jul 8, 2015 at 10:52 AM, Cody Koeninger c...@koeninger.org wrote:
Sean already answered your question. foreachRDD and foreachPartition are
completely different, there's nothing fuzzy or insufficient
What I seem to be don’t get is how my code ends up being on Worker node.
My understanding was that jar file, which I use to start the job should
automatically be copied into Worker nodes and added to classpath. It seems to
be not the case. But if my jar is not copied into Worker nodes, then how
DStream must be Serializable, it's metadata checkpointing. But you can use
KryoSerializer for data checkpointing. The data checkpointing uses
RDD.checkpoint which can be set by spark.serializer.
Best Regards,
Shixiong Zhu
2015-07-08 3:43 GMT+08:00 Chen Song chen.song...@gmail.com:
In Spark
Do you have a benchmark to say running these two statements as it is will
be slower than what you suggest?
On 9 Jul 2015 01:06, Brandon White bwwintheho...@gmail.com wrote:
The point of running them in parallel would be faster creation of the
tables. Has anybody been able to efficiently
Thanks for your reply Akhil.
How do you multithread it?
Sincerely,
Ashish Dutt
On Wed, Jul 8, 2015 at 3:29 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Whats the point of creating them in parallel? You can multi-thread it run
it in parallel though.
Thanks
Best Regards
On Wed, Jul 8,
Sorry, I misunderstood.
best,
/Shahab
On Wed, Jul 8, 2015 at 9:52 AM, spark user spark_u...@yahoo.com wrote:
Hi 'I am looking how to load data in redshift .
Thanks
On Wednesday, July 8, 2015 12:47 AM, shahab shahab.mok...@gmail.com
wrote:
Hi,
I did some experiment with loading
Hi Everyone,
I am new to spark.may I know how to define and use User Define Function in
SPARK SQL.
I want to use defined UDF by using sql queries.
My Environment
Windows 8
spark 1.3.1
Warm Regards,
Vinod
Hi Ashish,
I am running ipython notebook server on one of the nodes of the cluster
(HDP). Setting it up was quite straightforward, and I guess I followed the
same references that you linked to. Then I access the notebook remotely
from my development PC. Never tried to connect a local ipython (on
Hi all,
What is the most common used tool/product to benchmark spark job?
One option is the databricks/spark-perf project
https://github.com/databricks/spark-perf
2015-07-08 11:23 GMT-07:00 MrAsanjar . afsan...@gmail.com:
Hi all,
What is the most common used tool/product to benchmark spark job?
A RDD[Double] is an abstraction for a large collection of doubles, possibly
distributed across multiple nodes. The DoubleRDDFunctions are there for
performing mean and variance calculations across this distributed dataset.
In contrast, a Vector is not distributed and fits on your local machine.
Take a look at
https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-Eclipse
On Wed, Jul 8, 2015 at 7:47 AM, Daniel Siegmann daniel.siegm...@teamaol.com
wrote:
To set up Eclipse for Spark you should install the Scala IDE plugins:
Thanks for the clarification, Cody!
On Mon, Jul 6, 2015 at 6:44 AM, Cody Koeninger c...@koeninger.org wrote:
You shouldn't rely on being able to restart from a checkpoint after
changing code, regardless of whether the change was explicitly related to
serialization.
If you are relying on
Hi Julian,
I recently built a Python+Spark application to do search relevance
analytics. I use spark-submit to submit PySpark jobs to a Spark cluster on
EC2 (so I don't use the PySpark shell, hopefully thats what you are looking
for). Can't share the code, but the basic approach is covered in
I just asked this question at the streaming webinar that just ended, but
the speakers didn't answered so throwing here:
AFAIK checkpoints are the only recommended method for running Spark
streaming without data loss. But it involves serializing the entire dstream
graph, which prohibits any logic
You can use DStream.transform for some stuff. Transform takes a RDD = RDD
function that allow arbitrary RDD operations to be done on RDDs of a
DStream. This function gets evaluated on the driver on every batch
interval. If you are smart about writing the function, it can do different
stuff at
1.Does creation of read only singleton object in each map function is same
as broadcast object as singleton never gets garbage collected unless
executor gets shutdown ? Aim is to avoid creation of complex object at each
batch interval of a spark streaming app.
2.why JavaStreamingContext 's sc ()
Hi TD, you answered a wrong question. If you read the subject, mine was
specifically about checkpointing. I'll elaborate
The checkpoint, which is a serialized DStream DAG, contains all the
metadata and *logic*, like the function passed to e.g. DStream.transform()
This is serialized as a
Resolved that compilation issue using AvroKey and AvroKeyInputFormat.
val avroDs = ssc.fileStream[AvroKey[GenericRecord], NullWritable,
AvroKeyInputFormat[GenericRecord]](input)
--
View this message in context:
What's the best practice of creating RDD from some external unix command
output? I assume if the output size is large (say millions of lines),
creating RDD from an array of all lines is not a good idea? Thanks!
--
View this message in context:
Great post, thanks for sharing with us!
On Wed, Jul 8, 2015 at 9:59 AM, Sujit Pal sujitatgt...@gmail.com wrote:
Hi Julian,
I recently built a Python+Spark application to do search relevance
analytics. I use spark-submit to submit PySpark jobs to a Spark cluster on
EC2 (so I don't use the
Hey Jong,
No I did answer the right question. What I explained did not change the JVM
classes (that is the function is the same) but it still ensures that
computation is different (the filters get updated with time). So you can
checkpoint this and recover from it. This is ONE possible way to do
Here's a related JIRA: https://issues.apache.org/jira/browse/SPARK-7819
https://issues.apache.org/jira/browse/SPARK-7819
Typically you can work around this by making sure that the classes are
shared across the isolation boundary, as discussed in the comments.
On Tue, Jul 7, 2015 at 3:29 AM, Sea
Responses inline.
On Wed, Jul 8, 2015 at 10:26 AM, Shushant Arora shushantaror...@gmail.com
wrote:
1.Does creation of read only singleton object in each map function is same
as broadcast object as singleton never gets garbage collected unless
executor gets shutdown ? Aim is to avoid creation
As a distributed data processing engine, Spark should be fine with millions
of lines. It's built with the idea of massive data sets in mind. Do you
have more details on how you anticipate the output of a unix command
interacting with a running Spark application? Do you expect Spark to be
Can you use a con job to update it every X minutes?
On Wed, Jul 8, 2015 at 2:23 PM, Ganelin, Ilya ilya.gane...@capitalone.com
wrote:
Hi all – I’m just wondering if anyone has had success integrating Spark
Streaming with Zeppelin and actually dynamically updating the data in near
real-time.
The error is JVM has not responded after 10 seconds.
On 08-Jul-2015 10:54 PM, ayan guha guha.a...@gmail.com wrote:
What's the error you are getting?
On 9 Jul 2015 00:01, Ashish Dutt ashish.du...@gmail.com wrote:
Hi,
We have a cluster with 4 nodes. The cluster uses CDH 5.4 for the past two
Getting exception when wrting RDD to local disk using following function
saveAsTextFile(file:home/someuser/dir2/testupload/20150708/)
The dir (/home/someuser/dir2/testupload/) was created before running the
job. The error message is misleading.
org.apache.spark.SparkException: Job aborted
My job completed in 40 seconds that is not correct as there is no output..
I seee
Exception in thread main akka.actor.ActorNotFound: Actor not found for:
ActorSelection[Anchor(akka.tcp://sparkDriver@10.115.86.24:54737/),
Path(/user/OutputCommitCoordinator)]
at
Ah, I see this is streaming. I haven't any practical experience with that
side of Spark. But the foreachPartition idea is a good approach. I've used
that pattern extensively, even though not for singletons, but just to
create non-serializable objects like API and DB clients on the executor
side. I
I was thinking the same thing! Try sc.setLogLevel(ERROR)
On Wed, Jul 8, 2015 at 2:01 PM, Lincoln Atkinson lat...@microsoft.com
wrote:
“WARN Executor: Told to re-register on heartbeat” is logged repeatedly
in the spark shell, which is very distracting and corrupts the display of
whatever set
Hi Lincoln, I've noticed this myself. I believe it's a new issue that only
affects local mode. I've filed a JIRA to track it:
https://issues.apache.org/jira/browse/SPARK-8911
2015-07-08 14:20 GMT-07:00 Lincoln Atkinson lat...@microsoft.com:
Brilliant! Thanks.
*From:* Feynman Liang
I'm trying to submit a spark job from a different server outside of my Spark
Cluster (running spark 1.4.0, hadoop 2.4.0 and YARN) using the spark-submit
script :
spark/bin/spark-submit --master yarn-client --executor-memory 4G
myjobScript.py
The think is that my application never pass from the
WARN Executor: Told to re-register on heartbeat is logged repeatedly in the
spark shell, which is very distracting and corrupts the display of whatever set
of commands I'm currently typing out.
Is there an option to disable the logging of this message?
Thanks,
-Lincoln
Brilliant! Thanks.
From: Feynman Liang [mailto:fli...@databricks.com]
Sent: Wednesday, July 08, 2015 2:15 PM
To: Lincoln Atkinson
Cc: user@spark.apache.org
Subject: Re: Disable heartbeat messages in REPL
I was thinking the same thing! Try sc.setLogLevel(ERROR)
On Wed, Jul 8, 2015 at 2:01 PM,
Hi,
I am using MLlib collaborative filtering API on an implicit preference data
set. From a pySpark notebook, I am iteratively creating the matrix
factorization model with the aim of measuring the RMSE for each combination
of parameters for this API like the rank, lambda and alpha. After the code
Its showing connection refused, for some reason it was not able to connect
to the machine either its the machine\s start up time or its with the
security group.
Thanks
Best Regards
On Wed, Jul 8, 2015 at 2:04 AM, Pagliari, Roberto rpagli...@appcomsci.com
wrote:
I'm following the tutorial
Whats the point of creating them in parallel? You can multi-thread it run
it in parallel though.
Thanks
Best Regards
On Wed, Jul 8, 2015 at 5:34 AM, Brandon White bwwintheho...@gmail.com
wrote:
Say I have a spark job that looks like following:
def loadTable1() {
val table1 =
Hi all – I’m just wondering if anyone has had success integrating Spark
Streaming with Zeppelin and actually dynamically updating the data in near
real-time. From my investigation, it seems that Zeppelin will only allow you to
display a snapshot of data, not a continuously updating table. Has
Hi,
I get the error, DLL load failed: %1 is not a valid win32 application
whenever I invoke pyspark. Attached is the screenshot of the same.
Is there any way I can get rid of it. Still being new to PySpark and have
had, a not so pleasant experience so far most probably because I am on a
windows
try the spark-datetime package:
https://github.com/SparklineData/spark-datetime
Follow this example
https://github.com/SparklineData/spark-datetime#a-basic-example to get the
different attributes of a DateTime.
On Wed, Jul 8, 2015 at 9:11 PM, prosp4300 prosp4...@163.com wrote:
As mentioned in
You are most likely confused because you are using the UDF using
HiveContext. In your case, you are using Spark UDF, not Hive UDF. For a
naive scenario, I can use spark UDFs without any hive installation in my
cluster.
sqlContext.udf.register is for UDF in spark. Hive UDFs are stored in Hive
and
following function
saveAsTextFile(file:home/someuser/dir2/testupload/20150708/)
The dir (/home/someuser/dir2/testupload/) was created before running the
job. The error message is misleading.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in
stage 0.0 failed 4
Hi,
I get the error, DLL load failed: %1 is not a valid win32 application
whenever I invoke pyspark. Attached is the screenshot of the same.
Is there any way I can get rid of it. Still being new to PySpark and have
had, a not so pleasant experience so far most probably because I am on a
windows
Hi Ashish,
Nice post.
Agreed, kudos to the author of the post, Benjamin Benfort of District Labs.
Following your post, I get this problem;
Again, not my post.
I did try setting up IPython with the Spark profile for the edX Intro to
Spark course (because I didn't want to use the Vagrant
Hi everyone,
I can't get 'day of year' when using spark query. Can you help any way to
achieve day of year?
Regards,
Ravi
As mentioned in Spark sQL programming guide, Spark SQL support Hive UDFs,
please take a look below builtin UDFs of Hive, get day of year should be as
simply as existing RDBMS
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DateFunctions
At 2015-07-09
bq. return new
Tuple2ImmutableBytesWritable, Put(new ImmutableBytesWritable(), put);
I don't think Put is serializable.
FYI
On Fri, Jun 12, 2015 at 6:40 AM, Vamshi Krishna vamshi2...@gmail.com
wrote:
Hi I am trying to write data that is
Convert the column to a column of java Timestamps. Then you can do the
following
import java.sql.Timestamp
import java.util.Calendar
def date_trunc(timestamp:Timestamp, timeField:String) = {
timeField match {
case hour =
val cal = Calendar.getInstance()
Doesn't seem to be Spark problem, assuming TDigest comes from mahout.
Cheers
On Wed, Jul 8, 2015 at 7:49 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
Same exception with different values of compression (10,100)
var digest: TDigest = TDigest.createAvlTreeDigest(100)
On Wed, Jul 8, 2015 at
I have Spark 1.4 deployed on AWS EMR but methods of SparkR dataFrame read.df
method cannot load data from aws s3.
1) read.df error message
read.df(sqlContext,s3://some-bucket/some.json,json)
15/07/09 04:07:01 ERROR r.RBackendHandler: loadDF on
org.apache.spark.sql.api.r.SQLUtils failed
Hi Sujit,
Thanks for your response.
So i opened a new notebook using the command ipython notebook --profile
spark and tried the sequence of commands. i am getting errors. Attached is
the screenshot of the same.
Also I am attaching the 00-pyspark-setup.py for your reference. Looks
like, I have
There are several levels of indirection going on here, let me clarify.
In the local mode, Spark runs tasks (which includes receivers) using the
number of threads defined in the master (either local, or local[2], or
local[*]).
local or local[1] = single thread, so only one task at a time
local[2]
This is also discussed in the programming guide.
http://spark.apache.org/docs/latest/streaming-programming-guide.html#design-patterns-for-using-foreachrdd
On Wed, Jul 8, 2015 at 8:25 AM, Dmitry Goldenberg dgoldenberg...@gmail.com
wrote:
Thanks, Sean.
are you asking about foreach vs
Lots of places refer RDD lineage, I'd like to know what it refer to
exactly. My understanding is that it means the RDD dependencies and the
intermediate MapOutput info in MapOutputTracker. Correct me if I am wrong.
Thanks
1 - 100 of 106 matches
Mail list logo