I have updated the config since I realized the actor system was listening
on driver port + 1. So changed the ports in my program + the docker images
val conf = new SparkConf()
.setMaster(sparkMaster)
//.setMaster("local[2]")
.setAppName(sparkApp)
.set("spark.cassandra.connection.host",
Hi,
I am working with bioinformatics and trying to convert some scripts to
sparkR to fit into other spark jobs.
I tries a simple example from a bioinf lib and as soon as I start sparkR
environment it does not work.
code as follows -
countData <- matrix(1:100,ncol=4)
condition <-
Thanks, Behavior is now clear to me.
I tried with "foreachRDD" and indeed all partitions are being processed in
parallel.
I also tried using "saveAsTextFile" instead of print and again all partitions
were processed in parallel.
-Original Message-
From: Cody Koeninger
Hi,
Can somebody point how can I confgure custom logs for my Spark (scala
scripts)
So that I can at which level my script failed and why ?
Thanks,
Divya
Hi all,
While googling Spark, I accidentally found a RESTful API existing in Spark
for submitting jobs.
The link is here, http://arturmkrtchyan.com/apache-spark-hidden-rest-api
As Josh said, I can see the history of this RESTful API,
https://issues.apache.org/jira/browse/SPARK-5388 and also
Hi All,
I have two tables with same schema but different data. I have to join the
tables based on one column and then do a group by the same column name.
now the data in that column in two table might/might not exactly match. (Ex
- column name is "title". Table1. title = "doctor" and Table2.
Here is the PR : https://github.com/apache/spark/pull/11544
On Mon, Mar 14, 2016 at 7:26 PM, Ted Yu wrote:
> Please refer to JIRAs which were related to MiMa
> e.g.
> [SPARK-13834][BUILD] Update sbt and sbt plugins for 2.x.
>
> It would be easier for other people to help
I guess it’s Jenkins’ problem? My PR was failed for MiMa but still got a
message from SparkQA (https://github.com/SparkQA) saying that "This patch
passes all tests."
I checked Jenkins’ history, there are other PRs with the same issue….
Best,
--
Nan Zhu
http://codingcat.me
On Monday,
Please refer to JIRAs which were related to MiMa
e.g.
[SPARK-13834][BUILD] Update sbt and sbt plugins for 2.x.
It would be easier for other people to help if you provide link to your PR.
Cheers
On Mon, Mar 14, 2016 at 7:22 PM, Gayathri Murali <
gayathri.m.sof...@gmail.com> wrote:
> Hi All,
>
>
Hi All,
I recently submitted a patch(which was passing all tests) with some minor
modification to an existing PR. This patch is failing MiMa tests. Locally
it passes all unit and style check tests. How do I fix MiMa test failures?
Thanks
Gayathri
when i change to default coarse-grained, it’s ok.
> On Mar 14, 2016, at 21:55, sjk wrote:
>
> hi,all, when i run task on mesos, task error below. for help, thanks a lot.
>
>
> cluster mode, command:
>
> $SPARK_HOME/spark-submit --class com.xxx.ETL --master
>
This should work.
Create your sbt file first
cat PrintAllDatabases.sbt
name := "PrintAllDatabases"
version := "1.0"
scalaVersion := "2.10.5"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.0"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.5.0"
libraryDependencies
Hi all,
I have several Hive queries that work in spark-shell, but they don't work in
spark-submit. In fact, I can't even show all databases. The following works
in spark-shell:
import org.apache.spark._
import org.apache.spark.sql._
object ViewabilityFetchInsertDailyHive {
def main() {
I see - so you want the dependencies pre-installed on the cluster nodes so they
do not need to be submitted along with the job jar?
Where are you planning on deploying/running spark? Do you have your own cluster
or are you using AWS/other IaaS/PaaS provider?
Somehow you’ll need to get the
I think "SPARK_WORKER_INSTANCES" is deprecated.
This should work: "export SPARK_EXECUTOR_INSTANCES=2"
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Changing-number-of-workers-for-benchmarking-purposes-tp2606p26491.html
Sent from the Apache Spark User
Hi
I do not want create single jar that contains all the other dependencies .
because it will increase the size of my spark job jar .
so i want to copy all libraries in cluster using some automation process .
just like currently i am using chef .
but i am not sure is it a right method or not ?
Andres,
A couple points:
1) If you look at my post, you can see that you could use Spark for
low-latency - many sub-second queries could be executed in under a
second, with the right technology. It really depends on "real time"
definition, but I believe low latency is definitely possible.
2)
Hello Ashok,
I found three sources of how shuffle works (and what transformations trigger
it) instructive and illuminative. After learning from it, you should be able to
extrapolate how your particular and practical use case would work.
At least for simple queries, the DAGScheduler does not appear to be
the bottleneck - since we are able to schedule 700 queries, and all
the scheduling is probably done from the main application thread.
However, I did have high hopes for Sparrow. What was the reason they
decided not to include
Could you use netstat to show the ports that the driver is listening?
On Mon, Mar 14, 2016 at 1:45 PM, David Gomez Saavedra
wrote:
> hi everyone,
>
> I'm trying to set up spark streaming using akka with a similar example of
> the word count provided. When using spark master
experts,
please I need to understand how shuffling works in Spark and which parameters
influence it.
I am sorry but my knowledge of shuffling is very limited. Need a practical use
case if you can.
regards
Have you tried setting the configuration
`spark.executor.extraLibraryPath` to point to a location where your
.so's are available? (Not sure if non-local files, such as HDFS, are
supported)
On Mon, Mar 14, 2016 at 2:12 PM, Tristan Nixon wrote:
> What build system are you
What build system are you using to compile your code?
If you use a dependency management system like maven or sbt, then you should be
able to instruct it to build a single jar that contains all the other
dependencies, including third-party jars and .so’s. I am a maven user myself,
and I use the
On Mon, Mar 14, 2016 at 1:30 PM, Prabhu Joseph
wrote:
>
> Thanks for the recommendation. But can you share what are the
> improvements made above Spark-1.2.1 and how which specifically handle the
> issue that is observed here.
>
Memory used for query execution is
Hi
Thanks for the information .
but my problem is that if i want to write spark application which depend on
third party libraries like opencv then whats is the best approach to
distribute all .so and jar file of opencv in all cluster ?
Regards
Prateek
--
View this message in context:
hi everyone,
I'm trying to set up spark streaming using akka with a similar example of
the word count provided. When using spark master in local mode everything
works but when I try to run it the driver and executors using docker I get
the following exception
16/03/14 20:32:03 WARN
Hey,
I'm using this setup in a single m4.4xlarge node in order to utilize it :
https://github.com/gettyimages/docker-spark/blob/master/docker-compose.yml
but setting :
SPARK_WORKER_INSTANCES: 2
SPARK_WORKER_CORES: 2
still creates only one worker. One JVM process that utilizes up
I saw that JIRA too. But not sure if they are related, since the JIRA
mentioned "I got an exception when accessing the below REST API with an
unknown application Id.". While in my case, a "known" application ID was
supplied. Anyway, I guess I can try 1.6.1 to double check. Thanks, Ted!
On Mon,
Michael,
Thanks for the recommendation. But can you share what are the
improvements made above Spark-1.2.1 and how which specifically handle the
issue that is observed here.
On Tue, Mar 15, 2016 at 12:03 AM, Jörn Franke wrote:
> I am not sure about this. At least
See the following which is in 1.6.1:
[SPARK-12399] Display correct error message when accessing REST API with an
unknown app Id
On Mon, Mar 14, 2016 at 1:16 PM, Boric Tan wrote:
> I was using 1.6.0. Sorry I forgot to mention that.
>
> The full stack is shown below.
>
I was using 1.6.0. Sorry I forgot to mention that.
The full stack is shown below.
HTTP ERROR 500
Problem accessing /api/v1/applications/application_1457544696648_0002/jobs.
Reason:
Server Error
Caused by:
org.spark-project.guava.util.concurrent.UncheckedExecutionException:
Hello,When I used to submit a job with spark 1.4, it would return a job ID and
a status RUNNING, FAILED or something like this.I just upgraded to 1.6 and
there is no status returned by spark-submitIs there a way to get this
information back?
When I submit a job I want to know which one it
Which Spark release do you use ?
For NoSuchElementException, was there anything else in the stack trace ?
Thanks
On Mon, Mar 14, 2016 at 12:12 PM, Boric Tan
wrote:
> Hi there,
>
> I was trying to access application information with REST API. Looks like
> the
> top
Hi there,
I was trying to access application information with REST API. Looks like the
top application information can be retrieved successfully, as shown below.
But jobs/stages information cannot be retrieved; an exception was returned.
Any one has any ideas on how to fix it? Thanks!
Top
So I'm attempting to pre-compute my data such that I can pull an RDD from a
checkpoint. However, I'm finding that upon running the same job twice the
system is simply recreating the RDD from scratch.
Here is the code I'm implementing to create the checkpoint:
def
I am not sure about this. At least Hortonworks provides its distribution with
Hive and Spark 1.6
> On 14 Mar 2016, at 09:25, Mich Talebzadeh wrote:
>
> I think the only version of Spark that works OK with Hive (Hive on Spark
> engine) is version 1.3.1. I also get
Hi Iain,
Thanks for your reply. Actually i changed my trackStateFunc, it's working
now.
For reference my working code with mapWithState:
def trackStateFunc(batchTime: Time, key: String, value:
Option[Array[Long]], state: State[Array[Long]])
: Option[(String, Array[Long])] = {
// Check if
+1 to upgrading Spark. 1.2.1 has non of the memory management improvements
that were added in 1.4-1.6.
On Mon, Mar 14, 2016 at 2:03 AM, Prabhu Joseph
wrote:
> The issue is the query hits OOM on a Stage when reading Shuffle Output
> from previous stage.How come
>
> Each json file is of a single object and has the potential to have
> variance in the schema.
>
How much variance are we talking? JSON->Parquet is going to do well with
100s of different columns, but at 10,000s many things will probably start
breaking.
Yeah, sorry. I'll make sure this gets fixed.
On Mon, Mar 14, 2016 at 12:48 AM, Sean Owen wrote:
> Yeah I can't seem to download any of the artifacts via the direct download
> / cloudfront URL. The Apache mirrors are fine, so use those for the moment.
> @marmbrus were you
*Something like below ...*
*Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError:
Java heap space at
org.apache.spark.util.io.ByteArrayChunkOutputStream.allocateNewChunkIfNeeded(ByteArrayChunkOutputStream.scala:66)
*
As in Hadoop 2.5.1 of MapR 4.1.0, virtual memory checker is disabled while
physical memory checker is enabled by default.
Since on Centos/RHEL 6 there are aggressive allocation of virtual memory
due to OS behavior, you should disable virtual memory checker or increase
Hello there,
I am trying to write a program in Spark that is attempting to load multiple
json files (with undefined schemas) into a dataframe and then write it out
to a parquet file. When doing so, I am running into a number of garbage
collection issues as a result of my JVM running out of heap
Sounds like the jar you built doesn't include the dependencies (in
this case, the spark-streaming-kafka subproject). When you use
spark-submit to submit a job to spark, you need to either specify all
dependencies as additional --jars arguments (which is a pain), or
build an uber-jar containing
So what's happening here is that print() uses take(). Take() will try
to satisfy the request using only the first partition of the rdd, then
use other partitions if necessary.
If you change to using something like foreach
processed.foreachRDD(new VoidFunction() {
Hi all,
For each record I’m processing in a Spark streaming app (written in Java) I
need to take over 30 datapoints.
The output of my map would be something like:
KEY1,1,0,1,0,30,1,1,1,1,0,30,…
KEY1,0,1,1,0,15,1,1,1,1,0,28,…
KEY2,0,1,1,0,22,1,1,1,1,0,0,…
And I want to end up with:
What you mean? I 've pasted the output in the same format used by spark...
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/OOM-Exception-in-my-spark-streaming-application-tp26479p26483.html
Sent from the Apache Spark User List mailing list archive at
Mine is the same scenario. I get the HDFS_DELEGATION_TOKEN issue exactly
after the 7 days of the spark job started and it then gets killed.
Even I'm also looking for the solution.
Regards,
Nik.
On Fri, Mar 11, 2016 at 8:10 PM, Ruslan Dautkhanov
wrote:
> [image: Boxbe]
Hi,
I have an issue using Spark Streaming with a Spark Standalone cluster, my
job is well submitted but the workers seem to be unreachable. To build the
project I'musing sbt-assembly. My version of spark is 1.6.0.
here is my streaming conf:
val sparkConf = new SparkConf()
Summarizing an offline message:
The following worked for Divya:
dffiltered = dffiltered.unionAll(dfresult.filter ...
On Mon, Mar 14, 2016 at 5:54 AM, Lohith Samaga M
wrote:
> If all sql results have same set of columns you could UNION all the
> dataframes
>
> Create
hi,all, when i run task on mesos, task error below. for help, thanks a lot.
cluster mode, command:
$SPARK_HOME/spark-submit --class com.xxx.ETL --master
mesos://192.168.191.116:7077 --deploy-mode cluster --supervise --driver-memory
2G --executor-memory 10G —
total-executor-cores 4
Steve & Adam,
I would be interesting in hearing the outcome here as well. I am seeing
some similar issues in my 1.4.1 pipeline, using stateful functions
(reduceByKeyAndWindow and updateStateByKey).
Regards,
Bryan Jeffrey
On Mon, Mar 14, 2016 at 6:45 AM, Steve Loughran
If all sql results have same set of columns you could UNION all the dataframes
Create an empty df and Union all
Then reassign new df to original df before next union all
Not sure if it is a good idea, but it works
Lohith
Sent from my Sony Xperia™ smartphone
Divya Gehlot wrote
Hi,
HI All,
I am using Spark 1.6 and Pyspark.
I am trying to build a Randomforest classifier model using mlpipeline and
in python.
When I am trying to print the model I get the below value.
RandomForestClassificationModel (uid=rfc_be9d4f681b92) with 10 trees
When I use MLLIB RandomForest model
Hello all,
I have some doubts regarding performance tuning of my pipeline. I am trying
to achieve the following:
1. Consume from Kafka in 2 sec batches, filter it and remove 95% of data,
which comes down to around 4K messages/sec
2. Maintain keys (strings) by frequency over a moving window of
Hi,
Can you please try to show the stack trace line by line, because its bit
difficult to read the entire paragraph and make sense out of it .
On Mon, Mar 14, 2016 at 3:11 PM, adamreith [via Apache Spark User List] <
ml-node+s1001560n26479...@n3.nabble.com> wrote:
> Hi,
>
> I'm using spark
Hi Xi Shen,
Changing the initialization step from "kmeans||" to "random" decreased
the execution time from 2 hrs to 6 min. However, by default the no.of runs
is 1. If I try to set the number of runs to 10, then again see increase in
job execution time.
How to proceed on this ?.
By the way how
> On 14 Mar 2016, at 09:41, adamreith wrote:
>
> I dumped the heap of the driver process and seems that 486.2 MB on 512 MB of
> the available memory is used by an instance of the class
> /org.apache.spark.deploy.yarn.history.YarnHistoryService/. I'm trying to
> figure out
On 11 Mar 2016, at 23:01, Alexander Pivovarov
> wrote:
Forgot to mention. To avoid unnecessary container termination add the following
setting to yarn
yarn.nodemanager.vmem-check-enabled = false
That can kill performance on a shared cluster:
Hi,
I'm using spark 1.4.1 and i have a simple application that create a dstream
that read data from kafka and apply a filter transformation on it. After
more or less a day throw the following exception:
/Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError:
Java heap space
Hi,
I'm using spark 1.4.1 and i have a simple application that create a dstream
that read data from kafka and apply a filter transformation on it. After
more or less a day throw the following exception:
/Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError:
Java heap space
For uniform partitioning, you can try custom Partitioner.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-OutOfMemoryError-Requested-array-size-exceeds-VM-limit-tp16809p26477.html
Sent from the Apache Spark User List mailing list archive at
Dear All,
I am facing problem with Spark Twitter Streaming code, When ever twitter4j
throws exception, i am unable to catch that exception. Could anyone help me
catching that exception.
Here is Pseudo Code:
SparkConf sparkConf = new
SparkConf().setMaster("local[2]").setAppName("Test");
//
The issue is the query hits OOM on a Stage when reading Shuffle Output from
previous stage.How come increasing shuffle memory helps to avoid OOM.
On Mon, Mar 14, 2016 at 2:28 PM, Sabarish Sasidharan wrote:
> Thats a pretty old version of Spark SQL. It is devoid of all
Hi Team,
I am geeting below exceptions , while running the spark java streaming job
with custome reciver.
org.apache.spark.SparkException: Job aborted due to stage failure: Failed
to serialize task 508, not attempting to retry it. Exception during
serialization: java.io.IOException:
Thats a pretty old version of Spark SQL. It is devoid of all the
improvements introduced in the last few releases.
You should try bumping your spark.sql.shuffle.partitions to a value higher
than default (5x or 10x). Also increase your shuffle memory fraction as you
really are not explicitly
It is a Spark-SQL and the version used is Spark-1.2.1.
On Mon, Mar 14, 2016 at 2:16 PM, Sabarish Sasidharan <
sabarish.sasidha...@manthan.com> wrote:
> I believe the OP is using Spark SQL and not Hive on Spark.
>
> Regards
> Sab
>
> On Mon, Mar 14, 2016 at 1:55 PM, Mich Talebzadeh <
>
> I am trying to install spark on EC2.
>
> I am getting below error. I had issues like RPC timeout and Fetchtimeout
> for spark 1.6.0 so as per release notes was trying to get new cluster with
> 1.6.1
>
> Can you help? looks like spark 1.6.1 package is missing from s3.
>
> [timing] scala init:
I believe the OP is using Spark SQL and not Hive on Spark.
Regards
Sab
On Mon, Mar 14, 2016 at 1:55 PM, Mich Talebzadeh
wrote:
> I think the only version of Spark that works OK with Hive (Hive on Spark
> engine) is version 1.3.1. I also get OOM from time to time and
For RDD you can use flatMap, for DataFrames explode would be the best fit.
On 14 March 2016 at 08:28, lizhenm...@163.com wrote:
>
> hi:
> *I want to *convert the RDD[Array[Double]] to RDD[Double]. for example,
> t stored 1.0 2.0 3.0 in the file , how i read
>
>
hi:
I want to convert the RDD[Array[Double]] to RDD[Double]. for example, t stored
1.0 2.0 3.0 in the file , how i read
4.0 5.0 6.0
the file and convert
I think the only version of Spark that works OK with Hive (Hive on Spark
engine) is version 1.3.1. I also get OOM from time to time and have to
revert using MR
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
Which version of Spark are you using? The configuration varies by version.
Regards
Sab
On Mon, Mar 14, 2016 at 10:53 AM, Prabhu Joseph
wrote:
> Hi All,
>
> A Hive Join query which runs fine and faster in MapReduce takes lot of
> time with Spark and finally fails
Yeah I can't seem to download any of the artifacts via the direct download
/ cloudfront URL. The Apache mirrors are fine, so use those for the moment.
@marmbrus were you maybe the last to deal with these artifacts during the
release? I'm not sure where they are or how they get uploaded or I'd look
Any one have any idea? or should i raise a bug for that?
Thanks,
Shams
On Fri, Mar 11, 2016 at 3:40 PM, Shams ul Haque wrote:
> Hi,
>
> I want to kill a Spark Streaming job gracefully, so that whatever Spark
> has picked from Kafka have processed. My Spark version is:
Hi Friends,
Anyone can help me about how to terminate the Spark job in eclipse using
java code?
Thanks
Soniya
76 matches
Mail list logo