will still move to databricks cloud, which has far more
features than that. Many influential projects already depends on the
routinely published Scala-REPL (e.g. playFW), it would be strange for
Spark not doing the same.
What do you think?
Yours Peng
On 12/22/2014 04:57 PM, Sean Owen wrote
Me 2 :)
On 12/22/2014 06:14 PM, Andrew Ash wrote:
Hi Xiangrui,
That link is currently returning a 503 Over Quota error message.
Would you mind pinging back out when the page is back up?
Thanks!
Andrew
On Mon, Dec 22, 2014 at 12:37 PM, Xiangrui Meng men...@gmail.com
unsubscribe
I wasn't using spark sql before.
But by default spark should retry the exception for 4 times.
I'm curious why it aborted after 1 failure
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/spark1-0-spark-sql-saveAsParquetFile-Error-tp7006p7252.html
Sent from
I speculate that Spark will only retry on exceptions that are registered with
TaskSetScheduler, so a definitely-will-fail task will fail quickly without
taking more resources. However I haven't found any documentation or web page
on it
--
View this message in context:
I think these failed task must got retried automatically if you can't see any
error in your results. Other wise the entire application will throw a
SparkException and abort.
Unfortunately I don't know how to do this, my application always abort.
--
View this message in context:
parameters scattered in two different places (masterURL and
$spark.task.maxFailures).
I'm thinking of adding a new config parameter
$spark.task.maxLocalFailures to override 1, how do you think?
Thanks again buddy.
Yours Peng
On Mon 09 Jun 2014 01:33:45 PM EDT, Aaron Davidson wrote:
Looks like
Oh, and to make things worse, they forgot '\*' in their regex.
Am I the first to encounter this problem before?
On Mon 09 Jun 2014 02:24:43 PM EDT, Peng Cheng wrote:
Thanks a lot! That's very responsive, somebody definitely has
encountered the same problem before, and added two hidden modes
Hi Matei, Yeah you are right this is very niche (my user case is as a
web crawler), but I glad you also like an additional property. Let me
open a JIRA.
Yours Peng
On Mon 09 Jun 2014 03:08:29 PM EDT, Matei Zaharia wrote:
If this is a useful feature for local mode, we should open a JIRA
My transformations or actions has some external tool set dependencies and
sometimes they just stuck somewhere and there is no way I can fix them. If I
don't want the job to run forever, Do I need to implement several monitor
threads to throws an exception when they stuck. Or the framework can
I've tried enabling the speculative jobs, this seems partially solved the
problem, however I'm not sure if it can handle large-scale situations as it
only start when 75% of the job is finished.
--
View this message in context:
Thanks a lot! Let me check my maven shade plugin config and see if there is a
fix
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-throws-NoSuchFieldError-when-testing-on-cluster-mode-tp8064p8073.html
Sent from the Apache Spark User List mailing list
Indeed I see a lot of duplicate package warning in the maven-shade assembly
package output, so I tried to eliminate them:
First I set scope of dependency to apache-spark to 'provided', as suggested
in this page:
http://spark.apache.org/docs/latest/submitting-applications.html
But spark master
Latest Advancement:
I found the cause of NoClassDef exception: I wasn't using spark-submit,
instead I tried to run the spark application directly with SparkConf set in
the code. (this is handy in local debugging). However the old problem
remains: Even my maven-shade plugin doesn't give any warning
I also found that any buggy application submitted in --deploy-mode = cluster
mode will crash the worker (turn status to 'DEAD'). This shouldn't really
happen, otherwise nobody will use this mode. It is yet unclear whether all
workers will crash or only the one running the driver will (as I only
Hi Sean,
OK I'm about 90% sure about the cause of this problem: Just another classic
Dependency conflict:
Myproject - Selenium - apache.httpcomponents:httpcore 4.3.1 (has
ContentType)
Spark - Spark SQL Hive - Hive - Thrift - apache.httpcomponents:httpcore
4.1.3 (has no ContentType)
Though I
Right problem solved in a most disgraceful manner. Just add a package
relocation in maven shade config.
The downside is that it is not compatible with my IDE (IntelliJ IDEA), will
cause:
Error:scala.reflect.internal.MissingRequirementError: object scala.runtime
in compiler mirror not found.:
I'm trying to link a spark slave with an already-setup master, using:
$SPARK_HOME/sbin/start-slave.sh spark://ip-172-31-32-12:7077
However the result shows that it cannot open a log file it is supposed to
create:
failed to launch org.apache.spark.deploy.worker.Worker:
tail: cannot open
I haven't setup a passwordless login from slave to master node yet (I was
under impression that this is not necessary since they communicate using
port 7077)
--
View this message in context:
make sure all queries are called through class methods and wrap your query
info with a class having only simple properties (strings, collections etc).
If you can't find such wrapper you can also use SerializableWritable wrapper
out-of-the-box, but its not recommended. (developer-api and make fat
I've read somewhere that in 1.0 there is a bash tool called 'spark-config.sh'
that allows you to propagate your config files to a number of master and
slave nodes. However I haven't use it myself
--
View this message in context:
I got 'NoSuchFieldError' which is of the same type. its definitely a
dependency jar conflict. spark driver will load jars of itself which in
recent version get many dependencies that are 1-2 years old. And if your
newer version dependency is in the same package it will be shaded (Java's
first come
I'm afraid persisting connection across two tasks is a dangerous act as they
can't be guaranteed to be executed on the same machine. Your ES server may
think its a man-in-the-middle attack!
I think its possible to invoke a static method that give you a connection in
a local 'pool', so nothing
I'm deploying a cluster to Amazon EC2, trying to override its internal ip
addresses with public dns
I start a cluster with environment parameter: SPARK_PUBLIC_DNS=[my EC2
public DNS]
But it doesn't change anything on the web UI, it still shows internal ip
address
Spark Master at
Sorry I just realize that start-slave is for a different task. Please close
this
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-slave-fail-to-start-with-wierd-error-information-tp8203p8246.html
Sent from the Apache Spark User List mailing list
Node Submitted Time UserState Duration
app-20140625083158- org.tribbloid.spookystuff.example.GoogleImage$ 2
512.0 MB2014/06/25 08:31:58 pengRUNNING 17 min
However when submitting the job in client mode:
$SPARK_HOME/bin/spark-submit \
--class
Totally agree, also there is a class 'SparkSubmit' you can call directly to
replace shellscript
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Using-Spark-as-web-app-backend-tp8163p8248.html
Sent from the Apache Spark User List mailing list archive at
Expanded to 4 nodes and change the workers to listen to public DNS, but still
it shows the same error (which is obviously wrong). I can't believe I'm the
first to encounter this issue.
--
View this message in context:
I give up, communication must be blocked by the complex EC2 network topology
(though the error information indeed need some improvement). It doesn't make
sense to run a client thousands miles away to communicate frequently with
workers. I have moved everything to EC2 now.
--
View this message
?
Yours Peng
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Integrate-spark-shell-into-officially-supported-web-ui-api-plug-in-What-do-you-think-tp8447.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
That would be really cool with IPython, But I' still wondering if all
language features are supported, namely I need these 2 in particular:
1. importing class and ILoop from external jars (so I can point it to
SparkILoop or Sparkbinding ILoop of Apache Mahout instead of Scala's default
ILoop)
2.
). This
can be useful sometimes but may cause confusion at other times (people can
no longer add persist at will just for backup because it may change the
result).
So far I've found no documentation supporting this feature. So can some one
confirm that its a feature craftly designed?
Yours Peng
--
View
Yeah, Thanks a lot. I know for people understanding lazy execution this seems
straightforward. But for those who don't it may become a liability.
I've only tested its stability on a small example (which seems stable),
hopefully it's not a serendipity. Can a committer confirm this?
Yours Peng
Unfortunately, After some research I found its just a side effect of how
closure containing var works in scala:
http://stackoverflow.com/questions/11657676/how-does-scala-maintains-the-values-of-variable-when-the-closure-was-defined
the closure keep referring var broadcasted wrapper as a pointer,
On Wed, Sep 3, 2014 at 2:33 PM, Kevin Peng kpe...@gmail.com wrote:
Ted,
The hbase-site.xml is in the classpath (had worse issues before... until
I figured that it wasn't in the path).
I get the following error in the spark-shell:
org.apache.spark.SparkException: Job aborted due to stage
and deep graph following extraction. Please
drop me a line if you have a user case, as I'll try to integrate it as a
feature.
Yours Peng
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Crawler-Scraper-with-different-priorities-tp13645p13838.html
Sent from
Sean,
Thanks. That worked.
Kevin
On Mon, Sep 15, 2014 at 3:37 PM, Sean Owen so...@cloudera.com wrote:
This is more of a Java / Maven issue than Spark per se. I would use
the shade plugin to remove signature files in your final META-INF/
dir. As Spark does, in its configuration:
filters
Hi,
We have a cluster setup with spark 1.0.2 running 4 workers and 1 master
with 64G RAM for each. In the sparkContext we specify 32G executor memory.
However, as long as the task running longer than approximate 15 mins, all
the executors are lost just like some sort of timeout no matter if the
While Spark already offers support for asynchronous reduce (collect data from
workers, while not interrupting execution of a parallel transformation)
through accumulator, I have made little progress on making this process
reciprocal, namely, to broadcast data from driver to workers to be used by
Any suggestions? I'm thinking of submitting a feature request for mutable
broadcast. Is it doable?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Asynchronous-Broadcast-from-driver-to-workers-is-it-possible-tp15758p15807.html
Sent from the Apache Spark
Looks like the only way is to implement that feature. There is no way of
hacking it into working
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Asynchronous-Broadcast-from-driver-to-workers-is-it-possible-tp15758p16985.html
Sent from the Apache Spark User
distinct records (One positive and one
negative), others are all duplications.
Any one has any idea on why it takes so long on this small data?
Thanks,
Best,
Peng
= model.predict(point.features)
// (point.label, prediction)
// }
// val trainErr = labelAndPreds.filter(r = r._1 != r._2).count.toDouble /
testParsedData.count
// println(Training Error = + trainErr)
println(Calendar.getInstance().getTime())
}
}
Thanks,
Best,
Peng
On Thu, Oct 30, 2014 at 1:23 PM
Hi Xiangrui,
Can you give me some code example about caching, as I am new to Spark.
Thanks,
Best,
Peng
On Thu, Oct 30, 2014 at 6:57 PM, Xiangrui Meng men...@gmail.com wrote:
Then caching should solve the problem. Otherwise, it is just loading
and parsing data from disk for each iteration
Thanks Jimmy.
I will have a try.
Thanks very much for your guys' help.
Best,
Peng
On Thu, Oct 30, 2014 at 8:19 PM, Jimmy ji...@sellpoints.com wrote:
sampleRDD. cache()
Sent from my iPhone
On Oct 30, 2014, at 5:01 PM, peng xia toxiap...@gmail.com wrote:
Hi Xiangrui,
Can you give me some
: this error doesn't always happen, sometimes the old
Seq[Page] was get properly, sometimes it throws the exception, how could
this happen and how do I fix it?
Yours Peng
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-make-sure-a-ClassPath-is-always-shipped
: this error doesn't always happen, sometimes the old
Seq[Page] was get properly, sometimes it throws the exception, how could
this happen and how do I fix it?
Yours Peng
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-make-sure-a-ClassPath-is-always-shipped
Sorry its a timeout duplicate, please remove it
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-make-sure-a-ClassPath-is-always-shipped-to-workers-tp18018p18020.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
In my project I extend a new RDD type that wraps another RDD and some
metadata. The code I use is similar to FilteredRDD implementation:
case class PageRowRDD(
self: RDD[PageRow],
@transient keys: ListSet[KeyLike] = ListSet()
){
override def getPartitions:
Everything else is there except spark-repl. Can someone check that out this
weekend?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/spark-repl-1-2-0-was-not-uploaded-to-central-maven-repository-tp20799.html
Sent from the Apache Spark User List mailing list
IMHO: cache doesn't provide redundancy, and its in the same jvm, so its much
faster.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Tachyon-tp1463p20800.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
I was under the impression that ALS wasn't designed for it :- The famous
ebay online recommender uses SGD
However, you can try using the previous model as starting point, and
gradually reduce the number of iteration after the model stablize. I never
verify this idea, so you need to at least
Hi Everyone,
Is LogisticRegressionWithSGD in MLlib scalable?
If so, what is the idea behind the scalable implementation?
Thanks in advance,
Peng
-
Peng Zhang
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Is-LogisticRegressionWithSGD-in-MLlib
I got the same problem, maybe java serializer is unstable
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-IllegalStateException-unread-block-data-tp20668p21463.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
I'm talking about RDD1 (not persisted or checkpointed) in this situation:
...(somewhere) - RDD1 - RDD2
||
V V
RDD3 - RDD4 - Action!
To my experience the change RDD1 get
I'm talking about RDD1 (not persisted or checkpointed) in this situation:
...(somewhere) - RDD1 - RDD2
||
V V
RDD3 - RDD4 - Action!
To my experience the change RDD1 get
You are right. I've checked the overall stage metrics and looks like the
largest shuffling write is over 9G. The partition completed successfully
but its spilled file can't be removed until all others are finished.
It's very likely caused by a stupid mistake in my design. A lookup table
grows
I'm running a small job on a cluster with 15G of mem and 8G of disk per
machine.
The job always get into a deadlock where the last error message is:
java.io.IOException: No space left on device
at java.io.FileOutputStream.writeBytes(Native Method)
at
to distribute the parameters. Haven't thought thru yet.
Cheers
k/
On Fri, Jan 9, 2015 at 2:56 PM, Andrei faithlessfri...@gmail.com wrote:
Does it makes sense to use Spark's actor system (e.g. via
SparkContext.env.actorSystem) to create parameter server?
On Fri, Jan 9, 2015 at 10:09 PM, Peng
You are not the first :) probably not the fifth to have the question.
parameter server is not included in spark framework and I've seen all kinds
of hacking to improvise it: REST api, HDFS, tachyon, etc.
Not sure if an 'official' benchmark implementation will be released soon
On 9 January 2015
I double check the 1.2 feature list and found out that the new sort-based
shuffle manager has nothing to do with HashPartitioner :- Sorry for the
misinformation.
In another hand. This may explain increase in shuffle spill as a side effect
of the new shuffle manager, let me revert
not related to Spark 1.2.0's new features
Yours Peng
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Shuffle-write-increases-in-spark-1-2-tp20894p21656.html
Sent from the Apache Spark User List mailing list archive at Nabble.com
Hi Sean,
Thank very much for your reply.
I tried to config it from below code:
sf = SparkConf().setAppName(test).set(spark.executor.memory,
45g).set(spark.cores.max, 62),set(spark.local.dir, C:\\tmp)
But still get the error.
Do you know how I can config this?
Thanks,
Best,
Peng
On Sat, Mar
And I have 2 TB free space on C driver.
On Sat, Mar 14, 2015 at 8:29 PM, Peng Xia sparkpeng...@gmail.com wrote:
Hi Sean,
Thank very much for your reply.
I tried to config it from below code:
sf = SparkConf().setAppName(test).set(spark.executor.memory,
45g).set(spark.cores.max, 62),set
Hi David,
You can try the local-cluster.
the number in local-cluster[2,2,1024] represents that there are 2 worker, 2
cores and 1024M
Best Regards
Peng Xu
2015-03-16 19:46 GMT+08:00 Xi Shen davidshe...@gmail.com:
Hi,
In YARN mode you can specify the number of executors. I wonder if we can
(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
The data are transformed to LabeledPoint and I was using pyspark for this.
Can anyone help me on this?
Thanks,
Best,
Peng
algorithm in python.
3. train a logistic regression model with the converted labeled points.
Can any one give some advice for how to avoid the 2gb, if this is the cause?
Thanks very much for the help.
Best,
Peng
On Mon, Mar 9, 2015 at 3:54 PM, Peng Xia sparkpeng...@gmail.com wrote:
Hi,
I
).
Now you should be able to compile and run.
HTH,
Markus
On 03/12/2015 11:55 PM, Kevin Peng wrote:
Dale,
I basically have the same maven dependency above, but my code will not
compile due to not being able to reference to AvroSaver, though the
saveAsAvro reference compiles fine, which
Hi
I was running a logistic regression algorithm on a 8 nodes spark cluster,
each node has 8 cores and 56 GB Ram (each node is running a windows
system). And the spark installation driver has 1.9 TB capacity. The dataset
I was training on are has around 40 million records with around 6600
Yin,
Yup thanks. I fixed that shortly after I posted and it worked.
Thanks,
Kevin
On Fri, Mar 13, 2015 at 8:28 PM, Yin Huai yh...@databricks.com wrote:
Seems you want to use array for the field of providers, like
providers:[{id:
...}, {id:...}] instead of providers:{{id: ...}, {id:...}}
Dale,
I basically have the same maven dependency above, but my code will not
compile due to not being able to reference to AvroSaver, though the
saveAsAvro reference compiles fine, which is weird. Eventhough saveAsAvro
compiles for me, it errors out when running the spark job due to it not
being
, Mesos) or LOCAL_DIRS (YARN)
On Sat, Mar 14, 2015 at 5:29 PM, Peng Xia sparkpeng...@gmail.com
wrote:
Hi Sean,
Thank very much for your reply.
I tried to config it from below code:
sf = SparkConf().setAppName(test).set(spark.executor.memory,
45g).set(spark.cores.max, 62),set
Hi Ted,
Thanks very much, yea, using broadcast is much faster.
Best,
Peng
On Tue, Mar 31, 2015 at 8:49 AM, Ted Yu yuzhih...@gmail.com wrote:
You can use broadcast variable.
See also this thread:
http://search-hadoop.com/m/JW1q5GX7U22/Spark+broadcast+variablesubj=How+Broadcast+variable
/cloudera/en/documentation/core/v5-2-x/topics/cdh_vd_cdh5_maven_repo.html
On Wed, Mar 4, 2015 at 4:34 PM, Kevin Peng kpe...@gmail.com wrote:
Ted,
I am currently using CDH 5.3 distro, which has Spark 1.2.0, so I am not
too
sure about the compatibility issues between 1.2.0 and 1.2.1, that is why
this thread: http://search-hadoop.com/m/JW1q5Vfe6X1
Cheers
On Wed, Mar 4, 2015 at 4:18 PM, Kevin Peng kpe...@gmail.com wrote:
Marcelo,
Yes that is correct, I am going through a mirror, but 1.1.0 works
properly, while 1.2.0 does not. I suspect there is crc in the 1.2.0 pom
file.
On Wed, Mar
I got exactly the same problem, except that I'm running on a standalone
master. Can you tell me the counterpart parameter on standalone master for
increasing the same memroy overhead?
--
View this message in context:
I'm deploying a Spark data processing job on an EC2 cluster, the job is small
for the cluster (16 cores with 120G RAM in total), the largest RDD has only
76k+ rows. But heavily skewed in the middle (thus requires repartitioning)
and each row has around 100k of data after serialization. The job
Looks like this problem has been mentioned before:
http://qnalist.com/questions/5666463/downloads-from-s3-exceedingly-slow-when-running-on-spark-ec2
and a temporarily solution is to deploy on a dedicated EMR/S3 configuration.
I'll go for that one for a shot.
--
View this message in context:
Turns out the above thread is unrelated: it was caused by using s3:// instead
of s3n://. Which I already avoided in my checkpointDir configuration.
--
View this message in context:
BTW: My thread dump of the driver's main thread looks like it is stuck on
waiting for Amazon S3 bucket metadata for a long time (which may suggests
that I should move checkpointing directory from S3 to HDFS):
Thread 1: main (RUNNABLE)
java.net.SocketInputStream.socketRead0(Native Method)
I'm using Amazon EMR + S3 as my spark cluster infrastructure. When I'm
running a job with periodic checkpointing (it has a long dependency tree, so
truncating by checkpointing is mandatory, each checkpoint has 320
partitions). The job stops halfway, resulting an exception:
(On driver)
the new way to set
the same properties?
Yours Peng
On 12 June 2015 at 14:20, Andrew Or and...@databricks.com wrote:
Hi Peng,
Setting properties through --conf should still work in Spark 1.4. From the
warning it looks like the config you are trying to set does not start with
the prefix spark
2015 at 19:39, Ted Yu yuzhih...@gmail.com wrote:
This is the SPARK JIRA which introduced the warning:
[SPARK-7037] [CORE] Inconsistent behavior for non-spark config properties
in spark-shell and spark-submit
On Fri, Jun 12, 2015 at 4:34 PM, Peng Cheng rhw...@gmail.com wrote:
Hi Andrew
In Spark 1.3.x, the system property of the driver can be set by --conf
option, shared between setting spark properties and system properties.
In Spark 1.4.0 this feature is removed, the driver instead log the following
warning:
Warning: Ignoring non-spark config property: xxx.xxx=v
How do
I stumble upon this thread and I conjecture that this may affect restoring a
checkpointed RDD as well:
http://apache-spark-user-list.1001560.n3.nabble.com/Union-of-checkpointed-RDD-in-Apache-Spark-has-long-gt-10-hour-between-stage-latency-td22925.html#a22928
In my case I have 1600+ fragmented
Ted,
What triggerAndWait does is perform a rest call to a specified url and then
waits until the status message that gets returned by that url in a json a
field says complete. The issues is I put a println at the very top of the
method and that doesn't get printed out, and I know that println
gt;>
> >> Having given it a first look I do think that you have hit something here
> >> and this does not look quite fine. I have to work on the multiple AND
> >> conditions in ON and see whether that is causing any issues.
> >>
> >> Regards,
> >>
> >> wrote:
> >>>
> >>> Hi Kevin,
> >>>
> >>> Having given it a first look I do think that you have hit something
> here
> >>> and this does not look quite fine. I have to work on the multiple AND
> >>> conditions in ON
pe show the same
> results,
> which meant that all the rows from left could match at least one row from
> right,
> all the rows from right could match at least one row from left, even
> the number of row from left does not equal that of right.
>
> This is correct result.
>
>
Gourav,
Apologies. I edited my post with this information:
Spark version: 1.6
Result from spark shell
OS: Linux version 2.6.32-431.20.3.el6.x86_64 (
mockbu...@c6b9.bsys.dev.centos.org) (gcc version 4.4.7 20120313 (Red Hat
4.4.7-4) (GCC) ) #1 SMP Thu Jun 19 21:14:45 UTC 2014
Thanks,
KP
On Mon,
Gourav,
I wish that was case, but I have done a select count on each of the two
tables individually and they return back different number of rows:
dps.registerTempTable("dps_pin_promo_lt")
swig.registerTempTable("swig_pin_promo_lt")
dps.count()
RESULT: 42632
swig.count()
RESULT: 42034
Yong,
Sorry, let explain my deduction; it is going be difficult to get a sample
data out since the dataset I am using is proprietary.
>From the above set queries (ones mentioned in above comments) both inner
and outer join are producing the same counts. They are basically pulling
out selected
Hi,
is there a way to write a udf in pyspark support agg()?
i search all over the docs and internet, and tested it out.. some say yes,
some say no.
and when i try those yes code examples, just complaint about
AnalysisException: u"expression 'pythonUDF' is neither present in the group
by,
btw, i am using spark 1.6.1
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/udf-of-aggregation-in-pyspark-dataframe-tp27811p27812.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
df:
-
a|b|c
---
1|m|n
1|x | j
2|m|x
...
import pyspark.sql.functions as F
from pyspark.sql.types import MapType, StringType
def my_zip(c, d):
return dict(zip(c, d))
my_zip = F.udf(_my_zip, MapType(StingType(), StringType(), True), True)
Mohini,
We set that parameter before we went and played with the number of
executors and that didn't seem to help at all.
Thanks,
KP
On Tue, Mar 14, 2017 at 3:37 PM, mohini kalamkar
wrote:
> Hi,
>
> try using this parameter --conf spark.sql.shuffle.partitions=1000
Hi,
I have several Spark jobs including both batch job and Stream jobs to
process the system log and analyze them. We are using Kafka as the pipeline
to connect each jobs.
Once upgrade to Spark 2.1.0 + Spark Kafka Streaming 010, I found some of
the jobs(both batch or streaming) are thrown below
handle task fail so if job ended normally , this error
>>> can be ignore.
>>> Second, when using BypassMergeSortShuffleWriter, it will first write
>>> data file then write an index file.
>>> You can check "Failed to delete temporary index file at
Is there anyone at share me some lights about this issue?
Thanks
Martin
2017-07-21 18:58 GMT-07:00 Martin Peng <wei...@gmail.com>:
> Hi,
>
> I have several Spark jobs including both batch job and Stream jobs to
> process the system log and analyze them. We are using Kaf
1 - 100 of 123 matches
Mail list logo