Someone just replied to the bug, it was already known about and will be fixed
in the upcoming Iceberg 1.7.2 release.
On Thu, 2025-02-06 at 09:35 +, Aaron Grubb wrote:
> Hi all,
>
> I filed a bug with the Iceberg team [1] but I'm not sure that it's 100%
> specific to I
sing aarch64 somehow? All JARs are precompiled versions.
Thanks,
Aaron
[1] https://github.com/apache/iceberg/issues/12178
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
2
On Mon, 2025-01-13 at 18:49 +0000, Aaron Grubb wrote:
> Hi all,
>
> I'm trying to figure out how to persist a table definition in a catalog that
> can be used from different sessions. Something along the lines
> of
>
> ---
> CREATE TABLE spark_catal
;. I have also tried setting spark.sql.warehouse.dir to a location
in S3 and setting enableHiveSupport() on the session, however
creating this table under these circumstances only creates an empty directory
and the table doesn't show up in the next session. Would I need
to
at 13:01 +0000, Aaron Grubb wrote:
> Hi all,
>
> I'm running Spark on Kubernetes on AWS using only spot instances for
> executors with dynamic allocation enabled. This particular job is
> being
> triggered by Airflow and it hit this bug [1] 6 times in a row. However, I had
dynamic
spark.sql.streaming.kafka.useDeprecatedOffsetFetching
false
spark.submit.deployMode
cluster
Thanks,
Aaron
[1] https://issues.apache.org/jira/browse/SPARK-45858
Downgrade to hadoop-*:3.3.x, Hadoop 3.4.x is based on the AWS SDK v2 and should
probably be considered as breaking for tools that build on < 3.4.0 while using
AWS.
From: Oxlade, Dan
Sent: Wednesday, April 3, 2024 2:41:11 PM
To: user@spark.apache.org
Subject: [Sp
Hi Mich,
Thanks a lot for the insight, it was very helpful.
Aaron
On Thu, 2023-01-05 at 23:44 +, Mich Talebzadeh wrote:
Hi Aaron,
Thanks for the details.
It is a general practice when running Spark on premise to use Hadoop
clusters.<https://spark.apache.org/faq.html#:~:text=How%20d
nd HBase are running provide localization benefits when Spark reads
from HBase, or are localization benefits negligible and it's a better idea to
put Spark in a standalone cluster?
Thanks for your time,
Aaron
On Thu, 2023-01-05 at 19:00 +, Mich Talebzadeh wrote:
Few questions
* As I unde
same nodes provide a benefit when using
hbase-connectors
(https://github.com/apache/hbase-connectors/tree/master/spark)? Is there a
mechanism in the connector to "pass through" a short circuit read to Spark, or
would data always bounce from HDFS -> RegionServer -> Spark?
Thanks in advance,
Aaron
I'm assuming some things here, but hopefully I understand. So, basically
you have a big table of data distributed across a bunch of executors. And,
you want an efficient way to call a native method for each row.
It sounds similar to a dataframe writer to me. Except, instead of writing
to disk or n
I want to calculate quantiles on two different columns. I know that I can
calculate them with two separate operations. However, for performance
reasons, I'd like to calculate both with one operation.
Is this possible to do this with the Dataset API? I'm assuming that it
isn't. But, if it isn't, i
That setting defines the total number of tasks that an executor can run in
parallel.
Each node is partitioned into executors, each with identical heap and
cores. So, it can be a balancing act to optimally set these values,
particularly if the goal is to maximize CPU usage with memory and other IO.
Please unsubscribe
aaron.t.hoganc...@leidos.com<mailto:aaron.t.hoganc...@leidos.com> from this
mailing list.
Thanks,
Aaron Hogancamp
Data Scientist
(615) 431-3229 (desk)
(615) 617-7160 (mobile)
oes updated.FieldId
show 123 as well, when the expanded join for 'updated.*' shows null. I can
what I want to do by using an RDD, but I was hoping to avoid bypassing
tungsten.
It almost feels like it's optimizing the field based on the join. But I
tested other fields as well and they also came back with values from base.
Very odd.
Any thoughts?
Aaron
Unsubscribe.
Thanks,
Aaron Hogancamp
Data Scientist
Your error stems from spark.ml, and in your pom mllib is the only dependency
that is 2.10. Is there a reason for this? IE, you tell maven mllib 2.10 is
provided at runtime. Is 2.10 on the machine, or is 2.11?
-Aaron
From: VG
Date: Friday, July 22, 2016 at 1:49 PM
To: Sean Owen
Cc: User
What version of Spark/Scala are you running?
-Aaron
Hi,
I have a cluster with 15 nodes of which 5 are HDFS nodes. I kick off a job
that creates some 120 stages. Eventually, the active and pending stages
reduce down to a small bottleneck and it never fails... the tasks
associated with the 10 (or so) running tasks are always allocated to the
same e
> then resume execution. This works, but it ends up costing me a lot of extra
> memory (i.e. a few TiB when I have a lot of executors).
>
> What I'd like to do is use the broadcast mechanism to load the data structure
> once, per node. But, I can't serialize the data structure from the driver.
>
> Any ideas?
>
> Thanks!
>
> Aaron
>
5 GiB? Or, do I have to decrease executor memory
to ~385 across all executors?
(Note: I'm running on Yarn, which may affect this.)
Thanks,
Aaron
On Wed, Jun 29, 2016 at 12:09 PM Sean Owen wrote:
> If you have one executor per machine, which is the right default thing
> to do, and th
e a
lot of extra memory (i.e. a few TiB when I have a lot of executors).
What I'd like to do is use the broadcast mechanism to load the data
structure once, per node. But, I can't serialize the data structure from
the driver.
Any ideas?
Thanks!
Aaron
elp,
AARON ILOVICI
Software Engineer
Marketing Engineering
[cid:image001.png@01D1B7F9.A3949B20]
WAYFAIR
4 Copley Place
Boston, MA 02116
(617) 532-6100 x1231
ailov...@wayfair.com<mailto:ailov...@wayfair.com>
From: Reynold Xin
Date: Thursday, May 26, 2016 at 6:11 PM
To: Mohammed Guller
7; - is None mapped to the proper NULL type elsewhere?
My environment: Spark 1.6, Vertica Driver 7.2.2, Java 1.7
I would be happy to create a Jira and submit a pull request with the
VerticaDialect once I figure this out.
Thank you for any insight on this,
AARON ILOVICI
Software Engineer
I'm using the spark 1.6.1 (hadoop-2.6) and I'm trying to load a file that's
in s3. I've done this previously with spark 1.5 with no issue. Attempting
to load and count a single file as follows:
dataFrame = sqlContext.read.text('s3a://bucket/path-to-file.csv')
dataFrame.count()
But when it attem
I think the SparkListener is about as close as it gets. That way I can
start up the instance (aws, open-stack, vmware, etc) and simply wait until
the SparkListener indicates that the executors are online before starting.
Thanks for the advise.
Aaron
On Fri, Mar 25, 2016 at 10:54 AM, Jacek
specific case, I may be growing the cluster size by a hundred nodes and if
I fail to wait for that initialization to complete the job will not have
enough memory to run my jobs.
Aaron
On Thu, Mar 24, 2016 at 3:07 AM, Takeshi Yamamuro
wrote:
> Hi,
>
> There is no way to get such information
I went down the SQL path. The problem is the loss
of type and the possibility for SQL injection. No biggie, just means that
where parameterized queries are in-play, we'll have to write it out in-code
rather than in SQL.
Thanks,
Aaron
On Sun, Dec 27, 2015 at 8:06 PM, Michael Armbrust
wrote
tname/ip in mesos configuration - see Nikolaos answer
>
Cheers,
Aaron
On Mon, Nov 16, 2015 at 9:37 PM, Jo Voordeckers
wrote:
> I've seen this issue when the mesos cluster couldn't figure out my IP
> address correctly, have you tried setting the ENV var with your IP address
> wh
Greetings,
I am processing a "batch" of files and have structured an iterative process
around them. Each batch is processed by first loading the data with
spark-csv, performing some minor transformations and then writing back out
as parquet. Absolutely no caching or shuffle should occur with anyt
mber 28, 2015 at 1:35 PM
To: Aaron Dossett mailto:aaron.doss...@target.com>>
Subject: Re: Python script runs fine in local mode, errors in other modes
Was there any eventual solution to this that you discovered?
If you reply to this email, your messag
ConnectionManager has been deprecated and is no longer used by default
(NettyBlockTransferService is the replacement). Hopefully you would no
longer see these messages unless you have explicitly flipped it back on.
On Tue, Aug 4, 2015 at 6:14 PM, Jim Green wrote:
> And also https://issues.apache
h ? If I don't use this cmd line option, I get an error just
attempting to do the sqlContext.read.jdbc() assignment..not trying to
perform an operation on the RDD.
Cheers,
Aaron
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
Note that if you use multi-part upload, each part becomes 1 block, which
allows for multiple concurrent readers. One would typically use fixed-size
block sizes which align with Spark's default HDFS block size (64 MB, I
think) to ensure the reads are aligned.
On Sat, Jul 11, 2015 at 11:14 AM, Steve
Are you seeing this after the app has already been running for some time,
or just at the beginning? Generally, registration should only occur once
initially, and a timeout would be due to the master not being accessible.
Try telneting to the master IP/port from the machine on which the driver
will
I think 2.6 failed to abruptly close streams that weren't fully read, which
we observed as a huge performance hit. We had to backport the 2.7
improvements before being able to use it.
Should be able to use s3a (on new hadoop versions), I believe that will try
or at least has a setting for v4
On Tue, Jun 30, 2015 at 8:31 PM, Exie wrote:
> Not sure if this helps, but the options I set are slightly different:
>
> val hadoopConf=sc.hadoopConfiguration
> hadoopConf.set("fs.s3n.aws
Yep! That was it. Using the
1.6.0rc3
that comes with spark, rather than using the 1.5.0-cdh5.4.2 version.
Thanks for the help!
Cheers,
Aaron
On Thu, Jun 25, 2015 at 8:24 AM, Sean Owen wrote:
> Hm that looks like a Parquet version mismatch then. I think Spark 1.4
> uses 1.6? You
ote:
> You didn't provide any error?
>
> You're compiling vs Hive 1.1 here and that is the problem. It is nothing
> to do with CDH.
>
> On Wed, Jun 24, 2015, 10:15 PM Aaron wrote:
>
>> I was curious if any one was able to get CDH 5.4.1 or 5.4.2 compiling
>&g
n't lead me anywherethoughts? help? URLs
to read?
Thanks in advance.
Cheers,
Aaron
Be careful shoving arbitrary binary data into a string, invalid utf
characters can cause significant computational overhead in my experience.
On Jun 11, 2015 10:09 AM, "Mark Tse" wrote:
> Makes sense – I suspect what you suggested should work.
>
>
>
> However, I think the overhead between this a
Note that speculation is off by default to avoid these kinds of unexpected
issues.
On Sat, Mar 28, 2015 at 6:21 AM, Steve Loughran
wrote:
>
> It's worth adding that there's no guaranteed that re-evaluated work would
> be on the same host as before, and in the case of node failure, it is not
> gu
Actually, this is the more relevant JIRA (which is resolved):
https://issues.apache.org/jira/browse/SPARK-3595
6352 is about saveAsParquetFile, which is not in use here.
Here is a DirectOutputCommitter implementation:
https://gist.github.com/aarondav/c513916e72101bbe14ec
and it can be configured
>
>
>
>
>
> Jianshi
>
>
>
>
>
>
>
> On Thu, Mar 5, 2015 at 2:44 PM, Shao, Saisai
> wrote:
>
> Hi Jianshi,
>
>
>
> From my understanding, it may not be the problem of NIO or Netty, looking
> at your stack trace, the OOM is occ
one runs into the same problem I had.
> >>
> >> By setting --hadoop-major-version=2 when using the ec2 scripts,
> >> everything worked fine.
> >>
> >> Darin.
> >>
> >>
> >> - Original Message -
> >> From: Darin McBeath
&
ChannelHandlerContext.java:319)
> at
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
> at
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
> at
> io.netty.channel.nio.NioEventLoop.
"Failed to connect" implies that the executor at that host died, please
check its logs as well.
On Tue, Mar 3, 2015 at 11:03 AM, Jianshi Huang
wrote:
> Sorry that I forgot the subject.
>
> And in the driver, I got many FetchFailedException. The error messages are
>
> 15/03/03 10:34:32 WARN TaskS
All stated symptoms are consistent with GC pressure (other nodes timeout
trying to connect because of a long stop-the-world), quite possibly due to
groupByKey. groupByKey is a very expensive operation as it may bring all
the data for a particular partition into memory (in particular, it cannot
spil
Note that the parallelism (i.e., number of partitions) is just an upper
bound on how much of the work can be done in parallel. If you have 200
partitions, then you can divide the work among between 1 and 200 cores and
all resources will remain utilized. If you have more than 200 cores,
though, then
Here is the class: https://gist.github.com/aarondav/c513916e72101bbe14ec
You can use it by setting "mapred.output.committer.class" in the Hadoop
configuration (or "spark.hadoop.mapred.output.committer.class" in the Spark
configuration). Note that this only works for the old Hadoop APIs, I
believe
RangePartitioner does not actually provide a guarantee that all partitions
will be equal sized (that is hard), and instead uses sampling to
approximate equal buckets. Thus, it is possible that a bucket is left empty.
If you want the specified behavior, you should define your own partitioner.
It wo
I think Xuefeng Wu's suggestion is likely correct. This different is more
likely explained by the compression library changing versions than sort vs
hash shuffle (which should not affect output size significantly). Others
have reported that switching to lz4 fixed their issue.
We should document th
Did the problem go away when you switched to lz4? There was a change from
the default compression codec fro 1.0 to 1.1, where we went from LZF to
Snappy. I don't think there was any such change from 1.1 to 1.2, though.
On Fri, Feb 6, 2015 at 12:17 AM, Praveen Garg
wrote:
> We tried changing the
The latter would be faster. With S3, you want to maximize number of
concurrent readers until you hit your network throughput limits.
On Wed, Feb 4, 2015 at 6:20 AM, Peter Rudenko
wrote:
> Hi if i have a 10GB file on s3 and set 10 partitions, would it be
> download whole file on master first and
To be clear, there is no distinction between partitions and blocks for RDD
caching (each RDD partition corresponds to 1 cache block). The distinction
is important for shuffling, where by definition N partitions are shuffled
into M partitions, creating N*M intermediate blocks. Each of these blocks
m
Ah, this is in particular an issue due to sort-based shuffle (it was not
the case for hash-based shuffle, which would immediately serialize each
record rather than holding many in memory at once). The documentation
should be updated.
On Fri, Jan 30, 2015 at 11:27 AM, Sandy Ryza
wrote:
> Hi Andre
gt; code logs, but the job sits there as the moving of files happens.
>
> On Tue, Jan 27, 2015 at 7:24 PM, Aaron Davidson
> wrote:
>
>> This renaming from _temporary to the final location is actually done by
>> executors, in parallel, for saveAsTextFile. It should be perfor
This renaming from _temporary to the final location is actually done by
executors, in parallel, for saveAsTextFile. It should be performed by each
task individually before it returns.
I have seen an issue similar to what you mention dealing with Hive code
which did the renaming serially on the dri
It looks like something weird is going on with your object serialization,
perhaps a funny form of self-reference which is not detected by
ObjectOutputStream's typical loop avoidance. That, or you have some data
structure like a linked list with a parent pointer and you have many
thousand elements.
Please take a look at the executor logs (on both sides of the IOException)
to see if there are other exceptions (e.g., OOM) which precede this one.
Generally, the connections should not fail spontaneously.
On Sun, Jan 25, 2015 at 10:35 PM, octavian.ganea wrote:
> Hi,
>
> I am running a program t
This was a regression caused by Netty Block Transfer Service. The fix for
this just barely missed the 1.2 release, and you can see the associated
JIRA here: https://issues.apache.org/jira/browse/SPARK-4837
Current master has the fix, and the Spark 1.2.1 release will have it
included. If you don't
Spark's network-common package depends on guava as a "provided" dependency
in order to avoid conflicting with other libraries (e.g., Hadoop) that
depend on specific versions. com/google/common/base/Preconditions has been
present in Guava since version 2, so this is likely a "dependency not
found" r
Scala for-loops are implemented as closures using anonymous inner classes
which are instantiated once and invoked many times. This means, though,
that the code inside the loop is actually sitting inside a class, which
confuses Spark's Closure Cleaner, whose job is to remove unused references
from c
What version are you running? I think "spark.shuffle.use.netty" was a valid
option only in Spark 1.1, where the Netty stuff was strictly experimental.
Spark 1.2 contains an officially supported and much more thoroughly tested
version under the property "spark.shuffle.blockTransferService", which is
As Jerry said, this is not related to shuffle file consolidation.
The unique thing about this problem is that it's failing to find a file
while trying to _write_ to it, in append mode. The simplest explanation for
this would be that the file is deleted in between some check for existence
and openi
Do note that this problem may be fixed in Spark 1.2, as we changed the
default transfer service to use a Netty-based one rather than the
ConnectionManager.
On Thu, Jan 8, 2015 at 7:05 AM, Spidy wrote:
> Hi,
>
> Can you please explain which settings did you changed?
>
>
>
> --
> View this message
Found the issue in JIRA:
https://issues.apache.org/jira/browse/SPARK-4389?jql=project%20%3D%20SPARK%20AND%20text%20~%20NAT
On Tue, Jan 6, 2015 at 10:45 AM, Aaron wrote:
> From what I can tell, this isn't a "firewall" issue per se..it's how the
> Remoting Service "
to tell the workers which IP address to use..WITHOUT, binding to it maybe?
Maybe allow the Remoting Service to bind to the internal IP..but, advertise
it differently?
On Mon, Jan 5, 2015 at 9:02 AM, Aaron wrote:
> Thanks for the link! However, from reviewing the thread, it appears you
> c
,
Aaron
On Mon, Jan 5, 2015 at 8:28 AM, Akhil Das
wrote:
> You can have a look at this discussion
> http://apache-spark-user-list.1001560.n3.nabble.com/Submitting-Spark-job-on-Unix-cluster-from-dev-environment-Windows-td16989.html
>
> Thanks
> Best Regards
>
> On Mon, J
ious --conf spark.driver.host
parameters...but it still get's "angry."
Any thoughts/suggestions?
Currently our work around is to VPNC connection from inside the vagrant VMs
or Openstack instances...but, that doesn't seem like a long term plan.
Thanks in advance!
Cheers,
Aaron
The ContextCleaner uncaches RDDs that have gone out of scope on the driver.
So it's possible that the given RDD is no longer reachable in your
program's control flow, or else it'd be a bug in the ContextCleaner.
On Wed, Dec 10, 2014 at 5:34 PM, ankits wrote:
> I'm using spark 1.1.0 and am seeing
You can actually submit multiple jobs to a single SparkContext in different
threads. In the case you mentioned with 2 stages having a common parent,
both will wait for the parent stage to complete and then the two will
execute in parallel, sharing the cluster resources.
Solutions that submit multi
Because this was a maintenance release, we should not have introduced any
binary backwards or forwards incompatibilities. Therefore, applications
that were written and compiled against 1.1.0 should still work against a
1.1.1 cluster, and vice versa.
On Wed, Dec 3, 2014 at 1:30 PM, Andrew Or wrote
new s3a filesystem in Hadoop 2.6.0 [1].
>
> 1.
> https://issues.apache.org/jira/plugins/servlet/mobile#issue/HADOOP-10400
> On Nov 26, 2014 12:24 PM, "Aaron Davidson" wrote:
>
>> Spark has a known problem where it will do a pass of metadata on a large
>> num
Spark has a known problem where it will do a pass of metadata on a large
number of small files serially, in order to find the partition information
prior to starting the job. This will probably not be repaired by switching
the FS impl.
However, you can change the FS being used like so (prior to th
As Mohit said, making Main extend Serializable should fix this example. In
general, it's not a bad idea to mark the fields you don't want to serialize
(e.g., sc and conf in this case) as @transient as well, though this is not
the issue in this case.
Note that this problem would not have arisen in
In the situation you show, Spark will pipeline each filter together, and
will apply each filter one at a time to each row, effectively constructing
an "&&" statement. You would only see a performance difference if the
filter code itself is somewhat expensive, then you would want to only
execute it
They 8 minute slowdown seems to be solely
> attributable to the data locality issue, as far as I can tell. There was
> some further confusion though in the times I mentioned - the list I gave
> (3.1 min, 2 seconds, ... 8 min) were not different runs with different
> cache %s, they were it
n what I meant.
> I didn't mean it went down within a run, with the same instance.
>
> I meant I'd run the whole app, and one time, it would cache 100%, and the
> next run, it might cache only 83%
>
> Within a run, it doesn't change.
>
> On Wed, Nov 12,
The fact that the caching percentage went down is highly suspicious. It
should generally not decrease unless other cached data took its place, or
if unless executors were dying. Do you know if either of these were the
case?
On Tue, Nov 11, 2014 at 8:58 AM, Nathan Kronenfeld <
nkronenf...@oculusinf
oops, meant to cc userlist too
On Sat, Nov 8, 2014 at 3:13 PM, Aaron Davidson wrote:
> The default local master is "local[*]", which should use all cores on your
> system. So you should be able to just do "./bin/pyspark" and
> "sc.parallelize(range(1000)).co
This may be due in part to Scala allocating an anonymous inner class in
order to execute the for loop. I would expect if you change it to a while
loop like
var i = 0
while (i < 10) {
sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)
i += 1
}
then the problem may go away. I am not sup
coalesce() is a streaming operation if used without the second parameter,
it does not put all the data in RAM. If used with the second parameter
(shuffle = true), then it performs a shuffle, but still does not put all
the data in RAM.
On Sat, Nov 1, 2014 at 12:09 PM, wrote:
> Now I am getting to
Another wild guess, if your data is stored in S3, you might be running into
an issue where the default jets3t properties limits the number of parallel
S3 connections to 4. Consider increasing the max-thread-counts from here:
http://www.jets3t.org/toolkit/configuration.html.
On Tue, Oct 21, 2014 at
You may be running into this issue:
https://issues.apache.org/jira/browse/SPARK-4019
You could check by having 2000 or fewer reduce partitions.
On Wed, Oct 22, 2014 at 1:48 PM, DB Tsai wrote:
> PS, sorry for spamming the mailing list. Based my knowledge, both
> spark.shuffle.spill.compress and
The "minPartitions" argument of textFile/hadoopFile cannot decrease the
number of splits past the physical number of blocks/files. So if you have 3
HDFS blocks, asking for 2 minPartitions will still give you 3 partitions
(hence the "min"). It can, however, convert a file with fewer HDFS blocks
into
Pretty easy to do in Scala:
rdd.elementClassTag.runtimeClass
You can access this method from Python as well by using the internal _jrdd.
It would look something like this (warning, I have not tested it):
rdd._jrdd.classTag().runtimeClass()
(The method name is "classTag" for JavaRDDLike, and "ele
Looks like that's BlockManagerWorker.syncPutBlock(), which is in an if
check, perhaps obscuring its existence.
On Fri, Sep 5, 2014 at 2:19 AM, rapelly kartheek
wrote:
> Hi,
>
> var cachedPeers: Seq[BlockManagerId] = null
> private def replicate(blockId: String, data: ByteBuffer, level:
> Stor
Are you doing this from the spark-shell? You're probably running into
https://issues.apache.org/jira/browse/SPARK-1199 which should be fixed in
1.1.
On Sat, Sep 6, 2014 at 3:03 AM, Dhimant wrote:
> I am using Spark version 1.0.2
>
>
>
>
> --
> View this message in context:
> http://apache-spark
More of a Scala question than Spark, but "apply" here can be written with
just parentheses like this:
val array = Array.fill[Byte](10)(0)
if (array(index) == 0) {
array(index) = 1
}
The second instance of "array(index) = 1" is actually not calling apply,
but "update". It's a scala-ism that's us
If someone doesn't have the access to do that is there any easy to specify a
different properties file to be used?
Patrick Wendell wrote
> If you want to customize the logging behavior - the simplest way is to
> copy
> conf/log4j.properties.tempate to conf/log4j.properties. Then you can go
> and
This is likely due to a bug in shuffle file consolidation (which you have
enabled) which was hopefully fixed in 1.1 with this patch:
https://github.com/apache/spark/commit/78f2af582286b81e6dc9fa9d455ed2b369d933bd
Until 1.0.3 or 1.1 are released, the simplest solution is to disable
spark.shuffle.co
These three lines of python code cause the error for me:
sc = SparkContext(appName="foo")
input = sc.textFile("hdfs://[valid hdfs path]")
mappedToLines = input.map(lambda myline: myline.split(","))
The file I'm loading is a simple CSV.
--
View this message in context:
http://apach
Sure thing, this is the stacktrace from pyspark. It's repeated a few times,
but I think this is the unique stuff.
Traceback (most recent call last):
File "", line 1, in
File
"/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/spark/python/pyspark/rdd.py",
line 583, in collect
bytesInJa
g
in my own sandbox in purely local mode before. Any help would be
appreciated, thanks! -Aaron
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Python-script-runs-fine-in-local-mode-errors-in-other-modes-tp12390.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Yes, good point, I believe the masterLock is now unnecessary altogether.
The reason for its initial existence was that "changeMaster()" originally
could be called out-of-band of the actor, and so we needed to make sure the
master reference did not change out from under us. Now it appears that all
m
The driver must initially compute the partitions and their preferred
locations for each part of the file, which results in a serial
getFileBlockLocations() on each part. However, I would expect this to take
several seconds, not minutes, to perform on 1000 parts. Is your driver
inside or outside of
Ah, that's unfortunate, that definitely should be added. Using a
pyspark-internal method, you could try something like
javaIterator = rdd._jrdd.toLocalIterator()
it = rdd._collect_iterator_through_file(javaIterator)
On Fri, Aug 1, 2014 at 3:04 PM, Andrei wrote:
> Thanks, Aaron, it s
rdd.toLocalIterator will do almost what you want, but requires that each
individual partition fits in memory (rather than each individual line).
Hopefully that's sufficient, though.
On Fri, Aug 1, 2014 at 1:38 AM, Andrei wrote:
> Is there a way to get iterator from RDD? Something like rdd.colle
Make sure to set it before you start your SparkContext -- it cannot be
changed afterwards. Be warned that there are some known issues with shuffle
file consolidation, which should be fixed in 1.1.
On Thu, Jul 31, 2014 at 12:40 PM, Jianshi Huang
wrote:
> I got the number from the Hadoop admin. I
1 - 100 of 230 matches
Mail list logo