Re: Incorrect Results and SIGSEGV on Read with Iceberg + PySpark + Nessie

2025-02-06 Thread Aaron Grubb
Someone just replied to the bug, it was already known about and will be fixed in the upcoming Iceberg 1.7.2 release. On Thu, 2025-02-06 at 09:35 +, Aaron Grubb wrote: > Hi all, > > I filed a bug with the Iceberg team [1] but I'm not sure that it's 100% > specific to I

Incorrect Results and SIGSEGV on Read with Iceberg + PySpark + Nessie

2025-02-06 Thread Aaron Grubb
sing aarch64 somehow? All JARs are precompiled versions. Thanks, Aaron [1] https://github.com/apache/iceberg/issues/12178 - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Storing a JDBC-based table in a catalog for direct use in Spark SQL

2025-01-14 Thread Aaron Grubb
2 On Mon, 2025-01-13 at 18:49 +0000, Aaron Grubb wrote: > Hi all, > > I'm trying to figure out how to persist a table definition in a catalog that > can be used from different sessions. Something along the lines > of > > --- > CREATE TABLE spark_catal

Storing a JDBC-based table in a catalog for direct use in Spark SQL

2025-01-13 Thread Aaron Grubb
;. I have also tried setting spark.sql.warehouse.dir to a location in S3 and setting enableHiveSupport() on the session, however creating this table under these circumstances only creates an empty directory and the table doesn't show up in the next session. Would I need to

Re: Hitting SPARK-45858 on Kubernetes - Unavoidable bug or misconfiguration?

2024-08-20 Thread Aaron Grubb
at 13:01 +0000, Aaron Grubb wrote: > Hi all, > > I'm running Spark on Kubernetes on AWS using only spot instances for > executors with dynamic allocation enabled. This particular job is > being > triggered by Airflow and it hit this bug [1] 6 times in a row. However, I had

Hitting SPARK-45858 on Kubernetes - Unavoidable bug or misconfiguration?

2024-08-19 Thread Aaron Grubb
dynamic spark.sql.streaming.kafka.useDeprecatedOffsetFetching false spark.submit.deployMode cluster Thanks, Aaron [1] https://issues.apache.org/jira/browse/SPARK-45858

Re: [Spark]: Spark / Iceberg / hadoop-aws compatibility matrix

2024-04-03 Thread Aaron Grubb
Downgrade to hadoop-*:3.3.x, Hadoop 3.4.x is based on the AWS SDK v2 and should probably be considered as breaking for tools that build on < 3.4.0 while using AWS. From: Oxlade, Dan Sent: Wednesday, April 3, 2024 2:41:11 PM To: user@spark.apache.org Subject: [Sp

Re: Spark reading from HBase using hbase-connectors - any benefit from localization?

2023-01-06 Thread Aaron Grubb
Hi Mich, Thanks a lot for the insight, it was very helpful. Aaron On Thu, 2023-01-05 at 23:44 +, Mich Talebzadeh wrote: Hi Aaron, Thanks for the details. It is a general practice when running Spark on premise to use Hadoop clusters.<https://spark.apache.org/faq.html#:~:text=How%20d

Re: Spark reading from HBase using hbase-connectors - any benefit from localization?

2023-01-05 Thread Aaron Grubb
nd HBase are running provide localization benefits when Spark reads from HBase, or are localization benefits negligible and it's a better idea to put Spark in a standalone cluster? Thanks for your time, Aaron On Thu, 2023-01-05 at 19:00 +, Mich Talebzadeh wrote: Few questions * As I unde

Spark reading from HBase using hbase-connectors - any benefit from localization?

2023-01-05 Thread Aaron Grubb
same nodes provide a benefit when using hbase-connectors (https://github.com/apache/hbase-connectors/tree/master/spark)? Is there a mechanism in the connector to "pass through" a short circuit read to Spark, or would data always bounce from HDFS -> RegionServer -> Spark? Thanks in advance, Aaron

Re: What is the equivalent of mapPartitions in SpqrkSQL?

2017-06-27 Thread Aaron Perrin
I'm assuming some things here, but hopefully I understand. So, basically you have a big table of data distributed across a bunch of executors. And, you want an efficient way to call a native method for each row. It sounds similar to a dataframe writer to me. Except, instead of writing to disk or n

Multiple quantile calculations

2017-01-31 Thread Aaron Perrin
I want to calculate quantiles on two different columns. I know that I can calculate them with two separate operations. However, for performance reasons, I'd like to calculate both with one operation. Is this possible to do this with the Dataset API? I'm assuming that it isn't. But, if it isn't, i

Re: Tuning spark.executor.cores

2017-01-09 Thread Aaron Perrin
That setting defines the total number of tasks that an executor can run in parallel. Each node is partitioned into executors, each with identical heap and cores. So, it can be a balancing act to optimally set these values, particularly if the goal is to maximize CPU usage with memory and other IO.

Please unsubscribe me from this mailing list

2016-09-26 Thread Hogancamp, Aaron
Please unsubscribe aaron.t.hoganc...@leidos.com<mailto:aaron.t.hoganc...@leidos.com> from this mailing list. Thanks, Aaron Hogancamp Data Scientist (615) 431-3229 (desk) (615) 617-7160 (mobile)

Left Join Yields Results And Not Results

2016-09-24 Thread Aaron Jackson
oes updated.FieldId show 123 as well, when the expanded join for 'updated.*' shows null. I can what I want to do by using an RDD, but I was hoping to avoid bypassing tungsten. It almost feels like it's optimizing the field based on the join. But I tested other fields as well and they also came back with values from base. Very odd. Any thoughts? Aaron

Unsubscribe

2016-08-09 Thread Hogancamp, Aaron
Unsubscribe. Thanks, Aaron Hogancamp Data Scientist

Re: Error in running JavaALSExample example from spark examples

2016-07-22 Thread Aaron Ilovici
Your error stems from spark.ml, and in your pom mllib is the only dependency that is 2.10. Is there a reason for this? IE, you tell maven mllib 2.10 is provided at runtime. Is 2.10 on the machine, or is 2.11? -Aaron From: VG Date: Friday, July 22, 2016 at 1:49 PM To: Sean Owen Cc: User

Re: Error in running JavaALSExample example from spark examples

2016-07-22 Thread Aaron Ilovici
What version of Spark/Scala are you running? -Aaron

Heavy Stage Concentration - Ends With Failure

2016-07-19 Thread Aaron Jackson
Hi, I have a cluster with 15 nodes of which 5 are HDFS nodes. I kick off a job that creates some 120 stages. Eventually, the active and pending stages reduce down to a small bottleneck and it never fails... the tasks associated with the 10 (or so) running tasks are always allocated to the same e

Re: Possible to broadcast a function?

2016-06-30 Thread Aaron Perrin
> then resume execution. This works, but it ends up costing me a lot of extra > memory (i.e. a few TiB when I have a lot of executors). > > What I'd like to do is use the broadcast mechanism to load the data structure > once, per node. But, I can't serialize the data structure from the driver. > > Any ideas? > > Thanks! > > Aaron >

Re: Possible to broadcast a function?

2016-06-29 Thread Aaron Perrin
5 GiB? Or, do I have to decrease executor memory to ~385 across all executors? (Note: I'm running on Yarn, which may affect this.) Thanks, Aaron On Wed, Jun 29, 2016 at 12:09 PM Sean Owen wrote: > If you have one executor per machine, which is the right default thing > to do, and th

Possible to broadcast a function?

2016-06-29 Thread Aaron Perrin
e a lot of extra memory (i.e. a few TiB when I have a lot of executors). What I'd like to do is use the broadcast mechanism to load the data structure once, per node. But, I can't serialize the data structure from the driver. Any ideas? Thanks! Aaron

Re: JDBC Dialect for saving DataFrame into Vertica Table

2016-05-27 Thread Aaron Ilovici
elp, AARON ILOVICI Software Engineer Marketing Engineering [cid:image001.png@01D1B7F9.A3949B20] WAYFAIR 4 Copley Place Boston, MA 02116 (617) 532-6100 x1231 ailov...@wayfair.com<mailto:ailov...@wayfair.com> From: Reynold Xin Date: Thursday, May 26, 2016 at 6:11 PM To: Mohammed Guller

JDBC Dialect for saving DataFrame into Vertica Table

2016-05-26 Thread Aaron Ilovici
7; - is None mapped to the proper NULL type elsewhere? My environment: Spark 1.6, Vertica Driver 7.2.2, Java 1.7 I would be happy to create a Jira and submit a pull request with the VerticaDialect once I figure this out. Thank you for any insight on this, AARON ILOVICI Software Engineer

S3A Creating Task Per Byte (pyspark / 1.6.1)

2016-05-12 Thread Aaron Jackson
I'm using the spark 1.6.1 (hadoop-2.6) and I'm trying to load a file that's in s3. I've done this previously with spark 1.5 with no issue. Attempting to load and count a single file as follows: dataFrame = sqlContext.read.text('s3a://bucket/path-to-file.csv') dataFrame.count() But when it attem

Re: Best way to determine # of workers

2016-03-25 Thread Aaron Jackson
I think the SparkListener is about as close as it gets. That way I can start up the instance (aws, open-stack, vmware, etc) and simply wait until the SparkListener indicates that the executors are online before starting. Thanks for the advise. Aaron On Fri, Mar 25, 2016 at 10:54 AM, Jacek

Re: Best way to determine # of workers

2016-03-24 Thread Aaron Jackson
specific case, I may be growing the cluster size by a hundred nodes and if I fail to wait for that initialization to complete the job will not have enough memory to run my jobs. Aaron On Thu, Mar 24, 2016 at 3:07 AM, Takeshi Yamamuro wrote: > Hi, > > There is no way to get such information

Re: Passing parameters to spark SQL

2015-12-28 Thread Aaron Jackson
I went down the SQL path. The problem is the loss of type and the possibility for SQL injection. No biggie, just means that where parameterized queries are in-play, we'll have to write it out in-code rather than in SQL. Thanks, Aaron On Sun, Dec 27, 2015 at 8:06 PM, Michael Armbrust wrote

Re: Spark-shell connecting to Mesos stuck at sched.cpp

2015-12-16 Thread Aaron
tname/ip in mesos configuration - see Nikolaos answer > Cheers, Aaron On Mon, Nov 16, 2015 at 9:37 PM, Jo Voordeckers wrote: > I've seen this issue when the mesos cluster couldn't figure out my IP > address correctly, have you tried setting the ENV var with your IP address > wh

Increasing memory usage on batch job (pyspark)

2015-12-01 Thread Aaron Jackson
Greetings, I am processing a "batch" of files and have structured an iterative process around them. Each batch is processed by first loading the data with spark-csv, performing some minor transformations and then writing back out as parquet. Absolutely no caching or shuffle should occur with anyt

Re: Python script runs fine in local mode, errors in other modes

2015-09-28 Thread Aaron
mber 28, 2015 at 1:35 PM To: Aaron Dossett mailto:aaron.doss...@target.com>> Subject: Re: Python script runs fine in local mode, errors in other modes Was there any eventual solution to this that you discovered? If you reply to this email, your messag

Re: Is SPARK-3322 fixed in latest version of Spark?

2015-08-04 Thread Aaron Davidson
ConnectionManager has been deprecated and is no longer used by default (NettyBlockTransferService is the replacement). Hopefully you would no longer see these messages unless you have explicitly flipped it back on. On Tue, Aug 4, 2015 at 6:14 PM, Jim Green wrote: > And also https://issues.apache

Spark 1.4.1,MySQL and DataFrameReader.read.jdbc fun

2015-07-20 Thread Aaron
h ? If I don't use this cmd line option, I get an error just attempting to do the sqlContext.read.jdbc() assignment..not trying to perform an operation on the RDD. Cheers, Aaron - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Re: S3 vs HDFS

2015-07-11 Thread Aaron Davidson
Note that if you use multi-part upload, each part becomes 1 block, which allows for multiple concurrent readers. One would typically use fixed-size block sizes which align with Spark's default HDFS block size (64 MB, I think) to ensure the reads are aligned. On Sat, Jul 11, 2015 at 11:14 AM, Steve

Re: All master are unreponsive issue

2015-07-05 Thread Aaron Davidson
Are you seeing this after the app has already been running for some time, or just at the beginning? Generally, registration should only occur once initially, and a timeout would be due to the master not being accessible. Try telneting to the master IP/port from the machine on which the driver will

Re: s3 bucket access/read file

2015-07-01 Thread Aaron Davidson
I think 2.6 failed to abruptly close streams that weren't fully read, which we observed as a huge performance hit. We had to backport the 2.7 improvements before being able to use it.

Re: s3 bucket access/read file

2015-06-30 Thread Aaron Davidson
Should be able to use s3a (on new hadoop versions), I believe that will try or at least has a setting for v4 On Tue, Jun 30, 2015 at 8:31 PM, Exie wrote: > Not sure if this helps, but the options I set are slightly different: > > val hadoopConf=sc.hadoopConfiguration > hadoopConf.set("fs.s3n.aws

Re: Compiling Spark 1.4 (and/or Spark 1.4.1-rc1) with CDH 5.4.1/2

2015-06-25 Thread Aaron
Yep! That was it. Using the 1.6.0rc3 that comes with spark, rather than using the 1.5.0-cdh5.4.2 version. Thanks for the help! Cheers, Aaron On Thu, Jun 25, 2015 at 8:24 AM, Sean Owen wrote: > Hm that looks like a Parquet version mismatch then. I think Spark 1.4 > uses 1.6? You

Re: Compiling Spark 1.4 (and/or Spark 1.4.1-rc1) with CDH 5.4.1/2

2015-06-25 Thread Aaron
ote: > You didn't provide any error? > > You're compiling vs Hive 1.1 here and that is the problem. It is nothing > to do with CDH. > > On Wed, Jun 24, 2015, 10:15 PM Aaron wrote: > >> I was curious if any one was able to get CDH 5.4.1 or 5.4.2 compiling >&g

Compiling Spark 1.4 (and/or Spark 1.4.1-rc1) with CDH 5.4.1/2

2015-06-24 Thread Aaron
n't lead me anywherethoughts? help? URLs to read? Thanks in advance. Cheers, Aaron

RE: ReduceByKey with a byte array as the key

2015-06-11 Thread Aaron Davidson
Be careful shoving arbitrary binary data into a string, invalid utf characters can cause significant computational overhead in my experience. On Jun 11, 2015 10:09 AM, "Mark Tse" wrote: > Makes sense – I suspect what you suggested should work. > > > > However, I think the overhead between this a

Re: RDD resiliency -- does it keep state?

2015-03-28 Thread Aaron Davidson
Note that speculation is off by default to avoid these kinds of unexpected issues. On Sat, Mar 28, 2015 at 6:21 AM, Steve Loughran wrote: > > It's worth adding that there's no guaranteed that re-evaluated work would > be on the same host as before, and in the case of node failure, it is not > gu

Re: Spark will process _temporary folder on S3 is very slow and always cause failure

2015-03-17 Thread Aaron Davidson
Actually, this is the more relevant JIRA (which is resolved): https://issues.apache.org/jira/browse/SPARK-3595 6352 is about saveAsParquetFile, which is not in use here. Here is a DirectOutputCommitter implementation: https://gist.github.com/aarondav/c513916e72101bbe14ec and it can be configured

Re: Having lots of FetchFailedException in join

2015-03-05 Thread Aaron Davidson
> > > > > > Jianshi > > > > > > > > On Thu, Mar 5, 2015 at 2:44 PM, Shao, Saisai > wrote: > > Hi Jianshi, > > > > From my understanding, it may not be the problem of NIO or Netty, looking > at your stack trace, the OOM is occ

Re: Which OutputCommitter to use for S3?

2015-03-05 Thread Aaron Davidson
one runs into the same problem I had. > >> > >> By setting --hadoop-major-version=2 when using the ec2 scripts, > >> everything worked fine. > >> > >> Darin. > >> > >> > >> - Original Message - > >> From: Darin McBeath &

Re: Having lots of FetchFailedException in join

2015-03-03 Thread Aaron Davidson
ChannelHandlerContext.java:319) > at > io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130) > at > io.netty.channel.nio.NioEventLoop.

Re: Having lots of FetchFailedException in join

2015-03-03 Thread Aaron Davidson
"Failed to connect" implies that the executor at that host died, please check its logs as well. On Tue, Mar 3, 2015 at 11:03 AM, Jianshi Huang wrote: > Sorry that I forgot the subject. > > And in the driver, I got many FetchFailedException. The error messages are > > 15/03/03 10:34:32 WARN TaskS

Re: Problem getting program to run on 15TB input

2015-02-28 Thread Aaron Davidson
All stated symptoms are consistent with GC pressure (other nodes timeout trying to connect because of a long stop-the-world), quite possibly due to groupByKey. groupByKey is a very expensive operation as it may bring all the data for a particular partition into memory (in particular, it cannot spil

Re: Worker and Nodes

2015-02-21 Thread Aaron Davidson
Note that the parallelism (i.e., number of partitions) is just an upper bound on how much of the work can be done in parallel. If you have 200 partitions, then you can divide the work among between 1 and 200 cores and all resources will remain utilized. If you have more than 200 cores, though, then

Re: Which OutputCommitter to use for S3?

2015-02-21 Thread Aaron Davidson
Here is the class: https://gist.github.com/aarondav/c513916e72101bbe14ec You can use it by setting "mapred.output.committer.class" in the Hadoop configuration (or "spark.hadoop.mapred.output.committer.class" in the Spark configuration). Note that this only works for the old Hadoop APIs, I believe

Re: RangePartitioner in Spark 1.2.1

2015-02-17 Thread Aaron Davidson
RangePartitioner does not actually provide a guarantee that all partitions will be equal sized (that is hard), and instead uses sampling to approximate equal buckets. Thus, it is possible that a bucket is left empty. If you want the specified behavior, you should define your own partitioner. It wo

Re: Shuffle write increases in spark 1.2

2015-02-15 Thread Aaron Davidson
I think Xuefeng Wu's suggestion is likely correct. This different is more likely explained by the compression library changing versions than sort vs hash shuffle (which should not affect output size significantly). Others have reported that switching to lz4 fixed their issue. We should document th

Re: Shuffle read/write issue in spark 1.2

2015-02-06 Thread Aaron Davidson
Did the problem go away when you switched to lz4? There was a change from the default compression codec fro 1.0 to 1.1, where we went from LZF to Snappy. I don't think there was any such change from 1.1 to 1.2, though. On Fri, Feb 6, 2015 at 12:17 AM, Praveen Garg wrote: > We tried changing the

Re: ephemeral-hdfs vs persistent-hdfs - performance

2015-02-04 Thread Aaron Davidson
The latter would be faster. With S3, you want to maximize number of concurrent readers until you hit your network throughput limits. On Wed, Feb 4, 2015 at 6:20 AM, Peter Rudenko wrote: > Hi if i have a 10GB file on s3 and set 10 partitions, would it be > download whole file on master first and

Re: 2GB limit for partitions?

2015-02-03 Thread Aaron Davidson
To be clear, there is no distinction between partitions and blocks for RDD caching (each RDD partition corresponds to 1 cache block). The distinction is important for shuffling, where by definition N partitions are shuffled into M partitions, creating N*M intermediate blocks. Each of these blocks m

Re: Duplicate key when sorting BytesWritable with Kryo?

2015-01-30 Thread Aaron Davidson
Ah, this is in particular an issue due to sort-based shuffle (it was not the case for hash-based shuffle, which would immediately serialize each record rather than holding many in memory at once). The documentation should be updated. On Fri, Jan 30, 2015 at 11:27 AM, Sandy Ryza wrote: > Hi Andre

Re: performance of saveAsTextFile moving files from _temporary

2015-01-28 Thread Aaron Davidson
gt; code logs, but the job sits there as the moving of files happens. > > On Tue, Jan 27, 2015 at 7:24 PM, Aaron Davidson > wrote: > >> This renaming from _temporary to the final location is actually done by >> executors, in parallel, for saveAsTextFile. It should be perfor

Re: performance of saveAsTextFile moving files from _temporary

2015-01-27 Thread Aaron Davidson
This renaming from _temporary to the final location is actually done by executors, in parallel, for saveAsTextFile. It should be performed by each task individually before it returns. I have seen an issue similar to what you mention dealing with Hive code which did the renaming serially on the dri

Re: Lost task - connection closed

2015-01-26 Thread Aaron Davidson
It looks like something weird is going on with your object serialization, perhaps a funny form of self-reference which is not detected by ObjectOutputStream's typical loop avoidance. That, or you have some data structure like a linked list with a parent pointer and you have many thousand elements.

Re: Lost task - connection closed

2015-01-25 Thread Aaron Davidson
Please take a look at the executor logs (on both sides of the IOException) to see if there are other exceptions (e.g., OOM) which precede this one. Generally, the connections should not fail spontaneously. On Sun, Jan 25, 2015 at 10:35 PM, octavian.ganea wrote: > Hi, > > I am running a program t

Re: Spark 1.2 – How to change Default (Random) port ….

2015-01-25 Thread Aaron Davidson
This was a regression caused by Netty Block Transfer Service. The fix for this just barely missed the 1.2 release, and you can see the associated JIRA here: https://issues.apache.org/jira/browse/SPARK-4837 Current master has the fix, and the Spark 1.2.1 release will have it included. If you don't

Re: Spark 1.2 - com/google/common/base/Preconditions java.lang.NoClassDefFoundErro

2015-01-20 Thread Aaron Davidson
Spark's network-common package depends on guava as a "provided" dependency in order to avoid conflicting with other libraries (e.g., Hadoop) that depend on specific versions. com/google/common/base/Preconditions has been present in Guava since version 2, so this is likely a "dependency not found" r

Re: Serializability: for vs. while loops

2015-01-15 Thread Aaron Davidson
Scala for-loops are implemented as closures using anonymous inner classes which are instantiated once and invoked many times. This means, though, that the code inside the loop is actually sitting inside a class, which confuses Spark's Closure Cleaner, whose job is to remove unused references from c

Re: use netty shuffle for network cause high gc time

2015-01-13 Thread Aaron Davidson
What version are you running? I think "spark.shuffle.use.netty" was a valid option only in Spark 1.1, where the Netty stuff was strictly experimental. Spark 1.2 contains an officially supported and much more thoroughly tested version under the property "spark.shuffle.blockTransferService", which is

Re: FileNotFoundException in appcache shuffle files

2015-01-10 Thread Aaron Davidson
As Jerry said, this is not related to shuffle file consolidation. The unique thing about this problem is that it's failing to find a file while trying to _write_ to it, in append mode. The simplest explanation for this would be that the file is deleted in between some check for existence and openi

Re: ERROR ConnectionManager: Corresponding SendingConnection to ConnectionManagerId

2015-01-08 Thread Aaron Davidson
Do note that this problem may be fixed in Spark 1.2, as we changed the default transfer service to use a Netty-based one rather than the ConnectionManager. On Thu, Jan 8, 2015 at 7:05 AM, Spidy wrote: > Hi, > > Can you please explain which settings did you changed? > > > > -- > View this message

Re: Spark Driver "behind" NAT

2015-01-06 Thread Aaron
Found the issue in JIRA: https://issues.apache.org/jira/browse/SPARK-4389?jql=project%20%3D%20SPARK%20AND%20text%20~%20NAT On Tue, Jan 6, 2015 at 10:45 AM, Aaron wrote: > From what I can tell, this isn't a "firewall" issue per se..it's how the > Remoting Service "

Re: Spark Driver "behind" NAT

2015-01-06 Thread Aaron
to tell the workers which IP address to use..WITHOUT, binding to it maybe? Maybe allow the Remoting Service to bind to the internal IP..but, advertise it differently? On Mon, Jan 5, 2015 at 9:02 AM, Aaron wrote: > Thanks for the link! However, from reviewing the thread, it appears you > c

Re: Spark Driver "behind" NAT

2015-01-05 Thread Aaron
, Aaron On Mon, Jan 5, 2015 at 8:28 AM, Akhil Das wrote: > You can have a look at this discussion > http://apache-spark-user-list.1001560.n3.nabble.com/Submitting-Spark-job-on-Unix-cluster-from-dev-environment-Windows-td16989.html > > Thanks > Best Regards > > On Mon, J

Spark Driver "behind" NAT

2015-01-05 Thread Aaron
ious --conf spark.driver.host parameters...but it still get's "angry." Any thoughts/suggestions? Currently our work around is to VPNC connection from inside the vagrant VMs or Openstack instances...but, that doesn't seem like a long term plan. Thanks in advance! Cheers, Aaron

Re: RDDs being cleaned too fast

2014-12-10 Thread Aaron Davidson
The ContextCleaner uncaches RDDs that have gone out of scope on the driver. So it's possible that the given RDD is no longer reachable in your program's control flow, or else it'd be a bug in the ContextCleaner. On Wed, Dec 10, 2014 at 5:34 PM, ankits wrote: > I'm using spark 1.1.0 and am seeing

Re: Running two different Spark jobs vs multi-threading RDDs

2014-12-06 Thread Aaron Davidson
You can actually submit multiple jobs to a single SparkContext in different threads. In the case you mentioned with 2 stages having a common parent, both will wait for the parent stage to complete and then the two will execute in parallel, sharing the cluster resources. Solutions that submit multi

Re: Announcing Spark 1.1.1!

2014-12-03 Thread Aaron Davidson
Because this was a maintenance release, we should not have introduced any binary backwards or forwards incompatibilities. Therefore, applications that were written and compiled against 1.1.0 should still work against a 1.1.1 cluster, and vice versa. On Wed, Dec 3, 2014 at 1:30 PM, Andrew Or wrote

Re: S3NativeFileSystem inefficient implementation when calling sc.textFile

2014-11-30 Thread Aaron Davidson
new s3a filesystem in Hadoop 2.6.0 [1]. > > 1. > https://issues.apache.org/jira/plugins/servlet/mobile#issue/HADOOP-10400 > On Nov 26, 2014 12:24 PM, "Aaron Davidson" wrote: > >> Spark has a known problem where it will do a pass of metadata on a large >> num

Re: S3NativeFileSystem inefficient implementation when calling sc.textFile

2014-11-26 Thread Aaron Davidson
Spark has a known problem where it will do a pass of metadata on a large number of small files serially, in order to find the partition information prior to starting the job. This will probably not be repaired by switching the FS impl. However, you can change the FS being used like so (prior to th

Re: Bug in Accumulators...

2014-11-23 Thread Aaron Davidson
As Mohit said, making Main extend Serializable should fix this example. In general, it's not a bad idea to mark the fields you don't want to serialize (e.g., sc and conf in this case) as @transient as well, though this is not the issue in this case. Note that this problem would not have arisen in

Re: Given multiple .filter()'s, is there a way to set the order?

2014-11-14 Thread Aaron Davidson
In the situation you show, Spark will pipeline each filter together, and will apply each filter one at a time to each row, effectively constructing an "&&" statement. You would only see a performance difference if the filter code itself is somewhat expensive, then you would want to only execute it

Re: data locality, task distribution

2014-11-13 Thread Aaron Davidson
They 8 minute slowdown seems to be solely > attributable to the data locality issue, as far as I can tell. There was > some further confusion though in the times I mentioned - the list I gave > (3.1 min, 2 seconds, ... 8 min) were not different runs with different > cache %s, they were it

Re: data locality, task distribution

2014-11-12 Thread Aaron Davidson
n what I meant. > I didn't mean it went down within a run, with the same instance. > > I meant I'd run the whole app, and one time, it would cache 100%, and the > next run, it might cache only 83% > > Within a run, it doesn't change. > > On Wed, Nov 12,

Re: data locality, task distribution

2014-11-12 Thread Aaron Davidson
The fact that the caching percentage went down is highly suspicious. It should generally not decrease unless other cached data took its place, or if unless executors were dying. Do you know if either of these were the case? On Tue, Nov 11, 2014 at 8:58 AM, Nathan Kronenfeld < nkronenf...@oculusinf

Re: Does spark works on multicore systems?

2014-11-08 Thread Aaron Davidson
oops, meant to cc userlist too On Sat, Nov 8, 2014 at 3:13 PM, Aaron Davidson wrote: > The default local master is "local[*]", which should use all cores on your > system. So you should be able to just do "./bin/pyspark" and > "sc.parallelize(range(1000)).co

Re: Bug in Accumulators...

2014-11-07 Thread Aaron Davidson
This may be due in part to Scala allocating an anonymous inner class in order to execute the for loop. I would expect if you change it to a while loop like var i = 0 while (i < 10) { sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x) i += 1 } then the problem may go away. I am not sup

Re: Spark speed performance

2014-11-01 Thread Aaron Davidson
coalesce() is a streaming operation if used without the second parameter, it does not put all the data in RAM. If used with the second parameter (shuffle = true), then it performs a shuffle, but still does not put all the data in RAM. On Sat, Nov 1, 2014 at 12:09 PM, wrote: > Now I am getting to

Re: Getting spark to use more than 4 cores on Amazon EC2

2014-10-22 Thread Aaron Davidson
Another wild guess, if your data is stored in S3, you might be running into an issue where the default jets3t properties limits the number of parallel S3 connections to 4. Consider increasing the max-thread-counts from here: http://www.jets3t.org/toolkit/configuration.html. On Tue, Oct 21, 2014 at

Re: Shuffle issues in the current master

2014-10-22 Thread Aaron Davidson
You may be running into this issue: https://issues.apache.org/jira/browse/SPARK-4019 You could check by having 2000 or fewer reduce partitions. On Wed, Oct 22, 2014 at 1:48 PM, DB Tsai wrote: > PS, sorry for spamming the mailing list. Based my knowledge, both > spark.shuffle.spill.compress and

Re: input split size

2014-10-18 Thread Aaron Davidson
The "minPartitions" argument of textFile/hadoopFile cannot decrease the number of splits past the physical number of blocks/files. So if you have 3 HDFS blocks, asking for 2 minPartitions will still give you 3 partitions (hence the "min"). It can, however, convert a file with fewer HDFS blocks into

Re: Getting the type of an RDD in spark AND pyspark

2014-09-06 Thread Aaron Davidson
Pretty easy to do in Scala: rdd.elementClassTag.runtimeClass You can access this method from Python as well by using the internal _jrdd. It would look something like this (warning, I have not tested it): rdd._jrdd.classTag().runtimeClass() (The method name is "classTag" for JavaRDDLike, and "ele

Re: question on replicate() in blockManager.scala

2014-09-06 Thread Aaron Davidson
Looks like that's BlockManagerWorker.syncPutBlock(), which is in an if check, perhaps obscuring its existence. On Fri, Sep 5, 2014 at 2:19 AM, rapelly kartheek wrote: > Hi, > > var cachedPeers: Seq[BlockManagerId] = null > private def replicate(blockId: String, data: ByteBuffer, level: > Stor

Re: error: type mismatch while Union

2014-09-06 Thread Aaron Davidson
Are you doing this from the spark-shell? You're probably running into https://issues.apache.org/jira/browse/SPARK-1199 which should be fixed in 1.1. On Sat, Sep 6, 2014 at 3:03 AM, Dhimant wrote: > I am using Spark version 1.0.2 > > > > > -- > View this message in context: > http://apache-spark

Re: How to change the values in Array of Bytes

2014-09-06 Thread Aaron Davidson
More of a Scala question than Spark, but "apply" here can be written with just parentheses like this: val array = Array.fill[Byte](10)(0) if (array(index) == 0) { array(index) = 1 } The second instance of "array(index) = 1" is actually not calling apply, but "update". It's a scala-ism that's us

Re: disable log4j for spark-shell

2014-08-26 Thread Aaron
If someone doesn't have the access to do that is there any easy to specify a different properties file to be used? Patrick Wendell wrote > If you want to customize the logging behavior - the simplest way is to > copy > conf/log4j.properties.tempate to conf/log4j.properties. Then you can go > and

Re: Stage failure in BlockManager due to FileNotFoundException on long-running streaming job

2014-08-20 Thread Aaron Davidson
This is likely due to a bug in shuffle file consolidation (which you have enabled) which was hopefully fixed in 1.1 with this patch: https://github.com/apache/spark/commit/78f2af582286b81e6dc9fa9d455ed2b369d933bd Until 1.0.3 or 1.1 are released, the simplest solution is to disable spark.shuffle.co

Re: Python script runs fine in local mode, errors in other modes

2014-08-19 Thread Aaron
These three lines of python code cause the error for me: sc = SparkContext(appName="foo") input = sc.textFile("hdfs://[valid hdfs path]") mappedToLines = input.map(lambda myline: myline.split(",")) The file I'm loading is a simple CSV. -- View this message in context: http://apach

Re: Python script runs fine in local mode, errors in other modes

2014-08-19 Thread Aaron
Sure thing, this is the stacktrace from pyspark. It's repeated a few times, but I think this is the unique stuff. Traceback (most recent call last): File "", line 1, in File "/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/spark/python/pyspark/rdd.py", line 583, in collect bytesInJa

Python script runs fine in local mode, errors in other modes

2014-08-19 Thread Aaron
g in my own sandbox in purely local mode before. Any help would be appreciated, thanks! -Aaron -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Python-script-runs-fine-in-local-mode-errors-in-other-modes-tp12390.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark: why need a masterLock when sending heartbeat to master

2014-08-17 Thread Aaron Davidson
Yes, good point, I believe the masterLock is now unnecessary altogether. The reason for its initial existence was that "changeMaster()" originally could be called out-of-band of the actor, and so we needed to make sure the master reference did not change out from under us. Now it appears that all m

Re: s3:// sequence file startup time

2014-08-17 Thread Aaron Davidson
The driver must initially compute the partitions and their preferred locations for each part of the file, which results in a serial getFileBlockLocations() on each part. However, I would expect this to take several seconds, not minutes, to perform on 1000 parts. Is your driver inside or outside of

Re: Iterator over RDD in PySpark

2014-08-01 Thread Aaron Davidson
Ah, that's unfortunate, that definitely should be added. Using a pyspark-internal method, you could try something like javaIterator = rdd._jrdd.toLocalIterator() it = rdd._collect_iterator_through_file(javaIterator) On Fri, Aug 1, 2014 at 3:04 PM, Andrei wrote: > Thanks, Aaron, it s

Re: Iterator over RDD in PySpark

2014-08-01 Thread Aaron Davidson
rdd.toLocalIterator will do almost what you want, but requires that each individual partition fits in memory (rather than each individual line). Hopefully that's sufficient, though. On Fri, Aug 1, 2014 at 1:38 AM, Andrei wrote: > Is there a way to get iterator from RDD? Something like rdd.colle

Re: spark.shuffle.consolidateFiles seems not working

2014-07-31 Thread Aaron Davidson
Make sure to set it before you start your SparkContext -- it cannot be changed afterwards. Be warned that there are some known issues with shuffle file consolidation, which should be fixed in 1.1. On Thu, Jul 31, 2014 at 12:40 PM, Jianshi Huang wrote: > I got the number from the Hadoop admin. I

  1   2   3   >