SPARK-3039 Spark assembly for new hadoop API (hadoop 2) contains
avro-mapred for hadoop 1 API
was marked resolved with Spark 1.2.0 release. However, when I download the
pre-built Spark distro for Hadoop 2.4 and later
(spark-1.2.0-bin-hadoop2.4.tgz) and run it
against Avro code compiled against
It's already fixed in the master branch. Sorry that we forgot to update
this before releasing 1.2.0 and caused you trouble...
Cheng
On 2/2/15 2:03 PM, ankits wrote:
Great, thank you very much. I was confused because this is in the docs:
Actually |SchemaRDD.cache()| behaves exactly the same as |cacheTable|
since Spark 1.2.0. The reason why your web UI didn’t show you the cached
table is that both |cacheTable| and |sql(SELECT ...)| are lazy :-)
Simply add a |.collect()| after the |sql(...)| call.
Cheng
On 2/2/15 12:23 PM,
Great, thank you very much. I was confused because this is in the docs:
https://spark.apache.org/docs/1.2.0/sql-programming-guide.html, and on the
branch-1.2 branch,
https://github.com/apache/spark/blob/branch-1.2/docs/sql-programming-guide.md
Note that if you call schemaRDD.cache() rather than
Is there a recommended performance test for sort based shuffle? Something
similar to terasort on Hadoop. I couldn't find one on the spark-perf code
base.
https://github.com/databricks/spark-perf
--
Kannan
It's my fault, I'm sending a hot fix now.
On Mon, Feb 2, 2015 at 1:44 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-Master-Maven-with-YARN/HADOOP_PROFILE=hadoop-2.4,label=centos/
Is this is a known issue? It seems to have
Hey Spark developers,
Is there a good reason for JsonRDD being a Scala object as opposed to
class? Seems most other RDDs are classes, and can be extended.
The reason I'm asking is that there is a problem with Hive interoperability
with JSON DataFrames where jsonFile generates case sensitive
I'm asking from an experimental standpoint; this is not happening anytime
soon.
Of course, if the experiment turns out very well, Pants would replace both
sbt and Maven (like it has at Twitter, for example). Pants also works with
IDEs http://pantsbuild.github.io/index.html#using-pants-with.
On
There is a significant investment in sbt and maven - and they are not at
all likely to be going away. A third build tool? Note that there is also
the perspective of building within an IDE - which actually works presently
for sbt and with a little bit of tweaking with maven as well.
2015-02-02
It's bad naming - JsonRDD is actually not an RDD. It is just a set of util
methods.
The case sensitivity issues seem orthogonal, and would be great to be able
to control that with a flag.
On Mon, Feb 2, 2015 at 4:16 PM, Daniil Osipov daniil.osi...@shazam.com
wrote:
Hey Spark developers,
Is
To reiterate, I'm asking from an experimental perspective. I'm not
proposing we change Spark to build with Pants or anything like that.
I'm interested in trying Pants out and I'm wondering if anyone else shares
my interest or already has experience with Pants that they can share.
On Mon Feb 02
Does anyone here have experience with Pants
http://pantsbuild.github.io/index.html or interest in trying to build
Spark with it?
Pants has an interesting story. It was born at Twitter to help them build
their Scala, Java, and Python projects as several independent components in
one monolithic
Hey All,
I made a change to the Jenkins configuration that caused most builds
to fail (attempting to enable a new plugin), I've reverted the change
effective about 10 minutes ago.
If you've seen recent build failures like below, this was caused by
that change. Sorry about that.
ERROR:
+1 (non-binding, of course)
1. Compiled OSX 10.10 (Yosemite) OK Total time: 11:13 min
mvn clean package -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4
-Dhadoop.version=2.6.0 -Phive -DskipTests -Dscala-2.11
2. Tested pyspark, mlib - running as well as compare results with 1.1.x
1.2.0
2.1.
This is cancelled in favor of RC2.
On Mon, Feb 2, 2015 at 8:50 PM, Patrick Wendell pwend...@gmail.com wrote:
The windows issue reported only affects actually running Spark on
Windows (not job submission). However, I agree it's worth cutting a
new RC. I'm going to cancel this vote and propose
The windows issue reported only affects actually running Spark on
Windows (not job submission). However, I agree it's worth cutting a
new RC. I'm going to cancel this vote and propose RC3 with a single
additional patch. Let's try to vote that through so we can ship Spark
1.2.1.
- Patrick
On Sat,
Please vote on releasing the following candidate as Apache Spark version 1.2.1!
The tag to be voted on is v1.2.1-rc3 (commit b6eaf77):
https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b6eaf77d4332bfb0a698849b1f5f917d20d70e97
The release files, including signatures, digests, etc.
Hi all
I am trying the ml pipeline for text classfication now.
recently, i succeed to execute the pipeline processing in ml packages,
which consist of the original Japanese tokenizer, hashingTF,
logisticRegression.
then, i failed to executed the pipeline with idf in mllib package directly.
Hi Kannan,
I have a branch here:
https://github.com/ehiggs/spark/tree/terasort
The code is in the examples. I don't do any fancy partitioning so it
could be made quicker, I'm sure. But it should be a good baseline.
I have a WIP PR for spark-perf but I'm having trouble building it
there[1].
Thanks for your response. So AFAICT
calling parallelize(1 to1024).map(i =KV(i,
i.toString)).toSchemaRDD.cache().count(), will allow me to see the size of
the schemardd in memory
and parallelize(1 to1024).map(i =KV(i, i.toString)).cache().count() will
show me the size of a regular rdd.
But
In hadoop MR, there is an option *mapred.reduce.slowstart.completed.maps*
which can be used to start reducer stage when X% mappers are completed. By
doing this, the data shuffling process is able to parallel with the map
process.
In a large multi-tenancy cluster, this option is usually tuned
Hi all,
I have some questions about the future development of Spark's standalone
resource scheduler. We've heard some users have the requirements to have
multi-tenant support in standalone mode, like multi-user management, resource
management and isolation, whitelist of users. Seems current
Hey Jerry,
I think standalone mode will still add more features over time, but
the goal isn't really for it to become equivalent to what Mesos/YARN
are today. Or at least, I doubt Spark Standalone will ever attempt to
manage _other_ frameworks outside of Spark and become a general
purpose
Hi Patrick,
Thanks a lot for your detailed explanation. For now we have such requirements:
whitelist the application submitter, user resources (CPU, MEMORY) quotas,
resources allocations in Spark Standalone mode. These are quite specific
requirements for production-use, generally these problem
24 matches
Mail list logo