i will admit that it does seem like a bad idea to poke jenkins on
friday the 13th, but there's a release that fixes a lot of security
issues:
https://wiki.jenkins-ci.org/display/SECURITY/Jenkins+Security+Advisory+2015-11-11
i'll set jenkins to stop kicking off any new builds around 5am PST,
and
Relevant link:
http://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files
On Wed, Nov 11, 2015 at 7:31 PM, Reynold Xin wrote:
> Thanks for the email. Can you explain what the difference is between this
> and existing formats such as Parquet/ORC?
>
>
> On
The place of the RDD API in 2.0 is also something I've been wondering
about. I think it may be going too far to deprecate it, but changing
emphasis is something that we might consider. The RDD API came well before
DataFrames and DataSets, so programming guides, introductory how-to
articles and
I know we want to keep breaking changes to a minimum but I'm hoping that
with Spark 2.0 we can also look at better classpath isolation with user
programs. I propose we build on spark.{driver|executor}.userClassPathFirst,
setting it true by default, and not allow any spark transitive dependencies
I am not sure what the best practice for this specific problem, but it’s really
worth to think about it in 2.0, as it is a painful issue for lots of users.
By the way, is it also an opportunity to deprecate the RDD API (or internal API
only?)? As lots of its functionality overlapping with
Didn't notice that I can pass comma separated path in the existing API
(SparkContext#textFile). So no necessary for new api. Thanks all.
On Thu, Nov 12, 2015 at 10:24 AM, Jeff Zhang wrote:
> Hi Pradeep
>
> ≥≥≥ Looks like what I was suggesting doesn't work. :/
> I guess you
Agree, more features/apis/optimization need to be added in DF/DS.
I mean, we need to think about what kind of RDD APIs we have to provide to
developer, maybe the fundamental API is enough, like, the ShuffledRDD etc..
But PairRDDFunctions probably not in this category, as we can do the same
Who has the idea of machine learning? Spark missing some features for machine
learning, For example, the parameter server.
> 在 2015年11月12日,05:32,Matei Zaharia 写道:
>
> I like the idea of popping out Tachyon to an optional component too to reduce
> the number of
Seems it is back.
On Thu, Nov 12, 2015 at 6:21 PM, Yin Huai wrote:
> Hi Guys,
>
> Seems Jenkins is down or very slow? Does anyone else experience it or just
> me?
>
> Thanks,
>
> Yin
>
I was able to access the following where response was fast:
https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45806/
Cheers
On Thu, Nov 12, 2015 at 6:21 PM, Yin Huai wrote:
> Hi
My understanding is that the RDD's presently have more support for
complete control of partitioning which is a key consideration at scale.
While partitioning control is still piecemeal in DF/DS it would seem
premature to make RDD's a second-tier approach to spark dev.
An example is the use of
Hmmm... to me, that seems like precisely the kind of thing that argues for
retaining the RDD API but not as the first thing presented to new Spark
developers: "Here's how to use groupBy with DataFrames Until the
optimizer is more fully developed, that won't always get you the best
performance
Hi Guys,
Seems Jenkins is down or very slow? Does anyone else experience it or just
me?
Thanks,
Yin
I can assess directly in China
> On Nov 13, 2015, at 10:28 AM, Ted Yu wrote:
>
> I was able to access the following where response was fast:
>
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN
>
Yes, I agree with Nan Zhu. I recommend these projects:
https://github.com/dmlc/ps-lite (Apache License 2)
https://github.com/Microsoft/multiverso (MIT License)
Alexander, You may also be interested in the demo(graph on parameter Server)
Being specific to Parameter Server, I think the current agreement is that PS
shall exist as a third-party library instead of a component of the core code
base, isn’t?
Best,
--
Nan Zhu
http://codingcat.me
On Thursday, November 12, 2015 at 9:49 AM, wi...@qq.com wrote:
> Who has the idea
Parameter Server is a new feature and thus does not match the goal of 2.0 is
“to fix things that are broken in the current API and remove certain deprecated
APIs”. At the same time I would be happy to have that feature.
With regards to Machine learning, it would be great to move useful features
With regards to Machine learning, it would be great to move useful features
from MLlib to ML and deprecate the former. Current structure of two
separate machine learning packages seems to be somewhat confusing.
With regards to GraphX, it would be great to deprecate the use of RDD in
GraphX and
Hi Xiao,
Performance-wise, without the manual tuning, the query cannot be finished, and
with the tuning the query can finish in minutes in TPCH 100G data.
I have created https://issues.apache.org/jira/browse/SPARK-11704 and
https://issues.apache.org/jira/browse/SPARK-11705 for these two
Sorry, apparently only replied to Reynold, meant to copy the list as well,
so I'm self replying and taking the opportunity to illustrate with an
example.
Basically I want to conceptually do this:
val bigDf = sqlContext.sparkContext.parallelize((1 to 100)).map(i
=> (i, 1)).toDF("k", "v")
val
20 matches
Mail list logo