[build system] short jenkins downtime tomorrow morning, 11-13-2015 @ 7am PST

2015-11-12 Thread shane knapp
i will admit that it does seem like a bad idea to poke jenkins on friday the 13th, but there's a release that fixes a lot of security issues: https://wiki.jenkins-ci.org/display/SECURITY/Jenkins+Security+Advisory+2015-11-11 i'll set jenkins to stop kicking off any new builds around 5am PST, and

Re: Support for local disk columnar storage for DataFrames

2015-11-12 Thread Andrew Duffy
Relevant link: http://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files On Wed, Nov 11, 2015 at 7:31 PM, Reynold Xin wrote: > Thanks for the email. Can you explain what the difference is between this > and existing formats such as Parquet/ORC? > > > On

Re: A proposal for Spark 2.0

2015-11-12 Thread Mark Hamstra
The place of the RDD API in 2.0 is also something I've been wondering about. I think it may be going too far to deprecate it, but changing emphasis is something that we might consider. The RDD API came well before DataFrames and DataSets, so programming guides, introductory how-to articles and

Re: A proposal for Spark 2.0

2015-11-12 Thread Kostas Sakellis
I know we want to keep breaking changes to a minimum but I'm hoping that with Spark 2.0 we can also look at better classpath isolation with user programs. I propose we build on spark.{driver|executor}.userClassPathFirst, setting it true by default, and not allow any spark transitive dependencies

RE: A proposal for Spark 2.0

2015-11-12 Thread Cheng, Hao
I am not sure what the best practice for this specific problem, but it’s really worth to think about it in 2.0, as it is a painful issue for lots of users. By the way, is it also an opportunity to deprecate the RDD API (or internal API only?)? As lots of its functionality overlapping with

Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-12 Thread Jeff Zhang
Didn't notice that I can pass comma separated path in the existing API (SparkContext#textFile). So no necessary for new api. Thanks all. On Thu, Nov 12, 2015 at 10:24 AM, Jeff Zhang wrote: > Hi Pradeep > > ≥≥≥ Looks like what I was suggesting doesn't work. :/ > I guess you

RE: A proposal for Spark 2.0

2015-11-12 Thread Cheng, Hao
Agree, more features/apis/optimization need to be added in DF/DS. I mean, we need to think about what kind of RDD APIs we have to provide to developer, maybe the fundamental API is enough, like, the ShuffledRDD etc.. But PairRDDFunctions probably not in this category, as we can do the same

Re: A proposal for Spark 2.0

2015-11-12 Thread witgo
Who has the idea of machine learning? Spark missing some features for machine learning, For example, the parameter server. > 在 2015年11月12日,05:32,Matei Zaharia 写道: > > I like the idea of popping out Tachyon to an optional component too to reduce > the number of

Re: Seems jenkins is down (or very slow)?

2015-11-12 Thread Yin Huai
Seems it is back. On Thu, Nov 12, 2015 at 6:21 PM, Yin Huai wrote: > Hi Guys, > > Seems Jenkins is down or very slow? Does anyone else experience it or just > me? > > Thanks, > > Yin >

Re: Seems jenkins is down (or very slow)?

2015-11-12 Thread Ted Yu
I was able to access the following where response was fast: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45806/ Cheers On Thu, Nov 12, 2015 at 6:21 PM, Yin Huai wrote: > Hi

Re: A proposal for Spark 2.0

2015-11-12 Thread Stephen Boesch
My understanding is that the RDD's presently have more support for complete control of partitioning which is a key consideration at scale. While partitioning control is still piecemeal in DF/DS it would seem premature to make RDD's a second-tier approach to spark dev. An example is the use of

Re: A proposal for Spark 2.0

2015-11-12 Thread Mark Hamstra
Hmmm... to me, that seems like precisely the kind of thing that argues for retaining the RDD API but not as the first thing presented to new Spark developers: "Here's how to use groupBy with DataFrames Until the optimizer is more fully developed, that won't always get you the best performance

Seems jenkins is down (or very slow)?

2015-11-12 Thread Yin Huai
Hi Guys, Seems Jenkins is down or very slow? Does anyone else experience it or just me? Thanks, Yin

Re: Seems jenkins is down (or very slow)?

2015-11-12 Thread Fengdong Yu
I can assess directly in China > On Nov 13, 2015, at 10:28 AM, Ted Yu wrote: > > I was able to access the following where response was fast: > > https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN >

Re: RE: A proposal for Spark 2.0

2015-11-12 Thread Guoqiang Li
Yes, I agree with Nan Zhu. I recommend these projects: https://github.com/dmlc/ps-lite (Apache License 2) https://github.com/Microsoft/multiverso (MIT License) Alexander, You may also be interested in the demo(graph on parameter Server)

Re: A proposal for Spark 2.0

2015-11-12 Thread Nan Zhu
Being specific to Parameter Server, I think the current agreement is that PS shall exist as a third-party library instead of a component of the core code base, isn’t? Best, -- Nan Zhu http://codingcat.me On Thursday, November 12, 2015 at 9:49 AM, wi...@qq.com wrote: > Who has the idea

RE: A proposal for Spark 2.0

2015-11-12 Thread Ulanov, Alexander
Parameter Server is a new feature and thus does not match the goal of 2.0 is “to fix things that are broken in the current API and remove certain deprecated APIs”. At the same time I would be happy to have that feature. With regards to Machine learning, it would be great to move useful features

Re: A proposal for Spark 2.0

2015-11-12 Thread Nicholas Chammas
With regards to Machine learning, it would be great to move useful features from MLlib to ML and deprecate the former. Current structure of two separate machine learning packages seems to be somewhat confusing. With regards to GraphX, it would be great to deprecate the use of RDD in GraphX and

Re: Proposal for SQL join optimization

2015-11-12 Thread Zhan Zhang
Hi Xiao, Performance-wise, without the manual tuning, the query cannot be finished, and with the tuning the query can finish in minutes in TPCH 100G data. I have created https://issues.apache.org/jira/browse/SPARK-11704 and https://issues.apache.org/jira/browse/SPARK-11705 for these two

Re: Support for local disk columnar storage for DataFrames

2015-11-12 Thread Cristian O
Sorry, apparently only replied to Reynold, meant to copy the list as well, so I'm self replying and taking the opportunity to illustrate with an example. Basically I want to conceptually do this: val bigDf = sqlContext.sparkContext.parallelize((1 to 100)).map(i => (i, 1)).toDF("k", "v") val