Re: [DISCUSS] Necessity of Maven *and* SBT Build in Spark

2014-02-26 Thread Koert Kuipers
dependencySet, but provided will mark the entire dependency tree as excluded. It is also possible to exclude jar by jar, but this is pretty error prone and messy. On Tue, Feb 25, 2014 at 2:45 PM, Koert Kuipers ko...@tresata.com wrote: yes in sbt assembly you can exclude jars (although i never had a need

Re: [DISCUSS] Necessity of Maven *and* SBT Build in Spark

2014-02-26 Thread Koert Kuipers
associated with manual translation of dependency specs from one system to another, while still maintaining the things which are hard to translate (plugins). On Wed, Feb 26, 2014 at 7:17 AM, Koert Kuipers ko...@tresata.com wrote: We maintain in house spark build using sbt. We have

Re: [IMPORTANT] Github/jenkins migration

2014-02-26 Thread Koert Kuipers
github is not aware of the new repo being a base-fork, so its not easy to re-point pull requests. i am guessing it didnt get cloned from the incubator spark one? On Wed, Feb 26, 2014 at 5:56 PM, Patrick Wendell pwend...@gmail.com wrote: Sorry if this wasn't clear - If you are in the middle of

Re: [DISCUSS] Necessity of Maven *and* SBT Build in Spark

2014-03-11 Thread Koert Kuipers
we have a maven corporate repository inhouse and of course we also use maven central. sbt can handle retrieving from and publishing to maven repositories just fine. we have maven, ant/ivy and sbt projects depending on each others artifacts. not sure i see the issue there. On Tue, Mar 11, 2014 at

Re: [DISCUSS] Necessity of Maven *and* SBT Build in Spark

2014-03-11 Thread Koert Kuipers
Asm is such a mess. And their suggested solution being everyone should shade it sounds pretty awful to me (not uncommon to have shaded asm 15 times in a single project). But I guess it you are right that shading is only way to deal with it at this point... On Mar 11, 2014 5:35 PM, Kevin Markey

Re: [re-cont] map and flatMap

2014-03-17 Thread Koert Kuipers
/ - GitHub: https://github.com/andypetrella - Masterbranch: https://masterbranch.com/andy.petrella On Sat, Mar 15, 2014 at 7:06 PM, Koert Kuipers ko...@tresata.com wrote: just going head first without any thinking, it changed flatMap to flatMapData and added a flatMap. for FlatMappedRDD my

Re: Making RDDs Covariant

2014-03-22 Thread Koert Kuipers
i believe kryo serialization uses runtime class, not declared class we have no issues serializing covariant scala lists On Sat, Mar 22, 2014 at 11:59 AM, Pascal Voitot Dev pascal.voitot@gmail.com wrote: On Sat, Mar 22, 2014 at 3:45 PM, Michael Armbrust mich...@databricks.com wrote:

Re: Master compilation

2014-04-06 Thread Koert Kuipers
classes compiled with java7 run fine on java6 if you specified -target 1.6. however if thats the case generally you should also be able to also then compile it with java 6 just fine. something compiled with java7 with -target 1.7 will not run on java 6 On Sat, Apr 5, 2014 at 9:10 PM, Debasish

Re: Master compilation

2014-04-06 Thread Koert Kuipers
patrick, this has happened before, that a commit introduced java 7 code/dependencies and your build didnt fail, i think it was when reynold upgraded to jetty 9. must be that your entire build infrastructure runs java 7... On Sat, Apr 5, 2014 at 6:06 PM, Patrick Wendell pwend...@gmail.com wrote:

Re: Master compilation

2014-04-06 Thread Koert Kuipers
: java.lang.ClassNotFoundException: scala.None$ fun stuff! On Sun, Apr 6, 2014 at 12:13 PM, Koert Kuipers ko...@tresata.com wrote: patrick, this has happened before, that a commit introduced java 7 code/dependencies and your build didnt fail, i think it was when reynold upgraded to jetty 9. must be that your

Re: Master compilation

2014-04-06 Thread Koert Kuipers
i suggest we stick to 2.10.3, since otherwise it seems that (surprisingly) you force everyone to upgrade On Sun, Apr 6, 2014 at 1:46 PM, Koert Kuipers ko...@tresata.com wrote: also, i thought scala 2.10 was binary compatible, but does not seem to be the case. the spark artifacts for scala

Re: Apache Spark and Graphx for Real Time Analytics

2014-04-08 Thread Koert Kuipers
it all depends on what kind of traversing. if its point traversing then a random access based something would be great. if its more scan-like traversl then spark will fit On Tue, Apr 8, 2014 at 4:56 PM, Evan Chan e...@ooyala.com wrote: I doubt Titan would be able to give you traversal of

Re: Spark on Scala 2.11

2014-05-11 Thread Koert Kuipers
i believe matei has said before that he would like to crossbuild for 2.10 and 2.11, given that the difference is not as big as between 2.9 and 2.10. but dont know when this would happen... On Sat, May 10, 2014 at 11:02 PM, Gary Malouf malouf.g...@gmail.com wrote: Considering the team just

Re: Calling external classes added by sc.addJar needs to be through reflection

2014-05-21 Thread Koert Kuipers
db tsai, i do not think userClassPathFirst is working, unless the classes you load dont reference any classes already loaded by the parent classloader (a mostly hypothetical situation)... i filed a jira for this here: https://issues.apache.org/jira/browse/SPARK-1863 On Tue, May 20, 2014 at 1:04

Re: [VOTE] Release Apache Spark 1.1.0 (RC2)

2014-08-29 Thread Koert Kuipers
i suspect there are more cdh4 than cdh5 clusters. most people plan to move to cdh5 within say 6 months. On Fri, Aug 29, 2014 at 3:57 AM, Andrew Ash and...@andrewash.com wrote: FWIW we use CDH4 extensively and would very much appreciate having a prebuilt version of Spark for it. We're doing

Re: Dependency hell in Spark applications

2014-09-04 Thread Koert Kuipers
custom spark builds should not be the answer. at least not if spark ever wants to have a vibrant community for spark apps. spark does support a user-classpath-first option, which would deal with some of these issues, but I don't think it works. On Sep 4, 2014 9:01 AM, Felix Garcia Borrego

Re: Raise Java dependency from 6 to 7

2014-10-18 Thread Koert Kuipers
my experience is that there are still a lot of java 6 clusters out there. also distros that bundle spark still support java 6 On Oct 17, 2014 8:01 PM, Andrew Ash and...@andrewash.com wrote: Hi Spark devs, I've heard a few times that keeping support for Java 6 is a priority for Apache Spark.

scalastyle annoys me a little bit

2014-10-23 Thread Koert Kuipers
100 max width seems very restrictive to me. even the most restrictive environment i have for development (ssh with emacs) i get a lot more characters to work with than that. personally i find the code harder to read, not easier. like i kept wondering why there are weird newlines in the middle of

Re: scalastyle annoys me a little bit

2014-10-23 Thread Koert Kuipers
cases where the current limit is useful (e.g. if you have many windows open in a large screen). - Patrick On Thu, Oct 23, 2014 at 11:03 AM, Koert Kuipers ko...@tresata.com wrote: 100 max width seems very restrictive to me. even the most restrictive environment i have for development (ssh

Re: scalastyle annoys me a little bit

2014-10-23 Thread Koert Kuipers
${scalastyle.failonviolation}/failOnViolation includeTestSourceDirectoryfalse/includeTestSourceDirectory failOnWarningfalse/failOnWarning sourceDirectory${basedir}/src/main/scala/sourceDirectory On Thu, Oct 23, 2014 at 12:07 PM, Koert Kuipers ko...@tresata.com wrote: Hey Ted

Re: scalastyle annoys me a little bit

2014-10-24 Thread Koert Kuipers
SKIPPED in this case i dont care about Hive, but i would have liked to see REPL run, and Kafka. On Thu, Oct 23, 2014 at 4:44 PM, Ted Yu yuzhih...@gmail.com wrote: Created SPARK-4066 and attached patch there. On Thu, Oct 23, 2014 at 1:07 PM, Koert Kuipers ko...@tresata.com

Re: scalastyle annoys me a little bit

2014-10-24 Thread Koert Kuipers
oh i found some stuff about tests and how to continue them, gonna try that now (-fae switch). should have googled before asking... On Fri, Oct 24, 2014 at 3:59 PM, Koert Kuipers ko...@tresata.com wrote: thanks ted. apologies for complaining about maven here again, but this is the first time

Re: scalastyle annoys me a little bit

2014-10-24 Thread Koert Kuipers
separated) list you provide to -pl. Also before using -pl you should do a mvn compile package install on all modules. Use the -pl after those steps are done - and then it is very effective. 2014-10-24 13:08 GMT-07:00 Sean Owen so...@cloudera.com: On Fri, Oct 24, 2014 at 8:59 PM, Koert

Re: best IDE for scala + spark development?

2014-10-27 Thread Koert Kuipers
editor of your choice + sbt console works + grep great. if only folks stopped using wildcard imports (it has little benefits in terms of coding yet requires an IDE with 1G+ of ram to track em down). On Mon, Oct 27, 2014 at 9:17 AM, andy petrella andy.petre...@gmail.com wrote: I second the

spark kafka batch integration

2014-12-14 Thread Koert Kuipers
hello all, we at tresata wrote a library to provide for batch integration between spark and kafka (distributed write of rdd to kafa, distributed read of rdd from kafka). our main use cases are (in lambda architecture jargon): * period appends to the immutable master dataset on hdfs from kafka

Re: Which committers care about Kafka?

2014-12-19 Thread Koert Kuipers
yup, we at tresata do the idempotent store the same way. very simple approach. On Fri, Dec 19, 2014 at 5:32 PM, Cody Koeninger c...@koeninger.org wrote: That KafkaRDD code is dead simple. Given a user specified map (topic1, partition0) - (startingOffset, endingOffset) (topic1, partition1)

Re: Contribution in java

2014-12-20 Thread Koert Kuipers
yes it does. although the core of spark is written in scala it also maintains java and python apis, and there is plenty of work for those to contribute to. On Sat, Dec 20, 2014 at 7:30 AM, sreenivas putta putta.sreeni...@gmail.com wrote: Hi, I want to contribute for spark in java. Does it

Re: renaming SchemaRDD - DataFrame

2015-01-27 Thread Koert Kuipers
interfaces are both outside catalyst package and in org.apache.spark.sql. On Tue, Jan 27, 2015 at 9:08 AM, Koert Kuipers ko...@tresata.com wrote: hey matei, i think that stuff such as SchemaRDD, columar storage and perhaps also query planning can be re-used by many systems that do analysis

Re: renaming SchemaRDD - DataFrame

2015-01-26 Thread Koert Kuipers
The context is that SchemaRDD is becoming a common data format used for bringing data into Spark from external systems, and used for various components of Spark, e.g. MLlib's new pipeline API. i agree. this to me also implies it belongs in spark core, not sql On Mon, Jan 26, 2015 at 6:11 PM,

Re: renaming SchemaRDD - DataFrame

2015-02-10 Thread Koert Kuipers
useless. On Tue, Feb 10, 2015 at 11:47 AM, Koert Kuipers ko...@tresata.com wrote: so i understand the success or spark.sql. besides the fact that anything with the words SQL in its name will have thousands of developers running towards it because of the familiarity, there is also a genuine

Re: renaming SchemaRDD - DataFrame

2015-02-10 Thread Koert Kuipers
in an efficient columnar format. And you can also easily persist it on disk using Parquet, which is also columnar. Cheng On 1/29/15 1:24 PM, Koert Kuipers wrote: to me the word DataFrame does come with certain expectations. one of them is that the data is stored columnar. in R

Re: hadoop input/output format advanced control

2015-03-24 Thread Koert Kuipers
thread, Koert) On Mon, Mar 23, 2015 at 3:52 PM, Koert Kuipers ko...@tresata.com wrote: see email below. reynold suggested i send it to dev instead of user -- Forwarded message -- From: Koert Kuipers ko...@tresata.com Date: Mon, Mar 23, 2015 at 4:36 PM

Re: hadoop input/output format advanced control

2015-03-25 Thread Koert Kuipers
://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L894 It seems fine to have the same option for the loading functions, if it's easy to just pass this config into the input format. On Tue, Mar 24, 2015 at 3:46 PM, Koert Kuipers ko...@tresata.com wrote

Fwd: hadoop input/output format advanced control

2015-03-23 Thread Koert Kuipers
see email below. reynold suggested i send it to dev instead of user -- Forwarded message -- From: Koert Kuipers ko...@tresata.com Date: Mon, Mar 23, 2015 at 4:36 PM Subject: hadoop input/output format advanced control To: u...@spark.apache.org u...@spark.apache.org currently its

Re: renaming SchemaRDD - DataFrame

2015-01-29 Thread Koert Kuipers
this is possible to build over the core API, it's pretty natural to organize it that way, same as Spark Streaming is a library. Matei On Jan 26, 2015, at 4:26 PM, Koert Kuipers ko...@tresata.com wrote: The context is that SchemaRDD is becoming a common

Re: [discuss] ending support for Java 6?

2015-04-30 Thread Koert Kuipers
i am not sure eol means much if it is still actively used. we have a lot of clients with centos 5 (for which we still support python 2.4 in some form or another, fun!). most of them are on centos 6, which means python 2.6. by cutting out python 2.6 you would cut out the majority of the actual

Re: [discuss] ending support for Java 6?

2015-04-30 Thread Koert Kuipers
, Reynold Xin r...@databricks.com wrote: Guys thanks for chiming in, but please focus on Java here. Python is an entirely separate issue. On Thu, Apr 30, 2015 at 12:53 PM, Koert Kuipers ko...@tresata.com wrote: i am not sure eol means much if it is still actively used. we have a lot of clients

Re: [discuss] ending support for Java 6?

2015-05-01 Thread Koert Kuipers
it seems spark is happy to upgrade scala, drop older java versions, upgrade incompatible library versions (akka), and all of this within spark 1.x does the 1.x mean anything in terms of compatibility of dependencies? or is that limited to its own api? what are the rules? On May 1, 2015 9:04 AM,

Re: [discuss] ending support for Java 6?

2015-05-02 Thread Koert Kuipers
i think i might be misunderstanding, but shouldnt java 6 currently be used in jenkins? On Sat, May 2, 2015 at 11:53 PM, shane knapp skn...@berkeley.edu wrote: that's kinda what we're doing right now, java 7 is the default/standard on our jenkins. or, i vote we buy a butler's outfit for

Re: Change for submitting to yarn in 1.3.1

2015-05-21 Thread Koert Kuipers
we also launch jobs programmatically, both on standalone mode and yarn-client mode. in standalone mode it always worked, in yarn-client mode we ran into some issues and were forced to use spark-submit, but i still have on my todo list to move back to a normal java launch without spark-submit at

Re: FrequentItems in spark-sql-execution-stat

2015-07-31 Thread Koert Kuipers
this looks like a mistake in FrequentItems to me. if the map is full (map.size==size) then it should still add the new item (after removing items from the map and decrementing counts). if its not a mistake then at least it looks to me like the algo is different than described in the paper. is

Re: A proposal for Spark 2.0

2015-11-11 Thread Koert Kuipers
i would drop scala 2.10, but definitely keep java 7 cross build for scala 2.12 is great, but i dont know how that works with java 8 requirement. dont want to make java 8 mandatory. and probably stating the obvious, but a lot of apis got polluted due to binary compatibility requirement. cleaning

Re: A proposal for Spark 2.0

2015-11-11 Thread Koert Kuipers
good point about dropping <2.2 for hadoop. you dont want to deal with protobuf 2.4 for example On Wed, Nov 11, 2015 at 4:58 AM, Sean Owen wrote: > On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin wrote: > > to the Spark community. A major release should

Re: Ready to talk about Spark 2.0?

2015-11-08 Thread Koert Kuipers
romi, unless am i misunderstanding your suggestion you might be interested in projects like the new mahout where they try to abstract out the engine with bindings, so that they can support multiple engines within a single platform. I guess cascading is heading in a similar direction (although no

Re: State of the Build

2015-11-05 Thread Koert Kuipers
People who do upstream builds of spark (think bigtop and hadoop distros) are used to legacy systems like maven, so maven is the default build. I don't think it will change. Any improvements for the sbt build are of course welcome (it is still used by many developers), but i would not do anything

Re: Master build fails ?

2015-11-06 Thread Koert Kuipers
if there is no strong preference for one dependencies policy over another, but consistency between the 2 systems is desired, then i believe maven can be made to behave like ivy pretty easily with a setting in the pom On Fri, Nov 6, 2015 at 5:21 AM, Steve Loughran wrote:

Re: Should enforce the uniqueness of field name in DataFrame ?

2015-10-15 Thread Koert Kuipers
if DataFrame aspires to be more than a vehicle for SQL then i think it would be mistake to allow multiple column names. it is very confusing. pandas indeed allows this and it has led to many bugs. R does not allow it for data.frame (it renames the name dupes). i would consider a csv with

Re: Pyspark dataframe read

2015-10-06 Thread Koert Kuipers
i ran into the same thing in scala api. we depend heavily on comma separated paths, and it no longer works. On Tue, Oct 6, 2015 at 3:02 AM, Blaž Šnuderl wrote: > Hello everyone. > > It seems pyspark dataframe read is broken for reading multiple files. > > sql.read.json(

Re: Pyspark dataframe read

2015-10-06 Thread Koert Kuipers
; >> Could someone please file a JIRA to track this? >> https://issues.apache.org/jira/browse/SPARK >> >> On Tue, Oct 6, 2015 at 1:21 AM, Koert Kuipers <ko...@tresata.com> wrote: >> >>> i ran into the same thing in scala api. we depend heavily on comma >&g

Re: A proposal for Spark 2.0

2015-12-03 Thread Koert Kuipers
spark 1.x has been supporting scala 2.11 for 3 or 4 releases now. seems to me you already provide a clear upgrade path: get on scala 2.11 before upgrading to spark 2.x from scala team when scala 2.10.6 came out: We strongly encourage you to upgrade to the latest stable version of Scala 2.11.x, as

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Koert Kuipers
rhel/centos 6 ships with python 2.6, doesnt it? if so, i still know plenty of large companies where python 2.6 is the only option. asking them for python 2.7 is not going to work so i think its a bad idea On Tue, Jan 5, 2016 at 1:52 PM, Juliet Hougland wrote: > I

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Koert Kuipers
access). Does this address the Python versioning concerns for RHEL users? > > On Tue, Jan 5, 2016 at 2:33 PM, Koert Kuipers <ko...@tresata.com> wrote: > >> yeah, the practical concern is that we have no control over java or >> python version on large company clusters. our curr

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Koert Kuipers
>> >> I've been in a couple of projects using Spark (banking industry) where >> CentOS + Python 2.6 is the toolbox available. >> >> That said, I believe it should not be a concern for Spark. Python 2.6 is >> old and busted, which is totally opposite to the Spark ph

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Koert Kuipers
e, Jan 5, 2016 at 3:05 PM, Nicholas Chammas < >> nicholas.cham...@gmail.com> wrote: >> >>> I think all the slaves need the same (or a compatible) version of Python >>> installed since they run Python code in PySpark jobs natively. >>> >>> On Tue, Jan

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Koert Kuipers
d > version without making your changes open source. The GPL-compatible > licenses make it possible to combine Python with other software that is > released under the GPL; the others don’t. > > Nick > ​ > > On Tue, Jan 5, 2016 at 5:49 PM Koert Kuipers <ko...@tresata.

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Koert Kuipers
if python 2.7 only has to be present on the node that launches the app (does it?) than that could be important indeed. On Tue, Jan 5, 2016 at 6:02 PM, Koert Kuipers <ko...@tresata.com> wrote: > interesting i didnt know that! > > On Tue, Jan 5, 2016 at 5:57 PM, Nicholas Chammas &l

Re: A proposal for Spark 2.0

2015-11-26 Thread Koert Kuipers
I also thought the idea was to drop 2.10. Do we want to cross build for 3 scala versions? On Nov 25, 2015 3:54 AM, "Sandy Ryza" wrote: > I see. My concern is / was that cluster operators will be reluctant to > upgrade to 2.0, meaning that developers using those clusters

Re: Subtract implementation using broadcast

2015-11-28 Thread Koert Kuipers
if i wanted to pimp DataFrame to add subtract and intersect myself with a physical operator, without needing to modify spark directly, is that currently possible/intended? or will i run into the private[spark] issue? On Fri, Nov 27, 2015 at 7:36 PM, Reynold Xin wrote: > We

Re: NegativeArraySizeException / segfault

2016-06-08 Thread Koert Kuipers
y > Analyzer it looks very much like a UTF8String is very corrupt. > > Cheers, > > > On Fri, 27 May 2016 at 21:00 Koert Kuipers <ko...@tresata.com> wrote: > >> hello all, >> after getting our unit tests to pass on spark 2.0.0-SNAPSHOT we are now >&

Re: feedback on dataset api explode

2016-05-25 Thread Koert Kuipers
t;> >> Cheng >> >> >> On 5/25/16 12:30 PM, Reynold Xin wrote: >> >> Based on this discussion I'm thinking we should deprecate the two explode >> functions. >> >> On Wednesday, May 25, 2016, Koert Kuipers < <ko...@tresata.com>

NegativeArraySizeException / segfault

2016-05-27 Thread Koert Kuipers
hello all, after getting our unit tests to pass on spark 2.0.0-SNAPSHOT we are now trying to run some algorithms at scale on our cluster. unfortunately this means that when i see errors i am having a harder time boiling it down to a small reproducible example. today we are running an iterative

changed behavior for csv datasource and quoting in spark 2.0.0-SNAPSHOT

2016-05-26 Thread Koert Kuipers
in spark 1.6.1 we used: sqlContext.read .format("com.databricks.spark.csv") .delimiter("~") .option("quote", null) this effectively turned off quoting, which is a necessity for certain data formats where quoting is not supported and "\"" is a valid character itself in the data.

Re: changed behavior for csv datasource and quoting in spark 2.0.0-SNAPSHOT

2016-05-26 Thread Koert Kuipers
n API), but that's probably OK > given they shouldn't change all the time. > > Ticket https://issues.apache.org/jira/browse/SPARK-15585 > > > > > On Thu, May 26, 2016 at 3:35 PM, Koert Kuipers <ko...@tresata.com> wrote: > >> in spark 1.6.1 we us

Re: NegativeArraySizeException / segfault

2016-05-27 Thread Koert Kuipers
cannot just send it over. i will try to create a small test program to reproduce it. On Fri, May 27, 2016 at 4:25 PM, Reynold Xin <r...@databricks.com> wrote: > They should get printed if you turn on debug level logging. > > On Fri, May 27, 2016 at 1:00 PM, Koert Kuipers <ko...@

SPARK-15982 breaks external DataSources

2016-06-27 Thread Koert Kuipers
hey, since SPARK-15982 was fixed (https://github.com/apache/spark/pull/13727) i believe all external DataSources that rely on using .load(path) without being a FileFormat themselves are broken. i noticed this because our unit tests for the elasticsearch datasource broke. i commented on the

Re: Spark 2.0.0 release plan

2016-01-26 Thread Koert Kuipers
y or so instead informally in > conversation. Does anyone have a particularly strong opinion on that? > That's basically an extra 3 month period. > > https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage > > On Tue, Jan 26, 2016 at 10:00 PM, Koert Kuipers <ko...@tresata.com>

some joins stopped working with spark 2.0.0 SNAPSHOT

2016-02-26 Thread Koert Kuipers
dataframe df1: schema: StructType(StructField(x,IntegerType,true)) explain: == Physical Plan == MapPartitions , obj#135: object, [if (input[0, object].isNullAt) null else input[0, object].get AS x#128] +- MapPartitions , createexternalrow(if (isnull(x#9)) null else x#9), [input[0, object] AS

Re: some joins stopped working with spark 2.0.0 SNAPSHOT

2016-02-27 Thread Koert Kuipers
https://issues.apache.org/jira/browse/SPARK-13531 On Sat, Feb 27, 2016 at 3:49 AM, Reynold Xin <r...@databricks.com> wrote: > Can you file a JIRA ticket? > > > On Friday, February 26, 2016, Koert Kuipers <ko...@tresata.com> wrote: > >> dataframe df1: >&

Re: [discuss] DataFrame vs Dataset in Spark 2.0

2016-02-25 Thread Koert Kuipers
since a type alias is purely a convenience thing for the scala compiler, does option 1 mean that the concept of DataFrame ceases to exist from a java perspective, and they will have to refer to Dataset? On Thu, Feb 25, 2016 at 6:23 PM, Reynold Xin wrote: > When we first

Dataset in spark 2.0.0-SNAPSHOT missing columns

2016-02-15 Thread Koert Kuipers
i noticed some things stopped working on datasets in spark 2.0.0-SNAPSHOT, and with a confusing error message (cannot resolved some column with input columns []). for example in 1.6.0-SNAPSHOT: scala> val ds = sc.parallelize(1 to 10).toDS ds: org.apache.spark.sql.Dataset[Int] = [value: int]

Re: Dataset in spark 2.0.0-SNAPSHOT missing columns

2016-02-15 Thread Koert Kuipers
com> wrote: > Looks like a bug. I'm also not sure whether we support Option yet. (If > not, we should definitely support that in 2.0.) > > Can you file a JIRA ticket? > > > On Mon, Feb 15, 2016 at 7:12 AM, Koert Kuipers <ko...@tresata.com> wrote: > >> i notic

Re: spark 2.0 logging binary incompatibility

2016-03-15 Thread Koert Kuipers
outside of Spark isn't supposed to use > it. Mixing Spark library versions is also not recommended, not just > because of this reason. > > There have been other binary changes in the Logging class in the past too. > > On Tue, Mar 15, 2016 at 7:49 AM, Koert Kuipers <ko...@tresata.

Re: spark 2.0 logging binary incompatibility

2016-03-15 Thread Koert Kuipers
oh i just noticed the big warning in spark 1.x Logging * NOTE: DO NOT USE this class outside of Spark. It is intended as an internal utility. * This will likely be changed or removed in future releases. On Tue, Mar 15, 2016 at 3:29 PM, Koert Kuipers <ko...@tresata.com> wrote: &

SparkConf constructor now private

2016-03-15 Thread Koert Kuipers
in this commit 8301fadd8d269da11e72870b7a889596e3337839 Author: Marcelo Vanzin Date: Mon Mar 14 14:27:33 2016 -0700 [SPARK-13626][CORE] Avoid duplicate config deprecation warnings. the following change was made -class SparkConf(loadDefaults: Boolean) extends Cloneable

spark 2.0 logging binary incompatibility

2016-03-15 Thread Koert Kuipers
i have been using spark 2.0 snapshots with some libraries build for spark 1.0 so far (simply because it worked). in last few days i noticed this new error: [error] Uncaught exception when running com.tresata.spark.sql.fieldsapi.FieldsApiSpec: java.lang.AbstractMethodError sbt.ForkMain$ForkError:

question about catalyst and TreeNode

2016-03-15 Thread Koert Kuipers
i am trying to understand some parts of the catalyst optimizer. but i struggle with one bigger picture issue: LogicalPlan extends TreeNode, which makes sense since the optimizations rely on tree transformations like transformUp and transformDown. but how can a LogicalPlan be a tree? isnt it

Re: [discuss] ending support for Java 7 in Spark 2.0

2016-03-29 Thread Koert Kuipers
if scala prior to sbt 2.10.4 didn't support java 8, does that mean that 3rd party scala libraries compiled with a scala version < 2.10.4 might not work on java 8? On Mon, Mar 28, 2016 at 7:06 PM, Kostas Sakellis wrote: > Also, +1 on dropping jdk7 in Spark 2.0. > > Kostas >

spark 2.0 snapshot change in RowEncoder behavior

2016-03-23 Thread Koert Kuipers
one of our unit tests broke with changes in spark 2.0 snapshot in last few days (or maybe i simple missed it longer). i think it boils down to this: val df1 = sc.makeRDD(1 to 3).toDF val df2 = df1.map(row => Row(row(0).asInstanceOf[Int] + 1))(RowEncoder(df1.schema)) println(s"schema before

Re: [discuss] ending support for Java 7 in Spark 2.0

2016-03-24 Thread Koert Kuipers
i think the arguments are convincing, but it also makes me wonder if i live in some kind of alternate universe... we deploy on customers clusters, where the OS, python version, java version and hadoop distro are not chosen by us. so think centos 6, cdh5 or hdp 2.3, java 7 and python 2.6. we simply

Re: [discuss] ending support for Java 7 in Spark 2.0

2016-03-24 Thread Koert Kuipers
side. > > On Thu, Mar 24, 2016 at 4:27 PM, Koert Kuipers <ko...@tresata.com> wrote: > > i think the arguments are convincing, but it also makes me wonder if i > live > > in some kind of alternate universe... we deploy on customers clusters, > where > > the OS, pytho

Re: [discuss] ending support for Java 7 in Spark 2.0

2016-03-24 Thread Koert Kuipers
i guess what i am saying is that in a yarn world the only hard restrictions left are the the containers you run in, which means the hadoop version, java version and python version (if you use python). On Thu, Mar 24, 2016 at 12:39 PM, Koert Kuipers <ko...@tresata.com> wrote: >

Re: [discuss] ending support for Java 7 in Spark 2.0

2016-03-25 Thread Koert Kuipers
mpatibility wrt Java) >> Was there a proposal which did not go through ? Not sure if I missed it. >> >> Regards >> Mridul >> >> >> On Thursday, March 24, 2016, Koert Kuipers <ko...@tresata.com> wrote: >> >>> i think that logic is reasonab

Re: [discuss] ending support for Java 7 in Spark 2.0

2016-03-24 Thread Koert Kuipers
the good news is, that from an shared infrastructure perspective, most places have zero scala, so the upgrade is actually very easy. i can see how it would be different for say twitter On Thu, Mar 24, 2016 at 7:50 PM, Reynold Xin wrote: > If you want to go down that

Re: [discuss] ending support for Java 7 in Spark 2.0

2016-03-24 Thread Koert Kuipers
> On Thursday, March 24, 2016, Koert Kuipers <ko...@tresata.com> wrote: > >> i guess what i am saying is that in a yarn world the only hard >> restrictions left are the the containers you run in, which means the hadoop >> version, java version and python version (if you u

Re: [discuss] ending support for Java 7 in Spark 2.0

2016-03-24 Thread Koert Kuipers
i think that logic is reasonable, but then the same should also apply to scala 2.10, which is also unmaintained/unsupported at this point (basically has been since march 2015 except for one hotfix due to a license incompatibility) who wants to support scala 2.10 three years after they did the

Re: Does anyone implement org.apache.spark.serializer.Serializer in their own code?

2016-03-07 Thread Koert Kuipers
we are not, but it seems reasonable to me that a user has the ability to implement their own serializer. can you refactor and break compatibility, but not make it private? On Mon, Mar 7, 2016 at 9:57 PM, Josh Rosen wrote: > Does anyone implement Spark's serializer

Re: Build changes after SPARK-13579

2016-04-04 Thread Koert Kuipers
do i need to run sbt package before doing tests? On Mon, Apr 4, 2016 at 11:00 PM, Marcelo Vanzin wrote: > Hey all, > > We merged SPARK-13579 today, and if you're like me and have your > hands automatically type "sbt assembly" anytime you're building Spark, > that won't

Re: RDD Partitions not distributed evenly to executors

2016-04-04 Thread Koert Kuipers
rectly propagated to all nodes? Are they identical? > Yes; these files are stored on a shared memory directory accessible to > all nodes. > > Koert Kuipers: > > we ran into similar issues and it seems related to the new memory > > management. can you try: > > spa

Re: Discuss: commit to Scala 2.10 support for Spark 2.x lifecycle

2016-03-30 Thread Koert Kuipers
​about that pro, i think it's more the opposite: ​many libraries have stopped maintaining scala 2.10 versions. bugs will no longer be fixed for scala 2.10 and new libraries will not be available for scala 2.10 at all, making them unusable in spark. take for example akka, a distributed messaging

Re: Discuss: commit to Scala 2.10 support for Spark 2.x lifecycle

2016-03-30 Thread Koert Kuipers
Spark still runs on akka. So if you want the benefits of the latest akka (not saying we do, was just an example) then you need to drop scala 2.10 On Mar 30, 2016 10:44 AM, "Cody Koeninger" wrote: > I agree with Mark in that I don't see how supporting scala 2.10 for > spark

Re: Discuss: commit to Scala 2.10 support for Spark 2.x lifecycle

2016-03-30 Thread Koert Kuipers
Wed, Mar 30, 2016 at 9:10 AM, Koert Kuipers <ko...@tresata.com> wrote: > >> Spark still runs on akka. So if you want the benefits of the latest akka >> (not saying we do, was just an example) then you need to drop scala 2.10 >> On Mar 30, 2016 10:44 AM, "Cody Koe

Re: Discuss: commit to Scala 2.10 support for Spark 2.x lifecycle

2016-04-01 Thread Koert Kuipers
tayed on an >> old Scala version for multiple years because switching it, or mixing >> versions, would affect the company's entire codebase. >> >> Matei >> >> On Mar 30, 2016, at 12:08 PM, Koert Kuipers <ko...@tresata.com> wrote: >> >> oh

Re: RDD Partitions not distributed evenly to executors

2016-04-04 Thread Koert Kuipers
we ran into similar issues and it seems related to the new memory management. can you try: spark.memory.useLegacyMode = true On Mon, Apr 4, 2016 at 9:12 AM, Mike Hynes <91m...@gmail.com> wrote: > [ CC'ing dev list since nearly identical questions have occurred in > user list recently w/o

Re: right outer joins on Datasets

2016-05-24 Thread Koert Kuipers
got it, but i assume thats an internal implementation detail, and it should show null not -1? On Tue, May 24, 2016 at 3:10 AM, Zhan Zhang wrote: > The reason for "-1" is that the default value for Integer is -1 if the > value > is null > > def defaultValue(jt: String):

ClassCastException: SomeCaseClass cannot be cast to org.apache.spark.sql.Row

2016-05-24 Thread Koert Kuipers
hello, as we continue to test spark 2.0 SNAPSHOT in-house we ran into the following trying to port an existing application from spark 1.6.1 to spark 2.0.0-SNAPSHOT. given this code: case class Test(a: Int, b: String) val rdd = sc.parallelize(List(Row(List(Test(5, "ha"), Test(6, "ba") val

Re: ClassCastException: SomeCaseClass cannot be cast to org.apache.spark.sql.Row

2016-05-24 Thread Koert Kuipers
https://issues.apache.org/jira/browse/SPARK-15507 On Tue, May 24, 2016 at 12:21 PM, Ted Yu <yuzhih...@gmail.com> wrote: > Please log a JIRA. > > Thanks > > On Tue, May 24, 2016 at 8:33 AM, Koert Kuipers <ko...@tresata.com> wrote: > >> hello, >> as we co

CompileException for spark-sql generated code in 2.0.0-SNAPSHOT

2016-05-17 Thread Koert Kuipers
hello all, we are slowly expanding our test coverage for spark 2.0.0-SNAPSHOT to more in-house projects. today i ran into this issue... this runs fine: val df = sc.parallelize(List(("1", "2"), ("3", "4"))).toDF("a", "b") df .map(row => row)(RowEncoder(df.schema)) .select("a", "b") .show

Re: CompileException for spark-sql generated code in 2.0.0-SNAPSHOT

2016-05-18 Thread Koert Kuipers
databricks.com> wrote: > >> It seems like the problem here is that we are not using unique names >> for mapelements_isNull? >> >> >> >> On Tue, May 17, 2016 at 3:29 PM, Koert Kuipers <ko...@tresata.com> wrote: >> >>> hello all, we are slowl

SQLContext and "stable identifier required"

2016-05-03 Thread Koert Kuipers
with the introduction of SparkSession SQLContext changed from being a lazy val to a def. however this is troublesome if you want to do: import someDataset.sqlContext.implicits._ because it is no longer a stable identifier, i think? i get: stable identifier required, but

Re: spark 2 segfault

2016-05-02 Thread Koert Kuipers
;> org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:1861) >> at >> org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:1860) >> at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2438) >> at org.apache.spark.sql.Dataset.head(Dataset.scala:1860) >&g

  1   2   3   >