Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-31 Thread James Baker
by plenty I'm sure :) (and would make my implementation more straightforward - the state management is painful atm). James On Wed, 30 Aug 2017 at 14:56 Reynold Xin <r...@databricks.com<mailto:r...@databricks.com>> wrote: Sure that's good to do (and as discussed earlier a good comp

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-31 Thread James Baker
by plenty I'm sure :) (and would make my implementation more straightforward - the state management is painful atm). James On Wed, 30 Aug 2017 at 14:56 Reynold Xin <r...@databricks.com<mailto:r...@databricks.com>> wrote: Sure that's good to do (and as discussed earlier a good comp

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-31 Thread James Baker
by plenty I'm sure :) (and would make my implementation more straightforward - the state management is painful atm). James On Wed, 30 Aug 2017 at 14:56 Reynold Xin <r...@databricks.com<mailto:r...@databricks.com>> wrote: Sure that's good to do (and as discussed earlier a good comp

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-30 Thread James Baker
personal slant is that it's more important to improve support for other datastores than it is to lower the barrier of entry - this is why I've been pushing here. James On Wed, 30 Aug 2017 at 09:37 Ryan Blue <rb...@netflix.com<mailto:rb...@netflix.com>> wrote: -1 (non-binding) Someti

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-30 Thread James Baker
personal slant is that it's more important to improve support for other datastores than it is to lower the barrier of entry - this is why I've been pushing here. James On Wed, 30 Aug 2017 at 09:37 Ryan Blue <rb...@netflix.com<mailto:rb...@netflix.com>> wrote: -1 (non-binding) Someti

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-30 Thread James Baker
personal slant is that it's more important to improve support for other datastores than it is to lower the barrier of entry - this is why I've been pushing here. James On Wed, 30 Aug 2017 at 09:37 Ryan Blue <rb...@netflix.com<mailto:rb...@netflix.com>> wrote: -1 (non-binding) Someti

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-30 Thread James Baker
personal slant is that it's more important to improve support for other datastores than it is to lower the barrier of entry - this is why I've been pushing here. James On Wed, 30 Aug 2017 at 09:37 Ryan Blue <rb...@netflix.com<mailto:rb...@netflix.com>> wrote: -1 (non-binding) Someti

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-29 Thread James Baker
ch out something here if that'd be useful? James On Tue, 29 Aug 2017 at 18:59 Wenchen Fan <cloud0...@gmail.com<mailto:cloud0...@gmail.com>> wrote: Hi James, Thanks for your feedback! I think your concerns are all valid, but we need to make a tradeoff here. > Explicitly h

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-29 Thread James Baker
ch out something here if that'd be useful? James On Tue, 29 Aug 2017 at 18:59 Wenchen Fan <cloud0...@gmail.com<mailto:cloud0...@gmail.com>> wrote: Hi James, Thanks for your feedback! I think your concerns are all valid, but we need to make a tradeoff here. > Explicitly h

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-29 Thread James Baker
ava class structure works, but otherwise I can just throw). James On Tue, 29 Aug 2017 at 02:56 Reynold Xin <r...@databricks.com<mailto:r...@databricks.com>> wrote: James, Thanks for the comment. I think you just pointed out a trade-off between expressiveness and API simplicity

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-29 Thread James Baker
ava class structure works, but otherwise I can just throw). James On Tue, 29 Aug 2017 at 02:56 Reynold Xin <r...@databricks.com<mailto:r...@databricks.com>> wrote: James, Thanks for the comment. I think you just pointed out a trade-off between expressiveness and API simplicity

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-28 Thread James Baker
supported pushdown stuff, and then the user can transform and return it. I think this ends up being a more elegant API for consumers, and also far more intuitive. James On Mon, 28 Aug 2017 at 18:00 蒋星博 <jiangxb1...@gmail.com<mailto:jiangxb1...@gmail.com>> wrote: +1 (Non-bindi

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-28 Thread James Baker
supported pushdown stuff, and then the user can transform and return it. I think this ends up being a more elegant API for consumers, and also far more intuitive. James On Mon, 28 Aug 2017 at 18:00 蒋星博 <jiangxb1...@gmail.com<mailto:jiangxb1...@gmail.com>> wrote: +1 (Non-bindi

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-28 Thread James Baker
supported pushdown stuff, and then the user can transform and return it. I think this ends up being a more elegant API for consumers, and also far more intuitive. James On Mon, 28 Aug 2017 at 18:00 蒋星博 <jiangxb1...@gmail.com<mailto:jiangxb1...@gmail.com>> wrote: +1 (Non-bindi

RE: [VOTE] Release Apache Spark 2.0.0 (RC4)

2016-07-15 Thread james
-1 This bug SPARK-16515 in Spark 2.0 breaks our cases which can run on 1.6. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-2-0-0-RC4-tp18317p18341.html Sent from the Apache Spark Developers List mailing list archive at

How Spark SQL correctly connect hive metastore database with Spark 2.0 ?

2016-05-12 Thread james
Hi Spark guys, I am try to run Spark SQL using bin/spark-sql with Spark 2.0 master code(commit ba181c0c7a32b0e81bbcdbe5eed94fc97b58c83e) but ran across an issue that it always connect local derby database and can't connect my existing hive metastore database. Could you help me to check what's the

Re: dataframe udf functioin will be executed twice when filter on new column created by withColumn

2016-05-11 Thread James Hammerton
This may be related to: https://issues.apache.org/jira/browse/SPARK-13773 Regards, James On 11 May 2016 at 15:49, Ted Yu <yuzhih...@gmail.com> wrote: > In master branch, behavior is the same. > > Suggest opening a JIRA if you haven't done so. > > On Wed, May 11, 2016

Re: java.lang.OutOfMemoryError: Unable to acquire bytes of memory

2016-03-22 Thread james
I guess different workload cause diff result ? -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/java-lang-OutOfMemoryError-Unable-to-acquire-bytes-of-memory-tp16773p16789.html Sent from the Apache Spark Developers List mailing list archive at

Re: java.lang.OutOfMemoryError: Unable to acquire bytes of memory

2016-03-22 Thread james
Hi, I also found 'Unable to acquire memory' issue using Spark 1.6.1 with Dynamic allocation on YARN. My case happened with setting spark.sql.shuffle.partitions larger than 200. From error stack, it has a diff with issue reported by Nezih and not sure if these has same root cause. Thanks James

Re: ORC file writing hangs in pyspark

2016-02-24 Thread James Barney
. Thank you again for the suggestions On Tue, Feb 23, 2016 at 9:28 PM, Zhan Zhang <zzh...@hortonworks.com> wrote: > Hi James, > > You can try to write with other format, e.g., parquet to see whether it is > a orc specific issue or more generic issue. > > Thanks. > > Zhan Z

ORC file writing hangs in pyspark

2016-02-23 Thread James Barney
I'm trying to write an ORC file after running the FPGrowth algorithm on a dataset of around just 2GB in size. The algorithm performs well and can display results if I take(n) the freqItemSets() of the result after converting that to a DF. I'm using Spark 1.5.2 on HDP 2.3.4 and Python 3.4.2 on

Re: [VOTE] Release Apache Spark 1.5.1 (RC1)

2015-09-28 Thread james
+1 1) Build binary instruction: ./make-distribution.sh --tgz --skip-java-test -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver -DskipTests 2) Run Spark SQL with YARN client mode This 1.5.1 RC1 package have better test results than previous 1.5.0 except for

Re: [VOTE] Release Apache Spark 1.5.0 (RC3)

2015-09-07 Thread james
add a critical bug https://issues.apache.org/jira/browse/SPARK-10474 (Aggregation failed with unable to acquire memory) -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-5-0-RC3-tp13928p13987.html Sent from the Apache Spark

Re: [VOTE] Release Apache Spark 1.5.0 (RC3)

2015-09-06 Thread james
I saw a new "spark.shuffle.manager=tungsten-sort" implemented in https://issues.apache.org/jira/browse/SPARK-7081, but it can't be found its corresponding description in http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc3-docs/configuration.html(Currenlty there are only 'sort' and

Re: Came across Spark SQL hang/Error issue with Spark 1.5 Tungsten feature

2015-08-03 Thread james
Based on the latest spark code(commit 608353c8e8e50461fafff91a2c885dca8af3aaa8) and used the same Spark SQL query to test two group of combined configuration and seemed that currently it don't work fine in tungsten-sort shuffle manager from below results: *Test 1# (PASSED)*

Re: Came across Spark SQL hang/Error issue with Spark 1.5 Tungsten feature

2015-08-02 Thread james
Thank you for your reply! Do you mean that currently if i want to use this Tungsten feature, we had to set sort shuffle manager(spark.shuffle.manager=sort) ,right ? However, I saw a slide Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal published in Spark Summit 2015 and it

Came across Spark SQL hang issue with Spark 1.5 Tungsten feature

2015-07-31 Thread james
I try to enable Tungsten with Spark SQL and set below 3 parameters, but i found the Spark SQL always hang below point. So could you please point me what's the potential cause ? I'd appreciate any input. spark.shuffle.manager=tungsten-sort spark.sql.codegen=true spark.sql.unsafe.enabled=true

Re: Came across Spark SQL hang/Error issue with Spark 1.5 Tungsten feature

2015-07-31 Thread james
Another error: 15/07/31 16:15:28 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 3 to bignode1:40443 15/07/31 16:15:28 INFO spark.MapOutputTrackerMaster: Size of output statuses for shuffle 3 is 583 bytes 15/07/31 16:15:28 INFO

graph.mapVertices() function obtain edge triplets with null attribute

2015-02-26 Thread James
My code ``` // Initial the graph, assign a counter to each vertex that contains the vertex id only var anfGraph = graph.mapVertices { case (vid, _) = val counter = new HyperLogLog(5) counter.offer(vid) counter } val nullVertex = anfGraph.triplets.filter(edge = edge.srcAttr == null).first

Re: Why a program would receive null from send message of mapReduceTriplets

2015-02-13 Thread James
) // - NullPointerException ``` I could found that some vertex attributes in some triplets are null, but not all. Alcaid 2015-02-13 14:50 GMT+08:00 Reynold Xin r...@databricks.com: Then maybe you actually had a null in your vertex attribute? On Thu, Feb 12, 2015 at 10:47 PM, James alcaid1...@gmail.com wrote

Re: Why a program would receive null from send message of mapReduceTriplets

2015-02-12 Thread James
? On Thu, Feb 12, 2015 at 10:47 PM, James alcaid1...@gmail.com wrote: I changed the mapReduceTriplets() func to aggregateMessages(), but it still failed. 2015-02-13 6:52 GMT+08:00 Reynold Xin r...@databricks.com: Can you use the new aggregateNeighbors method? I suspect the null is coming from

Re: Why a program would receive null from send message of mapReduceTriplets

2015-02-12 Thread James
need the src or dst vertex data. Occasionally it can fail to detect. In the new aggregateNeighbors API, the caller needs to explicitly specifying that, making it more robust. On Thu, Feb 12, 2015 at 6:26 AM, James alcaid1...@gmail.com wrote: Hello, When I am running the code on a much bigger

Why a program would receive null from send message of mapReduceTriplets

2015-02-12 Thread James
is appreciated. Alcaid 2015-02-11 19:30 GMT+08:00 James alcaid1...@gmail.com: Hello, Recently I am trying to estimate the average distance of a big graph using spark with the help of [HyperAnf]( http://dl.acm.org/citation.cfm?id=1963493). It works like Connect Componenet algorithm, while

[GraphX] Estimating Average distance of a big graph using GraphX

2015-02-11 Thread James
Hello, Recently I am trying to estimate the average distance of a big graph using spark with the help of [HyperAnf](http://dl.acm.org/citation.cfm?id=1963493 ). It works like Connect Componenet algorithm, while the attribute of a vertex is a HyperLogLog counter that at k-th iteration it

not found: type LocalSparkContext

2015-01-20 Thread James
Hi all, When I was trying to write a test on my spark application I met ``` Error:(14, 43) not found: type LocalSparkContext class HyperANFSuite extends FunSuite with LocalSparkContext { ``` At the source code of spark-core I could not found LocalSparkContext, thus I wonder how to write a test

Re: not found: type LocalSparkContext

2015-01-20 Thread James
LocalSparkContext, but since the test classes aren't included in Spark packages, you'll also need to package them up in order to use them in your application (viz., outside of Spark). best, wb - Original Message - From: James alcaid1...@gmail.com To: dev@spark.apache.org Sent

Using graphx to calculate average distance of a big graph

2015-01-04 Thread James
Recently we want to use spark to calculate the average shortest path distance between each reachable pair of nodes in a very big graph. Is there any one ever try this? We hope to discuss about the problem.

Re: will/when Spark/SparkSQL will support ORCFile format

2014-10-09 Thread James Yu
For performance, will foreign data format support, same as native ones? Thanks, James On Wed, Oct 8, 2014 at 11:03 PM, Cheng Lian lian.cs@gmail.com wrote: The foreign data source API PR also matters here https://www.github.com/apache/spark/pull/2475 Foreign data source like ORC can

Re: will/when Spark/SparkSQL will support ORCFile format

2014-10-09 Thread James Yu
these APIs use will be the same as that for datasources included in the core spark sql library. Michael On Thu, Oct 9, 2014 at 2:18 PM, James Yu jym2...@gmail.com wrote: For performance, will foreign data format support, same as native ones? Thanks, James On Wed, Oct 8, 2014 at 11:03 PM

will/when Spark/SparkSQL will support ORCFile format

2014-10-08 Thread James Yu
Didn't see anyone asked the question before, but I was wondering if anyone knows if Spark/SparkSQL will support ORCFile format soon? ORCFile is getting more and more popular hi Hive world. Thanks, James