Re: [GRAPHX] Graph Algorithms and Spark

2016-04-21 Thread Denny Lee
BTW, we recently had a webinar on GraphFrames at http://go.databricks.com/graphframes-dataframe-based-graphs-for-apache-spark On Thu, Apr 21, 2016 at 14:30 Dimitris Kouzis - Loukas wrote: > This thread is good. Maybe it should make it to doc or the users group > > On Thu, Apr

Re: [GRAPHX] Graph Algorithms and Spark

2016-04-21 Thread Dimitris Kouzis - Loukas
This thread is good. Maybe it should make it to doc or the users group On Thu, Apr 21, 2016 at 9:25 PM, Zhan Zhang wrote: > > You can take a look at this blog from data bricks about GraphFrames > > https://databricks.com/blog/2016/03/03/introducing-graphframes.html > >

Re: Improving system design logging in spark

2016-04-21 Thread Ali Tootoonchian
Hi, My point for #2 is distinguishing between how long does it take for each task to read a data from disk and transfer it through network to targeted node. As I know (correct me if I'm wrong) block time to fetch data includes both reading a data by remote node and transferring it to requested

Re: RFC: Remote "HBaseTest" from examples?

2016-04-21 Thread Ted Yu
Zhan: I have mentioned the JIRA numbers in the thread starting with (note the typo in subject of this thread): RFC: Remove ... On Thu, Apr 21, 2016 at 1:28 PM, Zhan Zhang wrote: > FYI: There are several pending patches for DataFrame support on top of > HBase. > >

Re: [Spark-SQL] Reduce Shuffle Data by pushing filter toward storage

2016-04-21 Thread atootoonchian
I create an issue in Spark project: SPARK-14820 -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-Reduce-Shuffle-Data-by-pushing-filter-toward-storage-tp17297p17306.html Sent from the Apache Spark Developers List mailing list archive at

Re: [GRAPHX] Graph Algorithms and Spark

2016-04-21 Thread Zhan Zhang
You can take a look at this blog from data bricks about GraphFrames https://databricks.com/blog/2016/03/03/introducing-graphframes.html Thanks. Zhan Zhang On Apr 21, 2016, at 12:53 PM, Robin East > wrote: Hi Aside from LDA, which is

Re: RFC: Remote "HBaseTest" from examples?

2016-04-21 Thread Zhan Zhang
FYI: There are several pending patches for DataFrame support on top of HBase. Thanks. Zhan Zhang On Apr 20, 2016, at 2:43 AM, Saisai Shao > wrote: +1, HBaseTest in Spark Example is quite old and obsolete, the HBase connector in HBase repo

Re: [GRAPHX] Graph Algorithms and Spark

2016-04-21 Thread Robin East
Hi Aside from LDA, which is implemented in MLLib, GraphX has the following built-in algorithms: PageRank/Personalised PageRank Connected Components Strongly Connected Components Triangle Count Shortest Paths Label Propagation It also implements a version of Pregel framework, a form of

Re: [GRAPHX] Graph Algorithms and Spark

2016-04-21 Thread Krishna Sankar
Hi, 1. Yep, GraphX is stable and would be a good choice for you to implement algorithms. For a quick intro you can refer to our Strata MLlib tutorial GraphX slides http://goo.gl/Ffq2Az 2. GraphX has implemented algorithms like PageRank & ConnectedComponents[1] 3. It also has

[GRAPHX] Graph Algorithms and Spark

2016-04-21 Thread tgensol
Hi there, I am working in a group of the University of Michigan, and we are trying to make (and find first) some Distributed graph algorithms. I know spark, and I found GraphX. I read the docs, but I only found Latent Dirichlet Allocation algorithms working with GraphX, so I was wondering why ?

Re: [Spark-SQL] Reduce Shuffle Data by pushing filter toward storage

2016-04-21 Thread Ted Yu
Interesting analysis. Can you log a JIRA ? > On Apr 21, 2016, at 11:07 AM, atootoonchian wrote: > > SQL query planner can have intelligence to push down filter commands towards > the storage layer. If we optimize the query planner such that the IO to the > storage is reduced

Re: [Spark-SQL] Reduce Shuffle Data by pushing filter toward storage

2016-04-21 Thread atootoonchian
Hi Marcin I attached a pdf format of issue. Reduce_Shuffle_Data_by_pushing_filter_toward_storage.pdf -- View this message in context:

Re: [Spark-SQL] Reduce Shuffle Data by pushing filter toward storage

2016-04-21 Thread Marcin Tustin
I think that's an important result. Could you format your email to split out your parts a little more? It all runs together for me in gmail, so it's hard to follow, and I very much would like to. On Thu, Apr 21, 2016 at 2:07 PM, atootoonchian wrote: > SQL query planner can have

[Spark-SQL] Reduce Shuffle Data by pushing filter toward storage

2016-04-21 Thread atootoonchian
SQL query planner can have intelligence to push down filter commands towards the storage layer. If we optimize the query planner such that the IO to the storage is reduced at the cost of running multiple filters (i.e., compute), this should be desirable when the system is IO bound. An example to

[Spark-SQL] Reduce Shuffle Data by pushing filter toward storage

2016-04-21 Thread atootoonchian
SQL query planner can have intelligence to push down filter commands towards the storage layer. If we optimize the query planner such that the IO to the storage is reduced at the cost of running multiple filters (i.e., compute), this should be desirable when the system is IO bound. An example to

回复:Re: 回复:Spark sql and hive into different result with same sql

2016-04-21 Thread FangFang Chen
maybe I found the root cause from spark doc: "Unlimited precision decimal columns are no longer supported, instead Spark SQL enforces a maximum precision of 38. When inferring schema from BigDecimal objects, a precision of (38, 18) is now used. When no precision is specified in DDL then the