[jira] [Commented] (SPARK-3482) Allow symlinking to scripts (spark-shell, spark-submit, ...)

2014-09-14 Thread Radim Kolar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133204#comment-14133204
 ] 

Radim Kolar commented on SPARK-3482:


https://github.com/apache/spark/pull/2386

 Allow symlinking to scripts (spark-shell, spark-submit, ...)
 

 Key: SPARK-3482
 URL: https://issues.apache.org/jira/browse/SPARK-3482
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 1.0.2
 Environment: Unix system (FreeBSD 10.1 with bash)
Reporter: Radim Kolar
Priority: Trivial

 if you have install scenario like spark installed in /usr/local/share/spark 
 and want to link its scripts from /usr/local/bin like: 
 /usr/local/bin/spark-shell - /usr/local/share/spark/bin/spark-shell then 
 scripts fails to locate spark install directory correctly. FWDIR variable 
 needs to be changed to:
 {noformat}
 ## Global script variables
 FWDIR=$(cd $(dirname $(readlink -f $0))/..; pwd)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3520) java version check in spark-class fails with openjdk

2014-09-14 Thread Radim Kolar (JIRA)
Radim Kolar created SPARK-3520:
--

 Summary: java version check in spark-class fails with openjdk
 Key: SPARK-3520
 URL: https://issues.apache.org/jira/browse/SPARK-3520
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
 Environment: Freebsd 10.1, Openjdk 7
Reporter: Radim Kolar
Priority: Minor


tested on current git master:

(hsn@sanatana:pts/4):spark/bin% ./spark-shell
/home/hsn/live/spark/bin/spark-class: line 111: [: openjdk version 1.7.0_65: 
integer expression expected
(hsn@sanatana:pts/4):spark/bin% java -version
openjdk version 1.7.0_65
OpenJDK Runtime Environment (build 1.7.0_65-b17)
OpenJDK Server VM (build 24.65-b04, mixed mode)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3520) java version check in spark-class fails with openjdk

2014-09-14 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133226#comment-14133226
 ] 

Sean Owen commented on SPARK-3520:
--

Duplicated / subsumed by https://issues.apache.org/jira/browse/SPARK-3425

 java version check in spark-class fails with openjdk
 

 Key: SPARK-3520
 URL: https://issues.apache.org/jira/browse/SPARK-3520
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
 Environment: Freebsd 10.1, Openjdk 7
Reporter: Radim Kolar
Priority: Minor

 tested on current git master:
 (hsn@sanatana:pts/4):spark/bin% ./spark-shell
 /home/hsn/live/spark/bin/spark-class: line 111: [: openjdk version 
 1.7.0_65: integer expression expected
 (hsn@sanatana:pts/4):spark/bin% java -version
 openjdk version 1.7.0_65
 OpenJDK Runtime Environment (build 1.7.0_65-b17)
 OpenJDK Server VM (build 24.65-b04, mixed mode)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3520) java version check in spark-class fails with openjdk

2014-09-14 Thread Radim Kolar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133243#comment-14133243
 ] 

Radim Kolar commented on SPARK-3520:


{quote}
JAVA_VERSION=$($RUNNER -version 21 | sed 's/openjdk/java/' | sed 's/java 
version \(.*\)\.\(.*\)\..*/\1\2/; 1q')
{quote}

fixes problem, pull request in  SPARK-3425 too

 java version check in spark-class fails with openjdk
 

 Key: SPARK-3520
 URL: https://issues.apache.org/jira/browse/SPARK-3520
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
 Environment: Freebsd 10.1, Openjdk 7
Reporter: Radim Kolar
Priority: Minor

 tested on current git master:
 (hsn@sanatana:pts/4):spark/bin% ./spark-shell
 /home/hsn/live/spark/bin/spark-class: line 111: [: openjdk version 
 1.7.0_65: integer expression expected
 (hsn@sanatana:pts/4):spark/bin% java -version
 openjdk version 1.7.0_65
 OpenJDK Runtime Environment (build 1.7.0_65-b17)
 OpenJDK Server VM (build 24.65-b04, mixed mode)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1297) Upgrade HBase dependency to 0.98.0

2014-09-14 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133250#comment-14133250
 ] 

Ted Yu commented on SPARK-1297:
---

Please note: spark-1297-v5.txt is a level 0 patch.

 Upgrade HBase dependency to 0.98.0
 --

 Key: SPARK-1297
 URL: https://issues.apache.org/jira/browse/SPARK-1297
 Project: Spark
  Issue Type: Task
Reporter: Ted Yu
Assignee: Ted Yu
Priority: Minor
 Attachments: pom.xml, spark-1297-v2.txt, spark-1297-v4.txt, 
 spark-1297-v5.txt


 HBase 0.94.6 was released 11 months ago.
 Upgrade HBase dependency to 0.98.0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

2014-09-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133276#comment-14133276
 ] 

Apache Spark commented on SPARK-1405:
-

User 'witgo' has created a pull request for this issue:
https://github.com/apache/spark/pull/2388

 parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
 -

 Key: SPARK-1405
 URL: https://issues.apache.org/jira/browse/SPARK-1405
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xusen Yin
Assignee: Xusen Yin
  Labels: features
   Original Estimate: 336h
  Remaining Estimate: 336h

 Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
 topics from text corpus. Different with current machine learning algorithms 
 in MLlib, instead of using optimization algorithms such as gradient desent, 
 LDA uses expectation algorithms such as Gibbs sampling. 
 In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
 wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
 and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1449) Please delete old releases from mirroring system

2014-09-14 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1412#comment-1412
 ] 

Matei Zaharia commented on SPARK-1449:
--

Hey folks, sorry for the delay -- will look into this soon.

 Please delete old releases from mirroring system
 

 Key: SPARK-1449
 URL: https://issues.apache.org/jira/browse/SPARK-1449
 Project: Spark
  Issue Type: Task
Affects Versions: 0.8.1, 0.9.1, 0.9.2, 1.0.0, 1.0.1, 1.0.2
Reporter: Sebb

 To reduce the load on the ASF mirrors, projects are required to delete old 
 releases [1]
 Please can you remove all non-current releases?
 Thanks!
 [Note that older releases are always available from the ASF archive server]
 Any links to older releases on download pages should first be adjusted to 
 point to the archive server.
 [1] http://www.apache.org/dev/release.html#when-to-archive



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1449) Please delete old releases from mirroring system

2014-09-14 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-1449:
-
Assignee: Patrick Wendell

 Please delete old releases from mirroring system
 

 Key: SPARK-1449
 URL: https://issues.apache.org/jira/browse/SPARK-1449
 Project: Spark
  Issue Type: Task
Affects Versions: 0.8.1, 0.9.1, 0.9.2, 1.0.0, 1.0.1, 1.0.2
Reporter: Sebb
Assignee: Patrick Wendell

 To reduce the load on the ASF mirrors, projects are required to delete old 
 releases [1]
 Please can you remove all non-current releases?
 Thanks!
 [Note that older releases are always available from the ASF archive server]
 Any links to older releases on download pages should first be adjusted to 
 point to the archive server.
 [1] http://www.apache.org/dev/release.html#when-to-archive



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3521) Missing modules in 1.1.0 source distribution - cant be build with maven

2014-09-14 Thread Radim Kolar (JIRA)
Radim Kolar created SPARK-3521:
--

 Summary: Missing modules in 1.1.0 source distribution - cant be 
build with maven
 Key: SPARK-3521
 URL: https://issues.apache.org/jira/browse/SPARK-3521
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.1.0
Reporter: Radim Kolar
Priority: Minor


modules {{bagel}}, {{mllib}}, {{flume-sink}} and {{flume}} are missing from 
source code distro, spark cant be build with maven. It cant be build by 
{{sbt/sbt}} either due to other bug (_java.lang.IllegalStateException: 
impossible to get artifacts when data has not been loaded. IvyNode = 
org.slf4j#slf4j-api;1.6.1_)

(hsn@sanatana:pts/6):work/spark-1.1.0% mvn -Pyarn -Phadoop-2.4 
-Dhadoop.version=2.4.1 -DskipTests clean package
[INFO] Scanning for projects...
[ERROR] The build could not read 1 project - [Help 1]
[ERROR]   
[ERROR]   The project org.apache.spark:spark-parent:1.1.0 
(/home/hsn/myports/spark11/work/spark-1.1.0/pom.xml) has 4 errors
[ERROR] Child module /home/hsn/myports/spark11/work/spark-1.1.0/bagel of 
/home/hsn/myports/spark11/work/spark-1.1.0/pom.xml does not exist
[ERROR] Child module /home/hsn/myports/spark11/work/spark-1.1.0/mllib of 
/home/hsn/myports/spark11/work/spark-1.1.0/pom.xml does not exist
[ERROR] Child module 
/home/hsn/myports/spark11/work/spark-1.1.0/external/flume of 
/home/hsn/myports/spark11/work/spark-1.1.0/pom.xml does not exist
[ERROR] Child module 
/home/hsn/myports/spark11/work/spark-1.1.0/external/flume-sink/pom.xml of 
/home/hsn/myports/spark11/work/spark-1.1.0/pom.xml does not exist



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3521) Missing modules in 1.1.0 source distribution - cant be build with maven

2014-09-14 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133413#comment-14133413
 ] 

Sean Owen commented on SPARK-3521:
--

https://dist.apache.org/repos/dist/release/spark/spark-1.1.0/spark-1.1.0.tgz
All of that source code is plainly in the distribution. It compiles with Maven 
for me an this was verified by several people during the release. It sounds 
like something is quite corrupted about your copy.

 Missing modules in 1.1.0 source distribution - cant be build with maven
 ---

 Key: SPARK-3521
 URL: https://issues.apache.org/jira/browse/SPARK-3521
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.1.0
Reporter: Radim Kolar
Priority: Minor

 modules {{bagel}}, {{mllib}}, {{flume-sink}} and {{flume}} are missing from 
 source code distro, spark cant be build with maven. It cant be build by 
 {{sbt/sbt}} either due to other bug (_java.lang.IllegalStateException: 
 impossible to get artifacts when data has not been loaded. IvyNode = 
 org.slf4j#slf4j-api;1.6.1_)
 (hsn@sanatana:pts/6):work/spark-1.1.0% mvn -Pyarn -Phadoop-2.4 
 -Dhadoop.version=2.4.1 -DskipTests clean package
 [INFO] Scanning for projects...
 [ERROR] The build could not read 1 project - [Help 1]
 [ERROR]   
 [ERROR]   The project org.apache.spark:spark-parent:1.1.0 
 (/home/hsn/myports/spark11/work/spark-1.1.0/pom.xml) has 4 errors
 [ERROR] Child module /home/hsn/myports/spark11/work/spark-1.1.0/bagel of 
 /home/hsn/myports/spark11/work/spark-1.1.0/pom.xml does not exist
 [ERROR] Child module /home/hsn/myports/spark11/work/spark-1.1.0/mllib of 
 /home/hsn/myports/spark11/work/spark-1.1.0/pom.xml does not exist
 [ERROR] Child module 
 /home/hsn/myports/spark11/work/spark-1.1.0/external/flume of 
 /home/hsn/myports/spark11/work/spark-1.1.0/pom.xml does not exist
 [ERROR] Child module 
 /home/hsn/myports/spark11/work/spark-1.1.0/external/flume-sink/pom.xml of 
 /home/hsn/myports/spark11/work/spark-1.1.0/pom.xml does not exist



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1449) Please delete old releases from mirroring system

2014-09-14 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1449:
---
Affects Version/s: (was: 1.0.2)

 Please delete old releases from mirroring system
 

 Key: SPARK-1449
 URL: https://issues.apache.org/jira/browse/SPARK-1449
 Project: Spark
  Issue Type: Task
Affects Versions: 0.8.1, 0.9.1, 0.9.2, 1.0.0, 1.0.1
Reporter: Sebb
Assignee: Patrick Wendell

 To reduce the load on the ASF mirrors, projects are required to delete old 
 releases [1]
 Please can you remove all non-current releases?
 Thanks!
 [Note that older releases are always available from the ASF archive server]
 Any links to older releases on download pages should first be adjusted to 
 point to the archive server.
 [1] http://www.apache.org/dev/release.html#when-to-archive



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1449) Please delete old releases from mirroring system

2014-09-14 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1449:
---
Fix Version/s: 0.8.1
   0.9.1
   1.0.0
   1.0.1
   0.9.2

 Please delete old releases from mirroring system
 

 Key: SPARK-1449
 URL: https://issues.apache.org/jira/browse/SPARK-1449
 Project: Spark
  Issue Type: Task
Affects Versions: 0.8.1, 0.9.1, 0.9.2, 1.0.0, 1.0.1
Reporter: Sebb
Assignee: Patrick Wendell
 Fix For: 0.8.1, 0.9.1, 0.9.2, 1.0.0, 1.0.1


 To reduce the load on the ASF mirrors, projects are required to delete old 
 releases [1]
 Please can you remove all non-current releases?
 Thanks!
 [Note that older releases are always available from the ASF archive server]
 Any links to older releases on download pages should first be adjusted to 
 point to the archive server.
 [1] http://www.apache.org/dev/release.html#when-to-archive



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-1449) Please delete old releases from mirroring system

2014-09-14 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1449.

Resolution: Fixed

I've left 1.0.2 and 1.1.0, since 1.1.0 is an unstable release.

 Please delete old releases from mirroring system
 

 Key: SPARK-1449
 URL: https://issues.apache.org/jira/browse/SPARK-1449
 Project: Spark
  Issue Type: Task
Affects Versions: 0.8.1, 0.9.1, 0.9.2, 1.0.0, 1.0.1
Reporter: Sebb
Assignee: Patrick Wendell

 To reduce the load on the ASF mirrors, projects are required to delete old 
 releases [1]
 Please can you remove all non-current releases?
 Thanks!
 [Note that older releases are always available from the ASF archive server]
 Any links to older releases on download pages should first be adjusted to 
 point to the archive server.
 [1] http://www.apache.org/dev/release.html#when-to-archive



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3522) Make spark-ec2 verbosity configurable

2014-09-14 Thread Nicholas Chammas (JIRA)
Nicholas Chammas created SPARK-3522:
---

 Summary: Make spark-ec2 verbosity configurable
 Key: SPARK-3522
 URL: https://issues.apache.org/jira/browse/SPARK-3522
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Nicholas Chammas
Priority: Minor


When launching a cluster, {{spark-ec2}} spits out a lot of stuff that feels 
like debug output. It would be better for the user if {{spark-ec2}} did the 
following:
* default to info output level
* allow option to increase verbosity and include debug output

This will require converting most of the {{print}} statements in the script to 
use Python's {{logging}} module.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2594) Add CACHE TABLE name AS SELECT ...

2014-09-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133444#comment-14133444
 ] 

Apache Spark commented on SPARK-2594:
-

User 'ravipesala' has created a pull request for this issue:
https://github.com/apache/spark/pull/2390

 Add CACHE TABLE name AS SELECT ...
 

 Key: SPARK-2594
 URL: https://issues.apache.org/jira/browse/SPARK-2594
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Michael Armbrust
Priority: Critical





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3523) GraphX graph partitioning strategy

2014-09-14 Thread Larry Xiao (JIRA)
Larry Xiao created SPARK-3523:
-

 Summary: GraphX graph partitioning strategy
 Key: SPARK-3523
 URL: https://issues.apache.org/jira/browse/SPARK-3523
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Affects Versions: 1.0.2
Reporter: Larry Xiao


We implemented some algorithms for partitioning on GraphX, and evaluated. And 
find the partitioning has space of improving. Seek opinion and advice.

H5. Motivation
* Graph in real world follow power law. Eg. On twitter 1% of the vertices are 
adjacent to nearly half of the edges.
* For high-degree vertex, one vertex concentrates vast resources. So the 
workload on few high-degree vertex should be decomposed by all machines
*  For low-degree vertex, The computation on one vertex is  quite small. Thus 
should exploit the locality of the computation on low-degree vertex.

H5. Algorithm Description
* HybridCut
  * 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png
* HybridCutPlus
  * 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png
* Greedy BiCut
  * a heuristic algorithm for bipartite

H5. Result
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png

H5. Code
* 
https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173
* Because the implementation breaks the current separation using 
PartitionStrategy.scala, so need to think of a way to support access to graph.

H5. Reference
- Bipartite-oriented Distributed Graph Partitioning for Big Learning.
- PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3523) GraphX graph partitioning strategy

2014-09-14 Thread Larry Xiao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Larry Xiao updated SPARK-3523:
--
Description: 
We implemented some algorithms for partitioning on GraphX, and evaluated. And 
find the partitioning has space of improving. Seek opinion and advice.

h5. Motivation
* Graph in real world follow power law. Eg. On twitter 1% of the vertices are 
adjacent to nearly half of the edges.
* For high-degree vertex, one vertex concentrates vast resources. So the 
workload on few high-degree vertex should be decomposed by all machines
*  For low-degree vertex, The computation on one vertex is  quite small. Thus 
should exploit the locality of the computation on low-degree vertex.

h5. Algorithm Description
* HybridCut
  * 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png!
* HybridCutPlus
  * 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png!
* Greedy BiCut
  * a heuristic algorithm for bipartite

h5. Result
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png!
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png!
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png!

h5. Code
* 
https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173
* Because the implementation breaks the current separation using 
PartitionStrategy.scala, so need to think of a way to support access to graph.

h5. Reference
- Bipartite-oriented Distributed Graph Partitioning for Big Learning.
- PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs

  was:
We implemented some algorithms for partitioning on GraphX, and evaluated. And 
find the partitioning has space of improving. Seek opinion and advice.

h5. Motivation
* Graph in real world follow power law. Eg. On twitter 1% of the vertices are 
adjacent to nearly half of the edges.
* For high-degree vertex, one vertex concentrates vast resources. So the 
workload on few high-degree vertex should be decomposed by all machines
*  For low-degree vertex, The computation on one vertex is  quite small. Thus 
should exploit the locality of the computation on low-degree vertex.

h5. Algorithm Description
* HybridCut
  * 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png
* HybridCutPlus
  * 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png
* Greedy BiCut
  * a heuristic algorithm for bipartite

h5. Result
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png

h5. Code
* 
https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173
* Because the implementation breaks the current separation using 
PartitionStrategy.scala, so need to think of a way to support access to graph.

h5. Reference
- Bipartite-oriented Distributed Graph Partitioning for Big Learning.
- PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs


 GraphX graph partitioning strategy
 --

 Key: SPARK-3523
 URL: https://issues.apache.org/jira/browse/SPARK-3523
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Affects Versions: 1.0.2
Reporter: Larry Xiao

 We implemented some algorithms for partitioning on GraphX, and evaluated. And 
 find the partitioning has space of improving. Seek opinion and advice.
 h5. Motivation
 * Graph in real world follow power law. Eg. On twitter 1% of the vertices are 
 adjacent to nearly half of the edges.
 * For high-degree vertex, one vertex concentrates vast resources. So the 
 workload on few high-degree vertex should be decomposed by all machines
 *  For low-degree vertex, The computation on one vertex is  quite small. Thus 
 should exploit the locality of the computation on low-degree vertex.
 h5. Algorithm Description
 * HybridCut
   * 
 !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png!
 * HybridCutPlus
   * 
 !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png!
 * Greedy BiCut
   * a heuristic algorithm for bipartite
 h5. Result
 !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png!
 !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png!
 !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png!
 h5. Code
 * 
 

[jira] [Updated] (SPARK-3523) GraphX graph partitioning strategy

2014-09-14 Thread Larry Xiao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Larry Xiao updated SPARK-3523:
--
Description: 
We implemented some algorithms for partitioning on GraphX, and evaluated. And 
find the partitioning has space of improving. Seek opinion and advice.

h5. Motivation
* Graph in real world follow power law. Eg. On twitter 1% of the vertices are 
adjacent to nearly half of the edges.
* For high-degree vertex, one vertex concentrates vast resources. So the 
workload on few high-degree vertex should be decomposed by all machines
*  For low-degree vertex, The computation on one vertex is  quite small. Thus 
should exploit the locality of the computation on low-degree vertex.

h5. Algorithm Description
* HybridCut
  * 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png
* HybridCutPlus
  * 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png
* Greedy BiCut
  * a heuristic algorithm for bipartite

h5. Result
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png

h5. Code
* 
https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173
* Because the implementation breaks the current separation using 
PartitionStrategy.scala, so need to think of a way to support access to graph.

h5. Reference
- Bipartite-oriented Distributed Graph Partitioning for Big Learning.
- PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs

  was:
We implemented some algorithms for partitioning on GraphX, and evaluated. And 
find the partitioning has space of improving. Seek opinion and advice.

H5. Motivation
* Graph in real world follow power law. Eg. On twitter 1% of the vertices are 
adjacent to nearly half of the edges.
* For high-degree vertex, one vertex concentrates vast resources. So the 
workload on few high-degree vertex should be decomposed by all machines
*  For low-degree vertex, The computation on one vertex is  quite small. Thus 
should exploit the locality of the computation on low-degree vertex.

H5. Algorithm Description
* HybridCut
  * 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png
* HybridCutPlus
  * 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png
* Greedy BiCut
  * a heuristic algorithm for bipartite

H5. Result
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png

H5. Code
* 
https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173
* Because the implementation breaks the current separation using 
PartitionStrategy.scala, so need to think of a way to support access to graph.

H5. Reference
- Bipartite-oriented Distributed Graph Partitioning for Big Learning.
- PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs


 GraphX graph partitioning strategy
 --

 Key: SPARK-3523
 URL: https://issues.apache.org/jira/browse/SPARK-3523
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Affects Versions: 1.0.2
Reporter: Larry Xiao

 We implemented some algorithms for partitioning on GraphX, and evaluated. And 
 find the partitioning has space of improving. Seek opinion and advice.
 h5. Motivation
 * Graph in real world follow power law. Eg. On twitter 1% of the vertices are 
 adjacent to nearly half of the edges.
 * For high-degree vertex, one vertex concentrates vast resources. So the 
 workload on few high-degree vertex should be decomposed by all machines
 *  For low-degree vertex, The computation on one vertex is  quite small. Thus 
 should exploit the locality of the computation on low-degree vertex.
 h5. Algorithm Description
 * HybridCut
   * 
 !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png
 * HybridCutPlus
   * 
 !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png
 * Greedy BiCut
   * a heuristic algorithm for bipartite
 h5. Result
 !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png
 !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png
 !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png
 h5. Code
 * 
 

[jira] [Updated] (SPARK-3523) GraphX graph partitioning strategy

2014-09-14 Thread Larry Xiao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Larry Xiao updated SPARK-3523:
--
Description: 
We implemented some algorithms for partitioning on GraphX, and evaluated. And 
find the partitioning has space of improving. Seek opinion and advice.

h5. Motivation
* Graph in real world follow power law. Eg. On twitter 1% of the vertices are 
adjacent to nearly half of the edges.
* For high-degree vertex, one vertex concentrates vast resources. So the 
workload on few high-degree vertex should be decomposed by all machines
*  For low-degree vertex, The computation on one vertex is  quite small. Thus 
should exploit the locality of the computation on low-degree vertex.

h5. Algorithm Description
* HybridCut
  * 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png!
* HybridCutPlus
  * 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png!
* Greedy BiCut
  * a heuristic algorithm for bipartite

h5. Result
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png|scale=50%!
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|scale=50%!
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png|scale=50%!

h5. Code
* 
https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173
* Because the implementation breaks the current separation using 
PartitionStrategy.scala, so need to think of a way to support access to graph.

h5. Reference
- Bipartite-oriented Distributed Graph Partitioning for Big Learning.
- PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs

  was:
We implemented some algorithms for partitioning on GraphX, and evaluated. And 
find the partitioning has space of improving. Seek opinion and advice.

h5. Motivation
* Graph in real world follow power law. Eg. On twitter 1% of the vertices are 
adjacent to nearly half of the edges.
* For high-degree vertex, one vertex concentrates vast resources. So the 
workload on few high-degree vertex should be decomposed by all machines
*  For low-degree vertex, The computation on one vertex is  quite small. Thus 
should exploit the locality of the computation on low-degree vertex.

h5. Algorithm Description
* HybridCut
  * 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png!
* HybridCutPlus
  * 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png!
* Greedy BiCut
  * a heuristic algorithm for bipartite

h5. Result
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png!
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png!
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png!

h5. Code
* 
https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173
* Because the implementation breaks the current separation using 
PartitionStrategy.scala, so need to think of a way to support access to graph.

h5. Reference
- Bipartite-oriented Distributed Graph Partitioning for Big Learning.
- PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs


 GraphX graph partitioning strategy
 --

 Key: SPARK-3523
 URL: https://issues.apache.org/jira/browse/SPARK-3523
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Affects Versions: 1.0.2
Reporter: Larry Xiao

 We implemented some algorithms for partitioning on GraphX, and evaluated. And 
 find the partitioning has space of improving. Seek opinion and advice.
 h5. Motivation
 * Graph in real world follow power law. Eg. On twitter 1% of the vertices are 
 adjacent to nearly half of the edges.
 * For high-degree vertex, one vertex concentrates vast resources. So the 
 workload on few high-degree vertex should be decomposed by all machines
 *  For low-degree vertex, The computation on one vertex is  quite small. Thus 
 should exploit the locality of the computation on low-degree vertex.
 h5. Algorithm Description
 * HybridCut
   * 
 !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png!
 * HybridCutPlus
   * 
 !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png!
 * Greedy BiCut
   * a heuristic algorithm for bipartite
 h5. Result
 !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png|scale=50%!
 !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|scale=50%!
 

[jira] [Updated] (SPARK-3523) GraphX graph partitioning strategy

2014-09-14 Thread Larry Xiao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Larry Xiao updated SPARK-3523:
--
Description: 
We implemented some algorithms for partitioning on GraphX, and evaluated. And 
find the partitioning has space of improving. Seek opinion and advice.

h5. Motivation
* Graph in real world follow power law. Eg. On twitter 1% of the vertices are 
adjacent to nearly half of the edges.
* For high-degree vertex, one vertex concentrates vast resources. So the 
workload on few high-degree vertex should be decomposed by all machines
*  For low-degree vertex, The computation on one vertex is  quite small. Thus 
should exploit the locality of the computation on low-degree vertex.

h5. Algorithm Description
* HybridCut
  * 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png|width=360!
* HybridCutPlus
  * 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png|width=360!
* Greedy BiCut
  * a heuristic algorithm for bipartite

h5. Result
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png|width=360!
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|width=360!
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png|width=360!

h5. Code
* 
https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173
* Because the implementation breaks the current separation using 
PartitionStrategy.scala, so need to think of a way to support access to graph.

h5. Reference
- Bipartite-oriented Distributed Graph Partitioning for Big Learning.
- PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs

  was:
We implemented some algorithms for partitioning on GraphX, and evaluated. And 
find the partitioning has space of improving. Seek opinion and advice.

h5. Motivation
* Graph in real world follow power law. Eg. On twitter 1% of the vertices are 
adjacent to nearly half of the edges.
* For high-degree vertex, one vertex concentrates vast resources. So the 
workload on few high-degree vertex should be decomposed by all machines
*  For low-degree vertex, The computation on one vertex is  quite small. Thus 
should exploit the locality of the computation on low-degree vertex.

h5. Algorithm Description
* HybridCut
  * 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png!
* HybridCutPlus
  * 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png!
* Greedy BiCut
  * a heuristic algorithm for bipartite

h5. Result
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png|scale=50%!
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|scale=50%!
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png|scale=50%!

h5. Code
* 
https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173
* Because the implementation breaks the current separation using 
PartitionStrategy.scala, so need to think of a way to support access to graph.

h5. Reference
- Bipartite-oriented Distributed Graph Partitioning for Big Learning.
- PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs


 GraphX graph partitioning strategy
 --

 Key: SPARK-3523
 URL: https://issues.apache.org/jira/browse/SPARK-3523
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Affects Versions: 1.0.2
Reporter: Larry Xiao

 We implemented some algorithms for partitioning on GraphX, and evaluated. And 
 find the partitioning has space of improving. Seek opinion and advice.
 h5. Motivation
 * Graph in real world follow power law. Eg. On twitter 1% of the vertices are 
 adjacent to nearly half of the edges.
 * For high-degree vertex, one vertex concentrates vast resources. So the 
 workload on few high-degree vertex should be decomposed by all machines
 *  For low-degree vertex, The computation on one vertex is  quite small. Thus 
 should exploit the locality of the computation on low-degree vertex.
 h5. Algorithm Description
 * HybridCut
   * 
 !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png|width=360!
 * HybridCutPlus
   * 
 !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png|width=360!
 * Greedy BiCut
   * a heuristic algorithm for bipartite
 h5. Result
 !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png|width=360!
 !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|width=360!
 

[jira] [Updated] (SPARK-3523) GraphX graph partitioning strategy

2014-09-14 Thread Larry Xiao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Larry Xiao updated SPARK-3523:
--
Description: 
We implemented some algorithms for partitioning on GraphX, and evaluated. And 
find the partitioning has space of improving. Seek opinion and advice.

h5. Motivation
* Graph in real world follow power law. Eg. On twitter 1% of the vertices are 
adjacent to nearly half of the edges.
* For high-degree vertex, one vertex concentrates vast resources. So the 
workload on few high-degree vertex should be decomposed by all machines
*  For low-degree vertex, The computation on one vertex is  quite small. Thus 
should exploit the locality of the computation on low-degree vertex.

h5. Algorithm Description
* HybridCut 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png|width=360!
* HybridCutPlus 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png|width=360!
* Greedy BiCut
  * a heuristic algorithm for bipartite

h5. Result
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png|width=360!
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|width=360!
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png|width=360!

h5. Code
* 
https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173
* Because the implementation breaks the current separation using 
PartitionStrategy.scala, so need to think of a way to support access to graph.

h5. Reference
- Bipartite-oriented Distributed Graph Partitioning for Big Learning.
- PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs

  was:
We implemented some algorithms for partitioning on GraphX, and evaluated. And 
find the partitioning has space of improving. Seek opinion and advice.

h5. Motivation
* Graph in real world follow power law. Eg. On twitter 1% of the vertices are 
adjacent to nearly half of the edges.
* For high-degree vertex, one vertex concentrates vast resources. So the 
workload on few high-degree vertex should be decomposed by all machines
*  For low-degree vertex, The computation on one vertex is  quite small. Thus 
should exploit the locality of the computation on low-degree vertex.

h5. Algorithm Description
* HybridCut
  * 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png|width=360!
* HybridCutPlus
  * 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png|width=360!
* Greedy BiCut
  * a heuristic algorithm for bipartite

h5. Result
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png|width=360!
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|width=360!
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png|width=360!

h5. Code
* 
https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173
* Because the implementation breaks the current separation using 
PartitionStrategy.scala, so need to think of a way to support access to graph.

h5. Reference
- Bipartite-oriented Distributed Graph Partitioning for Big Learning.
- PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs


 GraphX graph partitioning strategy
 --

 Key: SPARK-3523
 URL: https://issues.apache.org/jira/browse/SPARK-3523
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Affects Versions: 1.0.2
Reporter: Larry Xiao

 We implemented some algorithms for partitioning on GraphX, and evaluated. And 
 find the partitioning has space of improving. Seek opinion and advice.
 h5. Motivation
 * Graph in real world follow power law. Eg. On twitter 1% of the vertices are 
 adjacent to nearly half of the edges.
 * For high-degree vertex, one vertex concentrates vast resources. So the 
 workload on few high-degree vertex should be decomposed by all machines
 *  For low-degree vertex, The computation on one vertex is  quite small. Thus 
 should exploit the locality of the computation on low-degree vertex.
 h5. Algorithm Description
 * HybridCut 
 !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png|width=360!
 * HybridCutPlus 
 !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png|width=360!
 * Greedy BiCut
   * a heuristic algorithm for bipartite
 h5. Result
 !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png|width=360!
 !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|width=360!
 

[jira] [Updated] (SPARK-3523) GraphX graph partitioning strategy

2014-09-14 Thread Larry Xiao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Larry Xiao updated SPARK-3523:
--
Description: 
We implemented some algorithms for partitioning on GraphX, and evaluated. And 
find the partitioning has space of improving. Seek opinion and advice.

h5. Motivation
* Graph in real world follow power law. Eg. On twitter 1% of the vertices are 
adjacent to nearly half of the edges.
* For high-degree vertex, one vertex concentrates vast resources. So the 
workload on few high-degree vertex should be decomposed by all machines
*  For low-degree vertex, The computation on one vertex is  quite small. Thus 
should exploit the locality of the computation on low-degree vertex.

h5. Algorithm Description
* HybridCut 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png|width=360!
* HybridCutPlus 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png|width=360!
* Greedy BiCut
  * a heuristic algorithm for bipartite

h5. Result
* 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png|width=360!
 The left Y axis is replication factor, right axis is the balance (measured 
using CV, coefficient of variation) of either vertices or edges of all 
partitions. The balance of edges can infer computation balance, and the balance 
of vertices can infer communication balance.
* 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png|width=360!
 This is an example of a balanced partitioning's saving on communication.
* 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|width=360!
* in-2.0-1m is a generated power law graph with alpha equals 2.0

h5. Code
* 
https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173
* Because the implementation breaks the current separation using 
PartitionStrategy.scala, so need to think of a way to support access to graph.

h5. Reference
- Bipartite-oriented Distributed Graph Partitioning for Big Learning.
- PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs

  was:
We implemented some algorithms for partitioning on GraphX, and evaluated. And 
find the partitioning has space of improving. Seek opinion and advice.

h5. Motivation
* Graph in real world follow power law. Eg. On twitter 1% of the vertices are 
adjacent to nearly half of the edges.
* For high-degree vertex, one vertex concentrates vast resources. So the 
workload on few high-degree vertex should be decomposed by all machines
*  For low-degree vertex, The computation on one vertex is  quite small. Thus 
should exploit the locality of the computation on low-degree vertex.

h5. Algorithm Description
* HybridCut 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png|width=360!
* HybridCutPlus 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png|width=360!
* Greedy BiCut
  * a heuristic algorithm for bipartite

h5. Result
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png|width=360!
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|width=360!
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png|width=360!

h5. Code
* 
https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173
* Because the implementation breaks the current separation using 
PartitionStrategy.scala, so need to think of a way to support access to graph.

h5. Reference
- Bipartite-oriented Distributed Graph Partitioning for Big Learning.
- PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs


 GraphX graph partitioning strategy
 --

 Key: SPARK-3523
 URL: https://issues.apache.org/jira/browse/SPARK-3523
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Affects Versions: 1.0.2
Reporter: Larry Xiao

 We implemented some algorithms for partitioning on GraphX, and evaluated. And 
 find the partitioning has space of improving. Seek opinion and advice.
 h5. Motivation
 * Graph in real world follow power law. Eg. On twitter 1% of the vertices are 
 adjacent to nearly half of the edges.
 * For high-degree vertex, one vertex concentrates vast resources. So the 
 workload on few high-degree vertex should be decomposed by all machines
 *  For low-degree vertex, The computation on one vertex is  quite small. Thus 
 should exploit the locality of the computation on low-degree vertex.
 h5. Algorithm Description
 * HybridCut 
 !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png|width=360!
 

[jira] [Updated] (SPARK-3523) GraphX graph partitioning strategy

2014-09-14 Thread Larry Xiao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Larry Xiao updated SPARK-3523:
--
Description: 
We implemented some algorithms for partitioning on GraphX, and evaluated. And 
find the partitioning has space of improving. Seek opinion and advice.

h5. Motivation
* Graph in real world follow power law. Eg. On twitter 1% of the vertices are 
adjacent to nearly half of the edges.
* For high-degree vertex, one vertex concentrates vast resources. So the 
workload on few high-degree vertex should be decomposed by all machines
*  For low-degree vertex, The computation on one vertex is  quite small. Thus 
should exploit the locality of the computation on low-degree vertex.

h5. Algorithm Description
* HybridCut 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png|width=360!
* HybridCutPlus 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png|width=360!
* Greedy BiCut
  * a heuristic algorithm for bipartite

h5. Result
* 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png!
  * The left Y axis is replication factor, right axis is the balance (measured 
using CV, coefficient of variation) of either vertices or edges of all 
partitions. The balance of edges can infer computation balance, and the balance 
of vertices can infer communication balance.
* 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png|width=360!
 
  * This is an example of a balanced partitioning's saving on communication.
* 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|width=360!
  * This is a simple partitioning result of BiCut.
* in-2.0-1m is a generated power law graph with alpha equals 2.0

h5. Code
* 
https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173
* Because the implementation breaks the current separation using 
PartitionStrategy.scala, so need to think of a way to support access to graph.

h5. Reference
- Bipartite-oriented Distributed Graph Partitioning for Big Learning.
- PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs

  was:
We implemented some algorithms for partitioning on GraphX, and evaluated. And 
find the partitioning has space of improving. Seek opinion and advice.

h5. Motivation
* Graph in real world follow power law. Eg. On twitter 1% of the vertices are 
adjacent to nearly half of the edges.
* For high-degree vertex, one vertex concentrates vast resources. So the 
workload on few high-degree vertex should be decomposed by all machines
*  For low-degree vertex, The computation on one vertex is  quite small. Thus 
should exploit the locality of the computation on low-degree vertex.

h5. Algorithm Description
* HybridCut 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png|width=360!
* HybridCutPlus 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png|width=360!
* Greedy BiCut
  * a heuristic algorithm for bipartite

h5. Result
* 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png|width=600!
  * The left Y axis is replication factor, right axis is the balance (measured 
using CV, coefficient of variation) of either vertices or edges of all 
partitions. The balance of edges can infer computation balance, and the balance 
of vertices can infer communication balance.
* 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png|width=360!
 
  * This is an example of a balanced partitioning's saving on communication.
* 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|width=360!
  * This is a simple partitioning result of BiCut.
* in-2.0-1m is a generated power law graph with alpha equals 2.0

h5. Code
* 
https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173
* Because the implementation breaks the current separation using 
PartitionStrategy.scala, so need to think of a way to support access to graph.

h5. Reference
- Bipartite-oriented Distributed Graph Partitioning for Big Learning.
- PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs


 GraphX graph partitioning strategy
 --

 Key: SPARK-3523
 URL: https://issues.apache.org/jira/browse/SPARK-3523
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Affects Versions: 1.0.2
Reporter: Larry Xiao

 We implemented some algorithms for partitioning on GraphX, and evaluated. And 
 find the partitioning has space of improving. Seek opinion and advice.
 h5. Motivation
 * Graph in real world 

[jira] [Updated] (SPARK-3523) GraphX graph partitioning strategy

2014-09-14 Thread Larry Xiao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Larry Xiao updated SPARK-3523:
--
Description: 
We implemented some algorithms for partitioning on GraphX, and evaluated. And 
find the partitioning has space of improving. Seek opinion and advice.

h5. Motivation
* Graph in real world follow power law. Eg. On twitter 1% of the vertices are 
adjacent to nearly half of the edges.
* For high-degree vertex, one vertex concentrates vast resources. So the 
workload on few high-degree vertex should be decomposed by all machines
*  For low-degree vertex, The computation on one vertex is  quite small. Thus 
should exploit the locality of the computation on low-degree vertex.

h5. Algorithm Description
* HybridCut 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png|width=360!
* HybridCutPlus 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png|width=360!
* Greedy BiCut
  * a heuristic algorithm for bipartite

h5. Result
* 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png|width=600!
  * The left Y axis is replication factor, right axis is the balance (measured 
using CV, coefficient of variation) of either vertices or edges of all 
partitions. The balance of edges can infer computation balance, and the balance 
of vertices can infer communication balance.
* 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png|width=360!
 
  * This is an example of a balanced partitioning's saving on communication.
* 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|width=360!
  * This is a simple partitioning result of BiCut.
* in-2.0-1m is a generated power law graph with alpha equals 2.0

h5. Code
* 
https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173
* Because the implementation breaks the current separation using 
PartitionStrategy.scala, so need to think of a way to support access to graph.

h5. Reference
- Bipartite-oriented Distributed Graph Partitioning for Big Learning.
- PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs

  was:
We implemented some algorithms for partitioning on GraphX, and evaluated. And 
find the partitioning has space of improving. Seek opinion and advice.

h5. Motivation
* Graph in real world follow power law. Eg. On twitter 1% of the vertices are 
adjacent to nearly half of the edges.
* For high-degree vertex, one vertex concentrates vast resources. So the 
workload on few high-degree vertex should be decomposed by all machines
*  For low-degree vertex, The computation on one vertex is  quite small. Thus 
should exploit the locality of the computation on low-degree vertex.

h5. Algorithm Description
* HybridCut 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png|width=360!
* HybridCutPlus 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png|width=360!
* Greedy BiCut
  * a heuristic algorithm for bipartite

h5. Result
* 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png|width=360!
 The left Y axis is replication factor, right axis is the balance (measured 
using CV, coefficient of variation) of either vertices or edges of all 
partitions. The balance of edges can infer computation balance, and the balance 
of vertices can infer communication balance.
* 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png|width=360!
 This is an example of a balanced partitioning's saving on communication.
* 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|width=360!
* in-2.0-1m is a generated power law graph with alpha equals 2.0

h5. Code
* 
https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173
* Because the implementation breaks the current separation using 
PartitionStrategy.scala, so need to think of a way to support access to graph.

h5. Reference
- Bipartite-oriented Distributed Graph Partitioning for Big Learning.
- PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs


 GraphX graph partitioning strategy
 --

 Key: SPARK-3523
 URL: https://issues.apache.org/jira/browse/SPARK-3523
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Affects Versions: 1.0.2
Reporter: Larry Xiao

 We implemented some algorithms for partitioning on GraphX, and evaluated. And 
 find the partitioning has space of improving. Seek opinion and advice.
 h5. Motivation
 * Graph in real world follow power law. Eg. On twitter 1% of the vertices 

[jira] [Updated] (SPARK-3523) GraphX graph partitioning strategy

2014-09-14 Thread Larry Xiao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Larry Xiao updated SPARK-3523:
--
Description: 
We implemented some algorithms for partitioning on GraphX, and evaluated. And 
find the partitioning has space of improving. Seek opinion and advice.

h5. Motivation
* Graph in real world follow power law. Eg. On twitter 1% of the vertices are 
adjacent to nearly half of the edges.
* For high-degree vertex, one vertex concentrates vast resources. So the 
workload on few high-degree vertex should be decomposed by all machines
*  For low-degree vertex, The computation on one vertex is  quite small. Thus 
should exploit the locality of the computation on low-degree vertex.

h5. Algorithm Description
* HybridCut 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png|width=360!
* HybridCutPlus 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png|width=360!
* Greedy BiCut
  * a heuristic algorithm for bipartite

h5. Result
* 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png|width=100%!
  * The left Y axis is replication factor, right axis is the balance (measured 
using CV, coefficient of variation) of either vertices or edges of all 
partitions. The balance of edges can infer computation balance, and the balance 
of vertices can infer communication balance.
* 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png|width=360!
 
  * This is an example of a balanced partitioning's saving on communication.
* 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|width=360!
  * This is a simple partitioning result of BiCut.
* in-2.0-1m is a generated power law graph with alpha equals 2.0

h5. Code
* 
https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173
* Because the implementation breaks the current separation with 
PartitionStrategy.scala, so need to think of a way to support access to graph.

h5. Reference
- Bipartite-oriented Distributed Graph Partitioning for Big Learning.
- PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs

  was:
We implemented some algorithms for partitioning on GraphX, and evaluated. And 
find the partitioning has space of improving. Seek opinion and advice.

h5. Motivation
* Graph in real world follow power law. Eg. On twitter 1% of the vertices are 
adjacent to nearly half of the edges.
* For high-degree vertex, one vertex concentrates vast resources. So the 
workload on few high-degree vertex should be decomposed by all machines
*  For low-degree vertex, The computation on one vertex is  quite small. Thus 
should exploit the locality of the computation on low-degree vertex.

h5. Algorithm Description
* HybridCut 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png|width=360!
* HybridCutPlus 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png|width=360!
* Greedy BiCut
  * a heuristic algorithm for bipartite

h5. Result
* 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png|width=100%!
  * The left Y axis is replication factor, right axis is the balance (measured 
using CV, coefficient of variation) of either vertices or edges of all 
partitions. The balance of edges can infer computation balance, and the balance 
of vertices can infer communication balance.
* 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png|width=360!
 
  * This is an example of a balanced partitioning's saving on communication.
* 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|width=360!
  * This is a simple partitioning result of BiCut.
* in-2.0-1m is a generated power law graph with alpha equals 2.0

h5. Code
* 
https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173
* Because the implementation breaks the current separation using 
PartitionStrategy.scala, so need to think of a way to support access to graph.

h5. Reference
- Bipartite-oriented Distributed Graph Partitioning for Big Learning.
- PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs


 GraphX graph partitioning strategy
 --

 Key: SPARK-3523
 URL: https://issues.apache.org/jira/browse/SPARK-3523
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Affects Versions: 1.0.2
Reporter: Larry Xiao

 We implemented some algorithms for partitioning on GraphX, and evaluated. And 
 find the partitioning has space of improving. Seek opinion and advice.
 h5. Motivation
 * Graph in real 

[jira] [Created] (SPARK-3524) remove workaround to pickle array of float for Pyrolite

2014-09-14 Thread Davies Liu (JIRA)
Davies Liu created SPARK-3524:
-

 Summary: remove workaround to pickle array of float for Pyrolite
 Key: SPARK-3524
 URL: https://issues.apache.org/jira/browse/SPARK-3524
 Project: Spark
  Issue Type: Improvement
Reporter: Davies Liu


After Pyrolite release a new version with PR 
https://github.com/irmen/Pyrolite/pull/11, we should remove the workaround 
introduced in PR https://github.com/apache/spark/pull/2365



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3039) Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 1 API

2014-09-14 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3039:
---
Assignee: Bertrand Bossy

 Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 
 1 API
 --

 Key: SPARK-3039
 URL: https://issues.apache.org/jira/browse/SPARK-3039
 Project: Spark
  Issue Type: Bug
  Components: Build, Input/Output, Spark Core
Affects Versions: 0.9.1, 1.0.0, 1.1.0
 Environment: hadoop2, hadoop-2.4.0, HDP-2.1
Reporter: Bertrand Bossy
Assignee: Bertrand Bossy

 The spark assembly contains the artifact org.apache.avro:avro-mapred as a 
 dependency of org.spark-project.hive:hive-serde.
 The avro-mapred package provides a hadoop FileInputFormat to read and write 
 avro files. There are two versions of this package, distinguished by a 
 classifier. avro-mapred for the new Hadoop API uses the classifier hadoop2. 
 avro-mapred for the old Hadoop API uses no classifier.
 E.g. when reading avro files using 
 {code}
 sc.newAPIHadoopFile[AvroKey[SomeClass]],NullWritable,AvroKeyInputFormat[SomeClass]](hdfs://path/to/file.avro)
 {code}
 The following error occurs:
 {code}
 java.lang.IncompatibleClassChangeError: Found interface 
 org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
 at 
 org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47)
 at 
 org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:111)
 at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:99)
 at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:61)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
 at org.apache.spark.scheduler.Task.run(Task.scala:51)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)
 {code}
 This error usually is a hint that there was a mix up of the old and the new 
 Hadoop API. As a work-around, if avro-mapred for hadoop2 is forced to 
 appear before the version that is bundled with Spark, reading avro files 
 works fine. 
 Also, if Spark is built using avro-mapred for hadoop2, it works fine as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3039) Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 1 API

2014-09-14 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-3039.

  Resolution: Fixed
   Fix Version/s: 1.2.0
  1.1.1
Target Version/s: 1.1.1, 1.2.0

Resolved by:
https://github.com/apache/spark/pull/1945

 Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 
 1 API
 --

 Key: SPARK-3039
 URL: https://issues.apache.org/jira/browse/SPARK-3039
 Project: Spark
  Issue Type: Bug
  Components: Build, Input/Output, Spark Core
Affects Versions: 0.9.1, 1.0.0, 1.1.0
 Environment: hadoop2, hadoop-2.4.0, HDP-2.1
Reporter: Bertrand Bossy
Assignee: Bertrand Bossy
 Fix For: 1.1.1, 1.2.0


 The spark assembly contains the artifact org.apache.avro:avro-mapred as a 
 dependency of org.spark-project.hive:hive-serde.
 The avro-mapred package provides a hadoop FileInputFormat to read and write 
 avro files. There are two versions of this package, distinguished by a 
 classifier. avro-mapred for the new Hadoop API uses the classifier hadoop2. 
 avro-mapred for the old Hadoop API uses no classifier.
 E.g. when reading avro files using 
 {code}
 sc.newAPIHadoopFile[AvroKey[SomeClass]],NullWritable,AvroKeyInputFormat[SomeClass]](hdfs://path/to/file.avro)
 {code}
 The following error occurs:
 {code}
 java.lang.IncompatibleClassChangeError: Found interface 
 org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
 at 
 org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47)
 at 
 org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:111)
 at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:99)
 at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:61)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
 at org.apache.spark.scheduler.Task.run(Task.scala:51)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)
 {code}
 This error usually is a hint that there was a mix up of the old and the new 
 Hadoop API. As a work-around, if avro-mapred for hadoop2 is forced to 
 appear before the version that is bundled with Spark, reading avro files 
 works fine. 
 Also, if Spark is built using avro-mapred for hadoop2, it works fine as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3452) Maven build should skip publishing artifacts people shouldn't depend on

2014-09-14 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-3452.

Resolution: Fixed

Fixed by:
https://github.com/apache/spark/pull/2329

 Maven build should skip publishing artifacts people shouldn't depend on
 ---

 Key: SPARK-3452
 URL: https://issues.apache.org/jira/browse/SPARK-3452
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.0, 1.1.0
Reporter: Patrick Wendell
Assignee: Prashant Sharma
Priority: Critical

 I think it's easy to do this by just adding a skip configuration somewhere. 
 We shouldn't be publishing repl, yarn, assembly, tools, repl-bin, or examples.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org