[jira] [Commented] (SPARK-3482) Allow symlinking to scripts (spark-shell, spark-submit, ...)
[ https://issues.apache.org/jira/browse/SPARK-3482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133204#comment-14133204 ] Radim Kolar commented on SPARK-3482: https://github.com/apache/spark/pull/2386 Allow symlinking to scripts (spark-shell, spark-submit, ...) Key: SPARK-3482 URL: https://issues.apache.org/jira/browse/SPARK-3482 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 1.0.2 Environment: Unix system (FreeBSD 10.1 with bash) Reporter: Radim Kolar Priority: Trivial if you have install scenario like spark installed in /usr/local/share/spark and want to link its scripts from /usr/local/bin like: /usr/local/bin/spark-shell - /usr/local/share/spark/bin/spark-shell then scripts fails to locate spark install directory correctly. FWDIR variable needs to be changed to: {noformat} ## Global script variables FWDIR=$(cd $(dirname $(readlink -f $0))/..; pwd) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3520) java version check in spark-class fails with openjdk
Radim Kolar created SPARK-3520: -- Summary: java version check in spark-class fails with openjdk Key: SPARK-3520 URL: https://issues.apache.org/jira/browse/SPARK-3520 Project: Spark Issue Type: Bug Components: Spark Shell Environment: Freebsd 10.1, Openjdk 7 Reporter: Radim Kolar Priority: Minor tested on current git master: (hsn@sanatana:pts/4):spark/bin% ./spark-shell /home/hsn/live/spark/bin/spark-class: line 111: [: openjdk version 1.7.0_65: integer expression expected (hsn@sanatana:pts/4):spark/bin% java -version openjdk version 1.7.0_65 OpenJDK Runtime Environment (build 1.7.0_65-b17) OpenJDK Server VM (build 24.65-b04, mixed mode) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3520) java version check in spark-class fails with openjdk
[ https://issues.apache.org/jira/browse/SPARK-3520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133226#comment-14133226 ] Sean Owen commented on SPARK-3520: -- Duplicated / subsumed by https://issues.apache.org/jira/browse/SPARK-3425 java version check in spark-class fails with openjdk Key: SPARK-3520 URL: https://issues.apache.org/jira/browse/SPARK-3520 Project: Spark Issue Type: Bug Components: Spark Shell Environment: Freebsd 10.1, Openjdk 7 Reporter: Radim Kolar Priority: Minor tested on current git master: (hsn@sanatana:pts/4):spark/bin% ./spark-shell /home/hsn/live/spark/bin/spark-class: line 111: [: openjdk version 1.7.0_65: integer expression expected (hsn@sanatana:pts/4):spark/bin% java -version openjdk version 1.7.0_65 OpenJDK Runtime Environment (build 1.7.0_65-b17) OpenJDK Server VM (build 24.65-b04, mixed mode) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3520) java version check in spark-class fails with openjdk
[ https://issues.apache.org/jira/browse/SPARK-3520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133243#comment-14133243 ] Radim Kolar commented on SPARK-3520: {quote} JAVA_VERSION=$($RUNNER -version 21 | sed 's/openjdk/java/' | sed 's/java version \(.*\)\.\(.*\)\..*/\1\2/; 1q') {quote} fixes problem, pull request in SPARK-3425 too java version check in spark-class fails with openjdk Key: SPARK-3520 URL: https://issues.apache.org/jira/browse/SPARK-3520 Project: Spark Issue Type: Bug Components: Spark Shell Environment: Freebsd 10.1, Openjdk 7 Reporter: Radim Kolar Priority: Minor tested on current git master: (hsn@sanatana:pts/4):spark/bin% ./spark-shell /home/hsn/live/spark/bin/spark-class: line 111: [: openjdk version 1.7.0_65: integer expression expected (hsn@sanatana:pts/4):spark/bin% java -version openjdk version 1.7.0_65 OpenJDK Runtime Environment (build 1.7.0_65-b17) OpenJDK Server VM (build 24.65-b04, mixed mode) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1297) Upgrade HBase dependency to 0.98.0
[ https://issues.apache.org/jira/browse/SPARK-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133250#comment-14133250 ] Ted Yu commented on SPARK-1297: --- Please note: spark-1297-v5.txt is a level 0 patch. Upgrade HBase dependency to 0.98.0 -- Key: SPARK-1297 URL: https://issues.apache.org/jira/browse/SPARK-1297 Project: Spark Issue Type: Task Reporter: Ted Yu Assignee: Ted Yu Priority: Minor Attachments: pom.xml, spark-1297-v2.txt, spark-1297-v4.txt, spark-1297-v5.txt HBase 0.94.6 was released 11 months ago. Upgrade HBase dependency to 0.98.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
[ https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133276#comment-14133276 ] Apache Spark commented on SPARK-1405: - User 'witgo' has created a pull request for this issue: https://github.com/apache/spark/pull/2388 parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib - Key: SPARK-1405 URL: https://issues.apache.org/jira/browse/SPARK-1405 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xusen Yin Assignee: Xusen Yin Labels: features Original Estimate: 336h Remaining Estimate: 336h Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts topics from text corpus. Different with current machine learning algorithms in MLlib, instead of using optimization algorithms such as gradient desent, LDA uses expectation algorithms such as Gibbs sampling. In this PR, I prepare a LDA implementation based on Gibbs sampling, with a wholeTextFiles API (solved yet), a word segmentation (import from Lucene), and a Gibbs sampling core. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1449) Please delete old releases from mirroring system
[ https://issues.apache.org/jira/browse/SPARK-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1412#comment-1412 ] Matei Zaharia commented on SPARK-1449: -- Hey folks, sorry for the delay -- will look into this soon. Please delete old releases from mirroring system Key: SPARK-1449 URL: https://issues.apache.org/jira/browse/SPARK-1449 Project: Spark Issue Type: Task Affects Versions: 0.8.1, 0.9.1, 0.9.2, 1.0.0, 1.0.1, 1.0.2 Reporter: Sebb To reduce the load on the ASF mirrors, projects are required to delete old releases [1] Please can you remove all non-current releases? Thanks! [Note that older releases are always available from the ASF archive server] Any links to older releases on download pages should first be adjusted to point to the archive server. [1] http://www.apache.org/dev/release.html#when-to-archive -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1449) Please delete old releases from mirroring system
[ https://issues.apache.org/jira/browse/SPARK-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1449: - Assignee: Patrick Wendell Please delete old releases from mirroring system Key: SPARK-1449 URL: https://issues.apache.org/jira/browse/SPARK-1449 Project: Spark Issue Type: Task Affects Versions: 0.8.1, 0.9.1, 0.9.2, 1.0.0, 1.0.1, 1.0.2 Reporter: Sebb Assignee: Patrick Wendell To reduce the load on the ASF mirrors, projects are required to delete old releases [1] Please can you remove all non-current releases? Thanks! [Note that older releases are always available from the ASF archive server] Any links to older releases on download pages should first be adjusted to point to the archive server. [1] http://www.apache.org/dev/release.html#when-to-archive -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3521) Missing modules in 1.1.0 source distribution - cant be build with maven
Radim Kolar created SPARK-3521: -- Summary: Missing modules in 1.1.0 source distribution - cant be build with maven Key: SPARK-3521 URL: https://issues.apache.org/jira/browse/SPARK-3521 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.1.0 Reporter: Radim Kolar Priority: Minor modules {{bagel}}, {{mllib}}, {{flume-sink}} and {{flume}} are missing from source code distro, spark cant be build with maven. It cant be build by {{sbt/sbt}} either due to other bug (_java.lang.IllegalStateException: impossible to get artifacts when data has not been loaded. IvyNode = org.slf4j#slf4j-api;1.6.1_) (hsn@sanatana:pts/6):work/spark-1.1.0% mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.1 -DskipTests clean package [INFO] Scanning for projects... [ERROR] The build could not read 1 project - [Help 1] [ERROR] [ERROR] The project org.apache.spark:spark-parent:1.1.0 (/home/hsn/myports/spark11/work/spark-1.1.0/pom.xml) has 4 errors [ERROR] Child module /home/hsn/myports/spark11/work/spark-1.1.0/bagel of /home/hsn/myports/spark11/work/spark-1.1.0/pom.xml does not exist [ERROR] Child module /home/hsn/myports/spark11/work/spark-1.1.0/mllib of /home/hsn/myports/spark11/work/spark-1.1.0/pom.xml does not exist [ERROR] Child module /home/hsn/myports/spark11/work/spark-1.1.0/external/flume of /home/hsn/myports/spark11/work/spark-1.1.0/pom.xml does not exist [ERROR] Child module /home/hsn/myports/spark11/work/spark-1.1.0/external/flume-sink/pom.xml of /home/hsn/myports/spark11/work/spark-1.1.0/pom.xml does not exist -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3521) Missing modules in 1.1.0 source distribution - cant be build with maven
[ https://issues.apache.org/jira/browse/SPARK-3521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133413#comment-14133413 ] Sean Owen commented on SPARK-3521: -- https://dist.apache.org/repos/dist/release/spark/spark-1.1.0/spark-1.1.0.tgz All of that source code is plainly in the distribution. It compiles with Maven for me an this was verified by several people during the release. It sounds like something is quite corrupted about your copy. Missing modules in 1.1.0 source distribution - cant be build with maven --- Key: SPARK-3521 URL: https://issues.apache.org/jira/browse/SPARK-3521 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.1.0 Reporter: Radim Kolar Priority: Minor modules {{bagel}}, {{mllib}}, {{flume-sink}} and {{flume}} are missing from source code distro, spark cant be build with maven. It cant be build by {{sbt/sbt}} either due to other bug (_java.lang.IllegalStateException: impossible to get artifacts when data has not been loaded. IvyNode = org.slf4j#slf4j-api;1.6.1_) (hsn@sanatana:pts/6):work/spark-1.1.0% mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.1 -DskipTests clean package [INFO] Scanning for projects... [ERROR] The build could not read 1 project - [Help 1] [ERROR] [ERROR] The project org.apache.spark:spark-parent:1.1.0 (/home/hsn/myports/spark11/work/spark-1.1.0/pom.xml) has 4 errors [ERROR] Child module /home/hsn/myports/spark11/work/spark-1.1.0/bagel of /home/hsn/myports/spark11/work/spark-1.1.0/pom.xml does not exist [ERROR] Child module /home/hsn/myports/spark11/work/spark-1.1.0/mllib of /home/hsn/myports/spark11/work/spark-1.1.0/pom.xml does not exist [ERROR] Child module /home/hsn/myports/spark11/work/spark-1.1.0/external/flume of /home/hsn/myports/spark11/work/spark-1.1.0/pom.xml does not exist [ERROR] Child module /home/hsn/myports/spark11/work/spark-1.1.0/external/flume-sink/pom.xml of /home/hsn/myports/spark11/work/spark-1.1.0/pom.xml does not exist -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1449) Please delete old releases from mirroring system
[ https://issues.apache.org/jira/browse/SPARK-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1449: --- Affects Version/s: (was: 1.0.2) Please delete old releases from mirroring system Key: SPARK-1449 URL: https://issues.apache.org/jira/browse/SPARK-1449 Project: Spark Issue Type: Task Affects Versions: 0.8.1, 0.9.1, 0.9.2, 1.0.0, 1.0.1 Reporter: Sebb Assignee: Patrick Wendell To reduce the load on the ASF mirrors, projects are required to delete old releases [1] Please can you remove all non-current releases? Thanks! [Note that older releases are always available from the ASF archive server] Any links to older releases on download pages should first be adjusted to point to the archive server. [1] http://www.apache.org/dev/release.html#when-to-archive -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1449) Please delete old releases from mirroring system
[ https://issues.apache.org/jira/browse/SPARK-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1449: --- Fix Version/s: 0.8.1 0.9.1 1.0.0 1.0.1 0.9.2 Please delete old releases from mirroring system Key: SPARK-1449 URL: https://issues.apache.org/jira/browse/SPARK-1449 Project: Spark Issue Type: Task Affects Versions: 0.8.1, 0.9.1, 0.9.2, 1.0.0, 1.0.1 Reporter: Sebb Assignee: Patrick Wendell Fix For: 0.8.1, 0.9.1, 0.9.2, 1.0.0, 1.0.1 To reduce the load on the ASF mirrors, projects are required to delete old releases [1] Please can you remove all non-current releases? Thanks! [Note that older releases are always available from the ASF archive server] Any links to older releases on download pages should first be adjusted to point to the archive server. [1] http://www.apache.org/dev/release.html#when-to-archive -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1449) Please delete old releases from mirroring system
[ https://issues.apache.org/jira/browse/SPARK-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-1449. Resolution: Fixed I've left 1.0.2 and 1.1.0, since 1.1.0 is an unstable release. Please delete old releases from mirroring system Key: SPARK-1449 URL: https://issues.apache.org/jira/browse/SPARK-1449 Project: Spark Issue Type: Task Affects Versions: 0.8.1, 0.9.1, 0.9.2, 1.0.0, 1.0.1 Reporter: Sebb Assignee: Patrick Wendell To reduce the load on the ASF mirrors, projects are required to delete old releases [1] Please can you remove all non-current releases? Thanks! [Note that older releases are always available from the ASF archive server] Any links to older releases on download pages should first be adjusted to point to the archive server. [1] http://www.apache.org/dev/release.html#when-to-archive -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3522) Make spark-ec2 verbosity configurable
Nicholas Chammas created SPARK-3522: --- Summary: Make spark-ec2 verbosity configurable Key: SPARK-3522 URL: https://issues.apache.org/jira/browse/SPARK-3522 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Nicholas Chammas Priority: Minor When launching a cluster, {{spark-ec2}} spits out a lot of stuff that feels like debug output. It would be better for the user if {{spark-ec2}} did the following: * default to info output level * allow option to increase verbosity and include debug output This will require converting most of the {{print}} statements in the script to use Python's {{logging}} module. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2594) Add CACHE TABLE name AS SELECT ...
[ https://issues.apache.org/jira/browse/SPARK-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133444#comment-14133444 ] Apache Spark commented on SPARK-2594: - User 'ravipesala' has created a pull request for this issue: https://github.com/apache/spark/pull/2390 Add CACHE TABLE name AS SELECT ... Key: SPARK-2594 URL: https://issues.apache.org/jira/browse/SPARK-2594 Project: Spark Issue Type: New Feature Components: SQL Reporter: Michael Armbrust Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3523) GraphX graph partitioning strategy
Larry Xiao created SPARK-3523: - Summary: GraphX graph partitioning strategy Key: SPARK-3523 URL: https://issues.apache.org/jira/browse/SPARK-3523 Project: Spark Issue Type: Improvement Components: GraphX Affects Versions: 1.0.2 Reporter: Larry Xiao We implemented some algorithms for partitioning on GraphX, and evaluated. And find the partitioning has space of improving. Seek opinion and advice. H5. Motivation * Graph in real world follow power law. Eg. On twitter 1% of the vertices are adjacent to nearly half of the edges. * For high-degree vertex, one vertex concentrates vast resources. So the workload on few high-degree vertex should be decomposed by all machines * For low-degree vertex, The computation on one vertex is quite small. Thus should exploit the locality of the computation on low-degree vertex. H5. Algorithm Description * HybridCut * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png * HybridCutPlus * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png * Greedy BiCut * a heuristic algorithm for bipartite H5. Result !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png H5. Code * https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173 * Because the implementation breaks the current separation using PartitionStrategy.scala, so need to think of a way to support access to graph. H5. Reference - Bipartite-oriented Distributed Graph Partitioning for Big Learning. - PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3523) GraphX graph partitioning strategy
[ https://issues.apache.org/jira/browse/SPARK-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Larry Xiao updated SPARK-3523: -- Description: We implemented some algorithms for partitioning on GraphX, and evaluated. And find the partitioning has space of improving. Seek opinion and advice. h5. Motivation * Graph in real world follow power law. Eg. On twitter 1% of the vertices are adjacent to nearly half of the edges. * For high-degree vertex, one vertex concentrates vast resources. So the workload on few high-degree vertex should be decomposed by all machines * For low-degree vertex, The computation on one vertex is quite small. Thus should exploit the locality of the computation on low-degree vertex. h5. Algorithm Description * HybridCut * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png! * HybridCutPlus * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png! * Greedy BiCut * a heuristic algorithm for bipartite h5. Result !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png! !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png! !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png! h5. Code * https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173 * Because the implementation breaks the current separation using PartitionStrategy.scala, so need to think of a way to support access to graph. h5. Reference - Bipartite-oriented Distributed Graph Partitioning for Big Learning. - PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs was: We implemented some algorithms for partitioning on GraphX, and evaluated. And find the partitioning has space of improving. Seek opinion and advice. h5. Motivation * Graph in real world follow power law. Eg. On twitter 1% of the vertices are adjacent to nearly half of the edges. * For high-degree vertex, one vertex concentrates vast resources. So the workload on few high-degree vertex should be decomposed by all machines * For low-degree vertex, The computation on one vertex is quite small. Thus should exploit the locality of the computation on low-degree vertex. h5. Algorithm Description * HybridCut * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png * HybridCutPlus * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png * Greedy BiCut * a heuristic algorithm for bipartite h5. Result !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png h5. Code * https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173 * Because the implementation breaks the current separation using PartitionStrategy.scala, so need to think of a way to support access to graph. h5. Reference - Bipartite-oriented Distributed Graph Partitioning for Big Learning. - PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs GraphX graph partitioning strategy -- Key: SPARK-3523 URL: https://issues.apache.org/jira/browse/SPARK-3523 Project: Spark Issue Type: Improvement Components: GraphX Affects Versions: 1.0.2 Reporter: Larry Xiao We implemented some algorithms for partitioning on GraphX, and evaluated. And find the partitioning has space of improving. Seek opinion and advice. h5. Motivation * Graph in real world follow power law. Eg. On twitter 1% of the vertices are adjacent to nearly half of the edges. * For high-degree vertex, one vertex concentrates vast resources. So the workload on few high-degree vertex should be decomposed by all machines * For low-degree vertex, The computation on one vertex is quite small. Thus should exploit the locality of the computation on low-degree vertex. h5. Algorithm Description * HybridCut * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png! * HybridCutPlus * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png! * Greedy BiCut * a heuristic algorithm for bipartite h5. Result !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png! !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png! !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png! h5. Code *
[jira] [Updated] (SPARK-3523) GraphX graph partitioning strategy
[ https://issues.apache.org/jira/browse/SPARK-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Larry Xiao updated SPARK-3523: -- Description: We implemented some algorithms for partitioning on GraphX, and evaluated. And find the partitioning has space of improving. Seek opinion and advice. h5. Motivation * Graph in real world follow power law. Eg. On twitter 1% of the vertices are adjacent to nearly half of the edges. * For high-degree vertex, one vertex concentrates vast resources. So the workload on few high-degree vertex should be decomposed by all machines * For low-degree vertex, The computation on one vertex is quite small. Thus should exploit the locality of the computation on low-degree vertex. h5. Algorithm Description * HybridCut * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png * HybridCutPlus * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png * Greedy BiCut * a heuristic algorithm for bipartite h5. Result !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png h5. Code * https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173 * Because the implementation breaks the current separation using PartitionStrategy.scala, so need to think of a way to support access to graph. h5. Reference - Bipartite-oriented Distributed Graph Partitioning for Big Learning. - PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs was: We implemented some algorithms for partitioning on GraphX, and evaluated. And find the partitioning has space of improving. Seek opinion and advice. H5. Motivation * Graph in real world follow power law. Eg. On twitter 1% of the vertices are adjacent to nearly half of the edges. * For high-degree vertex, one vertex concentrates vast resources. So the workload on few high-degree vertex should be decomposed by all machines * For low-degree vertex, The computation on one vertex is quite small. Thus should exploit the locality of the computation on low-degree vertex. H5. Algorithm Description * HybridCut * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png * HybridCutPlus * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png * Greedy BiCut * a heuristic algorithm for bipartite H5. Result !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png H5. Code * https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173 * Because the implementation breaks the current separation using PartitionStrategy.scala, so need to think of a way to support access to graph. H5. Reference - Bipartite-oriented Distributed Graph Partitioning for Big Learning. - PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs GraphX graph partitioning strategy -- Key: SPARK-3523 URL: https://issues.apache.org/jira/browse/SPARK-3523 Project: Spark Issue Type: Improvement Components: GraphX Affects Versions: 1.0.2 Reporter: Larry Xiao We implemented some algorithms for partitioning on GraphX, and evaluated. And find the partitioning has space of improving. Seek opinion and advice. h5. Motivation * Graph in real world follow power law. Eg. On twitter 1% of the vertices are adjacent to nearly half of the edges. * For high-degree vertex, one vertex concentrates vast resources. So the workload on few high-degree vertex should be decomposed by all machines * For low-degree vertex, The computation on one vertex is quite small. Thus should exploit the locality of the computation on low-degree vertex. h5. Algorithm Description * HybridCut * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png * HybridCutPlus * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png * Greedy BiCut * a heuristic algorithm for bipartite h5. Result !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png h5. Code *
[jira] [Updated] (SPARK-3523) GraphX graph partitioning strategy
[ https://issues.apache.org/jira/browse/SPARK-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Larry Xiao updated SPARK-3523: -- Description: We implemented some algorithms for partitioning on GraphX, and evaluated. And find the partitioning has space of improving. Seek opinion and advice. h5. Motivation * Graph in real world follow power law. Eg. On twitter 1% of the vertices are adjacent to nearly half of the edges. * For high-degree vertex, one vertex concentrates vast resources. So the workload on few high-degree vertex should be decomposed by all machines * For low-degree vertex, The computation on one vertex is quite small. Thus should exploit the locality of the computation on low-degree vertex. h5. Algorithm Description * HybridCut * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png! * HybridCutPlus * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png! * Greedy BiCut * a heuristic algorithm for bipartite h5. Result !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png|scale=50%! !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|scale=50%! !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png|scale=50%! h5. Code * https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173 * Because the implementation breaks the current separation using PartitionStrategy.scala, so need to think of a way to support access to graph. h5. Reference - Bipartite-oriented Distributed Graph Partitioning for Big Learning. - PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs was: We implemented some algorithms for partitioning on GraphX, and evaluated. And find the partitioning has space of improving. Seek opinion and advice. h5. Motivation * Graph in real world follow power law. Eg. On twitter 1% of the vertices are adjacent to nearly half of the edges. * For high-degree vertex, one vertex concentrates vast resources. So the workload on few high-degree vertex should be decomposed by all machines * For low-degree vertex, The computation on one vertex is quite small. Thus should exploit the locality of the computation on low-degree vertex. h5. Algorithm Description * HybridCut * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png! * HybridCutPlus * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png! * Greedy BiCut * a heuristic algorithm for bipartite h5. Result !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png! !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png! !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png! h5. Code * https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173 * Because the implementation breaks the current separation using PartitionStrategy.scala, so need to think of a way to support access to graph. h5. Reference - Bipartite-oriented Distributed Graph Partitioning for Big Learning. - PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs GraphX graph partitioning strategy -- Key: SPARK-3523 URL: https://issues.apache.org/jira/browse/SPARK-3523 Project: Spark Issue Type: Improvement Components: GraphX Affects Versions: 1.0.2 Reporter: Larry Xiao We implemented some algorithms for partitioning on GraphX, and evaluated. And find the partitioning has space of improving. Seek opinion and advice. h5. Motivation * Graph in real world follow power law. Eg. On twitter 1% of the vertices are adjacent to nearly half of the edges. * For high-degree vertex, one vertex concentrates vast resources. So the workload on few high-degree vertex should be decomposed by all machines * For low-degree vertex, The computation on one vertex is quite small. Thus should exploit the locality of the computation on low-degree vertex. h5. Algorithm Description * HybridCut * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png! * HybridCutPlus * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png! * Greedy BiCut * a heuristic algorithm for bipartite h5. Result !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png|scale=50%! !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|scale=50%!
[jira] [Updated] (SPARK-3523) GraphX graph partitioning strategy
[ https://issues.apache.org/jira/browse/SPARK-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Larry Xiao updated SPARK-3523: -- Description: We implemented some algorithms for partitioning on GraphX, and evaluated. And find the partitioning has space of improving. Seek opinion and advice. h5. Motivation * Graph in real world follow power law. Eg. On twitter 1% of the vertices are adjacent to nearly half of the edges. * For high-degree vertex, one vertex concentrates vast resources. So the workload on few high-degree vertex should be decomposed by all machines * For low-degree vertex, The computation on one vertex is quite small. Thus should exploit the locality of the computation on low-degree vertex. h5. Algorithm Description * HybridCut * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png|width=360! * HybridCutPlus * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png|width=360! * Greedy BiCut * a heuristic algorithm for bipartite h5. Result !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png|width=360! !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|width=360! !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png|width=360! h5. Code * https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173 * Because the implementation breaks the current separation using PartitionStrategy.scala, so need to think of a way to support access to graph. h5. Reference - Bipartite-oriented Distributed Graph Partitioning for Big Learning. - PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs was: We implemented some algorithms for partitioning on GraphX, and evaluated. And find the partitioning has space of improving. Seek opinion and advice. h5. Motivation * Graph in real world follow power law. Eg. On twitter 1% of the vertices are adjacent to nearly half of the edges. * For high-degree vertex, one vertex concentrates vast resources. So the workload on few high-degree vertex should be decomposed by all machines * For low-degree vertex, The computation on one vertex is quite small. Thus should exploit the locality of the computation on low-degree vertex. h5. Algorithm Description * HybridCut * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png! * HybridCutPlus * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png! * Greedy BiCut * a heuristic algorithm for bipartite h5. Result !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png|scale=50%! !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|scale=50%! !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png|scale=50%! h5. Code * https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173 * Because the implementation breaks the current separation using PartitionStrategy.scala, so need to think of a way to support access to graph. h5. Reference - Bipartite-oriented Distributed Graph Partitioning for Big Learning. - PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs GraphX graph partitioning strategy -- Key: SPARK-3523 URL: https://issues.apache.org/jira/browse/SPARK-3523 Project: Spark Issue Type: Improvement Components: GraphX Affects Versions: 1.0.2 Reporter: Larry Xiao We implemented some algorithms for partitioning on GraphX, and evaluated. And find the partitioning has space of improving. Seek opinion and advice. h5. Motivation * Graph in real world follow power law. Eg. On twitter 1% of the vertices are adjacent to nearly half of the edges. * For high-degree vertex, one vertex concentrates vast resources. So the workload on few high-degree vertex should be decomposed by all machines * For low-degree vertex, The computation on one vertex is quite small. Thus should exploit the locality of the computation on low-degree vertex. h5. Algorithm Description * HybridCut * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png|width=360! * HybridCutPlus * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png|width=360! * Greedy BiCut * a heuristic algorithm for bipartite h5. Result !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png|width=360! !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|width=360!
[jira] [Updated] (SPARK-3523) GraphX graph partitioning strategy
[ https://issues.apache.org/jira/browse/SPARK-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Larry Xiao updated SPARK-3523: -- Description: We implemented some algorithms for partitioning on GraphX, and evaluated. And find the partitioning has space of improving. Seek opinion and advice. h5. Motivation * Graph in real world follow power law. Eg. On twitter 1% of the vertices are adjacent to nearly half of the edges. * For high-degree vertex, one vertex concentrates vast resources. So the workload on few high-degree vertex should be decomposed by all machines * For low-degree vertex, The computation on one vertex is quite small. Thus should exploit the locality of the computation on low-degree vertex. h5. Algorithm Description * HybridCut !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png|width=360! * HybridCutPlus !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png|width=360! * Greedy BiCut * a heuristic algorithm for bipartite h5. Result !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png|width=360! !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|width=360! !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png|width=360! h5. Code * https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173 * Because the implementation breaks the current separation using PartitionStrategy.scala, so need to think of a way to support access to graph. h5. Reference - Bipartite-oriented Distributed Graph Partitioning for Big Learning. - PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs was: We implemented some algorithms for partitioning on GraphX, and evaluated. And find the partitioning has space of improving. Seek opinion and advice. h5. Motivation * Graph in real world follow power law. Eg. On twitter 1% of the vertices are adjacent to nearly half of the edges. * For high-degree vertex, one vertex concentrates vast resources. So the workload on few high-degree vertex should be decomposed by all machines * For low-degree vertex, The computation on one vertex is quite small. Thus should exploit the locality of the computation on low-degree vertex. h5. Algorithm Description * HybridCut * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png|width=360! * HybridCutPlus * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png|width=360! * Greedy BiCut * a heuristic algorithm for bipartite h5. Result !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png|width=360! !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|width=360! !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png|width=360! h5. Code * https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173 * Because the implementation breaks the current separation using PartitionStrategy.scala, so need to think of a way to support access to graph. h5. Reference - Bipartite-oriented Distributed Graph Partitioning for Big Learning. - PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs GraphX graph partitioning strategy -- Key: SPARK-3523 URL: https://issues.apache.org/jira/browse/SPARK-3523 Project: Spark Issue Type: Improvement Components: GraphX Affects Versions: 1.0.2 Reporter: Larry Xiao We implemented some algorithms for partitioning on GraphX, and evaluated. And find the partitioning has space of improving. Seek opinion and advice. h5. Motivation * Graph in real world follow power law. Eg. On twitter 1% of the vertices are adjacent to nearly half of the edges. * For high-degree vertex, one vertex concentrates vast resources. So the workload on few high-degree vertex should be decomposed by all machines * For low-degree vertex, The computation on one vertex is quite small. Thus should exploit the locality of the computation on low-degree vertex. h5. Algorithm Description * HybridCut !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png|width=360! * HybridCutPlus !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png|width=360! * Greedy BiCut * a heuristic algorithm for bipartite h5. Result !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png|width=360! !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|width=360!
[jira] [Updated] (SPARK-3523) GraphX graph partitioning strategy
[ https://issues.apache.org/jira/browse/SPARK-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Larry Xiao updated SPARK-3523: -- Description: We implemented some algorithms for partitioning on GraphX, and evaluated. And find the partitioning has space of improving. Seek opinion and advice. h5. Motivation * Graph in real world follow power law. Eg. On twitter 1% of the vertices are adjacent to nearly half of the edges. * For high-degree vertex, one vertex concentrates vast resources. So the workload on few high-degree vertex should be decomposed by all machines * For low-degree vertex, The computation on one vertex is quite small. Thus should exploit the locality of the computation on low-degree vertex. h5. Algorithm Description * HybridCut !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png|width=360! * HybridCutPlus !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png|width=360! * Greedy BiCut * a heuristic algorithm for bipartite h5. Result * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png|width=360! The left Y axis is replication factor, right axis is the balance (measured using CV, coefficient of variation) of either vertices or edges of all partitions. The balance of edges can infer computation balance, and the balance of vertices can infer communication balance. * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png|width=360! This is an example of a balanced partitioning's saving on communication. * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|width=360! * in-2.0-1m is a generated power law graph with alpha equals 2.0 h5. Code * https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173 * Because the implementation breaks the current separation using PartitionStrategy.scala, so need to think of a way to support access to graph. h5. Reference - Bipartite-oriented Distributed Graph Partitioning for Big Learning. - PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs was: We implemented some algorithms for partitioning on GraphX, and evaluated. And find the partitioning has space of improving. Seek opinion and advice. h5. Motivation * Graph in real world follow power law. Eg. On twitter 1% of the vertices are adjacent to nearly half of the edges. * For high-degree vertex, one vertex concentrates vast resources. So the workload on few high-degree vertex should be decomposed by all machines * For low-degree vertex, The computation on one vertex is quite small. Thus should exploit the locality of the computation on low-degree vertex. h5. Algorithm Description * HybridCut !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png|width=360! * HybridCutPlus !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png|width=360! * Greedy BiCut * a heuristic algorithm for bipartite h5. Result !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png|width=360! !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|width=360! !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png|width=360! h5. Code * https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173 * Because the implementation breaks the current separation using PartitionStrategy.scala, so need to think of a way to support access to graph. h5. Reference - Bipartite-oriented Distributed Graph Partitioning for Big Learning. - PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs GraphX graph partitioning strategy -- Key: SPARK-3523 URL: https://issues.apache.org/jira/browse/SPARK-3523 Project: Spark Issue Type: Improvement Components: GraphX Affects Versions: 1.0.2 Reporter: Larry Xiao We implemented some algorithms for partitioning on GraphX, and evaluated. And find the partitioning has space of improving. Seek opinion and advice. h5. Motivation * Graph in real world follow power law. Eg. On twitter 1% of the vertices are adjacent to nearly half of the edges. * For high-degree vertex, one vertex concentrates vast resources. So the workload on few high-degree vertex should be decomposed by all machines * For low-degree vertex, The computation on one vertex is quite small. Thus should exploit the locality of the computation on low-degree vertex. h5. Algorithm Description * HybridCut !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png|width=360!
[jira] [Updated] (SPARK-3523) GraphX graph partitioning strategy
[ https://issues.apache.org/jira/browse/SPARK-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Larry Xiao updated SPARK-3523: -- Description: We implemented some algorithms for partitioning on GraphX, and evaluated. And find the partitioning has space of improving. Seek opinion and advice. h5. Motivation * Graph in real world follow power law. Eg. On twitter 1% of the vertices are adjacent to nearly half of the edges. * For high-degree vertex, one vertex concentrates vast resources. So the workload on few high-degree vertex should be decomposed by all machines * For low-degree vertex, The computation on one vertex is quite small. Thus should exploit the locality of the computation on low-degree vertex. h5. Algorithm Description * HybridCut !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png|width=360! * HybridCutPlus !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png|width=360! * Greedy BiCut * a heuristic algorithm for bipartite h5. Result * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png! * The left Y axis is replication factor, right axis is the balance (measured using CV, coefficient of variation) of either vertices or edges of all partitions. The balance of edges can infer computation balance, and the balance of vertices can infer communication balance. * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png|width=360! * This is an example of a balanced partitioning's saving on communication. * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|width=360! * This is a simple partitioning result of BiCut. * in-2.0-1m is a generated power law graph with alpha equals 2.0 h5. Code * https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173 * Because the implementation breaks the current separation using PartitionStrategy.scala, so need to think of a way to support access to graph. h5. Reference - Bipartite-oriented Distributed Graph Partitioning for Big Learning. - PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs was: We implemented some algorithms for partitioning on GraphX, and evaluated. And find the partitioning has space of improving. Seek opinion and advice. h5. Motivation * Graph in real world follow power law. Eg. On twitter 1% of the vertices are adjacent to nearly half of the edges. * For high-degree vertex, one vertex concentrates vast resources. So the workload on few high-degree vertex should be decomposed by all machines * For low-degree vertex, The computation on one vertex is quite small. Thus should exploit the locality of the computation on low-degree vertex. h5. Algorithm Description * HybridCut !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png|width=360! * HybridCutPlus !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png|width=360! * Greedy BiCut * a heuristic algorithm for bipartite h5. Result * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png|width=600! * The left Y axis is replication factor, right axis is the balance (measured using CV, coefficient of variation) of either vertices or edges of all partitions. The balance of edges can infer computation balance, and the balance of vertices can infer communication balance. * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png|width=360! * This is an example of a balanced partitioning's saving on communication. * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|width=360! * This is a simple partitioning result of BiCut. * in-2.0-1m is a generated power law graph with alpha equals 2.0 h5. Code * https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173 * Because the implementation breaks the current separation using PartitionStrategy.scala, so need to think of a way to support access to graph. h5. Reference - Bipartite-oriented Distributed Graph Partitioning for Big Learning. - PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs GraphX graph partitioning strategy -- Key: SPARK-3523 URL: https://issues.apache.org/jira/browse/SPARK-3523 Project: Spark Issue Type: Improvement Components: GraphX Affects Versions: 1.0.2 Reporter: Larry Xiao We implemented some algorithms for partitioning on GraphX, and evaluated. And find the partitioning has space of improving. Seek opinion and advice. h5. Motivation * Graph in real world
[jira] [Updated] (SPARK-3523) GraphX graph partitioning strategy
[ https://issues.apache.org/jira/browse/SPARK-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Larry Xiao updated SPARK-3523: -- Description: We implemented some algorithms for partitioning on GraphX, and evaluated. And find the partitioning has space of improving. Seek opinion and advice. h5. Motivation * Graph in real world follow power law. Eg. On twitter 1% of the vertices are adjacent to nearly half of the edges. * For high-degree vertex, one vertex concentrates vast resources. So the workload on few high-degree vertex should be decomposed by all machines * For low-degree vertex, The computation on one vertex is quite small. Thus should exploit the locality of the computation on low-degree vertex. h5. Algorithm Description * HybridCut !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png|width=360! * HybridCutPlus !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png|width=360! * Greedy BiCut * a heuristic algorithm for bipartite h5. Result * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png|width=600! * The left Y axis is replication factor, right axis is the balance (measured using CV, coefficient of variation) of either vertices or edges of all partitions. The balance of edges can infer computation balance, and the balance of vertices can infer communication balance. * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png|width=360! * This is an example of a balanced partitioning's saving on communication. * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|width=360! * This is a simple partitioning result of BiCut. * in-2.0-1m is a generated power law graph with alpha equals 2.0 h5. Code * https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173 * Because the implementation breaks the current separation using PartitionStrategy.scala, so need to think of a way to support access to graph. h5. Reference - Bipartite-oriented Distributed Graph Partitioning for Big Learning. - PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs was: We implemented some algorithms for partitioning on GraphX, and evaluated. And find the partitioning has space of improving. Seek opinion and advice. h5. Motivation * Graph in real world follow power law. Eg. On twitter 1% of the vertices are adjacent to nearly half of the edges. * For high-degree vertex, one vertex concentrates vast resources. So the workload on few high-degree vertex should be decomposed by all machines * For low-degree vertex, The computation on one vertex is quite small. Thus should exploit the locality of the computation on low-degree vertex. h5. Algorithm Description * HybridCut !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png|width=360! * HybridCutPlus !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png|width=360! * Greedy BiCut * a heuristic algorithm for bipartite h5. Result * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png|width=360! The left Y axis is replication factor, right axis is the balance (measured using CV, coefficient of variation) of either vertices or edges of all partitions. The balance of edges can infer computation balance, and the balance of vertices can infer communication balance. * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png|width=360! This is an example of a balanced partitioning's saving on communication. * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|width=360! * in-2.0-1m is a generated power law graph with alpha equals 2.0 h5. Code * https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173 * Because the implementation breaks the current separation using PartitionStrategy.scala, so need to think of a way to support access to graph. h5. Reference - Bipartite-oriented Distributed Graph Partitioning for Big Learning. - PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs GraphX graph partitioning strategy -- Key: SPARK-3523 URL: https://issues.apache.org/jira/browse/SPARK-3523 Project: Spark Issue Type: Improvement Components: GraphX Affects Versions: 1.0.2 Reporter: Larry Xiao We implemented some algorithms for partitioning on GraphX, and evaluated. And find the partitioning has space of improving. Seek opinion and advice. h5. Motivation * Graph in real world follow power law. Eg. On twitter 1% of the vertices
[jira] [Updated] (SPARK-3523) GraphX graph partitioning strategy
[ https://issues.apache.org/jira/browse/SPARK-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Larry Xiao updated SPARK-3523: -- Description: We implemented some algorithms for partitioning on GraphX, and evaluated. And find the partitioning has space of improving. Seek opinion and advice. h5. Motivation * Graph in real world follow power law. Eg. On twitter 1% of the vertices are adjacent to nearly half of the edges. * For high-degree vertex, one vertex concentrates vast resources. So the workload on few high-degree vertex should be decomposed by all machines * For low-degree vertex, The computation on one vertex is quite small. Thus should exploit the locality of the computation on low-degree vertex. h5. Algorithm Description * HybridCut !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png|width=360! * HybridCutPlus !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png|width=360! * Greedy BiCut * a heuristic algorithm for bipartite h5. Result * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png|width=100%! * The left Y axis is replication factor, right axis is the balance (measured using CV, coefficient of variation) of either vertices or edges of all partitions. The balance of edges can infer computation balance, and the balance of vertices can infer communication balance. * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png|width=360! * This is an example of a balanced partitioning's saving on communication. * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|width=360! * This is a simple partitioning result of BiCut. * in-2.0-1m is a generated power law graph with alpha equals 2.0 h5. Code * https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173 * Because the implementation breaks the current separation with PartitionStrategy.scala, so need to think of a way to support access to graph. h5. Reference - Bipartite-oriented Distributed Graph Partitioning for Big Learning. - PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs was: We implemented some algorithms for partitioning on GraphX, and evaluated. And find the partitioning has space of improving. Seek opinion and advice. h5. Motivation * Graph in real world follow power law. Eg. On twitter 1% of the vertices are adjacent to nearly half of the edges. * For high-degree vertex, one vertex concentrates vast resources. So the workload on few high-degree vertex should be decomposed by all machines * For low-degree vertex, The computation on one vertex is quite small. Thus should exploit the locality of the computation on low-degree vertex. h5. Algorithm Description * HybridCut !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png|width=360! * HybridCutPlus !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png|width=360! * Greedy BiCut * a heuristic algorithm for bipartite h5. Result * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png|width=100%! * The left Y axis is replication factor, right axis is the balance (measured using CV, coefficient of variation) of either vertices or edges of all partitions. The balance of edges can infer computation balance, and the balance of vertices can infer communication balance. * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png|width=360! * This is an example of a balanced partitioning's saving on communication. * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|width=360! * This is a simple partitioning result of BiCut. * in-2.0-1m is a generated power law graph with alpha equals 2.0 h5. Code * https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173 * Because the implementation breaks the current separation using PartitionStrategy.scala, so need to think of a way to support access to graph. h5. Reference - Bipartite-oriented Distributed Graph Partitioning for Big Learning. - PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs GraphX graph partitioning strategy -- Key: SPARK-3523 URL: https://issues.apache.org/jira/browse/SPARK-3523 Project: Spark Issue Type: Improvement Components: GraphX Affects Versions: 1.0.2 Reporter: Larry Xiao We implemented some algorithms for partitioning on GraphX, and evaluated. And find the partitioning has space of improving. Seek opinion and advice. h5. Motivation * Graph in real
[jira] [Created] (SPARK-3524) remove workaround to pickle array of float for Pyrolite
Davies Liu created SPARK-3524: - Summary: remove workaround to pickle array of float for Pyrolite Key: SPARK-3524 URL: https://issues.apache.org/jira/browse/SPARK-3524 Project: Spark Issue Type: Improvement Reporter: Davies Liu After Pyrolite release a new version with PR https://github.com/irmen/Pyrolite/pull/11, we should remove the workaround introduced in PR https://github.com/apache/spark/pull/2365 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3039) Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 1 API
[ https://issues.apache.org/jira/browse/SPARK-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3039: --- Assignee: Bertrand Bossy Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 1 API -- Key: SPARK-3039 URL: https://issues.apache.org/jira/browse/SPARK-3039 Project: Spark Issue Type: Bug Components: Build, Input/Output, Spark Core Affects Versions: 0.9.1, 1.0.0, 1.1.0 Environment: hadoop2, hadoop-2.4.0, HDP-2.1 Reporter: Bertrand Bossy Assignee: Bertrand Bossy The spark assembly contains the artifact org.apache.avro:avro-mapred as a dependency of org.spark-project.hive:hive-serde. The avro-mapred package provides a hadoop FileInputFormat to read and write avro files. There are two versions of this package, distinguished by a classifier. avro-mapred for the new Hadoop API uses the classifier hadoop2. avro-mapred for the old Hadoop API uses no classifier. E.g. when reading avro files using {code} sc.newAPIHadoopFile[AvroKey[SomeClass]],NullWritable,AvroKeyInputFormat[SomeClass]](hdfs://path/to/file.avro) {code} The following error occurs: {code} java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected at org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:111) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:99) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:61) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) {code} This error usually is a hint that there was a mix up of the old and the new Hadoop API. As a work-around, if avro-mapred for hadoop2 is forced to appear before the version that is bundled with Spark, reading avro files works fine. Also, if Spark is built using avro-mapred for hadoop2, it works fine as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3039) Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 1 API
[ https://issues.apache.org/jira/browse/SPARK-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-3039. Resolution: Fixed Fix Version/s: 1.2.0 1.1.1 Target Version/s: 1.1.1, 1.2.0 Resolved by: https://github.com/apache/spark/pull/1945 Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 1 API -- Key: SPARK-3039 URL: https://issues.apache.org/jira/browse/SPARK-3039 Project: Spark Issue Type: Bug Components: Build, Input/Output, Spark Core Affects Versions: 0.9.1, 1.0.0, 1.1.0 Environment: hadoop2, hadoop-2.4.0, HDP-2.1 Reporter: Bertrand Bossy Assignee: Bertrand Bossy Fix For: 1.1.1, 1.2.0 The spark assembly contains the artifact org.apache.avro:avro-mapred as a dependency of org.spark-project.hive:hive-serde. The avro-mapred package provides a hadoop FileInputFormat to read and write avro files. There are two versions of this package, distinguished by a classifier. avro-mapred for the new Hadoop API uses the classifier hadoop2. avro-mapred for the old Hadoop API uses no classifier. E.g. when reading avro files using {code} sc.newAPIHadoopFile[AvroKey[SomeClass]],NullWritable,AvroKeyInputFormat[SomeClass]](hdfs://path/to/file.avro) {code} The following error occurs: {code} java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected at org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:111) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:99) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:61) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) {code} This error usually is a hint that there was a mix up of the old and the new Hadoop API. As a work-around, if avro-mapred for hadoop2 is forced to appear before the version that is bundled with Spark, reading avro files works fine. Also, if Spark is built using avro-mapred for hadoop2, it works fine as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3452) Maven build should skip publishing artifacts people shouldn't depend on
[ https://issues.apache.org/jira/browse/SPARK-3452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-3452. Resolution: Fixed Fixed by: https://github.com/apache/spark/pull/2329 Maven build should skip publishing artifacts people shouldn't depend on --- Key: SPARK-3452 URL: https://issues.apache.org/jira/browse/SPARK-3452 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.0.0, 1.1.0 Reporter: Patrick Wendell Assignee: Prashant Sharma Priority: Critical I think it's easy to do this by just adding a skip configuration somewhere. We shouldn't be publishing repl, yarn, assembly, tools, repl-bin, or examples. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org