[jira] [Created] (SPARK-28446) Document Kafka Headers support

2019-07-19 Thread Lee Dongjin (JIRA)
Lee Dongjin created SPARK-28446:
---

 Summary: Document Kafka Headers support
 Key: SPARK-28446
 URL: https://issues.apache.org/jira/browse/SPARK-28446
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, Structured Streaming
Affects Versions: 3.0.0
Reporter: Lee Dongjin


This issue is a follow up of SPARK-23539.

After completing SPARK-23539, the following information about the headers 
functionality should be noted in Structured Streaming + Kafka Integration Guide:
 * The requirements to use Headers functionality (i.e., Kafka version).
 * How to turn on the Headers functionality.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11107) spark.ml should support more input column types: umbrella

2018-06-14 Thread Lee Dongjin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16512162#comment-16512162
 ] 

Lee Dongjin commented on SPARK-11107:
-

[~josephkb]] Excuse me. Is there any reason this issue is still opened?

> spark.ml should support more input column types: umbrella
> -
>
> Key: SPARK-11107
> URL: https://issues.apache.org/jira/browse/SPARK-11107
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Major
>
> This is an umbrella for expanding the set of data types which spark.ml 
> Pipeline stages can take.  This should not involve breaking APIs, but merely 
> involve slight changes such as supporting all Numeric types instead of just 
> Double.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24530) pyspark.ml doesn't generate class docs correctly

2018-06-14 Thread Lee Dongjin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16512159#comment-16512159
 ] 

Lee Dongjin commented on SPARK-24530:
-

OMG, I am sorry; I was misunderstanding. The documentation is also broken in my 
environment.

> pyspark.ml doesn't generate class docs correctly
> 
>
> Key: SPARK-24530
> URL: https://issues.apache.org/jira/browse/SPARK-24530
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Priority: Blocker
> Attachments: Screen Shot 2018-06-12 at 8.23.18 AM.png, Screen Shot 
> 2018-06-12 at 8.23.29 AM.png, image-2018-06-13-15-15-51-025.png, 
> pyspark-ml-doc-utuntu18.04-python2.7-sphinx-1.7.5.png
>
>
> I generated python docs from master locally using `make html`. However, the 
> generated html doc doesn't render class docs correctly. I attached the 
> screenshot from Spark 2.3 docs and master docs generated on my local. Not 
> sure if this is because my local setup.
> cc: [~dongjoon] Could you help verify?
>  
> The followings are our released doc status. Some recent docs seems to be 
> broken.
> *2.1.x*
> (O) 
> [https://spark.apache.org/docs/2.1.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (O) 
> [https://spark.apache.org/docs/2.1.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.1.2/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> *2.2.x*
> (O) 
> [https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.2.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> *2.3.x*
> (O) 
> [https://spark.apache.org/docs/2.3.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.3.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4591) Algorithm/model parity for spark.ml (Scala)

2018-06-14 Thread Lee Dongjin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16512153#comment-16512153
 ] 

Lee Dongjin commented on SPARK-4591:


[~josephkb] Excuse me. By SPARK-14376 was resolved recently, I think we should 
make this issue be resolve also.

> Algorithm/model parity for spark.ml (Scala)
> ---
>
> Key: SPARK-4591
> URL: https://issues.apache.org/jira/browse/SPARK-4591
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Reporter: Xiangrui Meng
>Priority: Critical
>
> This is an umbrella JIRA for porting spark.mllib implementations to use the 
> DataFrame-based API defined under spark.ml.  We want to achieve critical 
> feature parity for the next release.
> h3. Instructions for 3 subtask types
> *Review tasks*: detailed review of a subpackage to identify feature gaps 
> between spark.mllib and spark.ml.
> * Should be listed as a subtask of this umbrella.
> * Review subtasks cover major algorithm groups.  To pick up a review subtask, 
> please:
> ** Comment that you are working on it.
> ** Compare the public APIs of spark.ml vs. spark.mllib.
> ** Comment on all missing items within spark.ml: algorithms, models, methods, 
> features, etc.
> ** Check for existing JIRAs covering those items.  If there is no existing 
> JIRA, create one, and link it to your comment.
> *Critical tasks*: higher priority missing features which are required for 
> this umbrella JIRA.
> * Should be linked as "requires" links.
> *Other tasks*: lower priority missing features which can be completed after 
> the critical tasks.
> * Should be linked as "contains" links.
> h4. Excluded items
> This does *not* include:
> * Python: We can compare Scala vs. Python in spark.ml itself.
> * Moving linalg to spark.ml: [SPARK-13944]
> * Streaming ML: Requires stabilizing some internal APIs of structured 
> streaming first
> h3. TODO list
> *Critical issues*
> * [SPARK-14501]: Frequent Pattern Mining
> * [SPARK-14709]: linear SVM
> * [SPARK-15784]: Power Iteration Clustering (PIC)
> *Lower priority issues*
> * Missing methods within algorithms (see Issue Links below)
> * evaluation submodule
> * stat submodule (should probably be covered in DataFrames)
> * Developer-facing submodules:
> ** optimization (including [SPARK-17136])
> ** random, rdd
> ** util
> *To be prioritized*
> * single-instance prediction: [SPARK-10413]
> * pmml [SPARK-11171]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24530) pyspark.ml doesn't generate class docs correctly

2018-06-12 Thread Lee Dongjin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16510570#comment-16510570
 ] 

Lee Dongjin commented on SPARK-24530:
-

In my case, it works correctly on current master (commit 9786ce6). My 
environment is Ubuntu 18.04, Python 2.7, and Sphinx 1.7.5.

!pyspark-ml-doc-utuntu18.04-python2.7-sphinx-1.7.5.png! 

> pyspark.ml doesn't generate class docs correctly
> 
>
> Key: SPARK-24530
> URL: https://issues.apache.org/jira/browse/SPARK-24530
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Priority: Blocker
> Attachments: Screen Shot 2018-06-12 at 8.23.18 AM.png, Screen Shot 
> 2018-06-12 at 8.23.29 AM.png, 
> pyspark-ml-doc-utuntu18.04-python2.7-sphinx-1.7.5.png
>
>
> I generated python docs from master locally using `make html`. However, the 
> generated html doc doesn't render class docs correctly. I attached the 
> screenshot from Spark 2.3 docs and master docs generated on my local. Not 
> sure if this is because my local setup.
> cc: [~dongjoon] Could you help verify?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24530) pyspark.ml doesn't generate class docs correctly

2018-06-12 Thread Lee Dongjin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lee Dongjin updated SPARK-24530:

Attachment: pyspark-ml-doc-utuntu18.04-python2.7-sphinx-1.7.5.png

> pyspark.ml doesn't generate class docs correctly
> 
>
> Key: SPARK-24530
> URL: https://issues.apache.org/jira/browse/SPARK-24530
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Priority: Blocker
> Attachments: Screen Shot 2018-06-12 at 8.23.18 AM.png, Screen Shot 
> 2018-06-12 at 8.23.29 AM.png, 
> pyspark-ml-doc-utuntu18.04-python2.7-sphinx-1.7.5.png
>
>
> I generated python docs from master locally using `make html`. However, the 
> generated html doc doesn't render class docs correctly. I attached the 
> screenshot from Spark 2.3 docs and master docs generated on my local. Not 
> sure if this is because my local setup.
> cc: [~dongjoon] Could you help verify?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15064) Locale support in StopWordsRemover

2018-06-12 Thread Lee Dongjin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16509799#comment-16509799
 ] 

Lee Dongjin commented on SPARK-15064:
-

[~mengxr] Hello. Please assign this issue to me.

> Locale support in StopWordsRemover
> --
>
> Key: SPARK-15064
> URL: https://issues.apache.org/jira/browse/SPARK-15064
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Priority: Major
> Fix For: 2.4.0
>
>
> We support case insensitive filtering (default) in StopWordsRemover. However, 
> case insensitive matching depends on the locale and region, which cannot be 
> explicitly set in StopWordsRemover. We should consider adding this support in 
> MLlib.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14376) spark.ml parity for trees

2018-06-12 Thread Lee Dongjin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16509658#comment-16509658
 ] 

Lee Dongjin commented on SPARK-14376:
-

[~josephkb] Excuse me. Is there any reason this issue is still opned? It seems 
like all sub-issues and linked issues are resolved.

> spark.ml parity for trees
> -
>
> Key: SPARK-14376
> URL: https://issues.apache.org/jira/browse/SPARK-14376
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Major
>
> Review parity of spark.ml vs. spark.mllib to ensure spark.ml contains all 
> functionality.  List all missing items.
> This only covers Scala since we can compare Scala vs. Python in spark.ml 
> itself.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24513) Attribute support in UnaryTransformer

2018-06-10 Thread Lee Dongjin (JIRA)
Lee Dongjin created SPARK-24513:
---

 Summary: Attribute support in UnaryTransformer
 Key: SPARK-24513
 URL: https://issues.apache.org/jira/browse/SPARK-24513
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.4.0
Reporter: Lee Dongjin


To make HashingTF can extend UnaryTransformer, UnaryTransformer should support 
metadata functionality first. (See [~josephkb]'s comment on SPARK-13998)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19958) Support ZStandard Compression

2017-03-15 Thread Lee Dongjin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lee Dongjin resolved SPARK-19958.
-
Resolution: Duplicate

See: https://issues.apache.org/jira/browse/SPARK-19112

> Support ZStandard Compression
> -
>
> Key: SPARK-19958
> URL: https://issues.apache.org/jira/browse/SPARK-19958
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Lee Dongjin
>
> Hadoop[^1] & HBase[^2] started to support ZStandard Compression from their 
> recent releases. Supporting this compression codec also requires adding a new 
> configuration for default compression level, for example, 
> 'spark.io.compression.zstandard.level.'
> [^1]: https://issues.apache.org/jira/browse/HADOOP-13578
> [^2]: https://issues.apache.org/jira/browse/HBASE-16710



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19958) Support ZStandard Compression

2017-03-15 Thread Lee Dongjin (JIRA)
Lee Dongjin created SPARK-19958:
---

 Summary: Support ZStandard Compression
 Key: SPARK-19958
 URL: https://issues.apache.org/jira/browse/SPARK-19958
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.1.0
Reporter: Lee Dongjin


Hadoop[^1] & HBase[^2] started to support ZStandard Compression from their 
recent releases. Supporting this compression codec also requires adding a new 
configuration for default compression level, for example, 
'spark.io.compression.zstandard.level.'

[^1]: https://issues.apache.org/jira/browse/HADOOP-13578
[^2]: https://issues.apache.org/jira/browse/HBASE-16710




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19634) Feature parity for descriptive statistics in MLlib

2017-03-14 Thread Lee Dongjin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15923770#comment-15923770
 ] 

Lee Dongjin commented on SPARK-19634:
-

[~timhunter] // It seems like I can help. Please assign it to me.

> Feature parity for descriptive statistics in MLlib
> --
>
> Key: SPARK-19634
> URL: https://issues.apache.org/jira/browse/SPARK-19634
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Timothy Hunter
>Assignee: Timothy Hunter
>
> This ticket tracks porting the functionality of 
> spark.mllib.MultivariateOnlineSummarizer over to spark.ml.
> A design has been discussed in SPARK-19208 . Here is a design doc:
> https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit#



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17248) Add native Scala enum support to Dataset Encoders

2017-03-09 Thread Lee Dongjin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15903245#comment-15903245
 ] 

Lee Dongjin commented on SPARK-17248:
-

[~pdxleif] // Although it may be an expired question, let me answer.

There are two ways of implementing Enum types in Scala. For more information, 
see: 
http://alvinalexander.com/scala/how-to-use-scala-enums-enumeration-examples The 
'enumeration object' in the tuning guide seems to point the second way, that 
is, using sealed traits and case objects.

However, it seems like the sentence "Consider using numeric IDs or enumeration 
objects instead of strings for keys." in the tuning guide does not apply to 
DataSet, like your case. From my humble knowledge, DataSet supports only case 
classes with primitive types, not with the other case class or object, in this 
case, the enum objects.

If you want some workaround, please check out this example: 
https://github.com/dongjinleekr/spark-dataset/blob/master/src/main/scala/com/github/dongjinleekr/spark/dataset/Titanic.scala
 It shows an example of DataSet using one of the public datasets. Please pay 
attention to how I matched the Passenger case class and its corresponding 
SchemaType - age, pClass, sex and embarked.

[~srowen] // It seems like lots of users are experiencing similar problems. How 
about changing this issue into providing more examples and explanations in 
official documentation? I needed, I would like to take the issue.

> Add native Scala enum support to Dataset Encoders
> -
>
> Key: SPARK-17248
> URL: https://issues.apache.org/jira/browse/SPARK-17248
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Silvio Fiorito
>
> Enable support for Scala enums in Encoders. Ideally, users should be able to 
> use enums as part of case classes automatically.
> Currently, this code...
> {code}
> object MyEnum extends Enumeration {
>   type MyEnum = Value
>   val EnumVal1, EnumVal2 = Value
> }
> case class MyClass(col: MyEnum.MyEnum)
> val data = Seq(MyClass(MyEnum.EnumVal1), MyClass(MyEnum.EnumVal2)).toDS()
> {code}
> ...results in this stacktrace:
> {code}
> ava.lang.UnsupportedOperationException: No Encoder found for MyEnum.MyEnum
> - field (class: "scala.Enumeration.Value", name: "col")
> - root class: 
> "line550c9f34c5144aa1a1e76bcac863244717.$read.$iwC.$iwC.$iwC.$iwC.MyClass"
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:598)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$9.apply(ScalaReflection.scala:592)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$9.apply(ScalaReflection.scala:583)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
>   at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:583)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.serializerFor(ScalaReflection.scala:425)
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:61)
>   at org.apache.spark.sql.Encoders$.product(Encoders.scala:274)
>   at 
> org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:47)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15880) PREGEL Based Semi-Clustering Algorithm Implementation using Spark GraphX API

2017-01-09 Thread Lee Dongjin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15811819#comment-15811819
 ] 

Lee Dongjin commented on SPARK-15880:
-

[~sowen] Thanks for your comment. However, I expect that the output will be 
similar to ConnectedComponents or StronglyConnectedComponents algorithm in 
GraphX, which are not RDD-Based. Given that, isn't it worth to add?

> PREGEL Based Semi-Clustering Algorithm Implementation using Spark GraphX API
> 
>
> Key: SPARK-15880
> URL: https://issues.apache.org/jira/browse/SPARK-15880
> Project: Spark
>  Issue Type: New Feature
>  Components: GraphX
>Reporter: R J
>Priority: Minor
> Attachments: pregel_paper.pdf
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> The main concept of Semi-Clustering algorithm on top of social graphs are:
>  - Vertices in a social graph typically represent people, and edges represent 
> connections between them.
>  - Edges may be based on explicit actions (e.g., adding a friend in a social 
> networking site), or may be inferred from people’s behaviour (e.g., email 
> conversations or co-publication).
>  - Edges may have weights, to represent the interactions frequency or 
> strength.
>  - A semi-cluster in a social graph is a group of people who interact 
> frequently with each other and less frequently with others.
>  - What distinguishes it from ordinary clustering is that, a vertex may 
> belong to more than one semi-cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15880) PREGEL Based Semi-Clustering Algorithm Implementation using Spark GraphX API

2017-01-05 Thread Lee Dongjin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15803730#comment-15803730
 ] 

Lee Dongjin commented on SPARK-15880:
-

Hello. It seems like this issue has been abandoned. May I take this? I have an 
implementation of a Semi-clustering algorithm written in Spark, so I can 
improve it for SparkML.

> PREGEL Based Semi-Clustering Algorithm Implementation using Spark GraphX API
> 
>
> Key: SPARK-15880
> URL: https://issues.apache.org/jira/browse/SPARK-15880
> Project: Spark
>  Issue Type: New Feature
>  Components: GraphX
>Reporter: R J
>Priority: Minor
> Attachments: pregel_paper.pdf
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> The main concept of Semi-Clustering algorithm on top of social graphs are:
>  - Vertices in a social graph typically represent people, and edges represent 
> connections between them.
>  - Edges may be based on explicit actions (e.g., adding a friend in a social 
> networking site), or may be inferred from people’s behaviour (e.g., email 
> conversations or co-publication).
>  - Edges may have weights, to represent the interactions frequency or 
> strength.
>  - A semi-cluster in a social graph is a group of people who interact 
> frequently with each other and less frequently with others.
>  - What distinguishes it from ordinary clustering is that, a vertex may 
> belong to more than one semi-cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org