[GitHub] spark pull request: [SPARK-10299][ML] word2vec should allow users ...

2015-10-01 Thread Ishiihara
Github user Ishiihara commented on the pull request: https://github.com/apache/spark/pull/8513#issuecomment-144873979 @holdenk LGTM. The reason to make the window size constant is that the window size does not affect the result too much given a large corpus. --- If your project

[GitHub] spark pull request: [SPARK-2213][SQL] Sort Merge Join

2014-11-08 Thread Ishiihara
GitHub user Ishiihara opened a pull request: https://github.com/apache/spark/pull/3173 [SPARK-2213][SQL] Sort Merge Join This PR adds MergeJoin operator to Spark SQL. The semantics of MergeJoin operator is similar to Hive's Sort merge bucket join. MergeJoin operator

[GitHub] spark pull request: [SPARK-2213][SQL] Sort Merge Join

2014-11-08 Thread Ishiihara
Github user Ishiihara commented on a diff in the pull request: https://github.com/apache/spark/pull/3173#discussion_r20056877 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala --- @@ -84,6 +84,10 @@ private[sql] abstract class SparkStrategies

[GitHub] spark pull request: [WIP][SQL][SPARK-3839] Reimplement Left/Right ...

2014-10-29 Thread Ishiihara
Github user Ishiihara commented on the pull request: https://github.com/apache/spark/pull/2723#issuecomment-60881119 @marmbrus All test failures have the same pattern select * from a right outer join b on condition1 join c on condition2 With the extra join

[GitHub] spark pull request: [WIP][SQL][SPARK-3839] Reimplement Left/Right ...

2014-10-28 Thread Ishiihara
Github user Ishiihara commented on the pull request: https://github.com/apache/spark/pull/2723#issuecomment-60861600 test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark pull request: [SPARK-4019] Fix MapStatus compression bug tha...

2014-10-21 Thread Ishiihara
Github user Ishiihara commented on the pull request: https://github.com/apache/spark/pull/2866#issuecomment-59883834 @JoshRosen I have been looking into the compressed bitmap and already get a good idea of how to use roaring bitmap to perform the task. If this work is not urgent, can

[GitHub] spark pull request: [SPARK-4019] Fix MapStatus compression bug tha...

2014-10-21 Thread Ishiihara
Github user Ishiihara commented on the pull request: https://github.com/apache/spark/pull/2866#issuecomment-59979144 @JoshRosen Thank you. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark pull request: Remove Bytecode Inspection for Join Eliminatio...

2014-10-15 Thread Ishiihara
Github user Ishiihara commented on a diff in the pull request: https://github.com/apache/spark/pull/2815#discussion_r18917918 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/Graph.scala --- @@ -195,6 +195,12 @@ abstract class Graph[VD: ClassTag, ED: ClassTag] protected

[GitHub] spark pull request: [SPARK-3961] Python API for mllib.feature

2014-10-15 Thread Ishiihara
Github user Ishiihara commented on a diff in the pull request: https://github.com/apache/spark/pull/2819#discussion_r18932327 --- Diff: python/pyspark/mllib/feature.py --- @@ -95,90 +360,46 @@ class Word2Vec(object): sentence = a b * 100 + a c * 10 localDoc

[GitHub] spark pull request: [WIP][SQL][SPARK-3839] Reimplement Left/Right ...

2014-10-14 Thread Ishiihara
Github user Ishiihara commented on the pull request: https://github.com/apache/spark/pull/2723#issuecomment-59001540 test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark pull request: [SQL]Small bug in unresolved.scala

2014-10-10 Thread Ishiihara
GitHub user Ishiihara opened a pull request: https://github.com/apache/spark/pull/2758 [SQL]Small bug in unresolved.scala name should throw exception with name instead of exprId. You can merge this pull request into a Git repository by running: $ git pull https://github.com

[GitHub] spark pull request: [SQL]Small bug in unresolved.scala

2014-10-10 Thread Ishiihara
Github user Ishiihara commented on the pull request: https://github.com/apache/spark/pull/2758#issuecomment-58731952 this is ok to test. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark pull request: [SQL][Doc] Keep Spark SQL README.md up to date

2014-10-08 Thread Ishiihara
GitHub user Ishiihara opened a pull request: https://github.com/apache/spark/pull/2706 [SQL][Doc] Keep Spark SQL README.md up to date @marmbrus Update README.md to be consistent with Spark 1.1 You can merge this pull request into a Git repository by running: $ git pull

[GitHub] spark pull request: [WIP][SQL][SPARK-3839] Reimplement Left/Right ...

2014-10-08 Thread Ishiihara
GitHub user Ishiihara opened a pull request: https://github.com/apache/spark/pull/2723 [WIP][SQL][SPARK-3839] Reimplement Left/Right outer join This is a working in progress PR. This PR reimplement Left/Right outer join using only one hash table. You can merge this pull request

[GitHub] spark pull request: [WIP][SQL][SPARK-3839] Reimplement Left/Right ...

2014-10-08 Thread Ishiihara
Github user Ishiihara commented on the pull request: https://github.com/apache/spark/pull/2723#issuecomment-58448551 This depends on https://github.com/apache/spark/pull/2719 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

2014-10-07 Thread Ishiihara
Github user Ishiihara commented on the pull request: https://github.com/apache/spark/pull/2356#issuecomment-58252347 test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

2014-10-07 Thread Ishiihara
Github user Ishiihara commented on the pull request: https://github.com/apache/spark/pull/2356#issuecomment-58270419 test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

2014-10-07 Thread Ishiihara
Github user Ishiihara commented on the pull request: https://github.com/apache/spark/pull/2356#issuecomment-58271152 test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

2014-10-07 Thread Ishiihara
Github user Ishiihara commented on the pull request: https://github.com/apache/spark/pull/2356#issuecomment-58271779 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

2014-10-06 Thread Ishiihara
Github user Ishiihara commented on the pull request: https://github.com/apache/spark/pull/2356#issuecomment-58119086 @mengxr will take care of that and other comments --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well

[GitHub] spark pull request: [SPARK-3366][MLLIB]Compute best splits distrib...

2014-10-01 Thread Ishiihara
Github user Ishiihara commented on the pull request: https://github.com/apache/spark/pull/2595#issuecomment-57426079 @chouqin You can run SPARK_TESTING=1 ./bin/pyspark python/pyspark/my_file.py to run unit tests for a certain file. In your case, use SPARK_TESTING=1 ./bin/pyspark

[GitHub] spark pull request: [SPARK-3613] Record only average block size in...

2014-09-30 Thread Ishiihara
Github user Ishiihara commented on the pull request: https://github.com/apache/spark/pull/2470#issuecomment-57274044 @rxin I looked through Roaring bitmap and that is a highly compressed bitmap compared with other bitmap implementations. I will start working on this and keep you

[GitHub] spark pull request: Add more debug message for ManagedBuffer

2014-09-29 Thread Ishiihara
Github user Ishiihara commented on a diff in the pull request: https://github.com/apache/spark/pull/2580#discussion_r18173321 --- Diff: core/src/main/scala/org/apache/spark/network/ManagedBuffer.scala --- @@ -71,6 +73,14 @@ final class FileSegmentManagedBuffer(val file: File, val

[GitHub] spark pull request: Add more debug message for ManagedBuffer

2014-09-29 Thread Ishiihara
Github user Ishiihara commented on a diff in the pull request: https://github.com/apache/spark/pull/2580#discussion_r18173898 --- Diff: core/src/main/scala/org/apache/spark/network/ManagedBuffer.scala --- @@ -71,6 +73,14 @@ final class FileSegmentManagedBuffer(val file: File, val

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

2014-09-27 Thread Ishiihara
Github user Ishiihara commented on a diff in the pull request: https://github.com/apache/spark/pull/2356#discussion_r18122598 --- Diff: python/pyspark/mllib/Word2Vec.py --- @@ -0,0 +1,124 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

2014-09-27 Thread Ishiihara
Github user Ishiihara commented on a diff in the pull request: https://github.com/apache/spark/pull/2356#discussion_r18122597 --- Diff: python/pyspark/mllib/Word2Vec.py --- @@ -0,0 +1,124 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

2014-09-27 Thread Ishiihara
Github user Ishiihara commented on the pull request: https://github.com/apache/spark/pull/2356#issuecomment-57046286 @mengxr Repartition is very slow when caching at Python side. It takes 9 minutes to do the repartition where as caching in Java only takes 5s. --- If your project

[GitHub] spark pull request: [Spark-3525] Adding gradient boosting

2014-09-26 Thread Ishiihara
Github user Ishiihara commented on a diff in the pull request: https://github.com/apache/spark/pull/2394#discussion_r18105434 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/StochasticGradientBoosting.scala --- @@ -0,0 +1,173 @@ +package

[GitHub] spark pull request: [Spark-3525] Adding gradient boosting

2014-09-26 Thread Ishiihara
Github user Ishiihara commented on a diff in the pull request: https://github.com/apache/spark/pull/2394#discussion_r18105459 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/StochasticGradientBoosting.scala --- @@ -0,0 +1,173 @@ +package

[GitHub] spark pull request: [Spark-3525] Adding gradient boosting

2014-09-26 Thread Ishiihara
Github user Ishiihara commented on a diff in the pull request: https://github.com/apache/spark/pull/2394#discussion_r18105497 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/StochasticGradientBoosting.scala --- @@ -0,0 +1,173 @@ +package

[GitHub] spark pull request: [Spark-3525] Adding gradient boosting

2014-09-26 Thread Ishiihara
Github user Ishiihara commented on a diff in the pull request: https://github.com/apache/spark/pull/2394#discussion_r18106160 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/StochasticGradientBoosting.scala --- @@ -0,0 +1,173 @@ +package

[GitHub] spark pull request: [Spark-3525] Adding gradient boosting

2014-09-26 Thread Ishiihara
Github user Ishiihara commented on a diff in the pull request: https://github.com/apache/spark/pull/2394#discussion_r18106192 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/StochasticGradientBoosting.scala --- @@ -0,0 +1,173 @@ +package

[GitHub] spark pull request: [Spark-3525] Adding gradient boosting

2014-09-26 Thread Ishiihara
Github user Ishiihara commented on a diff in the pull request: https://github.com/apache/spark/pull/2394#discussion_r18106266 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/StochasticGradientBoosting.scala --- @@ -0,0 +1,173 @@ +package

[GitHub] spark pull request: [Spark-3525] Adding gradient boosting

2014-09-26 Thread Ishiihara
Github user Ishiihara commented on a diff in the pull request: https://github.com/apache/spark/pull/2394#discussion_r18106416 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/StochasticGradientBoosting.scala --- @@ -0,0 +1,173 @@ +package

[GitHub] spark pull request: [Spark-3525] Adding gradient boosting

2014-09-26 Thread Ishiihara
Github user Ishiihara commented on a diff in the pull request: https://github.com/apache/spark/pull/2394#discussion_r18106472 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/StochasticGradientBoosting.scala --- @@ -0,0 +1,173 @@ +package

[GitHub] spark pull request: [Spark-3525] Adding gradient boosting

2014-09-26 Thread Ishiihara
Github user Ishiihara commented on a diff in the pull request: https://github.com/apache/spark/pull/2394#discussion_r18106585 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/StochasticGradientBoosting.scala --- @@ -0,0 +1,173 @@ +package

[GitHub] spark pull request: [Spark-3525] Adding gradient boosting

2014-09-26 Thread Ishiihara
Github user Ishiihara commented on a diff in the pull request: https://github.com/apache/spark/pull/2394#discussion_r18106962 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/StochasticGradientBoosting.scala --- @@ -0,0 +1,173 @@ +package

[GitHub] spark pull request: [Spark-3525] Adding gradient boosting

2014-09-26 Thread Ishiihara
Github user Ishiihara commented on a diff in the pull request: https://github.com/apache/spark/pull/2394#discussion_r18107064 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/StochasticGradientBoosting.scala --- @@ -0,0 +1,173 @@ +package

[GitHub] spark pull request: [Spark-3525] Adding gradient boosting

2014-09-26 Thread Ishiihara
Github user Ishiihara commented on a diff in the pull request: https://github.com/apache/spark/pull/2394#discussion_r18107119 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/regression/StochasticGradientBoostingSuite.scala --- @@ -0,0 +1,44 @@ +package

[GitHub] spark pull request: [Spark-3525] Adding gradient boosting

2014-09-26 Thread Ishiihara
Github user Ishiihara commented on a diff in the pull request: https://github.com/apache/spark/pull/2394#discussion_r18107150 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/regression/StochasticGradientBoostingSuite.scala --- @@ -0,0 +1,44 @@ +package

[GitHub] spark pull request: [Spark-3525] Adding gradient boosting

2014-09-26 Thread Ishiihara
Github user Ishiihara commented on a diff in the pull request: https://github.com/apache/spark/pull/2394#discussion_r18107249 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/regression/StochasticGradientBoostingSuite.scala --- @@ -0,0 +1,44 @@ +package

[GitHub] spark pull request: [Spark-3525] Adding gradient boosting

2014-09-26 Thread Ishiihara
Github user Ishiihara commented on a diff in the pull request: https://github.com/apache/spark/pull/2394#discussion_r18107318 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/StochasticGradientBoosting.scala --- @@ -0,0 +1,173 @@ +package

[GitHub] spark pull request: [Spark-3525] Adding gradient boosting

2014-09-26 Thread Ishiihara
Github user Ishiihara commented on a diff in the pull request: https://github.com/apache/spark/pull/2394#discussion_r18107306 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/StochasticGradientBoosting.scala --- @@ -0,0 +1,173 @@ +package

[GitHub] spark pull request: [Spark-3525] Adding gradient boosting

2014-09-26 Thread Ishiihara
Github user Ishiihara commented on the pull request: https://github.com/apache/spark/pull/2394#issuecomment-57006487 @mengxr @epahomov Added some comments after quickly going through the code. Will do a deeper looking at the algorithm later. --- If your project is set up

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

2014-09-26 Thread Ishiihara
Github user Ishiihara commented on a diff in the pull request: https://github.com/apache/spark/pull/2356#discussion_r18117490 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala --- @@ -284,6 +285,80 @@ class PythonMLLibAPI extends Serializable

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

2014-09-26 Thread Ishiihara
Github user Ishiihara commented on a diff in the pull request: https://github.com/apache/spark/pull/2356#discussion_r18117593 --- Diff: python/pyspark/mllib/Word2Vec.py --- @@ -0,0 +1,123 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

2014-09-26 Thread Ishiihara
Github user Ishiihara commented on a diff in the pull request: https://github.com/apache/spark/pull/2356#discussion_r18117584 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala --- @@ -284,6 +285,80 @@ class PythonMLLibAPI extends Serializable

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

2014-09-26 Thread Ishiihara
Github user Ishiihara commented on a diff in the pull request: https://github.com/apache/spark/pull/2356#discussion_r18117604 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala --- @@ -284,6 +285,80 @@ class PythonMLLibAPI extends Serializable

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

2014-09-26 Thread Ishiihara
Github user Ishiihara commented on a diff in the pull request: https://github.com/apache/spark/pull/2356#discussion_r18117608 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala --- @@ -284,6 +285,80 @@ class PythonMLLibAPI extends Serializable

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

2014-09-26 Thread Ishiihara
Github user Ishiihara commented on a diff in the pull request: https://github.com/apache/spark/pull/2356#discussion_r18117647 --- Diff: python/pyspark/mllib/Word2Vec.py --- @@ -0,0 +1,123 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

2014-09-26 Thread Ishiihara
Github user Ishiihara commented on a diff in the pull request: https://github.com/apache/spark/pull/2356#discussion_r18118109 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala --- @@ -284,6 +285,80 @@ class PythonMLLibAPI extends Serializable

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

2014-09-26 Thread Ishiihara
Github user Ishiihara commented on a diff in the pull request: https://github.com/apache/spark/pull/2356#discussion_r18120761 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala --- @@ -284,6 +285,80 @@ class PythonMLLibAPI extends Serializable

[GitHub] spark pull request: SPARK-CORE [SPARK-3651] Group common CoarseGra...

2014-09-25 Thread Ishiihara
Github user Ishiihara commented on a diff in the pull request: https://github.com/apache/spark/pull/2533#discussion_r18026314 --- Diff: core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala --- @@ -62,15 +62,9 @@ class

[GitHub] spark pull request: SPARK-CORE [SPARK-3651] Group common CoarseGra...

2014-09-25 Thread Ishiihara
Github user Ishiihara commented on a diff in the pull request: https://github.com/apache/spark/pull/2533#discussion_r18026563 --- Diff: core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala --- @@ -85,16 +79,18 @@ class

[GitHub] spark pull request: SPARK-CORE [SPARK-3651] Group common CoarseGra...

2014-09-25 Thread Ishiihara
Github user Ishiihara commented on a diff in the pull request: https://github.com/apache/spark/pull/2533#discussion_r18026675 --- Diff: core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala --- @@ -104,13 +100,15 @@ class

[GitHub] spark pull request: SPARK-CORE [SPARK-3651] Group common CoarseGra...

2014-09-25 Thread Ishiihara
Github user Ishiihara commented on a diff in the pull request: https://github.com/apache/spark/pull/2533#discussion_r18026654 --- Diff: core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala --- @@ -85,16 +79,18 @@ class

[GitHub] spark pull request: SPARK-CORE [SPARK-3651] Group common CoarseGra...

2014-09-25 Thread Ishiihara
Github user Ishiihara commented on a diff in the pull request: https://github.com/apache/spark/pull/2533#discussion_r18026763 --- Diff: core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala --- @@ -126,8 +124,8 @@ class

[GitHub] spark pull request: SPARK-CORE [SPARK-3651] Group common CoarseGra...

2014-09-25 Thread Ishiihara
Github user Ishiihara commented on a diff in the pull request: https://github.com/apache/spark/pull/2533#discussion_r18026850 --- Diff: core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala --- @@ -149,13 +147,14 @@ class

[GitHub] spark pull request: SPARK-CORE [SPARK-3651] Group common CoarseGra...

2014-09-25 Thread Ishiihara
Github user Ishiihara commented on a diff in the pull request: https://github.com/apache/spark/pull/2533#discussion_r18026933 --- Diff: core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala --- @@ -179,25 +178,22 @@ class

[GitHub] spark pull request: SPARK-CORE [SPARK-3651] Group common CoarseGra...

2014-09-25 Thread Ishiihara
Github user Ishiihara commented on a diff in the pull request: https://github.com/apache/spark/pull/2533#discussion_r18027054 --- Diff: core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala --- @@ -179,25 +178,22 @@ class

[GitHub] spark pull request: SPARK-CORE [SPARK-3651] Group common CoarseGra...

2014-09-25 Thread Ishiihara
Github user Ishiihara commented on a diff in the pull request: https://github.com/apache/spark/pull/2533#discussion_r18027094 --- Diff: core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala --- @@ -179,25 +178,22 @@ class

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

2014-09-25 Thread Ishiihara
Github user Ishiihara commented on the pull request: https://github.com/apache/spark/pull/2356#issuecomment-56869195 @mengxr PR updated to use new pickle SerDe. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark pull request: SPARK-3642. Document the nuances of shared var...

2014-09-24 Thread Ishiihara
Github user Ishiihara commented on a diff in the pull request: https://github.com/apache/spark/pull/2490#discussion_r17959346 --- Diff: docs/programming-guide.md --- @@ -1121,6 +1121,11 @@ than shipping a copy of it with tasks. They can be used, for example, to give ev large

[GitHub] spark pull request: SPARK-3642. Document the nuances of shared var...

2014-09-24 Thread Ishiihara
Github user Ishiihara commented on a diff in the pull request: https://github.com/apache/spark/pull/2490#discussion_r17959656 --- Diff: docs/programming-guide.md --- @@ -1183,6 +1188,10 @@ running on the cluster can then add to it using the `add` method or the `+=` ope

[GitHub] spark pull request: [SPARK-3614][MLLIB] Add minimumOccurence filte...

2014-09-23 Thread Ishiihara
Github user Ishiihara commented on a diff in the pull request: https://github.com/apache/spark/pull/2494#discussion_r17930496 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/IDF.scala --- @@ -60,13 +70,16 @@ class IDF { private object IDF

[GitHub] spark pull request: [SPARK-3614][MLLIB] Add minimumOccurence filte...

2014-09-23 Thread Ishiihara
Github user Ishiihara commented on the pull request: https://github.com/apache/spark/pull/2494#issuecomment-56571123 @rnowling LGTM in general. Some comments on styles. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

2014-09-22 Thread Ishiihara
Github user Ishiihara commented on the pull request: https://github.com/apache/spark/pull/2356#issuecomment-56420682 We need to modify the implementation to use the new SerDe mechanism. Working on that now. --- If your project is set up for it, you can reply to this email and have

[GitHub] spark pull request: [SPARK-3614][MLLIB] Add minimumOccurence filte...

2014-09-22 Thread Ishiihara
Github user Ishiihara commented on a diff in the pull request: https://github.com/apache/spark/pull/2494#discussion_r17878647 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/IDF.scala --- @@ -123,7 +134,17 @@ private object IDF { val inv = new Array[Double

[GitHub] spark pull request: [SPARK-3614][MLLIB] Add minimumOccurence filte...

2014-09-22 Thread Ishiihara
Github user Ishiihara commented on a diff in the pull request: https://github.com/apache/spark/pull/2494#discussion_r17879054 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/feature/IDFSuite.scala --- @@ -54,4 +54,38 @@ class IDFSuite extends FunSuite with LocalSparkContext

[GitHub] spark pull request: [SPARK-3614][MLLIB] Add minimumOccurence filte...

2014-09-22 Thread Ishiihara
Github user Ishiihara commented on the pull request: https://github.com/apache/spark/pull/2494#issuecomment-56445953 One question, with this parameter set, it also filter out words that is very important to some documents. Say, that if some word occurs many times in 1 or 2 documents

[GitHub] spark pull request: [SPARK-3614][MLLIB] Add minimumOccurence filte...

2014-09-22 Thread Ishiihara
Github user Ishiihara commented on a diff in the pull request: https://github.com/apache/spark/pull/2494#discussion_r17880822 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/IDF.scala --- @@ -123,7 +134,17 @@ private object IDF { val inv = new Array[Double

[GitHub] spark pull request: [SPARK-3614][MLLIB] Add minimumOccurence filte...

2014-09-22 Thread Ishiihara
Github user Ishiihara commented on the pull request: https://github.com/apache/spark/pull/2494#issuecomment-56461303 @rnowling Please run sbt/sbt scalastyle on your local machine to clear out style issues. --- If your project is set up for it, you can reply to this email and have

[GitHub] spark pull request: [SPARK-3614][MLLIB] Add minimumOccurence filte...

2014-09-22 Thread Ishiihara
Github user Ishiihara commented on a diff in the pull request: https://github.com/apache/spark/pull/2494#discussion_r17886499 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/IDF.scala --- @@ -30,9 +30,20 @@ import org.apache.spark.rdd.RDD * Inverse document

[GitHub] spark pull request: [SPARK-3613] Record only average block size in...

2014-09-21 Thread Ishiihara
Github user Ishiihara commented on the pull request: https://github.com/apache/spark/pull/2470#issuecomment-56292536 @rxin I am definitely interested in working on adding compressed bitmap. What is the first step? Thanks. --- If your project is set up for it, you can reply

[GitHub] spark pull request: [SPARK-3613] Record only average block size in...

2014-09-21 Thread Ishiihara
Github user Ishiihara commented on the pull request: https://github.com/apache/spark/pull/2470#issuecomment-56310242 @rxin @lemire Starting looking at Roaring. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark pull request: [SPARK-3613] Record only average block size in...

2014-09-20 Thread Ishiihara
Github user Ishiihara commented on the pull request: https://github.com/apache/spark/pull/2470#issuecomment-56277206 @rxin my understanding is that MapStatus is used to check whether a map output file contain data for a certain reducer. Why do we use actual size instead of a boolean

[GitHub] spark pull request: [SPARK-3613] Record only average block size in...

2014-09-20 Thread Ishiihara
Github user Ishiihara commented on the pull request: https://github.com/apache/spark/pull/2470#issuecomment-56277955 Thanks for the reply. Another questions, In hash shuffle write, the data may be screwed for different map output file. For some cases, the reducer may try to fetch

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

2014-09-11 Thread Ishiihara
GitHub user Ishiihara opened a pull request: https://github.com/apache/spark/pull/2356 [SPARK-3486][MLlib][PySpark] PySpark support for Word2Vec @mengxr Added PySpark support for Word2Vec Change list (1) PySpark support for Word2Vec (2) SerDe support of string

[GitHub] spark pull request: [SPARK-2907] [MLlib] Use mutable.HashMap to re...

2014-08-19 Thread Ishiihara
Github user Ishiihara commented on the pull request: https://github.com/apache/spark/pull/1871#issuecomment-52702675 @mateiz This is taken care of by https://github.com/apache/spark/pull/1932 and is already merged in master and 1.1. In that PR, the model output by each partition

[GitHub] spark pull request: [MLLIB] minor update to word2vec

2014-08-19 Thread Ishiihara
Github user Ishiihara commented on the pull request: https://github.com/apache/spark/pull/2043#issuecomment-52704090 Looks good to me. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark pull request: [SPARK-3142][MLLIB] output shuffle data direct...

2014-08-19 Thread Ishiihara
Github user Ishiihara commented on the pull request: https://github.com/apache/spark/pull/2049#issuecomment-52724285 Good point. This reduces the needs of temp object to store the output model. Although None is output but is a much smaller object compared with the vector

[GitHub] spark pull request: [MLlib] Remove transform(dataset: RDD[String])...

2014-08-18 Thread Ishiihara
GitHub user Ishiihara opened a pull request: https://github.com/apache/spark/pull/2010 [MLlib] Remove transform(dataset: RDD[String]) from Word2Vec public API @mengxr Remove transform(dataset: RDD[String]) from public API. You can merge this pull request into a Git repository

[GitHub] spark pull request: [SPARK-2907] [MLlib] Use mutable.HashMap to re...

2014-08-18 Thread Ishiihara
Github user Ishiihara closed the pull request at: https://github.com/apache/spark/pull/1871 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature

[GitHub] spark pull request: [SPARK-2842][MLlib]Word2Vec documentation

2014-08-17 Thread Ishiihara
GitHub user Ishiihara opened a pull request: https://github.com/apache/spark/pull/2003 [SPARK-2842][MLlib]Word2Vec documentation Documentation for Word2Vec You can merge this pull request into a Git repository by running: $ git pull https://github.com/Ishiihara/spark Word2Vec

[GitHub] spark pull request: [MLlib] Word2Vec performance improvement

2014-08-14 Thread Ishiihara
Github user Ishiihara commented on a diff in the pull request: https://github.com/apache/spark/pull/1932#discussion_r16222465 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -284,16 +284,15 @@ class Word2Vec extends Serializable with Logging

[GitHub] spark pull request: [SPARK-2907][MLlib] Word2Vec performance impro...

2014-08-13 Thread Ishiihara
GitHub user Ishiihara opened a pull request: https://github.com/apache/spark/pull/1932 [SPARK-2907][MLlib] Word2Vec performance improve @mengxr Please review the code. Adding weights in reduceByKey soon. Only output model entry for words appeared in the partition before

[GitHub] spark pull request: [MLlib] Word2Vec performance improvement

2014-08-13 Thread Ishiihara
Github user Ishiihara commented on a diff in the pull request: https://github.com/apache/spark/pull/1932#discussion_r16222127 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -284,16 +284,15 @@ class Word2Vec extends Serializable with Logging

[GitHub] spark pull request: [MLlib] Correctly set vectorSize and alpha

2014-08-12 Thread Ishiihara
GitHub user Ishiihara opened a pull request: https://github.com/apache/spark/pull/1900 [MLlib] Correctly set vectorSize and alpha You can merge this pull request into a Git repository by running: $ git pull https://github.com/Ishiihara/spark Word2Vec-bugfix Alternatively you

[GitHub] spark pull request: [SPARK-2907] [MLlib] Use mutable.HashMap to re...

2014-08-10 Thread Ishiihara
Github user Ishiihara commented on the pull request: https://github.com/apache/spark/pull/1871#issuecomment-51724432 @mengxr Some benchmark result Environment: OSX 10.9, 8G memory, 2.5G i5 CPU, 4 threads startingAlpha = 0.0025 vecterSize = 100 Driver memory 2g

[GitHub] spark pull request: [SPARK-2907] [MLlib] Use mutable.HashMap to re...

2014-08-09 Thread Ishiihara
GitHub user Ishiihara opened a pull request: https://github.com/apache/spark/pull/1871 [SPARK-2907] [MLlib] Use mutable.HashMap to represent model in Word2Vec Change list: 1. Used mutable.HashMap to represent syn0Global and syn1Global to reduce shuffle size. 2. Introduced

[GitHub] spark pull request: [SPARK-2864][MLLIB] fix random seed in word2ve...

2014-08-05 Thread Ishiihara
Github user Ishiihara commented on the pull request: https://github.com/apache/spark/pull/1790#issuecomment-51234797 @mengxr LGTM. We may need better implementation of TopK. It also worth trying to change the starting alpha in each iteration. --- If your project is set up

[GitHub] spark pull request: [MLlib] [SPARK-2510]Word2Vec: Distributed Repr...

2014-08-03 Thread Ishiihara
Github user Ishiihara commented on a diff in the pull request: https://github.com/apache/spark/pull/1719#discussion_r15741135 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/feature/Word2VecSuite.scala --- @@ -0,0 +1,61 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [MLlib] word2vec: Distributed Representation o...

2014-08-01 Thread Ishiihara
GitHub user Ishiihara opened a pull request: https://github.com/apache/spark/pull/1719 [MLlib] word2vec: Distributed Representation of Words Vector representation of words. This is a pull request regarding SPARK-2510 at https://issues.apache.org/jira/browse/SPARK-2510 You can

[GitHub] spark pull request: [MLlib] [SPARK-2510]word2vec: Distributed Repr...

2014-08-01 Thread Ishiihara
Github user Ishiihara commented on the pull request: https://github.com/apache/spark/pull/1719#issuecomment-50904281 @mengxr code format done. Working on test case of algorithm. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub

[GitHub] spark pull request: [MLlib] [SPARK-2510]word2vec: Distributed Repr...

2014-08-01 Thread Ishiihara
Github user Ishiihara commented on a diff in the pull request: https://github.com/apache/spark/pull/1719#discussion_r15723320 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -0,0 +1,375 @@ +/* +* Licensed to the Apache Software

[GitHub] spark pull request: [MLlib] [SPARK-2510]word2vec: Distributed Repr...

2014-08-01 Thread Ishiihara
Github user Ishiihara commented on the pull request: https://github.com/apache/spark/pull/1719#issuecomment-50949833 @mengxr result of 4 and 10 partitions make sense but result of 100 partitions doesn't make sense. Made changes according to review except the random seed