[GitHub] spark pull request: Test
GitHub user GenTang opened a pull request: https://github.com/apache/spark/pull/7726 Test You can merge this pull request into a Git repository by running: $ git pull https://github.com/GenTang/spark test Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/7726.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #7726 commit 3253b61e4a740233154c54458fa68dbec5a2b004 Author: GenTang gen.tan...@gmail.com Date: 2015-01-07T02:04:13Z the modification accompanying the improvement of pythonconverter commit ceb31c5e1619e56b91a8d03b4edd09c7c17813ce Author: GenTang gen.tan...@gmail.com Date: 2015-01-07T02:05:24Z the modification for adapting updation of hbase commit 5cbbcfc3e4034f824ae94e84374b4f505ee15a63 Author: GenTang gen.tan...@gmail.com Date: 2015-01-07T02:06:00Z the improvement of pythonconverter commit 15b1fe33caf1c2524afed7be6e99a903dfebb356 Author: GenTang gen.tan...@gmail.com Date: 2015-01-22T23:16:50Z the modification of comments commit 21de6534899cf8d0379f03d46f128cde9600c087 Author: GenTang gen.tan...@gmail.com Date: 2015-02-14T21:59:59Z return the string in json format commit 62df7f08671c5317fb153e8edf77bf9d9919 Author: GenTang gen.tan...@gmail.com Date: 2015-02-14T22:15:57Z remove the comment commit 48024815917237ed446c0a778771be101f3e0fc0 Author: GenTang gen.tan...@gmail.com Date: 2015-02-15T23:25:23Z dump the result into a singl String commit d2153df5f5472f34b427a39e8fd450f7d37d2fc6 Author: GenTang gen.tan...@gmail.com Date: 2015-02-23T13:39:41Z import JSONObject precisely commit e5fe7d62fb78b94a3e704cd8cb0718ef23c68f06 Author: Gen TANG gen.tan...@gmail.com Date: 2015-07-28T14:12:57Z update commit 931d3d6029ee0d5515356233ba9808cfd9f5c7f1 Author: Gen TANG gen.tan...@gmail.com Date: 2015-07-28T14:30:06Z test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Test
Github user GenTang closed the pull request at: https://github.com/apache/spark/pull/7726 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5090][examples] The improvement of pyth...
Github user GenTang commented on the pull request: https://github.com/apache/spark/pull/3920#issuecomment-104841817 @davies I just tested it with the assembly of master branch, it works. Thanks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5090][examples] The improvement of pyth...
Github user GenTang commented on the pull request: https://github.com/apache/spark/pull/3920#issuecomment-77252529 Hi, I am sorry to bother you all. But is this pull request OK for merging, please? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5090][examples] The improvement of pyth...
Github user GenTang commented on a diff in the pull request: https://github.com/apache/spark/pull/3920#discussion_r25155799 --- Diff: examples/src/main/scala/org/apache/spark/examples/pythonconverters/HBaseConverters.scala --- @@ -18,20 +18,34 @@ package org.apache.spark.examples.pythonconverters import scala.collection.JavaConversions._ +import scala.util.parsing.json._ import org.apache.spark.api.python.Converter import org.apache.hadoop.hbase.client.{Put, Result} import org.apache.hadoop.hbase.io.ImmutableBytesWritable import org.apache.hadoop.hbase.util.Bytes +import org.apache.hadoop.hbase.KeyValue.Type +import org.apache.hadoop.hbase.CellUtil /** - * Implementation of [[org.apache.spark.api.python.Converter]] that converts an - * HBase Result to a String + * Implementation of [[org.apache.spark.api.python.Converter]] that converts all + * the records in an HBase Result to a String */ class HBaseResultToStringConverter extends Converter[Any, String] { override def convert(obj: Any): String = { +import collection.JavaConverters._ val result = obj.asInstanceOf[Result] -Bytes.toStringBinary(result.value()) +val output = result.listCells.asScala.map(cell = +Map( + row - Bytes.toStringBinary(CellUtil.cloneRow(cell)), + columnFamily - Bytes.toStringBinary(CellUtil.cloneFamily(cell)), + qualifier - Bytes.toStringBinary(CellUtil.cloneQualifier(cell)), + timestamp - cell.getTimestamp.toString, + type - Type.codeToType(cell.getTypeByte).toString, + value - Bytes.toStringBinary(CellUtil.cloneValue(cell)) +) +) +output.map(JSONObject(_).toString()).mkString(\n) --- End diff -- Output is a `Buffer[Map[String, String]]`, since there are several records in an HBase Result. However `JSONObject` has the only constructor `JSONObject(obj: Map[String, Any])`. So `JSONObject(output).toString()` will cause compilation failure. ^^ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5090][examples] The improvement of pyth...
Github user GenTang commented on a diff in the pull request: https://github.com/apache/spark/pull/3920#discussion_r25182187 --- Diff: examples/src/main/scala/org/apache/spark/examples/pythonconverters/HBaseConverters.scala --- @@ -18,20 +18,34 @@ package org.apache.spark.examples.pythonconverters import scala.collection.JavaConversions._ +import scala.util.parsing.json._ import org.apache.spark.api.python.Converter import org.apache.hadoop.hbase.client.{Put, Result} import org.apache.hadoop.hbase.io.ImmutableBytesWritable import org.apache.hadoop.hbase.util.Bytes +import org.apache.hadoop.hbase.KeyValue.Type +import org.apache.hadoop.hbase.CellUtil /** - * Implementation of [[org.apache.spark.api.python.Converter]] that converts an - * HBase Result to a String + * Implementation of [[org.apache.spark.api.python.Converter]] that converts all + * the records in an HBase Result to a String */ class HBaseResultToStringConverter extends Converter[Any, String] { override def convert(obj: Any): String = { +import collection.JavaConverters._ val result = obj.asInstanceOf[Result] -Bytes.toStringBinary(result.value()) +val output = result.listCells.asScala.map(cell = +Map( + row - Bytes.toStringBinary(CellUtil.cloneRow(cell)), + columnFamily - Bytes.toStringBinary(CellUtil.cloneFamily(cell)), + qualifier - Bytes.toStringBinary(CellUtil.cloneQualifier(cell)), + timestamp - cell.getTimestamp.toString, + type - Type.codeToType(cell.getTypeByte).toString, + value - Bytes.toStringBinary(CellUtil.cloneValue(cell)) +) +) +output.map(JSONObject(_).toString()).mkString(\n) --- End diff -- Great! In fact, HBase itself will escape `\n` too. That's why I choose `\n` at the first place. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5090][examples] The improvement of pyth...
Github user GenTang commented on a diff in the pull request: https://github.com/apache/spark/pull/3920#discussion_r24727321 --- Diff: examples/src/main/scala/org/apache/spark/examples/pythonconverters/HBaseConverters.scala --- @@ -18,20 +18,34 @@ package org.apache.spark.examples.pythonconverters import scala.collection.JavaConversions._ +import scala.util.parsing.json._ import org.apache.spark.api.python.Converter import org.apache.hadoop.hbase.client.{Put, Result} import org.apache.hadoop.hbase.io.ImmutableBytesWritable import org.apache.hadoop.hbase.util.Bytes +import org.apache.hadoop.hbase.KeyValue.Type +import org.apache.hadoop.hbase.CellUtil /** - * Implementation of [[org.apache.spark.api.python.Converter]] that converts an - * HBase Result to a String + * Implementation of [[org.apache.spark.api.python.Converter]] that converts all + * the records in an HBase Result to an Array[String] */ -class HBaseResultToStringConverter extends Converter[Any, String] { - override def convert(obj: Any): String = { +class HBaseResultToStringConverter extends Converter[Any, Array[String]] { --- End diff -- You are right. Before, I test the code in local mode where the code passed. However, in cluster mode, it doesn't --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5090][examples] The improvement of pyth...
Github user GenTang commented on the pull request: https://github.com/apache/spark/pull/3920#issuecomment-74393987 @davies Now we return the string in json format. Therefore the specific characters such as `'` or `` don't cause problem any more. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5090][examples] The improvement of pyth...
Github user GenTang commented on the pull request: https://github.com/apache/spark/pull/3920#issuecomment-73889175 @davies @MLnick Perhaps it is not a good place to discuss this, but I tried the script hbase_outputformat.py in spark 1.2.0 and it caused `java.lang.IncompatibleClassChangeError: Implementing class`. I guess that it is caused by the conflict between dependency and version of java. Do you have any idea about this error? Or do we need create a jira for this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5090][examples] The improvement of pyth...
Github user GenTang commented on a diff in the pull request: https://github.com/apache/spark/pull/3920#discussion_r24495658 --- Diff: examples/src/main/scala/org/apache/spark/examples/pythonconverters/HBaseConverters.scala --- @@ -23,15 +23,27 @@ import org.apache.spark.api.python.Converter import org.apache.hadoop.hbase.client.{Put, Result} import org.apache.hadoop.hbase.io.ImmutableBytesWritable import org.apache.hadoop.hbase.util.Bytes +import org.apache.hadoop.hbase.KeyValue.Type +import org.apache.hadoop.hbase.CellUtil /** - * Implementation of [[org.apache.spark.api.python.Converter]] that converts an - * HBase Result to a String + * Implementation of [[org.apache.spark.api.python.Converter]] that converts all + * the records in an HBase Result to a String */ class HBaseResultToStringConverter extends Converter[Any, String] { override def convert(obj: Any): String = { +import collection.JavaConverters._ val result = obj.asInstanceOf[Result] -Bytes.toStringBinary(result.value()) +val output = result.listCells.asScala.map(cell = + {'columnFamily':'%s','qualifier':'%s','timestamp':'%s','type':'%s','value':'%s'}.format( --- End diff -- @davies Hi, Sorry for the delay. Right now, for returning a string in json format is done. However, I am thinking let the converter return a `Array[String]`, since there are several records for one row. Here what we are doing now is that we convert the `Buffer[String]` to `string` in the converter and then in python, we reconvert the 'string' to 'list'(or 'tuple'). If the converter return directly Array[String], there are two advantages: 1. We don't need to do conversion from 'string' to 'list' in python. 2. Right now, we should add a specific character to aggregate the `Buffer[String]` to a `String`(I am thinking \n). Therefore it is safer if we return `Array[String]` How do you think about this? Thanks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4983]insert waiting time before tagging...
Github user GenTang commented on the pull request: https://github.com/apache/spark/pull/3986#issuecomment-73311374 @nchammas You are welcome. I am more than happy that this PR could finally be done. ^^ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4983]insert waiting time before tagging...
Github user GenTang commented on a diff in the pull request: https://github.com/apache/spark/pull/3986#discussion_r24263282 --- Diff: ec2/spark_ec2.py --- @@ -569,6 +569,9 @@ def launch_cluster(conn, opts, cluster_name): master_nodes = master_res.instances print Launched master in %s, regid = %s % (zone, master_res.id) +# The insert of waiting time corresponds to issue [SPARK-4983] --- End diff -- OK --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5090][examples] The improvement of pyth...
Github user GenTang commented on a diff in the pull request: https://github.com/apache/spark/pull/3920#discussion_r24184482 --- Diff: examples/src/main/scala/org/apache/spark/examples/HBaseTest.scala --- @@ -36,7 +36,7 @@ object HBaseTest { // Initialize hBase table if necessary val admin = new HBaseAdmin(conf) if (!admin.isTableAvailable(args(0))) { - val tableDesc = new HTableDescriptor(args(0)) + val tableDesc = new HTableDescriptor(TableName.valueOf(args(0))) --- End diff -- Yes. The old one is deprecated --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5090][examples] The improvement of pyth...
Github user GenTang commented on a diff in the pull request: https://github.com/apache/spark/pull/3920#discussion_r24185498 --- Diff: examples/src/main/python/hbase_inputformat.py --- @@ -72,6 +75,9 @@ keyConverter=keyConv, valueConverter=valueConv, conf=conf) +# hbase_rdd is a RDD[dict] --- End diff -- In fact, every record in dict is a record of HBase. Therefore by using RDD[dict], the element of RDD corresponds a record in HBase. If we use RDD of dict, it is impossible to tell a certain value belongs to which row, which column, etc. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5090][examples] The improvement of pyth...
Github user GenTang commented on a diff in the pull request: https://github.com/apache/spark/pull/3920#discussion_r24187737 --- Diff: examples/src/main/scala/org/apache/spark/examples/pythonconverters/HBaseConverters.scala --- @@ -23,15 +23,27 @@ import org.apache.spark.api.python.Converter import org.apache.hadoop.hbase.client.{Put, Result} import org.apache.hadoop.hbase.io.ImmutableBytesWritable import org.apache.hadoop.hbase.util.Bytes +import org.apache.hadoop.hbase.KeyValue.Type +import org.apache.hadoop.hbase.CellUtil /** - * Implementation of [[org.apache.spark.api.python.Converter]] that converts an - * HBase Result to a String + * Implementation of [[org.apache.spark.api.python.Converter]] that converts all + * the records in an HBase Result to a String */ class HBaseResultToStringConverter extends Converter[Any, String] { override def convert(obj: Any): String = { +import collection.JavaConverters._ val result = obj.asInstanceOf[Result] -Bytes.toStringBinary(result.value()) +val output = result.listCells.asScala.map(cell = + {'columnFamily':'%s','qualifier':'%s','timestamp':'%s','type':'%s','value':'%s'}.format( --- End diff -- Aha,... I didn't consider that. You are right. It is not safe to use ast.literal_val. For your suggestion, you mean that we transform string into json in python, or the converter returns a json? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5090][examples] The improvement of pyth...
Github user GenTang commented on a diff in the pull request: https://github.com/apache/spark/pull/3920#discussion_r24188212 --- Diff: examples/src/main/python/hbase_inputformat.py --- @@ -72,6 +75,9 @@ keyConverter=keyConv, valueConverter=valueConv, conf=conf) +# hbase_rdd is a RDD[dict] --- End diff -- I am sorry that I misunderstood your meaning. I thought that you mean we should change the RDD.. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5090][examples] The improvement of pyth...
Github user GenTang commented on a diff in the pull request: https://github.com/apache/spark/pull/3920#discussion_r24188381 --- Diff: examples/src/main/scala/org/apache/spark/examples/pythonconverters/HBaseConverters.scala --- @@ -23,15 +23,27 @@ import org.apache.spark.api.python.Converter import org.apache.hadoop.hbase.client.{Put, Result} import org.apache.hadoop.hbase.io.ImmutableBytesWritable import org.apache.hadoop.hbase.util.Bytes +import org.apache.hadoop.hbase.KeyValue.Type +import org.apache.hadoop.hbase.CellUtil /** - * Implementation of [[org.apache.spark.api.python.Converter]] that converts an - * HBase Result to a String + * Implementation of [[org.apache.spark.api.python.Converter]] that converts all + * the records in an HBase Result to a String */ class HBaseResultToStringConverter extends Converter[Any, String] { override def convert(obj: Any): String = { +import collection.JavaConverters._ val result = obj.asInstanceOf[Result] -Bytes.toStringBinary(result.value()) +val output = result.listCells.asScala.map(cell = + {'columnFamily':'%s','qualifier':'%s','timestamp':'%s','type':'%s','value':'%s'}.format( --- End diff -- Great, I will do that.Thanks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4983]exception handling about adding ta...
Github user GenTang commented on a diff in the pull request: https://github.com/apache/spark/pull/3986#discussion_r23890734 --- Diff: ec2/spark_ec2.py --- @@ -569,6 +569,8 @@ def launch_cluster(conn, opts, cluster_name): master_nodes = master_res.instances print Launched master in %s, regid = %s % (zone, master_res.id) +# Wait for the information of the just-launched instances to be propagated within AWS +time.sleep(5) --- End diff -- @nchammas Maybe we should insert a print here to tell the client that we are waiting for the information to be propagated, since there will be 5 seconds idle between `launch master in` and `wait for the cluster to enter`. How do you think? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4983]exception handling about adding ta...
Github user GenTang commented on a diff in the pull request: https://github.com/apache/spark/pull/3986#discussion_r23812644 --- Diff: ec2/spark_ec2.py --- @@ -569,15 +569,34 @@ def launch_cluster(conn, opts, cluster_name): master_nodes = master_res.instances print Launched master in %s, regid = %s % (zone, master_res.id) --- End diff -- Yeah, It is true. I will make a new commit as soon as possible. I think I will add 5 seconds sleep time before tagging master and make some modification in the comment. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4983]exception handling about adding ta...
Github user GenTang commented on a diff in the pull request: https://github.com/apache/spark/pull/3986#discussion_r23811498 --- Diff: ec2/spark_ec2.py --- @@ -569,15 +569,34 @@ def launch_cluster(conn, opts, cluster_name): master_nodes = master_res.instances print Launched master in %s, regid = %s % (zone, master_res.id) --- End diff -- I agree with you. Since we can meet the same problem about propagating information of AWS in the `wait_for_cluster_state` function, I think that adding waiting time is much better workaround. The only disadvantage is that we need more 10 second to launch a cluster. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5090][examples] The improvement of pyth...
Github user GenTang commented on a diff in the pull request: https://github.com/apache/spark/pull/3920#discussion_r23414030 --- Diff: examples/src/main/python/hbase_inputformat.py --- @@ -16,6 +16,7 @@ # import sys +import ast from pyspark import SparkContext --- End diff -- Great idea. I will modify the sample data to show multiple records in a columnFamily and columnFamily:qualifier --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5090][examples] The improvement of pyth...
Github user GenTang commented on a diff in the pull request: https://github.com/apache/spark/pull/3920#discussion_r23413908 --- Diff: examples/src/main/scala/org/apache/spark/examples/pythonconverters/HBaseConverters.scala --- @@ -23,15 +23,27 @@ import org.apache.spark.api.python.Converter import org.apache.hadoop.hbase.client.{Put, Result} import org.apache.hadoop.hbase.io.ImmutableBytesWritable import org.apache.hadoop.hbase.util.Bytes +import org.apache.hadoop.hbase.KeyValue.Type +import org.apache.hadoop.hbase.CellUtil /** - * Implementation of [[org.apache.spark.api.python.Converter]] that converts an - * HBase Result to a String + * Implementation of [[org.apache.spark.api.python.Converter]] that converts all + * the records in an HBase Result to a String */ class HBaseResultToStringConverter extends Converter[Any, String] { override def convert(obj: Any): String = { +import collection.JavaConverters._ val result = obj.asInstanceOf[Result] -Bytes.toStringBinary(result.value()) +val output = result.listCells.asScala.map(cell = + {'columnFamliy':'%s','qualifier':'%s','timestamp':'%s','type':'%s','value':'%s'}.format( + Bytes.toStringBinary(CellUtil.cloneFamily(cell)), + Bytes.toStringBinary(CellUtil.cloneQualifier(cell)), + cell.getTimestamp.toString, + Type.codeToType(cell.getTypeByte), + Bytes.toStringBinary(CellUtil.cloneValue(cell)) +) +) +output.mkString( ) --- End diff -- In fact, in HBase a A *{row, columnFamily:qualifier, version}* tuple exactly specifies a cell (a record) in HBase. So there are usually several records per column family in a row. In fact, HBase `Result` contains all the information about a HBase table and `Result.listCells` returns all the cells as a java list of `hbase.Cell` which contains all the information about a cell. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5090][examples] The improvement of pyth...
Github user GenTang commented on the pull request: https://github.com/apache/spark/pull/3920#issuecomment-71121501 @MLnick I change the sample data to show that we can have several records in one columnFamily. In fact, in HBase 0.96 and newer, the default maximum number of versions(timestamp) has been changed to 1 and in order to show all the versions in a HBase table, we need to add *{hbase.mapreduce.scan.maxversions:num}* in the conf. For me, it is rather a topic of HBase not Spark. That's why I don't show that we can have several versions of records in `Result` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4983]exception handling about adding ta...
Github user GenTang commented on the pull request: https://github.com/apache/spark/pull/3986#issuecomment-69490428 I understand why it still print exception error information even we catch the exception. It is because that we use ``` logging.basicConfig() ``` in the script. So the exception information will be added in the log even being catched. Maybe it is a stupid question, why do we use logging in this script since we don't add any logging information in the script ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4983]exception handling about adding ta...
Github user GenTang commented on the pull request: https://github.com/apache/spark/pull/3986#issuecomment-69506584 Yeah, Thanks for your solution. It works. However, I prefer to remove the logging package entirely, if we don't really use it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4983]exception handling about adding ta...
Github user GenTang commented on the pull request: https://github.com/apache/spark/pull/3986#issuecomment-69456767 As all the exception about ec2 will throw out an EC2ResponseError exception, we use error_code to identify the specific error of instance not existing. If EC2 returns instance not found exception, we wait a short time to relaunch the request. Otherwise, throw out the same error. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4983]exception handling about adding ta...
Github user GenTang commented on the pull request: https://github.com/apache/spark/pull/3986#issuecomment-69471651 Yes, I reproduced the InvalidInstanceID.NotFound by change the instance ID before add_tag action and then re-change it to the correct id. However, it will print the error information as follow in the screen: ERROR:boto:400 Bad Request ERROR:boto:?xml version=1.0 encoding=UTF-8? ResponseErrorsErrorCodeInvalidInstanceID.NotFound/CodeMessageThe instance ID 'i-5aa99fb1' does not exist/Message/Error/ErrorsRequestID7bfc5342-1fc9-4431-b569-190b2f9c9e8c/RequestID/Response It is printed by boto package. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4983]exception handling about adding ta...
Github user GenTang commented on a diff in the pull request: https://github.com/apache/spark/pull/3986#discussion_r22761550 --- Diff: ec2/spark_ec2.py --- @@ -569,15 +569,34 @@ def launch_cluster(conn, opts, cluster_name): master_nodes = master_res.instances print Launched master in %s, regid = %s % (zone, master_res.id) -# Give the instances descriptive names +# Give the instances descriptive names. +# The code of handling exceptions corresponds to issue [SPARK-4983] for master in master_nodes: -master.add_tag( -key='Name', -value='{cn}-master-{iid}'.format(cn=cluster_name, iid=master.id)) +while True: +try: +master.add_tag( +key='Name', +value='{cn}-master-{iid}'.format(cn=cluster_name, iid=master.id)) +except boto.exception.EC2ResponseError as e: +if e.error_code == InvalidInstanceID.NotFound: +time.sleep(0.1) +else: +raise e +else: +break for slave in slave_nodes: -slave.add_tag( -key='Name', -value='{cn}-slave-{iid}'.format(cn=cluster_name, iid=slave.id)) +while True: +try: +slave.add_tag( +key='Name', +value='{cn}-slave-{iid}'.format(cn=cluster_name, iid=slave.id)) +except boto.exception.EC2ResponseError as e: +if e.error_code == InvalidInstanceID.NotFound: +time.Sleep(0.1) --- End diff -- Sorry for the typo. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4983]exception handling about adding ta...
GitHub user GenTang reopened a pull request: https://github.com/apache/spark/pull/3986 [SPARK-4983]exception handling about adding tags to EC2 instance As the boto API doesn't support tag ec2 instances in the same call that launches them. We use exception handling code to wait that ec2 has enough time to propagate the information. You can merge this pull request into a Git repository by running: $ git pull https://github.com/GenTang/spark spark-4983 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/3986.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3986 commit 6adcf6d28d21c14662d6f92d550614bc4548eb5d Author: GenTang gen.tan...@gmail.com Date: 2015-01-09T23:01:09Z handling exceptions about adding tags to ec2 commit 692fc2b72f19f1b02e710433f3d38388c5cc3a55 Author: GenTang gen.tan...@gmail.com Date: 2015-01-10T13:54:05Z the improvement of exception handling commit 63fd360cc1284189892d5aca8f42a49db9cbf0f0 Author: GenTang gen.tan...@gmail.com Date: 2015-01-10T20:00:26Z typo --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4983]exception handling about adding ta...
Github user GenTang commented on the pull request: https://github.com/apache/spark/pull/3986#issuecomment-69473316 However, I met a really strange error a moment ago. I launched a cluster containing 1 master and 1 slave with the script. Add_tag to master succussed after two tries and add_tag to slave succussed without throwing out the error. However, EC2 threw out `InvalidInstanceID.NotFound` error for slave node at : ``` for i in cluster_instances: i.update() ``` in wait_for_cluster_state function. It seems that the information of instance has not been propagated for the update action. Meantime, information of instance has reached to certain point that add_tag action can be succussed. I tried several time, it happened only once. I am not very clear why it happened. As wait_for_cluster_state is used for `launch`, `start`(these two need more than 1 minute to reach `ssh-ready` state), `destroy`(it needs about 1 second to reach `terminated` state) action, maybe the workaround this to add some more waiting time to launch update action by making following change: ``` while True: time.sleep(5 * num_attempts + 1) ``` at line 724 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4983]exception handling about adding ta...
Github user GenTang closed the pull request at: https://github.com/apache/spark/pull/3986 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4983]exception handling about adding ta...
Github user GenTang commented on the pull request: https://github.com/apache/spark/pull/3986#issuecomment-69473464 Yes, boto will print the error even we catch the exception but the script will continue and the cluster will be successfully launched. It is just ugly to have such error information in the screen. FYI, it used boto 2.34.0 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4983]Tag EC2 instances in the same call...
Github user GenTang commented on a diff in the pull request: https://github.com/apache/spark/pull/3986#discussion_r22751730 --- Diff: ec2/spark_ec2.py --- @@ -569,15 +569,28 @@ def launch_cluster(conn, opts, cluster_name): master_nodes = master_res.instances print Launched master in %s, regid = %s % (zone, master_res.id) -# Give the instances descriptive names +# Give the instances descriptive names. +# The code of handling exceptions corresponds to issue [SPARK-4983] for master in master_nodes: -master.add_tag( -key='Name', -value='{cn}-master-{iid}'.format(cn=cluster_name, iid=master.id)) +while True: +try: +master.add_tag( +key='Name', +value='{cn}-master-{iid}'.format(cn=cluster_name, iid=master.id)) +except: +pass --- End diff -- I think that it takes some time for EC2 to return an instance not existing exception . That's why I leave pass in the exception. However, Maybe we should add a small wait time to ensure that we don't submit too much requests to ec2 Yes, Here we just want to catch the exception of instance not existing. You are right, it is better to use specific exception. I will work on this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4983]Tag EC2 instances in the same call...
GitHub user GenTang opened a pull request: https://github.com/apache/spark/pull/3986 [SPARK-4983]Tag EC2 instances in the same call that launches them As the boto API doesn't support tag ec2 instances in the same call that launches them. We use exception handling code to wait that ec2 has enough time to propagate the information. You can merge this pull request into a Git repository by running: $ git pull https://github.com/GenTang/spark spark-4983 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/3986.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3986 commit 6adcf6d28d21c14662d6f92d550614bc4548eb5d Author: GenTang gen.tan...@gmail.com Date: 2015-01-09T23:01:09Z handling exceptions about adding tags to ec2 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: The improvement of python converter for hbase(...
GitHub user GenTang opened a pull request: https://github.com/apache/spark/pull/3920 The improvement of python converter for hbase(examples) Hi, Following the discussion in http://apache-spark-developers-list.1001551.n3.nabble.com/python-converter-in-HBaseConverter-scala-spark-examples-td10001.html. I made some modification in three files in package examples: 1. HBaseConverters.scala: the new converter will converts all the records in an hbase results into a single string 2. hbase_input.py: as the value string may contain several records, we can use ast package to convert the string into dict 3. HBaseTest.scala: as the package examples use hbase 0.98.7 the original constructor HTableDescriptor is deprecated. The updation to new constructor is made You can merge this pull request into a Git repository by running: $ git pull https://github.com/GenTang/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/3920.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3920 commit 3253b61e4a740233154c54458fa68dbec5a2b004 Author: GenTang gen.tan...@gmail.com Date: 2015-01-07T02:04:13Z the modification accompanying the improvement of pythonconverter commit ceb31c5e1619e56b91a8d03b4edd09c7c17813ce Author: GenTang gen.tan...@gmail.com Date: 2015-01-07T02:05:24Z the modification for adapting updation of hbase commit 5cbbcfc3e4034f824ae94e84374b4f505ee15a63 Author: GenTang gen.tan...@gmail.com Date: 2015-01-07T02:06:00Z the improvement of pythonconverter --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5090] The improvement of python convert...
Github user GenTang commented on the pull request: https://github.com/apache/spark/pull/3920#issuecomment-68969946 @MLnick Hi Nick Pentreath, please review. Thanks ^^ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org