[GitHub] spark pull request: Test

2015-07-28 Thread GenTang
GitHub user GenTang opened a pull request:

https://github.com/apache/spark/pull/7726

Test



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/GenTang/spark test

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/7726.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #7726


commit 3253b61e4a740233154c54458fa68dbec5a2b004
Author: GenTang gen.tan...@gmail.com
Date:   2015-01-07T02:04:13Z

the modification accompanying the improvement of pythonconverter

commit ceb31c5e1619e56b91a8d03b4edd09c7c17813ce
Author: GenTang gen.tan...@gmail.com
Date:   2015-01-07T02:05:24Z

the modification for adapting updation of hbase

commit 5cbbcfc3e4034f824ae94e84374b4f505ee15a63
Author: GenTang gen.tan...@gmail.com
Date:   2015-01-07T02:06:00Z

the improvement of pythonconverter

commit 15b1fe33caf1c2524afed7be6e99a903dfebb356
Author: GenTang gen.tan...@gmail.com
Date:   2015-01-22T23:16:50Z

the modification of comments

commit 21de6534899cf8d0379f03d46f128cde9600c087
Author: GenTang gen.tan...@gmail.com
Date:   2015-02-14T21:59:59Z

return the string in json format

commit 62df7f08671c5317fb153e8edf77bf9d9919
Author: GenTang gen.tan...@gmail.com
Date:   2015-02-14T22:15:57Z

remove the comment

commit 48024815917237ed446c0a778771be101f3e0fc0
Author: GenTang gen.tan...@gmail.com
Date:   2015-02-15T23:25:23Z

dump the result into a singl String

commit d2153df5f5472f34b427a39e8fd450f7d37d2fc6
Author: GenTang gen.tan...@gmail.com
Date:   2015-02-23T13:39:41Z

import JSONObject precisely

commit e5fe7d62fb78b94a3e704cd8cb0718ef23c68f06
Author: Gen TANG gen.tan...@gmail.com
Date:   2015-07-28T14:12:57Z

update

commit 931d3d6029ee0d5515356233ba9808cfd9f5c7f1
Author: Gen TANG gen.tan...@gmail.com
Date:   2015-07-28T14:30:06Z

test




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Test

2015-07-28 Thread GenTang
Github user GenTang closed the pull request at:

https://github.com/apache/spark/pull/7726


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5090][examples] The improvement of pyth...

2015-05-22 Thread GenTang
Github user GenTang commented on the pull request:

https://github.com/apache/spark/pull/3920#issuecomment-104841817
  
@davies I just tested it with the assembly of master branch, it works.
Thanks 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5090][examples] The improvement of pyth...

2015-03-04 Thread GenTang
Github user GenTang commented on the pull request:

https://github.com/apache/spark/pull/3920#issuecomment-77252529
  
Hi, I am sorry to bother you all.
But is this pull request OK for merging, please?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5090][examples] The improvement of pyth...

2015-02-23 Thread GenTang
Github user GenTang commented on a diff in the pull request:

https://github.com/apache/spark/pull/3920#discussion_r25155799
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/pythonconverters/HBaseConverters.scala
 ---
@@ -18,20 +18,34 @@
 package org.apache.spark.examples.pythonconverters
 
 import scala.collection.JavaConversions._
+import scala.util.parsing.json._
 
 import org.apache.spark.api.python.Converter
 import org.apache.hadoop.hbase.client.{Put, Result}
 import org.apache.hadoop.hbase.io.ImmutableBytesWritable
 import org.apache.hadoop.hbase.util.Bytes
+import org.apache.hadoop.hbase.KeyValue.Type
+import org.apache.hadoop.hbase.CellUtil
 
 /**
- * Implementation of [[org.apache.spark.api.python.Converter]] that 
converts an
- * HBase Result to a String
+ * Implementation of [[org.apache.spark.api.python.Converter]] that 
converts all
+ * the records in an HBase Result to a String
  */
 class HBaseResultToStringConverter extends Converter[Any, String] {
   override def convert(obj: Any): String = {
+import collection.JavaConverters._
 val result = obj.asInstanceOf[Result]
-Bytes.toStringBinary(result.value())
+val output = result.listCells.asScala.map(cell =
+Map(
+  row - Bytes.toStringBinary(CellUtil.cloneRow(cell)),
+  columnFamily - 
Bytes.toStringBinary(CellUtil.cloneFamily(cell)),
+  qualifier - 
Bytes.toStringBinary(CellUtil.cloneQualifier(cell)),
+  timestamp - cell.getTimestamp.toString,
+  type - Type.codeToType(cell.getTypeByte).toString,
+  value - Bytes.toStringBinary(CellUtil.cloneValue(cell))
+)
+)
+output.map(JSONObject(_).toString()).mkString(\n)
--- End diff --

Output is a `Buffer[Map[String, String]]`, since there are several records 
in an HBase Result.  
However `JSONObject` has the only constructor `JSONObject(obj: Map[String, 
Any])`. So `JSONObject(output).toString()` will cause compilation failure. ^^


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5090][examples] The improvement of pyth...

2015-02-23 Thread GenTang
Github user GenTang commented on a diff in the pull request:

https://github.com/apache/spark/pull/3920#discussion_r25182187
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/pythonconverters/HBaseConverters.scala
 ---
@@ -18,20 +18,34 @@
 package org.apache.spark.examples.pythonconverters
 
 import scala.collection.JavaConversions._
+import scala.util.parsing.json._
 
 import org.apache.spark.api.python.Converter
 import org.apache.hadoop.hbase.client.{Put, Result}
 import org.apache.hadoop.hbase.io.ImmutableBytesWritable
 import org.apache.hadoop.hbase.util.Bytes
+import org.apache.hadoop.hbase.KeyValue.Type
+import org.apache.hadoop.hbase.CellUtil
 
 /**
- * Implementation of [[org.apache.spark.api.python.Converter]] that 
converts an
- * HBase Result to a String
+ * Implementation of [[org.apache.spark.api.python.Converter]] that 
converts all
+ * the records in an HBase Result to a String
  */
 class HBaseResultToStringConverter extends Converter[Any, String] {
   override def convert(obj: Any): String = {
+import collection.JavaConverters._
 val result = obj.asInstanceOf[Result]
-Bytes.toStringBinary(result.value())
+val output = result.listCells.asScala.map(cell =
+Map(
+  row - Bytes.toStringBinary(CellUtil.cloneRow(cell)),
+  columnFamily - 
Bytes.toStringBinary(CellUtil.cloneFamily(cell)),
+  qualifier - 
Bytes.toStringBinary(CellUtil.cloneQualifier(cell)),
+  timestamp - cell.getTimestamp.toString,
+  type - Type.codeToType(cell.getTypeByte).toString,
+  value - Bytes.toStringBinary(CellUtil.cloneValue(cell))
+)
+)
+output.map(JSONObject(_).toString()).mkString(\n)
--- End diff --

Great! In fact, HBase itself will escape `\n` too. That's why I choose `\n` 
at the first place.
Thanks!



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5090][examples] The improvement of pyth...

2015-02-15 Thread GenTang
Github user GenTang commented on a diff in the pull request:

https://github.com/apache/spark/pull/3920#discussion_r24727321
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/pythonconverters/HBaseConverters.scala
 ---
@@ -18,20 +18,34 @@
 package org.apache.spark.examples.pythonconverters
 
 import scala.collection.JavaConversions._
+import scala.util.parsing.json._
 
 import org.apache.spark.api.python.Converter
 import org.apache.hadoop.hbase.client.{Put, Result}
 import org.apache.hadoop.hbase.io.ImmutableBytesWritable
 import org.apache.hadoop.hbase.util.Bytes
+import org.apache.hadoop.hbase.KeyValue.Type
+import org.apache.hadoop.hbase.CellUtil
 
 /**
- * Implementation of [[org.apache.spark.api.python.Converter]] that 
converts an
- * HBase Result to a String
+ * Implementation of [[org.apache.spark.api.python.Converter]] that 
converts all
+ * the records in an HBase Result to an Array[String]
  */
-class HBaseResultToStringConverter extends Converter[Any, String] {
-  override def convert(obj: Any): String = {
+class HBaseResultToStringConverter extends Converter[Any, Array[String]] {
--- End diff --

You are right.
Before, I test the code in local mode where the code passed. However, in 
cluster mode, it doesn't


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5090][examples] The improvement of pyth...

2015-02-14 Thread GenTang
Github user GenTang commented on the pull request:

https://github.com/apache/spark/pull/3920#issuecomment-74393987
  
@davies 
Now we return the string in json format. 
Therefore the specific characters such as `'` or `` don't cause problem 
any more.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5090][examples] The improvement of pyth...

2015-02-11 Thread GenTang
Github user GenTang commented on the pull request:

https://github.com/apache/spark/pull/3920#issuecomment-73889175
  
@davies @MLnick 
Perhaps it is not a good place to discuss this, but I tried the script 
hbase_outputformat.py in spark 1.2.0 and it caused 
`java.lang.IncompatibleClassChangeError: Implementing class`. I guess that it 
is caused by the conflict between dependency and version of java.
Do you have any idea about this error? Or do we need create a jira for this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5090][examples] The improvement of pyth...

2015-02-11 Thread GenTang
Github user GenTang commented on a diff in the pull request:

https://github.com/apache/spark/pull/3920#discussion_r24495658
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/pythonconverters/HBaseConverters.scala
 ---
@@ -23,15 +23,27 @@ import org.apache.spark.api.python.Converter
 import org.apache.hadoop.hbase.client.{Put, Result}
 import org.apache.hadoop.hbase.io.ImmutableBytesWritable
 import org.apache.hadoop.hbase.util.Bytes
+import org.apache.hadoop.hbase.KeyValue.Type
+import org.apache.hadoop.hbase.CellUtil
 
 /**
- * Implementation of [[org.apache.spark.api.python.Converter]] that 
converts an
- * HBase Result to a String
+ * Implementation of [[org.apache.spark.api.python.Converter]] that 
converts all
+ * the records in an HBase Result to a String
  */
 class HBaseResultToStringConverter extends Converter[Any, String] {
   override def convert(obj: Any): String = {
+import collection.JavaConverters._
 val result = obj.asInstanceOf[Result]
-Bytes.toStringBinary(result.value())
+val output = result.listCells.asScala.map(cell =
+
{'columnFamily':'%s','qualifier':'%s','timestamp':'%s','type':'%s','value':'%s'}.format(
--- End diff --

@davies 
Hi, Sorry for the delay.
Right now, for returning a string in json format is done.
However, I am thinking let the converter return a `Array[String]`, since 
there are several records for one row. Here what we are doing now is that we 
convert the `Buffer[String]` to `string` in the converter and then in python, 
we reconvert the 'string' to 'list'(or 'tuple'). 
If the converter return directly Array[String], there are two advantages:
1. We don't need to do conversion from 'string' to 'list' in python.
2. Right now, we should add a specific character to aggregate the 
`Buffer[String]` to a `String`(I am thinking \n). Therefore it is safer if we 
return `Array[String]`

How do you think about this? Thanks 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4983]insert waiting time before tagging...

2015-02-08 Thread GenTang
Github user GenTang commented on the pull request:

https://github.com/apache/spark/pull/3986#issuecomment-73311374
  
@nchammas You are welcome.
I am more than happy that this PR could finally be done. ^^


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4983]insert waiting time before tagging...

2015-02-06 Thread GenTang
Github user GenTang commented on a diff in the pull request:

https://github.com/apache/spark/pull/3986#discussion_r24263282
  
--- Diff: ec2/spark_ec2.py ---
@@ -569,6 +569,9 @@ def launch_cluster(conn, opts, cluster_name):
 master_nodes = master_res.instances
 print Launched master in %s, regid = %s % (zone, master_res.id)
 
+# The insert of waiting time corresponds to issue [SPARK-4983]
--- End diff --

OK


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5090][examples] The improvement of pyth...

2015-02-05 Thread GenTang
Github user GenTang commented on a diff in the pull request:

https://github.com/apache/spark/pull/3920#discussion_r24184482
  
--- Diff: examples/src/main/scala/org/apache/spark/examples/HBaseTest.scala 
---
@@ -36,7 +36,7 @@ object HBaseTest {
 // Initialize hBase table if necessary
 val admin = new HBaseAdmin(conf)
 if (!admin.isTableAvailable(args(0))) {
-  val tableDesc = new HTableDescriptor(args(0))
+  val tableDesc = new HTableDescriptor(TableName.valueOf(args(0)))
--- End diff --

Yes. The old one is deprecated


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5090][examples] The improvement of pyth...

2015-02-05 Thread GenTang
Github user GenTang commented on a diff in the pull request:

https://github.com/apache/spark/pull/3920#discussion_r24185498
  
--- Diff: examples/src/main/python/hbase_inputformat.py ---
@@ -72,6 +75,9 @@
 keyConverter=keyConv,
 valueConverter=valueConv,
 conf=conf)
+# hbase_rdd is a RDD[dict]
--- End diff --

In fact, every record in dict is a record of HBase. 
Therefore by using RDD[dict], the element of RDD corresponds a record in 
HBase. If we use RDD of dict, it is impossible to tell a certain value belongs 
to which row, which column, etc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5090][examples] The improvement of pyth...

2015-02-05 Thread GenTang
Github user GenTang commented on a diff in the pull request:

https://github.com/apache/spark/pull/3920#discussion_r24187737
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/pythonconverters/HBaseConverters.scala
 ---
@@ -23,15 +23,27 @@ import org.apache.spark.api.python.Converter
 import org.apache.hadoop.hbase.client.{Put, Result}
 import org.apache.hadoop.hbase.io.ImmutableBytesWritable
 import org.apache.hadoop.hbase.util.Bytes
+import org.apache.hadoop.hbase.KeyValue.Type
+import org.apache.hadoop.hbase.CellUtil
 
 /**
- * Implementation of [[org.apache.spark.api.python.Converter]] that 
converts an
- * HBase Result to a String
+ * Implementation of [[org.apache.spark.api.python.Converter]] that 
converts all
+ * the records in an HBase Result to a String
  */
 class HBaseResultToStringConverter extends Converter[Any, String] {
   override def convert(obj: Any): String = {
+import collection.JavaConverters._
 val result = obj.asInstanceOf[Result]
-Bytes.toStringBinary(result.value())
+val output = result.listCells.asScala.map(cell =
+
{'columnFamily':'%s','qualifier':'%s','timestamp':'%s','type':'%s','value':'%s'}.format(
--- End diff --

Aha,... I didn't consider that. 
You are right. It is not safe to use ast.literal_val.
For your suggestion, you mean that we transform string into json in python, 
or the converter returns a json? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5090][examples] The improvement of pyth...

2015-02-05 Thread GenTang
Github user GenTang commented on a diff in the pull request:

https://github.com/apache/spark/pull/3920#discussion_r24188212
  
--- Diff: examples/src/main/python/hbase_inputformat.py ---
@@ -72,6 +75,9 @@
 keyConverter=keyConv,
 valueConverter=valueConv,
 conf=conf)
+# hbase_rdd is a RDD[dict]
--- End diff --

I am sorry that I misunderstood your meaning. 
I thought that you mean we should change the RDD..


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5090][examples] The improvement of pyth...

2015-02-05 Thread GenTang
Github user GenTang commented on a diff in the pull request:

https://github.com/apache/spark/pull/3920#discussion_r24188381
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/pythonconverters/HBaseConverters.scala
 ---
@@ -23,15 +23,27 @@ import org.apache.spark.api.python.Converter
 import org.apache.hadoop.hbase.client.{Put, Result}
 import org.apache.hadoop.hbase.io.ImmutableBytesWritable
 import org.apache.hadoop.hbase.util.Bytes
+import org.apache.hadoop.hbase.KeyValue.Type
+import org.apache.hadoop.hbase.CellUtil
 
 /**
- * Implementation of [[org.apache.spark.api.python.Converter]] that 
converts an
- * HBase Result to a String
+ * Implementation of [[org.apache.spark.api.python.Converter]] that 
converts all
+ * the records in an HBase Result to a String
  */
 class HBaseResultToStringConverter extends Converter[Any, String] {
   override def convert(obj: Any): String = {
+import collection.JavaConverters._
 val result = obj.asInstanceOf[Result]
-Bytes.toStringBinary(result.value())
+val output = result.listCells.asScala.map(cell =
+
{'columnFamily':'%s','qualifier':'%s','timestamp':'%s','type':'%s','value':'%s'}.format(
--- End diff --

Great, I will do that.Thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4983]exception handling about adding ta...

2015-01-31 Thread GenTang
Github user GenTang commented on a diff in the pull request:

https://github.com/apache/spark/pull/3986#discussion_r23890734
  
--- Diff: ec2/spark_ec2.py ---
@@ -569,6 +569,8 @@ def launch_cluster(conn, opts, cluster_name):
 master_nodes = master_res.instances
 print Launched master in %s, regid = %s % (zone, master_res.id)
 
+# Wait for the information of the just-launched instances to be 
propagated within AWS
+time.sleep(5)
--- End diff --

@nchammas 
Maybe we should insert a print here to tell the client that we are waiting 
for the information to be propagated, since there will be 5 seconds idle 
between `launch master in` and `wait for the cluster to enter`. 
How do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4983]exception handling about adding ta...

2015-01-29 Thread GenTang
Github user GenTang commented on a diff in the pull request:

https://github.com/apache/spark/pull/3986#discussion_r23812644
  
--- Diff: ec2/spark_ec2.py ---
@@ -569,15 +569,34 @@ def launch_cluster(conn, opts, cluster_name):
 master_nodes = master_res.instances
 print Launched master in %s, regid = %s % (zone, master_res.id)
 
--- End diff --

Yeah, It is true. I will make a new commit as soon as possible. 
I think I will add 5 seconds sleep time before tagging master and make some 
modification in the comment.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4983]exception handling about adding ta...

2015-01-29 Thread GenTang
Github user GenTang commented on a diff in the pull request:

https://github.com/apache/spark/pull/3986#discussion_r23811498
  
--- Diff: ec2/spark_ec2.py ---
@@ -569,15 +569,34 @@ def launch_cluster(conn, opts, cluster_name):
 master_nodes = master_res.instances
 print Launched master in %s, regid = %s % (zone, master_res.id)
 
--- End diff --

I agree with you. Since we can meet the same problem about propagating 
information of AWS in the `wait_for_cluster_state` function, I think that 
adding waiting time is much better workaround. 
The only disadvantage is that we need more 10 second to launch a cluster.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5090][examples] The improvement of pyth...

2015-01-22 Thread GenTang
Github user GenTang commented on a diff in the pull request:

https://github.com/apache/spark/pull/3920#discussion_r23414030
  
--- Diff: examples/src/main/python/hbase_inputformat.py ---
@@ -16,6 +16,7 @@
 #
 
 import sys
+import ast
 
 from pyspark import SparkContext
 
--- End diff --

Great idea. I will modify the sample data to show multiple records in a 
columnFamily and columnFamily:qualifier 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5090][examples] The improvement of pyth...

2015-01-22 Thread GenTang
Github user GenTang commented on a diff in the pull request:

https://github.com/apache/spark/pull/3920#discussion_r23413908
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/pythonconverters/HBaseConverters.scala
 ---
@@ -23,15 +23,27 @@ import org.apache.spark.api.python.Converter
 import org.apache.hadoop.hbase.client.{Put, Result}
 import org.apache.hadoop.hbase.io.ImmutableBytesWritable
 import org.apache.hadoop.hbase.util.Bytes
+import org.apache.hadoop.hbase.KeyValue.Type
+import org.apache.hadoop.hbase.CellUtil
 
 /**
- * Implementation of [[org.apache.spark.api.python.Converter]] that 
converts an
- * HBase Result to a String
+ * Implementation of [[org.apache.spark.api.python.Converter]] that 
converts all
+ * the records in an HBase Result to a String
  */
 class HBaseResultToStringConverter extends Converter[Any, String] {
   override def convert(obj: Any): String = {
+import collection.JavaConverters._
 val result = obj.asInstanceOf[Result]
-Bytes.toStringBinary(result.value())
+val output = result.listCells.asScala.map(cell =
+
{'columnFamliy':'%s','qualifier':'%s','timestamp':'%s','type':'%s','value':'%s'}.format(
+  Bytes.toStringBinary(CellUtil.cloneFamily(cell)),
+  Bytes.toStringBinary(CellUtil.cloneQualifier(cell)),
+  cell.getTimestamp.toString,
+  Type.codeToType(cell.getTypeByte),
+  Bytes.toStringBinary(CellUtil.cloneValue(cell))
+)
+)   
+output.mkString( )
--- End diff --

In fact, in HBase a A *{row, columnFamily:qualifier, version}* tuple 
exactly specifies a cell (a record) in HBase. So there are usually several 
records per column family in a row. 
In fact, HBase `Result` contains all the information about a HBase table 
and `Result.listCells` returns all the cells as a java list of `hbase.Cell` 
which contains all the information about a cell.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5090][examples] The improvement of pyth...

2015-01-22 Thread GenTang
Github user GenTang commented on the pull request:

https://github.com/apache/spark/pull/3920#issuecomment-71121501
  
@MLnick I change the sample data to show that we can have several records 
in one columnFamily. 

In fact, in HBase 0.96 and newer, the default maximum number of 
versions(timestamp) has been changed to 1 and in order to show all the versions 
in a HBase table, we need to add *{hbase.mapreduce.scan.maxversions:num}* 
in the conf. For me, it is rather a topic of HBase not Spark. That's why I 
don't show that we can have several versions of records in `Result`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4983]exception handling about adding ta...

2015-01-11 Thread GenTang
Github user GenTang commented on the pull request:

https://github.com/apache/spark/pull/3986#issuecomment-69490428
  
I understand why it still print exception error information even we catch 
the exception. It is because that we use 
```
logging.basicConfig()
```
in the script. So the exception information will be added in the log even 
being catched.
Maybe it is a stupid question, why do we use logging in this script since 
we don't add any logging information in the script ?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4983]exception handling about adding ta...

2015-01-11 Thread GenTang
Github user GenTang commented on the pull request:

https://github.com/apache/spark/pull/3986#issuecomment-69506584
  
Yeah, Thanks for your solution. It works.
However, I prefer to remove the logging package entirely, if we don't 
really use it.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4983]exception handling about adding ta...

2015-01-10 Thread GenTang
Github user GenTang commented on the pull request:

https://github.com/apache/spark/pull/3986#issuecomment-69456767
  
As all the exception about ec2 will throw out an EC2ResponseError 
exception, we use error_code to identify the specific error of instance not 
existing. 
If EC2 returns instance not found exception, we wait a short time to 
relaunch the request. Otherwise, throw out the same error.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4983]exception handling about adding ta...

2015-01-10 Thread GenTang
Github user GenTang commented on the pull request:

https://github.com/apache/spark/pull/3986#issuecomment-69471651
  
Yes, I reproduced the InvalidInstanceID.NotFound by change the instance ID 
before add_tag action and then re-change it to the correct id. However, it will 
print the error information as follow in the screen:
ERROR:boto:400 Bad Request
ERROR:boto:?xml version=1.0 encoding=UTF-8?

ResponseErrorsErrorCodeInvalidInstanceID.NotFound/CodeMessageThe 
instance ID 'i-5aa99fb1' does not 
exist/Message/Error/ErrorsRequestID7bfc5342-1fc9-4431-b569-190b2f9c9e8c/RequestID/Response
It is printed by boto package.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4983]exception handling about adding ta...

2015-01-10 Thread GenTang
Github user GenTang commented on a diff in the pull request:

https://github.com/apache/spark/pull/3986#discussion_r22761550
  
--- Diff: ec2/spark_ec2.py ---
@@ -569,15 +569,34 @@ def launch_cluster(conn, opts, cluster_name):
 master_nodes = master_res.instances
 print Launched master in %s, regid = %s % (zone, master_res.id)
 
-# Give the instances descriptive names
+# Give the instances descriptive names.
+# The code of handling exceptions corresponds to issue [SPARK-4983]
 for master in master_nodes:
-master.add_tag(
-key='Name',
-value='{cn}-master-{iid}'.format(cn=cluster_name, 
iid=master.id))
+while True:
+try:
+master.add_tag(
+key='Name',
+value='{cn}-master-{iid}'.format(cn=cluster_name, 
iid=master.id))
+except boto.exception.EC2ResponseError as e:
+if e.error_code == InvalidInstanceID.NotFound:
+time.sleep(0.1)
+else:
+raise e
+else:
+break
 for slave in slave_nodes:
-slave.add_tag(
-key='Name',
-value='{cn}-slave-{iid}'.format(cn=cluster_name, iid=slave.id))
+while True:
+try:
+slave.add_tag(
+key='Name',
+value='{cn}-slave-{iid}'.format(cn=cluster_name, 
iid=slave.id))
+except boto.exception.EC2ResponseError as e:
+if e.error_code == InvalidInstanceID.NotFound:
+time.Sleep(0.1)
--- End diff --

Sorry for the typo. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4983]exception handling about adding ta...

2015-01-10 Thread GenTang
GitHub user GenTang reopened a pull request:

https://github.com/apache/spark/pull/3986

[SPARK-4983]exception handling about adding tags to EC2 instance

As the boto API doesn't support tag ec2 instances in the same call that 
launches them. We use exception handling code to wait that ec2 has enough time 
to propagate the information. 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/GenTang/spark spark-4983

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/3986.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3986


commit 6adcf6d28d21c14662d6f92d550614bc4548eb5d
Author: GenTang gen.tan...@gmail.com
Date:   2015-01-09T23:01:09Z

handling exceptions about adding tags to ec2

commit 692fc2b72f19f1b02e710433f3d38388c5cc3a55
Author: GenTang gen.tan...@gmail.com
Date:   2015-01-10T13:54:05Z

the improvement of exception handling

commit 63fd360cc1284189892d5aca8f42a49db9cbf0f0
Author: GenTang gen.tan...@gmail.com
Date:   2015-01-10T20:00:26Z

typo




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4983]exception handling about adding ta...

2015-01-10 Thread GenTang
Github user GenTang commented on the pull request:

https://github.com/apache/spark/pull/3986#issuecomment-69473316
  
However, I met a really strange error a moment ago.
I launched a cluster containing 1 master and 1 slave with the script. 
Add_tag to master succussed after two tries and add_tag to slave succussed 
without throwing out the error. However, EC2 threw out 
`InvalidInstanceID.NotFound` error for slave node at :
```
for i in cluster_instances:
i.update()
```
in  wait_for_cluster_state function. It seems that the information of 
instance has not been propagated for the update action. Meantime, information 
of instance has reached to certain point that add_tag action can be succussed.  
I tried several time, it happened only once. I am not very clear why it 
happened. As wait_for_cluster_state is used for `launch`, `start`(these two 
need more than 1 minute to reach `ssh-ready` state), `destroy`(it needs about 1 
second to reach `terminated` state) action, maybe the workaround this to add 
some more waiting time to launch update action by making following change:
```
while True:
time.sleep(5 * num_attempts + 1) 
```
at line 724



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4983]exception handling about adding ta...

2015-01-10 Thread GenTang
Github user GenTang closed the pull request at:

https://github.com/apache/spark/pull/3986


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4983]exception handling about adding ta...

2015-01-10 Thread GenTang
Github user GenTang commented on the pull request:

https://github.com/apache/spark/pull/3986#issuecomment-69473464
  
Yes, boto will print the error even we catch the exception but the script 
will continue and the cluster will be successfully launched. 
It is just ugly to have such error information in the screen.
FYI, it used boto 2.34.0


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4983]Tag EC2 instances in the same call...

2015-01-09 Thread GenTang
Github user GenTang commented on a diff in the pull request:

https://github.com/apache/spark/pull/3986#discussion_r22751730
  
--- Diff: ec2/spark_ec2.py ---
@@ -569,15 +569,28 @@ def launch_cluster(conn, opts, cluster_name):
 master_nodes = master_res.instances
 print Launched master in %s, regid = %s % (zone, master_res.id)
 
-# Give the instances descriptive names
+# Give the instances descriptive names.
+# The code of handling exceptions corresponds to issue [SPARK-4983]
 for master in master_nodes:
-master.add_tag(
-key='Name',
-value='{cn}-master-{iid}'.format(cn=cluster_name, 
iid=master.id))
+while True:
+try:
+master.add_tag(
+key='Name',
+value='{cn}-master-{iid}'.format(cn=cluster_name, 
iid=master.id))
+except:
+pass
--- End diff --

I think that it takes some time for EC2 to return an instance not existing 
exception . That's why I leave pass in the exception. However, Maybe we should 
add a small wait time to ensure that we don't submit too much requests to ec2

Yes, Here we just want to catch the exception of instance not existing. You 
are right, it is better to use specific exception. I will work on this.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4983]Tag EC2 instances in the same call...

2015-01-09 Thread GenTang
GitHub user GenTang opened a pull request:

https://github.com/apache/spark/pull/3986

[SPARK-4983]Tag EC2 instances in the same call that launches them

As the boto API doesn't support tag ec2 instances in the same call that 
launches them. We use exception handling code to wait that ec2 has enough time 
to propagate the information. 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/GenTang/spark spark-4983

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/3986.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3986


commit 6adcf6d28d21c14662d6f92d550614bc4548eb5d
Author: GenTang gen.tan...@gmail.com
Date:   2015-01-09T23:01:09Z

handling exceptions about adding tags to ec2




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: The improvement of python converter for hbase(...

2015-01-06 Thread GenTang
GitHub user GenTang opened a pull request:

https://github.com/apache/spark/pull/3920

The improvement of python converter for hbase(examples)

Hi,

Following the discussion in 
http://apache-spark-developers-list.1001551.n3.nabble.com/python-converter-in-HBaseConverter-scala-spark-examples-td10001.html.
 I made some modification in three files in package examples:
1. HBaseConverters.scala: the new converter will converts all the records 
in an hbase results into a single string
2. hbase_input.py: as the value string may contain several records, we can 
use ast package to convert the string into dict
3. HBaseTest.scala: as the package examples use hbase 0.98.7 the original 
constructor HTableDescriptor is deprecated. The updation to new constructor is 
made


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/GenTang/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/3920.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3920


commit 3253b61e4a740233154c54458fa68dbec5a2b004
Author: GenTang gen.tan...@gmail.com
Date:   2015-01-07T02:04:13Z

the modification accompanying the improvement of pythonconverter

commit ceb31c5e1619e56b91a8d03b4edd09c7c17813ce
Author: GenTang gen.tan...@gmail.com
Date:   2015-01-07T02:05:24Z

the modification for adapting updation of hbase

commit 5cbbcfc3e4034f824ae94e84374b4f505ee15a63
Author: GenTang gen.tan...@gmail.com
Date:   2015-01-07T02:06:00Z

the improvement of pythonconverter




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5090] The improvement of python convert...

2015-01-06 Thread GenTang
Github user GenTang commented on the pull request:

https://github.com/apache/spark/pull/3920#issuecomment-68969946
  
@MLnick 
Hi Nick Pentreath, please review. Thanks ^^


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org