[jira] [Commented] (SPARK-11190) SparkR support for cassandra collection types.

2015-10-21 Thread Bilind Hajer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14968596#comment-14968596
 ] 

Bilind Hajer commented on SPARK-11190:
--

> df.recipes <- read.df(sqlContext, source = 
> "org.apache.spark.sql.cassandra",keyspace = "datahub", table = 
> "person_recipes")
> someRecords <- head(df.recipes)
> mapField <- someRecords$recipes[[1]]

> ls( mapField ) 
 [1] "1000"   "1100"   "12000"  "18000"  "2000"   "22000"  "22074"  "24000" 
 [9] "28000"  "3000"   "33000"  "44000"  "45000"  "47000"  "48000"  "49000" 
[17] "5000"   "51000"  "53000"  "55000"  "56000"  "57000"  "57076"  "6" 
[25] "63000"  "64000"  "65000"  "66000"  "67000"  "73000"  "75000"  "79000" 
[33] "8"  "82000"  "83000"  "84000"  "87000"  "89000"  "9"  "999000"

> mapField[["1000"]]
[1] 0

success! Thanks guys, R environment data types are pretty cool, didn't know 
they existed. 

> SparkR support for cassandra collection types. 
> ---
>
> Key: SPARK-11190
> URL: https://issues.apache.org/jira/browse/SPARK-11190
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.1
> Environment: SparkR Version: 1.5.1
> Cassandra Version: 2.1.6
> R Version: 3.2.2 
> Cassandra Connector version: 1.5.0-M2
>Reporter: Bilind Hajer
>  Labels: cassandra, dataframe, sparkR
>
> I want to create a data frame from a Cassandra keyspace and column family in 
> sparkR. 
> I am able to create data frames from tables which do not include any 
> Cassandra collection datatypes, 
> such as Map, Set and List.  But, many of the schemas that I need data from, 
> do include these collection data types. 
> Here is my local environment. 
> SparkR Version: 1.5.1
> Cassandra Version: 2.1.6
> R Version: 3.2.2 
> Cassandra Connector version: 1.5.0-M2
> To test this issue, I did the following iterative process. 
> sudo ./sparkR --packages 
> com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M2 --conf 
> spark.cassandra.connection.host=127.0.0.1
> Running this command, with sparkR gives me access to the spark cassandra 
> connector package I need, 
> and connects me to my local cqlsh server ( which is up and running while 
> running this code in sparkR shell ). 
> CREATE TABLE test_table (
>   column_1 int,
>   column_2 text,
>   column_3 float,
>   column_4 uuid,
>   column_5 timestamp,
>   column_6 boolean,
>   column_7 timeuuid,
>   column_8 bigint,
>   column_9 blob,
>   column_10   ascii,
>   column_11   decimal,
>   column_12   double,
>   column_13   inet,
>   column_14   varchar,
>   column_15   varint,
>   PRIMARY KEY( ( column_1, column_2 ) )
> ); 
> All of the above data types are supported. I insert dummy data after creating 
> this test schema. 
> For example, now in my sparkR shell, I run the following code. 
> df.test  <- read.df(sqlContext,  source = "org.apache.spark.sql.cassandra", 
> keyspace = "datahub", table = "test_table")
> assigns with no errors, then, 
> > schema(df.test)
> StructType
> |-name = "column_1", type = "IntegerType", nullable = TRUE
> |-name = "column_2", type = "StringType", nullable = TRUE
> |-name = "column_10", type = "StringType", nullable = TRUE
> |-name = "column_11", type = "DecimalType(38,18)", nullable = TRUE
> |-name = "column_12", type = "DoubleType", nullable = TRUE
> |-name = "column_13", type = "InetAddressType", nullable = TRUE
> |-name = "column_14", type = "StringType", nullable = TRUE
> |-name = "column_15", type = "DecimalType(38,0)", nullable = TRUE
> |-name = "column_3", type = "FloatType", nullable = TRUE
> |-name = "column_4", type = "UUIDType", nullable = TRUE
> |-name = "column_5", type = "TimestampType", nullable = TRUE
> |-name = "column_6", type = "BooleanType", nullable = TRUE
> |-name = "column_7", type = "UUIDType", nullable = TRUE
> |-name = "column_8", type = "LongType", nullable = TRUE
> |-name = "column_9", type = "BinaryType", nullable = TRUE
> Schema is correct. 
> > class(df.test)
> [1] "DataFrame"
> attr(,"package")
> [1] "SparkR"
> df.test is clearly defined to be a DataFrame Object. 
> > head(df.test)
>   column_1 column_2 column_10 column_11 column_12 column_13 column_14 
> column_15
> 11helloNANANANANA
> NA
>   column_3 column_4 column_5 column_6 column_7 column_8 column_9
> 1  3.4   NA   NA   NA   NA   NA   NA
> sparkR is reading from the 

[jira] [Commented] (SPARK-11190) SparkR support for cassandra collection types.

2015-10-21 Thread Bilind Hajer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14967979#comment-14967979
 ] 

Bilind Hajer commented on SPARK-11190:
--

Ok, got master spark version built, I am no longer getting the Error in 
as.data.frame.default(x[[i]], optional = TRUE) when reading a data frame from a 
cassandra column family that contains collection data types. 

But, for example, for a map it is reading the field as something like 


> SparkR support for cassandra collection types. 
> ---
>
> Key: SPARK-11190
> URL: https://issues.apache.org/jira/browse/SPARK-11190
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.1
> Environment: SparkR Version: 1.5.1
> Cassandra Version: 2.1.6
> R Version: 3.2.2 
> Cassandra Connector version: 1.5.0-M2
>Reporter: Bilind Hajer
>  Labels: cassandra, dataframe, sparkR
>
> I want to create a data frame from a Cassandra keyspace and column family in 
> sparkR. 
> I am able to create data frames from tables which do not include any 
> Cassandra collection datatypes, 
> such as Map, Set and List.  But, many of the schemas that I need data from, 
> do include these collection data types. 
> Here is my local environment. 
> SparkR Version: 1.5.1
> Cassandra Version: 2.1.6
> R Version: 3.2.2 
> Cassandra Connector version: 1.5.0-M2
> To test this issue, I did the following iterative process. 
> sudo ./sparkR --packages 
> com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M2 --conf 
> spark.cassandra.connection.host=127.0.0.1
> Running this command, with sparkR gives me access to the spark cassandra 
> connector package I need, 
> and connects me to my local cqlsh server ( which is up and running while 
> running this code in sparkR shell ). 
> CREATE TABLE test_table (
>   column_1 int,
>   column_2 text,
>   column_3 float,
>   column_4 uuid,
>   column_5 timestamp,
>   column_6 boolean,
>   column_7 timeuuid,
>   column_8 bigint,
>   column_9 blob,
>   column_10   ascii,
>   column_11   decimal,
>   column_12   double,
>   column_13   inet,
>   column_14   varchar,
>   column_15   varint,
>   PRIMARY KEY( ( column_1, column_2 ) )
> ); 
> All of the above data types are supported. I insert dummy data after creating 
> this test schema. 
> For example, now in my sparkR shell, I run the following code. 
> df.test  <- read.df(sqlContext,  source = "org.apache.spark.sql.cassandra", 
> keyspace = "datahub", table = "test_table")
> assigns with no errors, then, 
> > schema(df.test)
> StructType
> |-name = "column_1", type = "IntegerType", nullable = TRUE
> |-name = "column_2", type = "StringType", nullable = TRUE
> |-name = "column_10", type = "StringType", nullable = TRUE
> |-name = "column_11", type = "DecimalType(38,18)", nullable = TRUE
> |-name = "column_12", type = "DoubleType", nullable = TRUE
> |-name = "column_13", type = "InetAddressType", nullable = TRUE
> |-name = "column_14", type = "StringType", nullable = TRUE
> |-name = "column_15", type = "DecimalType(38,0)", nullable = TRUE
> |-name = "column_3", type = "FloatType", nullable = TRUE
> |-name = "column_4", type = "UUIDType", nullable = TRUE
> |-name = "column_5", type = "TimestampType", nullable = TRUE
> |-name = "column_6", type = "BooleanType", nullable = TRUE
> |-name = "column_7", type = "UUIDType", nullable = TRUE
> |-name = "column_8", type = "LongType", nullable = TRUE
> |-name = "column_9", type = "BinaryType", nullable = TRUE
> Schema is correct. 
> > class(df.test)
> [1] "DataFrame"
> attr(,"package")
> [1] "SparkR"
> df.test is clearly defined to be a DataFrame Object. 
> > head(df.test)
>   column_1 column_2 column_10 column_11 column_12 column_13 column_14 
> column_15
> 11helloNANANANANA
> NA
>   column_3 column_4 column_5 column_6 column_7 column_8 column_9
> 1  3.4   NA   NA   NA   NA   NA   NA
> sparkR is reading from the column_family correctly, but now lets add a 
> collection data type to the schema. 
> Now I will drop that test_table, and recreate the table, with with an extra 
> column of data type  map
> CREATE TABLE test_table (
>   column_1 int,
>   column_2 text,
>   column_3 float,
>   column_4 uuid,
>   column_5   

[jira] [Commented] (SPARK-11190) SparkR support for cassandra collection types.

2015-10-21 Thread Bilind Hajer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14967903#comment-14967903
 ] 

Bilind Hajer commented on SPARK-11190:
--

I built spark using mvn, on the master version on github right now, spark-scala 
is able to build, but when I open a sparkR shell, it seems I just have access 
to R, and not spark R. Any ideas? 

> SparkR support for cassandra collection types. 
> ---
>
> Key: SPARK-11190
> URL: https://issues.apache.org/jira/browse/SPARK-11190
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.1
> Environment: SparkR Version: 1.5.1
> Cassandra Version: 2.1.6
> R Version: 3.2.2 
> Cassandra Connector version: 1.5.0-M2
>Reporter: Bilind Hajer
>  Labels: cassandra, dataframe, sparkR
>
> I want to create a data frame from a Cassandra keyspace and column family in 
> sparkR. 
> I am able to create data frames from tables which do not include any 
> Cassandra collection datatypes, 
> such as Map, Set and List.  But, many of the schemas that I need data from, 
> do include these collection data types. 
> Here is my local environment. 
> SparkR Version: 1.5.1
> Cassandra Version: 2.1.6
> R Version: 3.2.2 
> Cassandra Connector version: 1.5.0-M2
> To test this issue, I did the following iterative process. 
> sudo ./sparkR --packages 
> com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M2 --conf 
> spark.cassandra.connection.host=127.0.0.1
> Running this command, with sparkR gives me access to the spark cassandra 
> connector package I need, 
> and connects me to my local cqlsh server ( which is up and running while 
> running this code in sparkR shell ). 
> CREATE TABLE test_table (
>   column_1 int,
>   column_2 text,
>   column_3 float,
>   column_4 uuid,
>   column_5 timestamp,
>   column_6 boolean,
>   column_7 timeuuid,
>   column_8 bigint,
>   column_9 blob,
>   column_10   ascii,
>   column_11   decimal,
>   column_12   double,
>   column_13   inet,
>   column_14   varchar,
>   column_15   varint,
>   PRIMARY KEY( ( column_1, column_2 ) )
> ); 
> All of the above data types are supported. I insert dummy data after creating 
> this test schema. 
> For example, now in my sparkR shell, I run the following code. 
> df.test  <- read.df(sqlContext,  source = "org.apache.spark.sql.cassandra", 
> keyspace = "datahub", table = "test_table")
> assigns with no errors, then, 
> > schema(df.test)
> StructType
> |-name = "column_1", type = "IntegerType", nullable = TRUE
> |-name = "column_2", type = "StringType", nullable = TRUE
> |-name = "column_10", type = "StringType", nullable = TRUE
> |-name = "column_11", type = "DecimalType(38,18)", nullable = TRUE
> |-name = "column_12", type = "DoubleType", nullable = TRUE
> |-name = "column_13", type = "InetAddressType", nullable = TRUE
> |-name = "column_14", type = "StringType", nullable = TRUE
> |-name = "column_15", type = "DecimalType(38,0)", nullable = TRUE
> |-name = "column_3", type = "FloatType", nullable = TRUE
> |-name = "column_4", type = "UUIDType", nullable = TRUE
> |-name = "column_5", type = "TimestampType", nullable = TRUE
> |-name = "column_6", type = "BooleanType", nullable = TRUE
> |-name = "column_7", type = "UUIDType", nullable = TRUE
> |-name = "column_8", type = "LongType", nullable = TRUE
> |-name = "column_9", type = "BinaryType", nullable = TRUE
> Schema is correct. 
> > class(df.test)
> [1] "DataFrame"
> attr(,"package")
> [1] "SparkR"
> df.test is clearly defined to be a DataFrame Object. 
> > head(df.test)
>   column_1 column_2 column_10 column_11 column_12 column_13 column_14 
> column_15
> 11helloNANANANANA
> NA
>   column_3 column_4 column_5 column_6 column_7 column_8 column_9
> 1  3.4   NA   NA   NA   NA   NA   NA
> sparkR is reading from the column_family correctly, but now lets add a 
> collection data type to the schema. 
> Now I will drop that test_table, and recreate the table, with with an extra 
> column of data type  map
> CREATE TABLE test_table (
>   column_1 int,
>   column_2 text,
>   column_3 float,
>   column_4 uuid,
>   column_5 timestamp,
>   column_6 boolean,
>   column_7 

[jira] [Commented] (SPARK-11190) SparkR support for cassandra collection types.

2015-10-21 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14967907#comment-14967907
 ] 

Shivaram Venkataraman commented on SPARK-11190:
---

You'll need to add the  -Psparkr flag as described in 
https://github.com/apache/spark/tree/master/R#sparkr-development

> SparkR support for cassandra collection types. 
> ---
>
> Key: SPARK-11190
> URL: https://issues.apache.org/jira/browse/SPARK-11190
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.1
> Environment: SparkR Version: 1.5.1
> Cassandra Version: 2.1.6
> R Version: 3.2.2 
> Cassandra Connector version: 1.5.0-M2
>Reporter: Bilind Hajer
>  Labels: cassandra, dataframe, sparkR
>
> I want to create a data frame from a Cassandra keyspace and column family in 
> sparkR. 
> I am able to create data frames from tables which do not include any 
> Cassandra collection datatypes, 
> such as Map, Set and List.  But, many of the schemas that I need data from, 
> do include these collection data types. 
> Here is my local environment. 
> SparkR Version: 1.5.1
> Cassandra Version: 2.1.6
> R Version: 3.2.2 
> Cassandra Connector version: 1.5.0-M2
> To test this issue, I did the following iterative process. 
> sudo ./sparkR --packages 
> com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M2 --conf 
> spark.cassandra.connection.host=127.0.0.1
> Running this command, with sparkR gives me access to the spark cassandra 
> connector package I need, 
> and connects me to my local cqlsh server ( which is up and running while 
> running this code in sparkR shell ). 
> CREATE TABLE test_table (
>   column_1 int,
>   column_2 text,
>   column_3 float,
>   column_4 uuid,
>   column_5 timestamp,
>   column_6 boolean,
>   column_7 timeuuid,
>   column_8 bigint,
>   column_9 blob,
>   column_10   ascii,
>   column_11   decimal,
>   column_12   double,
>   column_13   inet,
>   column_14   varchar,
>   column_15   varint,
>   PRIMARY KEY( ( column_1, column_2 ) )
> ); 
> All of the above data types are supported. I insert dummy data after creating 
> this test schema. 
> For example, now in my sparkR shell, I run the following code. 
> df.test  <- read.df(sqlContext,  source = "org.apache.spark.sql.cassandra", 
> keyspace = "datahub", table = "test_table")
> assigns with no errors, then, 
> > schema(df.test)
> StructType
> |-name = "column_1", type = "IntegerType", nullable = TRUE
> |-name = "column_2", type = "StringType", nullable = TRUE
> |-name = "column_10", type = "StringType", nullable = TRUE
> |-name = "column_11", type = "DecimalType(38,18)", nullable = TRUE
> |-name = "column_12", type = "DoubleType", nullable = TRUE
> |-name = "column_13", type = "InetAddressType", nullable = TRUE
> |-name = "column_14", type = "StringType", nullable = TRUE
> |-name = "column_15", type = "DecimalType(38,0)", nullable = TRUE
> |-name = "column_3", type = "FloatType", nullable = TRUE
> |-name = "column_4", type = "UUIDType", nullable = TRUE
> |-name = "column_5", type = "TimestampType", nullable = TRUE
> |-name = "column_6", type = "BooleanType", nullable = TRUE
> |-name = "column_7", type = "UUIDType", nullable = TRUE
> |-name = "column_8", type = "LongType", nullable = TRUE
> |-name = "column_9", type = "BinaryType", nullable = TRUE
> Schema is correct. 
> > class(df.test)
> [1] "DataFrame"
> attr(,"package")
> [1] "SparkR"
> df.test is clearly defined to be a DataFrame Object. 
> > head(df.test)
>   column_1 column_2 column_10 column_11 column_12 column_13 column_14 
> column_15
> 11helloNANANANANA
> NA
>   column_3 column_4 column_5 column_6 column_7 column_8 column_9
> 1  3.4   NA   NA   NA   NA   NA   NA
> sparkR is reading from the column_family correctly, but now lets add a 
> collection data type to the schema. 
> Now I will drop that test_table, and recreate the table, with with an extra 
> column of data type  map
> CREATE TABLE test_table (
>   column_1 int,
>   column_2 text,
>   column_3 float,
>   column_4 uuid,
>   column_5 timestamp,
>   column_6 boolean,
>   column_7 timeuuid,
>   column_8   

[jira] [Commented] (SPARK-11190) SparkR support for cassandra collection types.

2015-10-21 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14968085#comment-14968085
 ] 

Shivaram Venkataraman commented on SPARK-11190:
---

Environments can be used like a hash map with string keys. See 
http://stackoverflow.com/a/8299417/4577954 for example

> SparkR support for cassandra collection types. 
> ---
>
> Key: SPARK-11190
> URL: https://issues.apache.org/jira/browse/SPARK-11190
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.1
> Environment: SparkR Version: 1.5.1
> Cassandra Version: 2.1.6
> R Version: 3.2.2 
> Cassandra Connector version: 1.5.0-M2
>Reporter: Bilind Hajer
>  Labels: cassandra, dataframe, sparkR
>
> I want to create a data frame from a Cassandra keyspace and column family in 
> sparkR. 
> I am able to create data frames from tables which do not include any 
> Cassandra collection datatypes, 
> such as Map, Set and List.  But, many of the schemas that I need data from, 
> do include these collection data types. 
> Here is my local environment. 
> SparkR Version: 1.5.1
> Cassandra Version: 2.1.6
> R Version: 3.2.2 
> Cassandra Connector version: 1.5.0-M2
> To test this issue, I did the following iterative process. 
> sudo ./sparkR --packages 
> com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M2 --conf 
> spark.cassandra.connection.host=127.0.0.1
> Running this command, with sparkR gives me access to the spark cassandra 
> connector package I need, 
> and connects me to my local cqlsh server ( which is up and running while 
> running this code in sparkR shell ). 
> CREATE TABLE test_table (
>   column_1 int,
>   column_2 text,
>   column_3 float,
>   column_4 uuid,
>   column_5 timestamp,
>   column_6 boolean,
>   column_7 timeuuid,
>   column_8 bigint,
>   column_9 blob,
>   column_10   ascii,
>   column_11   decimal,
>   column_12   double,
>   column_13   inet,
>   column_14   varchar,
>   column_15   varint,
>   PRIMARY KEY( ( column_1, column_2 ) )
> ); 
> All of the above data types are supported. I insert dummy data after creating 
> this test schema. 
> For example, now in my sparkR shell, I run the following code. 
> df.test  <- read.df(sqlContext,  source = "org.apache.spark.sql.cassandra", 
> keyspace = "datahub", table = "test_table")
> assigns with no errors, then, 
> > schema(df.test)
> StructType
> |-name = "column_1", type = "IntegerType", nullable = TRUE
> |-name = "column_2", type = "StringType", nullable = TRUE
> |-name = "column_10", type = "StringType", nullable = TRUE
> |-name = "column_11", type = "DecimalType(38,18)", nullable = TRUE
> |-name = "column_12", type = "DoubleType", nullable = TRUE
> |-name = "column_13", type = "InetAddressType", nullable = TRUE
> |-name = "column_14", type = "StringType", nullable = TRUE
> |-name = "column_15", type = "DecimalType(38,0)", nullable = TRUE
> |-name = "column_3", type = "FloatType", nullable = TRUE
> |-name = "column_4", type = "UUIDType", nullable = TRUE
> |-name = "column_5", type = "TimestampType", nullable = TRUE
> |-name = "column_6", type = "BooleanType", nullable = TRUE
> |-name = "column_7", type = "UUIDType", nullable = TRUE
> |-name = "column_8", type = "LongType", nullable = TRUE
> |-name = "column_9", type = "BinaryType", nullable = TRUE
> Schema is correct. 
> > class(df.test)
> [1] "DataFrame"
> attr(,"package")
> [1] "SparkR"
> df.test is clearly defined to be a DataFrame Object. 
> > head(df.test)
>   column_1 column_2 column_10 column_11 column_12 column_13 column_14 
> column_15
> 11helloNANANANANA
> NA
>   column_3 column_4 column_5 column_6 column_7 column_8 column_9
> 1  3.4   NA   NA   NA   NA   NA   NA
> sparkR is reading from the column_family correctly, but now lets add a 
> collection data type to the schema. 
> Now I will drop that test_table, and recreate the table, with with an extra 
> column of data type  map
> CREATE TABLE test_table (
>   column_1 int,
>   column_2 text,
>   column_3 float,
>   column_4 uuid,
>   column_5 timestamp,
>   column_6 boolean,
>   column_7 timeuuid,
>   column_8

[jira] [Commented] (SPARK-11190) SparkR support for cassandra collection types.

2015-10-21 Thread Bilind Hajer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14968079#comment-14968079
 ] 

Bilind Hajer commented on SPARK-11190:
--

Well this would be a map datatype from Cassandra, and it should 
convert to the respective R datatype when read in sparkR. I do not think R has 
a map datatype? 

> SparkR support for cassandra collection types. 
> ---
>
> Key: SPARK-11190
> URL: https://issues.apache.org/jira/browse/SPARK-11190
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.1
> Environment: SparkR Version: 1.5.1
> Cassandra Version: 2.1.6
> R Version: 3.2.2 
> Cassandra Connector version: 1.5.0-M2
>Reporter: Bilind Hajer
>  Labels: cassandra, dataframe, sparkR
>
> I want to create a data frame from a Cassandra keyspace and column family in 
> sparkR. 
> I am able to create data frames from tables which do not include any 
> Cassandra collection datatypes, 
> such as Map, Set and List.  But, many of the schemas that I need data from, 
> do include these collection data types. 
> Here is my local environment. 
> SparkR Version: 1.5.1
> Cassandra Version: 2.1.6
> R Version: 3.2.2 
> Cassandra Connector version: 1.5.0-M2
> To test this issue, I did the following iterative process. 
> sudo ./sparkR --packages 
> com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M2 --conf 
> spark.cassandra.connection.host=127.0.0.1
> Running this command, with sparkR gives me access to the spark cassandra 
> connector package I need, 
> and connects me to my local cqlsh server ( which is up and running while 
> running this code in sparkR shell ). 
> CREATE TABLE test_table (
>   column_1 int,
>   column_2 text,
>   column_3 float,
>   column_4 uuid,
>   column_5 timestamp,
>   column_6 boolean,
>   column_7 timeuuid,
>   column_8 bigint,
>   column_9 blob,
>   column_10   ascii,
>   column_11   decimal,
>   column_12   double,
>   column_13   inet,
>   column_14   varchar,
>   column_15   varint,
>   PRIMARY KEY( ( column_1, column_2 ) )
> ); 
> All of the above data types are supported. I insert dummy data after creating 
> this test schema. 
> For example, now in my sparkR shell, I run the following code. 
> df.test  <- read.df(sqlContext,  source = "org.apache.spark.sql.cassandra", 
> keyspace = "datahub", table = "test_table")
> assigns with no errors, then, 
> > schema(df.test)
> StructType
> |-name = "column_1", type = "IntegerType", nullable = TRUE
> |-name = "column_2", type = "StringType", nullable = TRUE
> |-name = "column_10", type = "StringType", nullable = TRUE
> |-name = "column_11", type = "DecimalType(38,18)", nullable = TRUE
> |-name = "column_12", type = "DoubleType", nullable = TRUE
> |-name = "column_13", type = "InetAddressType", nullable = TRUE
> |-name = "column_14", type = "StringType", nullable = TRUE
> |-name = "column_15", type = "DecimalType(38,0)", nullable = TRUE
> |-name = "column_3", type = "FloatType", nullable = TRUE
> |-name = "column_4", type = "UUIDType", nullable = TRUE
> |-name = "column_5", type = "TimestampType", nullable = TRUE
> |-name = "column_6", type = "BooleanType", nullable = TRUE
> |-name = "column_7", type = "UUIDType", nullable = TRUE
> |-name = "column_8", type = "LongType", nullable = TRUE
> |-name = "column_9", type = "BinaryType", nullable = TRUE
> Schema is correct. 
> > class(df.test)
> [1] "DataFrame"
> attr(,"package")
> [1] "SparkR"
> df.test is clearly defined to be a DataFrame Object. 
> > head(df.test)
>   column_1 column_2 column_10 column_11 column_12 column_13 column_14 
> column_15
> 11helloNANANANANA
> NA
>   column_3 column_4 column_5 column_6 column_7 column_8 column_9
> 1  3.4   NA   NA   NA   NA   NA   NA
> sparkR is reading from the column_family correctly, but now lets add a 
> collection data type to the schema. 
> Now I will drop that test_table, and recreate the table, with with an extra 
> column of data type  map
> CREATE TABLE test_table (
>   column_1 int,
>   column_2 text,
>   column_3 float,
>   column_4 uuid,
>   column_5 timestamp,
>   column_6 boolean,
>   column_7 

[jira] [Commented] (SPARK-11190) SparkR support for cassandra collection types.

2015-10-21 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14968031#comment-14968031
 ] 

Shivaram Venkataraman commented on SPARK-11190:
---

We do convert maps in Scala to environments in R when we convert the data. Is 
there a problem with the conversion ? 

> SparkR support for cassandra collection types. 
> ---
>
> Key: SPARK-11190
> URL: https://issues.apache.org/jira/browse/SPARK-11190
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.1
> Environment: SparkR Version: 1.5.1
> Cassandra Version: 2.1.6
> R Version: 3.2.2 
> Cassandra Connector version: 1.5.0-M2
>Reporter: Bilind Hajer
>  Labels: cassandra, dataframe, sparkR
>
> I want to create a data frame from a Cassandra keyspace and column family in 
> sparkR. 
> I am able to create data frames from tables which do not include any 
> Cassandra collection datatypes, 
> such as Map, Set and List.  But, many of the schemas that I need data from, 
> do include these collection data types. 
> Here is my local environment. 
> SparkR Version: 1.5.1
> Cassandra Version: 2.1.6
> R Version: 3.2.2 
> Cassandra Connector version: 1.5.0-M2
> To test this issue, I did the following iterative process. 
> sudo ./sparkR --packages 
> com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M2 --conf 
> spark.cassandra.connection.host=127.0.0.1
> Running this command, with sparkR gives me access to the spark cassandra 
> connector package I need, 
> and connects me to my local cqlsh server ( which is up and running while 
> running this code in sparkR shell ). 
> CREATE TABLE test_table (
>   column_1 int,
>   column_2 text,
>   column_3 float,
>   column_4 uuid,
>   column_5 timestamp,
>   column_6 boolean,
>   column_7 timeuuid,
>   column_8 bigint,
>   column_9 blob,
>   column_10   ascii,
>   column_11   decimal,
>   column_12   double,
>   column_13   inet,
>   column_14   varchar,
>   column_15   varint,
>   PRIMARY KEY( ( column_1, column_2 ) )
> ); 
> All of the above data types are supported. I insert dummy data after creating 
> this test schema. 
> For example, now in my sparkR shell, I run the following code. 
> df.test  <- read.df(sqlContext,  source = "org.apache.spark.sql.cassandra", 
> keyspace = "datahub", table = "test_table")
> assigns with no errors, then, 
> > schema(df.test)
> StructType
> |-name = "column_1", type = "IntegerType", nullable = TRUE
> |-name = "column_2", type = "StringType", nullable = TRUE
> |-name = "column_10", type = "StringType", nullable = TRUE
> |-name = "column_11", type = "DecimalType(38,18)", nullable = TRUE
> |-name = "column_12", type = "DoubleType", nullable = TRUE
> |-name = "column_13", type = "InetAddressType", nullable = TRUE
> |-name = "column_14", type = "StringType", nullable = TRUE
> |-name = "column_15", type = "DecimalType(38,0)", nullable = TRUE
> |-name = "column_3", type = "FloatType", nullable = TRUE
> |-name = "column_4", type = "UUIDType", nullable = TRUE
> |-name = "column_5", type = "TimestampType", nullable = TRUE
> |-name = "column_6", type = "BooleanType", nullable = TRUE
> |-name = "column_7", type = "UUIDType", nullable = TRUE
> |-name = "column_8", type = "LongType", nullable = TRUE
> |-name = "column_9", type = "BinaryType", nullable = TRUE
> Schema is correct. 
> > class(df.test)
> [1] "DataFrame"
> attr(,"package")
> [1] "SparkR"
> df.test is clearly defined to be a DataFrame Object. 
> > head(df.test)
>   column_1 column_2 column_10 column_11 column_12 column_13 column_14 
> column_15
> 11helloNANANANANA
> NA
>   column_3 column_4 column_5 column_6 column_7 column_8 column_9
> 1  3.4   NA   NA   NA   NA   NA   NA
> sparkR is reading from the column_family correctly, but now lets add a 
> collection data type to the schema. 
> Now I will drop that test_table, and recreate the table, with with an extra 
> column of data type  map
> CREATE TABLE test_table (
>   column_1 int,
>   column_2 text,
>   column_3 float,
>   column_4 uuid,
>   column_5 timestamp,
>   column_6 boolean,
>   column_7 timeuuid,
>   column_8 

[jira] [Commented] (SPARK-11190) SparkR support for cassandra collection types.

2015-10-21 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14968543#comment-14968543
 ] 

Sun Rui commented on SPARK-11190:
-

For better user experience, I am thinking if we can decorate the R environment 
converted from Scala map with override show() that prints its content like 
"Map(key1 -> value1, ...)", like that in Scala?

> SparkR support for cassandra collection types. 
> ---
>
> Key: SPARK-11190
> URL: https://issues.apache.org/jira/browse/SPARK-11190
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.1
> Environment: SparkR Version: 1.5.1
> Cassandra Version: 2.1.6
> R Version: 3.2.2 
> Cassandra Connector version: 1.5.0-M2
>Reporter: Bilind Hajer
>  Labels: cassandra, dataframe, sparkR
>
> I want to create a data frame from a Cassandra keyspace and column family in 
> sparkR. 
> I am able to create data frames from tables which do not include any 
> Cassandra collection datatypes, 
> such as Map, Set and List.  But, many of the schemas that I need data from, 
> do include these collection data types. 
> Here is my local environment. 
> SparkR Version: 1.5.1
> Cassandra Version: 2.1.6
> R Version: 3.2.2 
> Cassandra Connector version: 1.5.0-M2
> To test this issue, I did the following iterative process. 
> sudo ./sparkR --packages 
> com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M2 --conf 
> spark.cassandra.connection.host=127.0.0.1
> Running this command, with sparkR gives me access to the spark cassandra 
> connector package I need, 
> and connects me to my local cqlsh server ( which is up and running while 
> running this code in sparkR shell ). 
> CREATE TABLE test_table (
>   column_1 int,
>   column_2 text,
>   column_3 float,
>   column_4 uuid,
>   column_5 timestamp,
>   column_6 boolean,
>   column_7 timeuuid,
>   column_8 bigint,
>   column_9 blob,
>   column_10   ascii,
>   column_11   decimal,
>   column_12   double,
>   column_13   inet,
>   column_14   varchar,
>   column_15   varint,
>   PRIMARY KEY( ( column_1, column_2 ) )
> ); 
> All of the above data types are supported. I insert dummy data after creating 
> this test schema. 
> For example, now in my sparkR shell, I run the following code. 
> df.test  <- read.df(sqlContext,  source = "org.apache.spark.sql.cassandra", 
> keyspace = "datahub", table = "test_table")
> assigns with no errors, then, 
> > schema(df.test)
> StructType
> |-name = "column_1", type = "IntegerType", nullable = TRUE
> |-name = "column_2", type = "StringType", nullable = TRUE
> |-name = "column_10", type = "StringType", nullable = TRUE
> |-name = "column_11", type = "DecimalType(38,18)", nullable = TRUE
> |-name = "column_12", type = "DoubleType", nullable = TRUE
> |-name = "column_13", type = "InetAddressType", nullable = TRUE
> |-name = "column_14", type = "StringType", nullable = TRUE
> |-name = "column_15", type = "DecimalType(38,0)", nullable = TRUE
> |-name = "column_3", type = "FloatType", nullable = TRUE
> |-name = "column_4", type = "UUIDType", nullable = TRUE
> |-name = "column_5", type = "TimestampType", nullable = TRUE
> |-name = "column_6", type = "BooleanType", nullable = TRUE
> |-name = "column_7", type = "UUIDType", nullable = TRUE
> |-name = "column_8", type = "LongType", nullable = TRUE
> |-name = "column_9", type = "BinaryType", nullable = TRUE
> Schema is correct. 
> > class(df.test)
> [1] "DataFrame"
> attr(,"package")
> [1] "SparkR"
> df.test is clearly defined to be a DataFrame Object. 
> > head(df.test)
>   column_1 column_2 column_10 column_11 column_12 column_13 column_14 
> column_15
> 11helloNANANANANA
> NA
>   column_3 column_4 column_5 column_6 column_7 column_8 column_9
> 1  3.4   NA   NA   NA   NA   NA   NA
> sparkR is reading from the column_family correctly, but now lets add a 
> collection data type to the schema. 
> Now I will drop that test_table, and recreate the table, with with an extra 
> column of data type  map
> CREATE TABLE test_table (
>   column_1 int,
>   column_2 text,
>   column_3 float,
>   column_4 uuid,
>   column_5 timestamp,
>   column_6 boolean,
>   column_7  

[jira] [Commented] (SPARK-11190) SparkR support for cassandra collection types.

2015-10-21 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14968594#comment-14968594
 ] 

Sun Rui commented on SPARK-11190:
-

ls.str() is not automaic for print. But overriding print for map can show its 
content automatically. For example, if we collect a DataFrame with a column of 
MapType, 
> ldf <- collect(df)
>ldf
   col


But if we assign a class "map" to the column after collect, and implement a 
print.map(), then we can similar content dump as Scala DataFrame:
>ldf
   col
  Map

This will help reduce user's confusion.

> SparkR support for cassandra collection types. 
> ---
>
> Key: SPARK-11190
> URL: https://issues.apache.org/jira/browse/SPARK-11190
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.1
> Environment: SparkR Version: 1.5.1
> Cassandra Version: 2.1.6
> R Version: 3.2.2 
> Cassandra Connector version: 1.5.0-M2
>Reporter: Bilind Hajer
>  Labels: cassandra, dataframe, sparkR
>
> I want to create a data frame from a Cassandra keyspace and column family in 
> sparkR. 
> I am able to create data frames from tables which do not include any 
> Cassandra collection datatypes, 
> such as Map, Set and List.  But, many of the schemas that I need data from, 
> do include these collection data types. 
> Here is my local environment. 
> SparkR Version: 1.5.1
> Cassandra Version: 2.1.6
> R Version: 3.2.2 
> Cassandra Connector version: 1.5.0-M2
> To test this issue, I did the following iterative process. 
> sudo ./sparkR --packages 
> com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M2 --conf 
> spark.cassandra.connection.host=127.0.0.1
> Running this command, with sparkR gives me access to the spark cassandra 
> connector package I need, 
> and connects me to my local cqlsh server ( which is up and running while 
> running this code in sparkR shell ). 
> CREATE TABLE test_table (
>   column_1 int,
>   column_2 text,
>   column_3 float,
>   column_4 uuid,
>   column_5 timestamp,
>   column_6 boolean,
>   column_7 timeuuid,
>   column_8 bigint,
>   column_9 blob,
>   column_10   ascii,
>   column_11   decimal,
>   column_12   double,
>   column_13   inet,
>   column_14   varchar,
>   column_15   varint,
>   PRIMARY KEY( ( column_1, column_2 ) )
> ); 
> All of the above data types are supported. I insert dummy data after creating 
> this test schema. 
> For example, now in my sparkR shell, I run the following code. 
> df.test  <- read.df(sqlContext,  source = "org.apache.spark.sql.cassandra", 
> keyspace = "datahub", table = "test_table")
> assigns with no errors, then, 
> > schema(df.test)
> StructType
> |-name = "column_1", type = "IntegerType", nullable = TRUE
> |-name = "column_2", type = "StringType", nullable = TRUE
> |-name = "column_10", type = "StringType", nullable = TRUE
> |-name = "column_11", type = "DecimalType(38,18)", nullable = TRUE
> |-name = "column_12", type = "DoubleType", nullable = TRUE
> |-name = "column_13", type = "InetAddressType", nullable = TRUE
> |-name = "column_14", type = "StringType", nullable = TRUE
> |-name = "column_15", type = "DecimalType(38,0)", nullable = TRUE
> |-name = "column_3", type = "FloatType", nullable = TRUE
> |-name = "column_4", type = "UUIDType", nullable = TRUE
> |-name = "column_5", type = "TimestampType", nullable = TRUE
> |-name = "column_6", type = "BooleanType", nullable = TRUE
> |-name = "column_7", type = "UUIDType", nullable = TRUE
> |-name = "column_8", type = "LongType", nullable = TRUE
> |-name = "column_9", type = "BinaryType", nullable = TRUE
> Schema is correct. 
> > class(df.test)
> [1] "DataFrame"
> attr(,"package")
> [1] "SparkR"
> df.test is clearly defined to be a DataFrame Object. 
> > head(df.test)
>   column_1 column_2 column_10 column_11 column_12 column_13 column_14 
> column_15
> 11helloNANANANANA
> NA
>   column_3 column_4 column_5 column_6 column_7 column_8 column_9
> 1  3.4   NA   NA   NA   NA   NA   NA
> sparkR is reading from the column_family correctly, but now lets add a 
> collection data type to the schema. 
> Now I will drop that test_table, and recreate the table, with with an extra 
> column of data type  map
> CREATE TABLE test_table (
>   column_1 int,
>   column_2 text,
>   column_3 

[jira] [Commented] (SPARK-11190) SparkR support for cassandra collection types.

2015-10-20 Thread Bilind Hajer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14965391#comment-14965391
 ] 

Bilind Hajer commented on SPARK-11190:
--

Nvm, answered my own question. Will test on Master and let you guys know. 
Thanks. 

> SparkR support for cassandra collection types. 
> ---
>
> Key: SPARK-11190
> URL: https://issues.apache.org/jira/browse/SPARK-11190
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.1
> Environment: SparkR Version: 1.5.1
> Cassandra Version: 2.1.6
> R Version: 3.2.2 
> Cassandra Connector version: 1.5.0-M2
>Reporter: Bilind Hajer
>  Labels: cassandra, dataframe, sparkR
>
> I want to create a data frame from a Cassandra keyspace and column family in 
> sparkR. 
> I am able to create data frames from tables which do not include any 
> Cassandra collection datatypes, 
> such as Map, Set and List.  But, many of the schemas that I need data from, 
> do include these collection data types. 
> Here is my local environment. 
> SparkR Version: 1.5.1
> Cassandra Version: 2.1.6
> R Version: 3.2.2 
> Cassandra Connector version: 1.5.0-M2
> To test this issue, I did the following iterative process. 
> sudo ./sparkR --packages 
> com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M2 --conf 
> spark.cassandra.connection.host=127.0.0.1
> Running this command, with sparkR gives me access to the spark cassandra 
> connector package I need, 
> and connects me to my local cqlsh server ( which is up and running while 
> running this code in sparkR shell ). 
> CREATE TABLE test_table (
>   column_1 int,
>   column_2 text,
>   column_3 float,
>   column_4 uuid,
>   column_5 timestamp,
>   column_6 boolean,
>   column_7 timeuuid,
>   column_8 bigint,
>   column_9 blob,
>   column_10   ascii,
>   column_11   decimal,
>   column_12   double,
>   column_13   inet,
>   column_14   varchar,
>   column_15   varint,
>   PRIMARY KEY( ( column_1, column_2 ) )
> ); 
> All of the above data types are supported. I insert dummy data after creating 
> this test schema. 
> For example, now in my sparkR shell, I run the following code. 
> df.test  <- read.df(sqlContext,  source = "org.apache.spark.sql.cassandra", 
> keyspace = "datahub", table = "test_table")
> assigns with no errors, then, 
> > schema(df.test)
> StructType
> |-name = "column_1", type = "IntegerType", nullable = TRUE
> |-name = "column_2", type = "StringType", nullable = TRUE
> |-name = "column_10", type = "StringType", nullable = TRUE
> |-name = "column_11", type = "DecimalType(38,18)", nullable = TRUE
> |-name = "column_12", type = "DoubleType", nullable = TRUE
> |-name = "column_13", type = "InetAddressType", nullable = TRUE
> |-name = "column_14", type = "StringType", nullable = TRUE
> |-name = "column_15", type = "DecimalType(38,0)", nullable = TRUE
> |-name = "column_3", type = "FloatType", nullable = TRUE
> |-name = "column_4", type = "UUIDType", nullable = TRUE
> |-name = "column_5", type = "TimestampType", nullable = TRUE
> |-name = "column_6", type = "BooleanType", nullable = TRUE
> |-name = "column_7", type = "UUIDType", nullable = TRUE
> |-name = "column_8", type = "LongType", nullable = TRUE
> |-name = "column_9", type = "BinaryType", nullable = TRUE
> Schema is correct. 
> > class(df.test)
> [1] "DataFrame"
> attr(,"package")
> [1] "SparkR"
> df.test is clearly defined to be a DataFrame Object. 
> > head(df.test)
>   column_1 column_2 column_10 column_11 column_12 column_13 column_14 
> column_15
> 11helloNANANANANA
> NA
>   column_3 column_4 column_5 column_6 column_7 column_8 column_9
> 1  3.4   NA   NA   NA   NA   NA   NA
> sparkR is reading from the column_family correctly, but now lets add a 
> collection data type to the schema. 
> Now I will drop that test_table, and recreate the table, with with an extra 
> column of data type  map
> CREATE TABLE test_table (
>   column_1 int,
>   column_2 text,
>   column_3 float,
>   column_4 uuid,
>   column_5 timestamp,
>   column_6 boolean,
>   column_7 timeuuid,
>   column_8 bigint,
>   column_9

[jira] [Commented] (SPARK-11190) SparkR support for cassandra collection types.

2015-10-20 Thread Bilind Hajer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14965320#comment-14965320
 ] 

Bilind Hajer commented on SPARK-11190:
--

Sorry about that Sean Owen. So I'm assuming this fix is in master branch of 
spark 1.52? 

> SparkR support for cassandra collection types. 
> ---
>
> Key: SPARK-11190
> URL: https://issues.apache.org/jira/browse/SPARK-11190
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.1
> Environment: SparkR Version: 1.5.1
> Cassandra Version: 2.1.6
> R Version: 3.2.2 
> Cassandra Connector version: 1.5.0-M2
>Reporter: Bilind Hajer
>  Labels: cassandra, dataframe, sparkR
>
> I want to create a data frame from a Cassandra keyspace and column family in 
> sparkR. 
> I am able to create data frames from tables which do not include any 
> Cassandra collection datatypes, 
> such as Map, Set and List.  But, many of the schemas that I need data from, 
> do include these collection data types. 
> Here is my local environment. 
> SparkR Version: 1.5.1
> Cassandra Version: 2.1.6
> R Version: 3.2.2 
> Cassandra Connector version: 1.5.0-M2
> To test this issue, I did the following iterative process. 
> sudo ./sparkR --packages 
> com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M2 --conf 
> spark.cassandra.connection.host=127.0.0.1
> Running this command, with sparkR gives me access to the spark cassandra 
> connector package I need, 
> and connects me to my local cqlsh server ( which is up and running while 
> running this code in sparkR shell ). 
> CREATE TABLE test_table (
>   column_1 int,
>   column_2 text,
>   column_3 float,
>   column_4 uuid,
>   column_5 timestamp,
>   column_6 boolean,
>   column_7 timeuuid,
>   column_8 bigint,
>   column_9 blob,
>   column_10   ascii,
>   column_11   decimal,
>   column_12   double,
>   column_13   inet,
>   column_14   varchar,
>   column_15   varint,
>   PRIMARY KEY( ( column_1, column_2 ) )
> ); 
> All of the above data types are supported. I insert dummy data after creating 
> this test schema. 
> For example, now in my sparkR shell, I run the following code. 
> df.test  <- read.df(sqlContext,  source = "org.apache.spark.sql.cassandra", 
> keyspace = "datahub", table = "test_table")
> assigns with no errors, then, 
> > schema(df.test)
> StructType
> |-name = "column_1", type = "IntegerType", nullable = TRUE
> |-name = "column_2", type = "StringType", nullable = TRUE
> |-name = "column_10", type = "StringType", nullable = TRUE
> |-name = "column_11", type = "DecimalType(38,18)", nullable = TRUE
> |-name = "column_12", type = "DoubleType", nullable = TRUE
> |-name = "column_13", type = "InetAddressType", nullable = TRUE
> |-name = "column_14", type = "StringType", nullable = TRUE
> |-name = "column_15", type = "DecimalType(38,0)", nullable = TRUE
> |-name = "column_3", type = "FloatType", nullable = TRUE
> |-name = "column_4", type = "UUIDType", nullable = TRUE
> |-name = "column_5", type = "TimestampType", nullable = TRUE
> |-name = "column_6", type = "BooleanType", nullable = TRUE
> |-name = "column_7", type = "UUIDType", nullable = TRUE
> |-name = "column_8", type = "LongType", nullable = TRUE
> |-name = "column_9", type = "BinaryType", nullable = TRUE
> Schema is correct. 
> > class(df.test)
> [1] "DataFrame"
> attr(,"package")
> [1] "SparkR"
> df.test is clearly defined to be a DataFrame Object. 
> > head(df.test)
>   column_1 column_2 column_10 column_11 column_12 column_13 column_14 
> column_15
> 11helloNANANANANA
> NA
>   column_3 column_4 column_5 column_6 column_7 column_8 column_9
> 1  3.4   NA   NA   NA   NA   NA   NA
> sparkR is reading from the column_family correctly, but now lets add a 
> collection data type to the schema. 
> Now I will drop that test_table, and recreate the table, with with an extra 
> column of data type  map
> CREATE TABLE test_table (
>   column_1 int,
>   column_2 text,
>   column_3 float,
>   column_4 uuid,
>   column_5 timestamp,
>   column_6 boolean,
>   column_7 timeuuid,
>   column_8 bigint,
>   column_9  

[jira] [Commented] (SPARK-11190) SparkR support for cassandra collection types.

2015-10-20 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14965327#comment-14965327
 ] 

Shivaram Venkataraman commented on SPARK-11190:
---

The changes I mentioned are not in 1.5.2. They are in the master branch which 
will become 1.6.0 when the next release happens. We dont know if the changes 
fix your problem though which is why it'd be great if you could test against 
the master branch
 

> SparkR support for cassandra collection types. 
> ---
>
> Key: SPARK-11190
> URL: https://issues.apache.org/jira/browse/SPARK-11190
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.1
> Environment: SparkR Version: 1.5.1
> Cassandra Version: 2.1.6
> R Version: 3.2.2 
> Cassandra Connector version: 1.5.0-M2
>Reporter: Bilind Hajer
>  Labels: cassandra, dataframe, sparkR
>
> I want to create a data frame from a Cassandra keyspace and column family in 
> sparkR. 
> I am able to create data frames from tables which do not include any 
> Cassandra collection datatypes, 
> such as Map, Set and List.  But, many of the schemas that I need data from, 
> do include these collection data types. 
> Here is my local environment. 
> SparkR Version: 1.5.1
> Cassandra Version: 2.1.6
> R Version: 3.2.2 
> Cassandra Connector version: 1.5.0-M2
> To test this issue, I did the following iterative process. 
> sudo ./sparkR --packages 
> com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M2 --conf 
> spark.cassandra.connection.host=127.0.0.1
> Running this command, with sparkR gives me access to the spark cassandra 
> connector package I need, 
> and connects me to my local cqlsh server ( which is up and running while 
> running this code in sparkR shell ). 
> CREATE TABLE test_table (
>   column_1 int,
>   column_2 text,
>   column_3 float,
>   column_4 uuid,
>   column_5 timestamp,
>   column_6 boolean,
>   column_7 timeuuid,
>   column_8 bigint,
>   column_9 blob,
>   column_10   ascii,
>   column_11   decimal,
>   column_12   double,
>   column_13   inet,
>   column_14   varchar,
>   column_15   varint,
>   PRIMARY KEY( ( column_1, column_2 ) )
> ); 
> All of the above data types are supported. I insert dummy data after creating 
> this test schema. 
> For example, now in my sparkR shell, I run the following code. 
> df.test  <- read.df(sqlContext,  source = "org.apache.spark.sql.cassandra", 
> keyspace = "datahub", table = "test_table")
> assigns with no errors, then, 
> > schema(df.test)
> StructType
> |-name = "column_1", type = "IntegerType", nullable = TRUE
> |-name = "column_2", type = "StringType", nullable = TRUE
> |-name = "column_10", type = "StringType", nullable = TRUE
> |-name = "column_11", type = "DecimalType(38,18)", nullable = TRUE
> |-name = "column_12", type = "DoubleType", nullable = TRUE
> |-name = "column_13", type = "InetAddressType", nullable = TRUE
> |-name = "column_14", type = "StringType", nullable = TRUE
> |-name = "column_15", type = "DecimalType(38,0)", nullable = TRUE
> |-name = "column_3", type = "FloatType", nullable = TRUE
> |-name = "column_4", type = "UUIDType", nullable = TRUE
> |-name = "column_5", type = "TimestampType", nullable = TRUE
> |-name = "column_6", type = "BooleanType", nullable = TRUE
> |-name = "column_7", type = "UUIDType", nullable = TRUE
> |-name = "column_8", type = "LongType", nullable = TRUE
> |-name = "column_9", type = "BinaryType", nullable = TRUE
> Schema is correct. 
> > class(df.test)
> [1] "DataFrame"
> attr(,"package")
> [1] "SparkR"
> df.test is clearly defined to be a DataFrame Object. 
> > head(df.test)
>   column_1 column_2 column_10 column_11 column_12 column_13 column_14 
> column_15
> 11helloNANANANANA
> NA
>   column_3 column_4 column_5 column_6 column_7 column_8 column_9
> 1  3.4   NA   NA   NA   NA   NA   NA
> sparkR is reading from the column_family correctly, but now lets add a 
> collection data type to the schema. 
> Now I will drop that test_table, and recreate the table, with with an extra 
> column of data type  map
> CREATE TABLE test_table (
>   column_1 int,
>   column_2 text,
>   column_3 float,
>   column_4 uuid,
>   column_5 

[jira] [Commented] (SPARK-11190) SparkR support for cassandra collection types.

2015-10-20 Thread Bilind Hajer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14965363#comment-14965363
 ] 

Bilind Hajer commented on SPARK-11190:
--

How can I get access to the master branch? Would I be able to clone the current 
repo and test on my local?

> SparkR support for cassandra collection types. 
> ---
>
> Key: SPARK-11190
> URL: https://issues.apache.org/jira/browse/SPARK-11190
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.1
> Environment: SparkR Version: 1.5.1
> Cassandra Version: 2.1.6
> R Version: 3.2.2 
> Cassandra Connector version: 1.5.0-M2
>Reporter: Bilind Hajer
>  Labels: cassandra, dataframe, sparkR
>
> I want to create a data frame from a Cassandra keyspace and column family in 
> sparkR. 
> I am able to create data frames from tables which do not include any 
> Cassandra collection datatypes, 
> such as Map, Set and List.  But, many of the schemas that I need data from, 
> do include these collection data types. 
> Here is my local environment. 
> SparkR Version: 1.5.1
> Cassandra Version: 2.1.6
> R Version: 3.2.2 
> Cassandra Connector version: 1.5.0-M2
> To test this issue, I did the following iterative process. 
> sudo ./sparkR --packages 
> com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M2 --conf 
> spark.cassandra.connection.host=127.0.0.1
> Running this command, with sparkR gives me access to the spark cassandra 
> connector package I need, 
> and connects me to my local cqlsh server ( which is up and running while 
> running this code in sparkR shell ). 
> CREATE TABLE test_table (
>   column_1 int,
>   column_2 text,
>   column_3 float,
>   column_4 uuid,
>   column_5 timestamp,
>   column_6 boolean,
>   column_7 timeuuid,
>   column_8 bigint,
>   column_9 blob,
>   column_10   ascii,
>   column_11   decimal,
>   column_12   double,
>   column_13   inet,
>   column_14   varchar,
>   column_15   varint,
>   PRIMARY KEY( ( column_1, column_2 ) )
> ); 
> All of the above data types are supported. I insert dummy data after creating 
> this test schema. 
> For example, now in my sparkR shell, I run the following code. 
> df.test  <- read.df(sqlContext,  source = "org.apache.spark.sql.cassandra", 
> keyspace = "datahub", table = "test_table")
> assigns with no errors, then, 
> > schema(df.test)
> StructType
> |-name = "column_1", type = "IntegerType", nullable = TRUE
> |-name = "column_2", type = "StringType", nullable = TRUE
> |-name = "column_10", type = "StringType", nullable = TRUE
> |-name = "column_11", type = "DecimalType(38,18)", nullable = TRUE
> |-name = "column_12", type = "DoubleType", nullable = TRUE
> |-name = "column_13", type = "InetAddressType", nullable = TRUE
> |-name = "column_14", type = "StringType", nullable = TRUE
> |-name = "column_15", type = "DecimalType(38,0)", nullable = TRUE
> |-name = "column_3", type = "FloatType", nullable = TRUE
> |-name = "column_4", type = "UUIDType", nullable = TRUE
> |-name = "column_5", type = "TimestampType", nullable = TRUE
> |-name = "column_6", type = "BooleanType", nullable = TRUE
> |-name = "column_7", type = "UUIDType", nullable = TRUE
> |-name = "column_8", type = "LongType", nullable = TRUE
> |-name = "column_9", type = "BinaryType", nullable = TRUE
> Schema is correct. 
> > class(df.test)
> [1] "DataFrame"
> attr(,"package")
> [1] "SparkR"
> df.test is clearly defined to be a DataFrame Object. 
> > head(df.test)
>   column_1 column_2 column_10 column_11 column_12 column_13 column_14 
> column_15
> 11helloNANANANANA
> NA
>   column_3 column_4 column_5 column_6 column_7 column_8 column_9
> 1  3.4   NA   NA   NA   NA   NA   NA
> sparkR is reading from the column_family correctly, but now lets add a 
> collection data type to the schema. 
> Now I will drop that test_table, and recreate the table, with with an extra 
> column of data type  map
> CREATE TABLE test_table (
>   column_1 int,
>   column_2 text,
>   column_3 float,
>   column_4 uuid,
>   column_5 timestamp,
>   column_6 boolean,
>   column_7 timeuuid,
>   column_8 bigint,
>   

[jira] [Commented] (SPARK-11190) SparkR support for cassandra collection types.

2015-10-19 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14964032#comment-14964032
 ] 

Shivaram Venkataraman commented on SPARK-11190:
---

cc [~sunrui] Could you try this on the master branch ? We recently added 
support for Lists, Maps etc. in the master branch

> SparkR support for cassandra collection types. 
> ---
>
> Key: SPARK-11190
> URL: https://issues.apache.org/jira/browse/SPARK-11190
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.1
> Environment: SparkR Version: 1.5.1
> Cassandra Version: 2.1.6
> R Version: 3.2.2 
> Cassandra Connector version: 1.5.0-M2
>Reporter: Bilind Hajer
>  Labels: cassandra, dataframe, sparkR
> Fix For: 1.5.2
>
>
> I want to create a data frame from a Cassandra keyspace and column family in 
> sparkR. 
> I am able to create data frames from tables which do not include any 
> Cassandra collection datatypes, 
> such as Map, Set and List.  But, many of the schemas that I need data from, 
> do include these collection data types. 
> Here is my local environment. 
> SparkR Version: 1.5.1
> Cassandra Version: 2.1.6
> R Version: 3.2.2 
> Cassandra Connector version: 1.5.0-M2
> To test this issue, I did the following iterative process. 
> sudo ./sparkR --packages 
> com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M2 --conf 
> spark.cassandra.connection.host=127.0.0.1
> Running this command, with sparkR gives me access to the spark cassandra 
> connector package I need, 
> and connects me to my local cqlsh server ( which is up and running while 
> running this code in sparkR shell ). 
> CREATE TABLE test_table (
>   column_1 int,
>   column_2 text,
>   column_3 float,
>   column_4 uuid,
>   column_5 timestamp,
>   column_6 boolean,
>   column_7 timeuuid,
>   column_8 bigint,
>   column_9 blob,
>   column_10   ascii,
>   column_11   decimal,
>   column_12   double,
>   column_13   inet,
>   column_14   varchar,
>   column_15   varint,
>   PRIMARY KEY( ( column_1, column_2 ) )
> ); 
> All of the above data types are supported. I insert dummy data after creating 
> this test schema. 
> For example, now in my sparkR shell, I run the following code. 
> df.test  <- read.df(sqlContext,  source = "org.apache.spark.sql.cassandra", 
> keyspace = "datahub", table = "test_table")
> assigns with no errors, then, 
> > schema(df.test)
> StructType
> |-name = "column_1", type = "IntegerType", nullable = TRUE
> |-name = "column_2", type = "StringType", nullable = TRUE
> |-name = "column_10", type = "StringType", nullable = TRUE
> |-name = "column_11", type = "DecimalType(38,18)", nullable = TRUE
> |-name = "column_12", type = "DoubleType", nullable = TRUE
> |-name = "column_13", type = "InetAddressType", nullable = TRUE
> |-name = "column_14", type = "StringType", nullable = TRUE
> |-name = "column_15", type = "DecimalType(38,0)", nullable = TRUE
> |-name = "column_3", type = "FloatType", nullable = TRUE
> |-name = "column_4", type = "UUIDType", nullable = TRUE
> |-name = "column_5", type = "TimestampType", nullable = TRUE
> |-name = "column_6", type = "BooleanType", nullable = TRUE
> |-name = "column_7", type = "UUIDType", nullable = TRUE
> |-name = "column_8", type = "LongType", nullable = TRUE
> |-name = "column_9", type = "BinaryType", nullable = TRUE
> Schema is correct. 
> > class(df.test)
> [1] "DataFrame"
> attr(,"package")
> [1] "SparkR"
> df.test is clearly defined to be a DataFrame Object. 
> > head(df.test)
>   column_1 column_2 column_10 column_11 column_12 column_13 column_14 
> column_15
> 11helloNANANANANA
> NA
>   column_3 column_4 column_5 column_6 column_7 column_8 column_9
> 1  3.4   NA   NA   NA   NA   NA   NA
> sparkR is reading from the column_family correctly, but now lets add a 
> collection data type to the schema. 
> Now I will drop that test_table, and recreate the table, with with an extra 
> column of data type  map
> CREATE TABLE test_table (
>   column_1 int,
>   column_2 text,
>   column_3 float,
>   column_4 uuid,
>   column_5 timestamp,
>   column_6 boolean,
>   column_7 timeuuid,
>   column_8  

[jira] [Commented] (SPARK-11190) SparkR support for cassandra collection types.

2015-10-19 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14964397#comment-14964397
 ] 

Sun Rui commented on SPARK-11190:
-

yes, please try it on the latest master branch.

> SparkR support for cassandra collection types. 
> ---
>
> Key: SPARK-11190
> URL: https://issues.apache.org/jira/browse/SPARK-11190
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.1
> Environment: SparkR Version: 1.5.1
> Cassandra Version: 2.1.6
> R Version: 3.2.2 
> Cassandra Connector version: 1.5.0-M2
>Reporter: Bilind Hajer
>  Labels: cassandra, dataframe, sparkR
> Fix For: 1.5.2
>
>
> I want to create a data frame from a Cassandra keyspace and column family in 
> sparkR. 
> I am able to create data frames from tables which do not include any 
> Cassandra collection datatypes, 
> such as Map, Set and List.  But, many of the schemas that I need data from, 
> do include these collection data types. 
> Here is my local environment. 
> SparkR Version: 1.5.1
> Cassandra Version: 2.1.6
> R Version: 3.2.2 
> Cassandra Connector version: 1.5.0-M2
> To test this issue, I did the following iterative process. 
> sudo ./sparkR --packages 
> com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M2 --conf 
> spark.cassandra.connection.host=127.0.0.1
> Running this command, with sparkR gives me access to the spark cassandra 
> connector package I need, 
> and connects me to my local cqlsh server ( which is up and running while 
> running this code in sparkR shell ). 
> CREATE TABLE test_table (
>   column_1 int,
>   column_2 text,
>   column_3 float,
>   column_4 uuid,
>   column_5 timestamp,
>   column_6 boolean,
>   column_7 timeuuid,
>   column_8 bigint,
>   column_9 blob,
>   column_10   ascii,
>   column_11   decimal,
>   column_12   double,
>   column_13   inet,
>   column_14   varchar,
>   column_15   varint,
>   PRIMARY KEY( ( column_1, column_2 ) )
> ); 
> All of the above data types are supported. I insert dummy data after creating 
> this test schema. 
> For example, now in my sparkR shell, I run the following code. 
> df.test  <- read.df(sqlContext,  source = "org.apache.spark.sql.cassandra", 
> keyspace = "datahub", table = "test_table")
> assigns with no errors, then, 
> > schema(df.test)
> StructType
> |-name = "column_1", type = "IntegerType", nullable = TRUE
> |-name = "column_2", type = "StringType", nullable = TRUE
> |-name = "column_10", type = "StringType", nullable = TRUE
> |-name = "column_11", type = "DecimalType(38,18)", nullable = TRUE
> |-name = "column_12", type = "DoubleType", nullable = TRUE
> |-name = "column_13", type = "InetAddressType", nullable = TRUE
> |-name = "column_14", type = "StringType", nullable = TRUE
> |-name = "column_15", type = "DecimalType(38,0)", nullable = TRUE
> |-name = "column_3", type = "FloatType", nullable = TRUE
> |-name = "column_4", type = "UUIDType", nullable = TRUE
> |-name = "column_5", type = "TimestampType", nullable = TRUE
> |-name = "column_6", type = "BooleanType", nullable = TRUE
> |-name = "column_7", type = "UUIDType", nullable = TRUE
> |-name = "column_8", type = "LongType", nullable = TRUE
> |-name = "column_9", type = "BinaryType", nullable = TRUE
> Schema is correct. 
> > class(df.test)
> [1] "DataFrame"
> attr(,"package")
> [1] "SparkR"
> df.test is clearly defined to be a DataFrame Object. 
> > head(df.test)
>   column_1 column_2 column_10 column_11 column_12 column_13 column_14 
> column_15
> 11helloNANANANANA
> NA
>   column_3 column_4 column_5 column_6 column_7 column_8 column_9
> 1  3.4   NA   NA   NA   NA   NA   NA
> sparkR is reading from the column_family correctly, but now lets add a 
> collection data type to the schema. 
> Now I will drop that test_table, and recreate the table, with with an extra 
> column of data type  map
> CREATE TABLE test_table (
>   column_1 int,
>   column_2 text,
>   column_3 float,
>   column_4 uuid,
>   column_5 timestamp,
>   column_6 boolean,
>   column_7 timeuuid,
>   column_8 bigint,
>   column_9 blob,
>   column_10