[jira] [Commented] (SPARK-11190) SparkR support for cassandra collection types.
[ https://issues.apache.org/jira/browse/SPARK-11190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14968596#comment-14968596 ] Bilind Hajer commented on SPARK-11190: -- > df.recipes <- read.df(sqlContext, source = > "org.apache.spark.sql.cassandra",keyspace = "datahub", table = > "person_recipes") > someRecords <- head(df.recipes) > mapField <- someRecords$recipes[[1]] > ls( mapField ) [1] "1000" "1100" "12000" "18000" "2000" "22000" "22074" "24000" [9] "28000" "3000" "33000" "44000" "45000" "47000" "48000" "49000" [17] "5000" "51000" "53000" "55000" "56000" "57000" "57076" "6" [25] "63000" "64000" "65000" "66000" "67000" "73000" "75000" "79000" [33] "8" "82000" "83000" "84000" "87000" "89000" "9" "999000" > mapField[["1000"]] [1] 0 success! Thanks guys, R environment data types are pretty cool, didn't know they existed. > SparkR support for cassandra collection types. > --- > > Key: SPARK-11190 > URL: https://issues.apache.org/jira/browse/SPARK-11190 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.5.1 > Environment: SparkR Version: 1.5.1 > Cassandra Version: 2.1.6 > R Version: 3.2.2 > Cassandra Connector version: 1.5.0-M2 >Reporter: Bilind Hajer > Labels: cassandra, dataframe, sparkR > > I want to create a data frame from a Cassandra keyspace and column family in > sparkR. > I am able to create data frames from tables which do not include any > Cassandra collection datatypes, > such as Map, Set and List. But, many of the schemas that I need data from, > do include these collection data types. > Here is my local environment. > SparkR Version: 1.5.1 > Cassandra Version: 2.1.6 > R Version: 3.2.2 > Cassandra Connector version: 1.5.0-M2 > To test this issue, I did the following iterative process. > sudo ./sparkR --packages > com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M2 --conf > spark.cassandra.connection.host=127.0.0.1 > Running this command, with sparkR gives me access to the spark cassandra > connector package I need, > and connects me to my local cqlsh server ( which is up and running while > running this code in sparkR shell ). > CREATE TABLE test_table ( > column_1 int, > column_2 text, > column_3 float, > column_4 uuid, > column_5 timestamp, > column_6 boolean, > column_7 timeuuid, > column_8 bigint, > column_9 blob, > column_10 ascii, > column_11 decimal, > column_12 double, > column_13 inet, > column_14 varchar, > column_15 varint, > PRIMARY KEY( ( column_1, column_2 ) ) > ); > All of the above data types are supported. I insert dummy data after creating > this test schema. > For example, now in my sparkR shell, I run the following code. > df.test <- read.df(sqlContext, source = "org.apache.spark.sql.cassandra", > keyspace = "datahub", table = "test_table") > assigns with no errors, then, > > schema(df.test) > StructType > |-name = "column_1", type = "IntegerType", nullable = TRUE > |-name = "column_2", type = "StringType", nullable = TRUE > |-name = "column_10", type = "StringType", nullable = TRUE > |-name = "column_11", type = "DecimalType(38,18)", nullable = TRUE > |-name = "column_12", type = "DoubleType", nullable = TRUE > |-name = "column_13", type = "InetAddressType", nullable = TRUE > |-name = "column_14", type = "StringType", nullable = TRUE > |-name = "column_15", type = "DecimalType(38,0)", nullable = TRUE > |-name = "column_3", type = "FloatType", nullable = TRUE > |-name = "column_4", type = "UUIDType", nullable = TRUE > |-name = "column_5", type = "TimestampType", nullable = TRUE > |-name = "column_6", type = "BooleanType", nullable = TRUE > |-name = "column_7", type = "UUIDType", nullable = TRUE > |-name = "column_8", type = "LongType", nullable = TRUE > |-name = "column_9", type = "BinaryType", nullable = TRUE > Schema is correct. > > class(df.test) > [1] "DataFrame" > attr(,"package") > [1] "SparkR" > df.test is clearly defined to be a DataFrame Object. > > head(df.test) > column_1 column_2 column_10 column_11 column_12 column_13 column_14 > column_15 > 11helloNANANANANA > NA > column_3 column_4 column_5 column_6 column_7 column_8 column_9 > 1 3.4 NA NA NA NA NA NA > sparkR is reading from the
[jira] [Commented] (SPARK-11190) SparkR support for cassandra collection types.
[ https://issues.apache.org/jira/browse/SPARK-11190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14967979#comment-14967979 ] Bilind Hajer commented on SPARK-11190: -- Ok, got master spark version built, I am no longer getting the Error in as.data.frame.default(x[[i]], optional = TRUE) when reading a data frame from a cassandra column family that contains collection data types. But, for example, for a mapit is reading the field as something like > SparkR support for cassandra collection types. > --- > > Key: SPARK-11190 > URL: https://issues.apache.org/jira/browse/SPARK-11190 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.5.1 > Environment: SparkR Version: 1.5.1 > Cassandra Version: 2.1.6 > R Version: 3.2.2 > Cassandra Connector version: 1.5.0-M2 >Reporter: Bilind Hajer > Labels: cassandra, dataframe, sparkR > > I want to create a data frame from a Cassandra keyspace and column family in > sparkR. > I am able to create data frames from tables which do not include any > Cassandra collection datatypes, > such as Map, Set and List. But, many of the schemas that I need data from, > do include these collection data types. > Here is my local environment. > SparkR Version: 1.5.1 > Cassandra Version: 2.1.6 > R Version: 3.2.2 > Cassandra Connector version: 1.5.0-M2 > To test this issue, I did the following iterative process. > sudo ./sparkR --packages > com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M2 --conf > spark.cassandra.connection.host=127.0.0.1 > Running this command, with sparkR gives me access to the spark cassandra > connector package I need, > and connects me to my local cqlsh server ( which is up and running while > running this code in sparkR shell ). > CREATE TABLE test_table ( > column_1 int, > column_2 text, > column_3 float, > column_4 uuid, > column_5 timestamp, > column_6 boolean, > column_7 timeuuid, > column_8 bigint, > column_9 blob, > column_10 ascii, > column_11 decimal, > column_12 double, > column_13 inet, > column_14 varchar, > column_15 varint, > PRIMARY KEY( ( column_1, column_2 ) ) > ); > All of the above data types are supported. I insert dummy data after creating > this test schema. > For example, now in my sparkR shell, I run the following code. > df.test <- read.df(sqlContext, source = "org.apache.spark.sql.cassandra", > keyspace = "datahub", table = "test_table") > assigns with no errors, then, > > schema(df.test) > StructType > |-name = "column_1", type = "IntegerType", nullable = TRUE > |-name = "column_2", type = "StringType", nullable = TRUE > |-name = "column_10", type = "StringType", nullable = TRUE > |-name = "column_11", type = "DecimalType(38,18)", nullable = TRUE > |-name = "column_12", type = "DoubleType", nullable = TRUE > |-name = "column_13", type = "InetAddressType", nullable = TRUE > |-name = "column_14", type = "StringType", nullable = TRUE > |-name = "column_15", type = "DecimalType(38,0)", nullable = TRUE > |-name = "column_3", type = "FloatType", nullable = TRUE > |-name = "column_4", type = "UUIDType", nullable = TRUE > |-name = "column_5", type = "TimestampType", nullable = TRUE > |-name = "column_6", type = "BooleanType", nullable = TRUE > |-name = "column_7", type = "UUIDType", nullable = TRUE > |-name = "column_8", type = "LongType", nullable = TRUE > |-name = "column_9", type = "BinaryType", nullable = TRUE > Schema is correct. > > class(df.test) > [1] "DataFrame" > attr(,"package") > [1] "SparkR" > df.test is clearly defined to be a DataFrame Object. > > head(df.test) > column_1 column_2 column_10 column_11 column_12 column_13 column_14 > column_15 > 11helloNANANANANA > NA > column_3 column_4 column_5 column_6 column_7 column_8 column_9 > 1 3.4 NA NA NA NA NA NA > sparkR is reading from the column_family correctly, but now lets add a > collection data type to the schema. > Now I will drop that test_table, and recreate the table, with with an extra > column of data type map > CREATE TABLE test_table ( > column_1 int, > column_2 text, > column_3 float, > column_4 uuid, > column_5
[jira] [Commented] (SPARK-11190) SparkR support for cassandra collection types.
[ https://issues.apache.org/jira/browse/SPARK-11190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14967903#comment-14967903 ] Bilind Hajer commented on SPARK-11190: -- I built spark using mvn, on the master version on github right now, spark-scala is able to build, but when I open a sparkR shell, it seems I just have access to R, and not spark R. Any ideas? > SparkR support for cassandra collection types. > --- > > Key: SPARK-11190 > URL: https://issues.apache.org/jira/browse/SPARK-11190 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.5.1 > Environment: SparkR Version: 1.5.1 > Cassandra Version: 2.1.6 > R Version: 3.2.2 > Cassandra Connector version: 1.5.0-M2 >Reporter: Bilind Hajer > Labels: cassandra, dataframe, sparkR > > I want to create a data frame from a Cassandra keyspace and column family in > sparkR. > I am able to create data frames from tables which do not include any > Cassandra collection datatypes, > such as Map, Set and List. But, many of the schemas that I need data from, > do include these collection data types. > Here is my local environment. > SparkR Version: 1.5.1 > Cassandra Version: 2.1.6 > R Version: 3.2.2 > Cassandra Connector version: 1.5.0-M2 > To test this issue, I did the following iterative process. > sudo ./sparkR --packages > com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M2 --conf > spark.cassandra.connection.host=127.0.0.1 > Running this command, with sparkR gives me access to the spark cassandra > connector package I need, > and connects me to my local cqlsh server ( which is up and running while > running this code in sparkR shell ). > CREATE TABLE test_table ( > column_1 int, > column_2 text, > column_3 float, > column_4 uuid, > column_5 timestamp, > column_6 boolean, > column_7 timeuuid, > column_8 bigint, > column_9 blob, > column_10 ascii, > column_11 decimal, > column_12 double, > column_13 inet, > column_14 varchar, > column_15 varint, > PRIMARY KEY( ( column_1, column_2 ) ) > ); > All of the above data types are supported. I insert dummy data after creating > this test schema. > For example, now in my sparkR shell, I run the following code. > df.test <- read.df(sqlContext, source = "org.apache.spark.sql.cassandra", > keyspace = "datahub", table = "test_table") > assigns with no errors, then, > > schema(df.test) > StructType > |-name = "column_1", type = "IntegerType", nullable = TRUE > |-name = "column_2", type = "StringType", nullable = TRUE > |-name = "column_10", type = "StringType", nullable = TRUE > |-name = "column_11", type = "DecimalType(38,18)", nullable = TRUE > |-name = "column_12", type = "DoubleType", nullable = TRUE > |-name = "column_13", type = "InetAddressType", nullable = TRUE > |-name = "column_14", type = "StringType", nullable = TRUE > |-name = "column_15", type = "DecimalType(38,0)", nullable = TRUE > |-name = "column_3", type = "FloatType", nullable = TRUE > |-name = "column_4", type = "UUIDType", nullable = TRUE > |-name = "column_5", type = "TimestampType", nullable = TRUE > |-name = "column_6", type = "BooleanType", nullable = TRUE > |-name = "column_7", type = "UUIDType", nullable = TRUE > |-name = "column_8", type = "LongType", nullable = TRUE > |-name = "column_9", type = "BinaryType", nullable = TRUE > Schema is correct. > > class(df.test) > [1] "DataFrame" > attr(,"package") > [1] "SparkR" > df.test is clearly defined to be a DataFrame Object. > > head(df.test) > column_1 column_2 column_10 column_11 column_12 column_13 column_14 > column_15 > 11helloNANANANANA > NA > column_3 column_4 column_5 column_6 column_7 column_8 column_9 > 1 3.4 NA NA NA NA NA NA > sparkR is reading from the column_family correctly, but now lets add a > collection data type to the schema. > Now I will drop that test_table, and recreate the table, with with an extra > column of data type map> CREATE TABLE test_table ( > column_1 int, > column_2 text, > column_3 float, > column_4 uuid, > column_5 timestamp, > column_6 boolean, > column_7
[jira] [Commented] (SPARK-11190) SparkR support for cassandra collection types.
[ https://issues.apache.org/jira/browse/SPARK-11190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14967907#comment-14967907 ] Shivaram Venkataraman commented on SPARK-11190: --- You'll need to add the -Psparkr flag as described in https://github.com/apache/spark/tree/master/R#sparkr-development > SparkR support for cassandra collection types. > --- > > Key: SPARK-11190 > URL: https://issues.apache.org/jira/browse/SPARK-11190 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.5.1 > Environment: SparkR Version: 1.5.1 > Cassandra Version: 2.1.6 > R Version: 3.2.2 > Cassandra Connector version: 1.5.0-M2 >Reporter: Bilind Hajer > Labels: cassandra, dataframe, sparkR > > I want to create a data frame from a Cassandra keyspace and column family in > sparkR. > I am able to create data frames from tables which do not include any > Cassandra collection datatypes, > such as Map, Set and List. But, many of the schemas that I need data from, > do include these collection data types. > Here is my local environment. > SparkR Version: 1.5.1 > Cassandra Version: 2.1.6 > R Version: 3.2.2 > Cassandra Connector version: 1.5.0-M2 > To test this issue, I did the following iterative process. > sudo ./sparkR --packages > com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M2 --conf > spark.cassandra.connection.host=127.0.0.1 > Running this command, with sparkR gives me access to the spark cassandra > connector package I need, > and connects me to my local cqlsh server ( which is up and running while > running this code in sparkR shell ). > CREATE TABLE test_table ( > column_1 int, > column_2 text, > column_3 float, > column_4 uuid, > column_5 timestamp, > column_6 boolean, > column_7 timeuuid, > column_8 bigint, > column_9 blob, > column_10 ascii, > column_11 decimal, > column_12 double, > column_13 inet, > column_14 varchar, > column_15 varint, > PRIMARY KEY( ( column_1, column_2 ) ) > ); > All of the above data types are supported. I insert dummy data after creating > this test schema. > For example, now in my sparkR shell, I run the following code. > df.test <- read.df(sqlContext, source = "org.apache.spark.sql.cassandra", > keyspace = "datahub", table = "test_table") > assigns with no errors, then, > > schema(df.test) > StructType > |-name = "column_1", type = "IntegerType", nullable = TRUE > |-name = "column_2", type = "StringType", nullable = TRUE > |-name = "column_10", type = "StringType", nullable = TRUE > |-name = "column_11", type = "DecimalType(38,18)", nullable = TRUE > |-name = "column_12", type = "DoubleType", nullable = TRUE > |-name = "column_13", type = "InetAddressType", nullable = TRUE > |-name = "column_14", type = "StringType", nullable = TRUE > |-name = "column_15", type = "DecimalType(38,0)", nullable = TRUE > |-name = "column_3", type = "FloatType", nullable = TRUE > |-name = "column_4", type = "UUIDType", nullable = TRUE > |-name = "column_5", type = "TimestampType", nullable = TRUE > |-name = "column_6", type = "BooleanType", nullable = TRUE > |-name = "column_7", type = "UUIDType", nullable = TRUE > |-name = "column_8", type = "LongType", nullable = TRUE > |-name = "column_9", type = "BinaryType", nullable = TRUE > Schema is correct. > > class(df.test) > [1] "DataFrame" > attr(,"package") > [1] "SparkR" > df.test is clearly defined to be a DataFrame Object. > > head(df.test) > column_1 column_2 column_10 column_11 column_12 column_13 column_14 > column_15 > 11helloNANANANANA > NA > column_3 column_4 column_5 column_6 column_7 column_8 column_9 > 1 3.4 NA NA NA NA NA NA > sparkR is reading from the column_family correctly, but now lets add a > collection data type to the schema. > Now I will drop that test_table, and recreate the table, with with an extra > column of data type map> CREATE TABLE test_table ( > column_1 int, > column_2 text, > column_3 float, > column_4 uuid, > column_5 timestamp, > column_6 boolean, > column_7 timeuuid, > column_8
[jira] [Commented] (SPARK-11190) SparkR support for cassandra collection types.
[ https://issues.apache.org/jira/browse/SPARK-11190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14968085#comment-14968085 ] Shivaram Venkataraman commented on SPARK-11190: --- Environments can be used like a hash map with string keys. See http://stackoverflow.com/a/8299417/4577954 for example > SparkR support for cassandra collection types. > --- > > Key: SPARK-11190 > URL: https://issues.apache.org/jira/browse/SPARK-11190 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.5.1 > Environment: SparkR Version: 1.5.1 > Cassandra Version: 2.1.6 > R Version: 3.2.2 > Cassandra Connector version: 1.5.0-M2 >Reporter: Bilind Hajer > Labels: cassandra, dataframe, sparkR > > I want to create a data frame from a Cassandra keyspace and column family in > sparkR. > I am able to create data frames from tables which do not include any > Cassandra collection datatypes, > such as Map, Set and List. But, many of the schemas that I need data from, > do include these collection data types. > Here is my local environment. > SparkR Version: 1.5.1 > Cassandra Version: 2.1.6 > R Version: 3.2.2 > Cassandra Connector version: 1.5.0-M2 > To test this issue, I did the following iterative process. > sudo ./sparkR --packages > com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M2 --conf > spark.cassandra.connection.host=127.0.0.1 > Running this command, with sparkR gives me access to the spark cassandra > connector package I need, > and connects me to my local cqlsh server ( which is up and running while > running this code in sparkR shell ). > CREATE TABLE test_table ( > column_1 int, > column_2 text, > column_3 float, > column_4 uuid, > column_5 timestamp, > column_6 boolean, > column_7 timeuuid, > column_8 bigint, > column_9 blob, > column_10 ascii, > column_11 decimal, > column_12 double, > column_13 inet, > column_14 varchar, > column_15 varint, > PRIMARY KEY( ( column_1, column_2 ) ) > ); > All of the above data types are supported. I insert dummy data after creating > this test schema. > For example, now in my sparkR shell, I run the following code. > df.test <- read.df(sqlContext, source = "org.apache.spark.sql.cassandra", > keyspace = "datahub", table = "test_table") > assigns with no errors, then, > > schema(df.test) > StructType > |-name = "column_1", type = "IntegerType", nullable = TRUE > |-name = "column_2", type = "StringType", nullable = TRUE > |-name = "column_10", type = "StringType", nullable = TRUE > |-name = "column_11", type = "DecimalType(38,18)", nullable = TRUE > |-name = "column_12", type = "DoubleType", nullable = TRUE > |-name = "column_13", type = "InetAddressType", nullable = TRUE > |-name = "column_14", type = "StringType", nullable = TRUE > |-name = "column_15", type = "DecimalType(38,0)", nullable = TRUE > |-name = "column_3", type = "FloatType", nullable = TRUE > |-name = "column_4", type = "UUIDType", nullable = TRUE > |-name = "column_5", type = "TimestampType", nullable = TRUE > |-name = "column_6", type = "BooleanType", nullable = TRUE > |-name = "column_7", type = "UUIDType", nullable = TRUE > |-name = "column_8", type = "LongType", nullable = TRUE > |-name = "column_9", type = "BinaryType", nullable = TRUE > Schema is correct. > > class(df.test) > [1] "DataFrame" > attr(,"package") > [1] "SparkR" > df.test is clearly defined to be a DataFrame Object. > > head(df.test) > column_1 column_2 column_10 column_11 column_12 column_13 column_14 > column_15 > 11helloNANANANANA > NA > column_3 column_4 column_5 column_6 column_7 column_8 column_9 > 1 3.4 NA NA NA NA NA NA > sparkR is reading from the column_family correctly, but now lets add a > collection data type to the schema. > Now I will drop that test_table, and recreate the table, with with an extra > column of data type map> CREATE TABLE test_table ( > column_1 int, > column_2 text, > column_3 float, > column_4 uuid, > column_5 timestamp, > column_6 boolean, > column_7 timeuuid, > column_8
[jira] [Commented] (SPARK-11190) SparkR support for cassandra collection types.
[ https://issues.apache.org/jira/browse/SPARK-11190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14968079#comment-14968079 ] Bilind Hajer commented on SPARK-11190: -- Well this would be a mapdatatype from Cassandra, and it should convert to the respective R datatype when read in sparkR. I do not think R has a map datatype? > SparkR support for cassandra collection types. > --- > > Key: SPARK-11190 > URL: https://issues.apache.org/jira/browse/SPARK-11190 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.5.1 > Environment: SparkR Version: 1.5.1 > Cassandra Version: 2.1.6 > R Version: 3.2.2 > Cassandra Connector version: 1.5.0-M2 >Reporter: Bilind Hajer > Labels: cassandra, dataframe, sparkR > > I want to create a data frame from a Cassandra keyspace and column family in > sparkR. > I am able to create data frames from tables which do not include any > Cassandra collection datatypes, > such as Map, Set and List. But, many of the schemas that I need data from, > do include these collection data types. > Here is my local environment. > SparkR Version: 1.5.1 > Cassandra Version: 2.1.6 > R Version: 3.2.2 > Cassandra Connector version: 1.5.0-M2 > To test this issue, I did the following iterative process. > sudo ./sparkR --packages > com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M2 --conf > spark.cassandra.connection.host=127.0.0.1 > Running this command, with sparkR gives me access to the spark cassandra > connector package I need, > and connects me to my local cqlsh server ( which is up and running while > running this code in sparkR shell ). > CREATE TABLE test_table ( > column_1 int, > column_2 text, > column_3 float, > column_4 uuid, > column_5 timestamp, > column_6 boolean, > column_7 timeuuid, > column_8 bigint, > column_9 blob, > column_10 ascii, > column_11 decimal, > column_12 double, > column_13 inet, > column_14 varchar, > column_15 varint, > PRIMARY KEY( ( column_1, column_2 ) ) > ); > All of the above data types are supported. I insert dummy data after creating > this test schema. > For example, now in my sparkR shell, I run the following code. > df.test <- read.df(sqlContext, source = "org.apache.spark.sql.cassandra", > keyspace = "datahub", table = "test_table") > assigns with no errors, then, > > schema(df.test) > StructType > |-name = "column_1", type = "IntegerType", nullable = TRUE > |-name = "column_2", type = "StringType", nullable = TRUE > |-name = "column_10", type = "StringType", nullable = TRUE > |-name = "column_11", type = "DecimalType(38,18)", nullable = TRUE > |-name = "column_12", type = "DoubleType", nullable = TRUE > |-name = "column_13", type = "InetAddressType", nullable = TRUE > |-name = "column_14", type = "StringType", nullable = TRUE > |-name = "column_15", type = "DecimalType(38,0)", nullable = TRUE > |-name = "column_3", type = "FloatType", nullable = TRUE > |-name = "column_4", type = "UUIDType", nullable = TRUE > |-name = "column_5", type = "TimestampType", nullable = TRUE > |-name = "column_6", type = "BooleanType", nullable = TRUE > |-name = "column_7", type = "UUIDType", nullable = TRUE > |-name = "column_8", type = "LongType", nullable = TRUE > |-name = "column_9", type = "BinaryType", nullable = TRUE > Schema is correct. > > class(df.test) > [1] "DataFrame" > attr(,"package") > [1] "SparkR" > df.test is clearly defined to be a DataFrame Object. > > head(df.test) > column_1 column_2 column_10 column_11 column_12 column_13 column_14 > column_15 > 11helloNANANANANA > NA > column_3 column_4 column_5 column_6 column_7 column_8 column_9 > 1 3.4 NA NA NA NA NA NA > sparkR is reading from the column_family correctly, but now lets add a > collection data type to the schema. > Now I will drop that test_table, and recreate the table, with with an extra > column of data type map > CREATE TABLE test_table ( > column_1 int, > column_2 text, > column_3 float, > column_4 uuid, > column_5 timestamp, > column_6 boolean, > column_7
[jira] [Commented] (SPARK-11190) SparkR support for cassandra collection types.
[ https://issues.apache.org/jira/browse/SPARK-11190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14968031#comment-14968031 ] Shivaram Venkataraman commented on SPARK-11190: --- We do convert maps in Scala to environments in R when we convert the data. Is there a problem with the conversion ? > SparkR support for cassandra collection types. > --- > > Key: SPARK-11190 > URL: https://issues.apache.org/jira/browse/SPARK-11190 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.5.1 > Environment: SparkR Version: 1.5.1 > Cassandra Version: 2.1.6 > R Version: 3.2.2 > Cassandra Connector version: 1.5.0-M2 >Reporter: Bilind Hajer > Labels: cassandra, dataframe, sparkR > > I want to create a data frame from a Cassandra keyspace and column family in > sparkR. > I am able to create data frames from tables which do not include any > Cassandra collection datatypes, > such as Map, Set and List. But, many of the schemas that I need data from, > do include these collection data types. > Here is my local environment. > SparkR Version: 1.5.1 > Cassandra Version: 2.1.6 > R Version: 3.2.2 > Cassandra Connector version: 1.5.0-M2 > To test this issue, I did the following iterative process. > sudo ./sparkR --packages > com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M2 --conf > spark.cassandra.connection.host=127.0.0.1 > Running this command, with sparkR gives me access to the spark cassandra > connector package I need, > and connects me to my local cqlsh server ( which is up and running while > running this code in sparkR shell ). > CREATE TABLE test_table ( > column_1 int, > column_2 text, > column_3 float, > column_4 uuid, > column_5 timestamp, > column_6 boolean, > column_7 timeuuid, > column_8 bigint, > column_9 blob, > column_10 ascii, > column_11 decimal, > column_12 double, > column_13 inet, > column_14 varchar, > column_15 varint, > PRIMARY KEY( ( column_1, column_2 ) ) > ); > All of the above data types are supported. I insert dummy data after creating > this test schema. > For example, now in my sparkR shell, I run the following code. > df.test <- read.df(sqlContext, source = "org.apache.spark.sql.cassandra", > keyspace = "datahub", table = "test_table") > assigns with no errors, then, > > schema(df.test) > StructType > |-name = "column_1", type = "IntegerType", nullable = TRUE > |-name = "column_2", type = "StringType", nullable = TRUE > |-name = "column_10", type = "StringType", nullable = TRUE > |-name = "column_11", type = "DecimalType(38,18)", nullable = TRUE > |-name = "column_12", type = "DoubleType", nullable = TRUE > |-name = "column_13", type = "InetAddressType", nullable = TRUE > |-name = "column_14", type = "StringType", nullable = TRUE > |-name = "column_15", type = "DecimalType(38,0)", nullable = TRUE > |-name = "column_3", type = "FloatType", nullable = TRUE > |-name = "column_4", type = "UUIDType", nullable = TRUE > |-name = "column_5", type = "TimestampType", nullable = TRUE > |-name = "column_6", type = "BooleanType", nullable = TRUE > |-name = "column_7", type = "UUIDType", nullable = TRUE > |-name = "column_8", type = "LongType", nullable = TRUE > |-name = "column_9", type = "BinaryType", nullable = TRUE > Schema is correct. > > class(df.test) > [1] "DataFrame" > attr(,"package") > [1] "SparkR" > df.test is clearly defined to be a DataFrame Object. > > head(df.test) > column_1 column_2 column_10 column_11 column_12 column_13 column_14 > column_15 > 11helloNANANANANA > NA > column_3 column_4 column_5 column_6 column_7 column_8 column_9 > 1 3.4 NA NA NA NA NA NA > sparkR is reading from the column_family correctly, but now lets add a > collection data type to the schema. > Now I will drop that test_table, and recreate the table, with with an extra > column of data type map> CREATE TABLE test_table ( > column_1 int, > column_2 text, > column_3 float, > column_4 uuid, > column_5 timestamp, > column_6 boolean, > column_7 timeuuid, > column_8
[jira] [Commented] (SPARK-11190) SparkR support for cassandra collection types.
[ https://issues.apache.org/jira/browse/SPARK-11190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14968543#comment-14968543 ] Sun Rui commented on SPARK-11190: - For better user experience, I am thinking if we can decorate the R environment converted from Scala map with override show() that prints its content like "Map(key1 -> value1, ...)", like that in Scala? > SparkR support for cassandra collection types. > --- > > Key: SPARK-11190 > URL: https://issues.apache.org/jira/browse/SPARK-11190 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.5.1 > Environment: SparkR Version: 1.5.1 > Cassandra Version: 2.1.6 > R Version: 3.2.2 > Cassandra Connector version: 1.5.0-M2 >Reporter: Bilind Hajer > Labels: cassandra, dataframe, sparkR > > I want to create a data frame from a Cassandra keyspace and column family in > sparkR. > I am able to create data frames from tables which do not include any > Cassandra collection datatypes, > such as Map, Set and List. But, many of the schemas that I need data from, > do include these collection data types. > Here is my local environment. > SparkR Version: 1.5.1 > Cassandra Version: 2.1.6 > R Version: 3.2.2 > Cassandra Connector version: 1.5.0-M2 > To test this issue, I did the following iterative process. > sudo ./sparkR --packages > com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M2 --conf > spark.cassandra.connection.host=127.0.0.1 > Running this command, with sparkR gives me access to the spark cassandra > connector package I need, > and connects me to my local cqlsh server ( which is up and running while > running this code in sparkR shell ). > CREATE TABLE test_table ( > column_1 int, > column_2 text, > column_3 float, > column_4 uuid, > column_5 timestamp, > column_6 boolean, > column_7 timeuuid, > column_8 bigint, > column_9 blob, > column_10 ascii, > column_11 decimal, > column_12 double, > column_13 inet, > column_14 varchar, > column_15 varint, > PRIMARY KEY( ( column_1, column_2 ) ) > ); > All of the above data types are supported. I insert dummy data after creating > this test schema. > For example, now in my sparkR shell, I run the following code. > df.test <- read.df(sqlContext, source = "org.apache.spark.sql.cassandra", > keyspace = "datahub", table = "test_table") > assigns with no errors, then, > > schema(df.test) > StructType > |-name = "column_1", type = "IntegerType", nullable = TRUE > |-name = "column_2", type = "StringType", nullable = TRUE > |-name = "column_10", type = "StringType", nullable = TRUE > |-name = "column_11", type = "DecimalType(38,18)", nullable = TRUE > |-name = "column_12", type = "DoubleType", nullable = TRUE > |-name = "column_13", type = "InetAddressType", nullable = TRUE > |-name = "column_14", type = "StringType", nullable = TRUE > |-name = "column_15", type = "DecimalType(38,0)", nullable = TRUE > |-name = "column_3", type = "FloatType", nullable = TRUE > |-name = "column_4", type = "UUIDType", nullable = TRUE > |-name = "column_5", type = "TimestampType", nullable = TRUE > |-name = "column_6", type = "BooleanType", nullable = TRUE > |-name = "column_7", type = "UUIDType", nullable = TRUE > |-name = "column_8", type = "LongType", nullable = TRUE > |-name = "column_9", type = "BinaryType", nullable = TRUE > Schema is correct. > > class(df.test) > [1] "DataFrame" > attr(,"package") > [1] "SparkR" > df.test is clearly defined to be a DataFrame Object. > > head(df.test) > column_1 column_2 column_10 column_11 column_12 column_13 column_14 > column_15 > 11helloNANANANANA > NA > column_3 column_4 column_5 column_6 column_7 column_8 column_9 > 1 3.4 NA NA NA NA NA NA > sparkR is reading from the column_family correctly, but now lets add a > collection data type to the schema. > Now I will drop that test_table, and recreate the table, with with an extra > column of data type map> CREATE TABLE test_table ( > column_1 int, > column_2 text, > column_3 float, > column_4 uuid, > column_5 timestamp, > column_6 boolean, > column_7
[jira] [Commented] (SPARK-11190) SparkR support for cassandra collection types.
[ https://issues.apache.org/jira/browse/SPARK-11190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14968594#comment-14968594 ] Sun Rui commented on SPARK-11190: - ls.str() is not automaic for print. But overriding print for map can show its content automatically. For example, if we collect a DataFrame with a column of MapType, > ldf <- collect(df) >ldf col But if we assign a class "map" to the column after collect, and implement a print.map(), then we can similar content dump as Scala DataFrame: >ldf col Map This will help reduce user's confusion. > SparkR support for cassandra collection types. > --- > > Key: SPARK-11190 > URL: https://issues.apache.org/jira/browse/SPARK-11190 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.5.1 > Environment: SparkR Version: 1.5.1 > Cassandra Version: 2.1.6 > R Version: 3.2.2 > Cassandra Connector version: 1.5.0-M2 >Reporter: Bilind Hajer > Labels: cassandra, dataframe, sparkR > > I want to create a data frame from a Cassandra keyspace and column family in > sparkR. > I am able to create data frames from tables which do not include any > Cassandra collection datatypes, > such as Map, Set and List. But, many of the schemas that I need data from, > do include these collection data types. > Here is my local environment. > SparkR Version: 1.5.1 > Cassandra Version: 2.1.6 > R Version: 3.2.2 > Cassandra Connector version: 1.5.0-M2 > To test this issue, I did the following iterative process. > sudo ./sparkR --packages > com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M2 --conf > spark.cassandra.connection.host=127.0.0.1 > Running this command, with sparkR gives me access to the spark cassandra > connector package I need, > and connects me to my local cqlsh server ( which is up and running while > running this code in sparkR shell ). > CREATE TABLE test_table ( > column_1 int, > column_2 text, > column_3 float, > column_4 uuid, > column_5 timestamp, > column_6 boolean, > column_7 timeuuid, > column_8 bigint, > column_9 blob, > column_10 ascii, > column_11 decimal, > column_12 double, > column_13 inet, > column_14 varchar, > column_15 varint, > PRIMARY KEY( ( column_1, column_2 ) ) > ); > All of the above data types are supported. I insert dummy data after creating > this test schema. > For example, now in my sparkR shell, I run the following code. > df.test <- read.df(sqlContext, source = "org.apache.spark.sql.cassandra", > keyspace = "datahub", table = "test_table") > assigns with no errors, then, > > schema(df.test) > StructType > |-name = "column_1", type = "IntegerType", nullable = TRUE > |-name = "column_2", type = "StringType", nullable = TRUE > |-name = "column_10", type = "StringType", nullable = TRUE > |-name = "column_11", type = "DecimalType(38,18)", nullable = TRUE > |-name = "column_12", type = "DoubleType", nullable = TRUE > |-name = "column_13", type = "InetAddressType", nullable = TRUE > |-name = "column_14", type = "StringType", nullable = TRUE > |-name = "column_15", type = "DecimalType(38,0)", nullable = TRUE > |-name = "column_3", type = "FloatType", nullable = TRUE > |-name = "column_4", type = "UUIDType", nullable = TRUE > |-name = "column_5", type = "TimestampType", nullable = TRUE > |-name = "column_6", type = "BooleanType", nullable = TRUE > |-name = "column_7", type = "UUIDType", nullable = TRUE > |-name = "column_8", type = "LongType", nullable = TRUE > |-name = "column_9", type = "BinaryType", nullable = TRUE > Schema is correct. > > class(df.test) > [1] "DataFrame" > attr(,"package") > [1] "SparkR" > df.test is clearly defined to be a DataFrame Object. > > head(df.test) > column_1 column_2 column_10 column_11 column_12 column_13 column_14 > column_15 > 11helloNANANANANA > NA > column_3 column_4 column_5 column_6 column_7 column_8 column_9 > 1 3.4 NA NA NA NA NA NA > sparkR is reading from the column_family correctly, but now lets add a > collection data type to the schema. > Now I will drop that test_table, and recreate the table, with with an extra > column of data type map> CREATE TABLE test_table ( > column_1 int, > column_2 text, > column_3
[jira] [Commented] (SPARK-11190) SparkR support for cassandra collection types.
[ https://issues.apache.org/jira/browse/SPARK-11190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14965391#comment-14965391 ] Bilind Hajer commented on SPARK-11190: -- Nvm, answered my own question. Will test on Master and let you guys know. Thanks. > SparkR support for cassandra collection types. > --- > > Key: SPARK-11190 > URL: https://issues.apache.org/jira/browse/SPARK-11190 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.5.1 > Environment: SparkR Version: 1.5.1 > Cassandra Version: 2.1.6 > R Version: 3.2.2 > Cassandra Connector version: 1.5.0-M2 >Reporter: Bilind Hajer > Labels: cassandra, dataframe, sparkR > > I want to create a data frame from a Cassandra keyspace and column family in > sparkR. > I am able to create data frames from tables which do not include any > Cassandra collection datatypes, > such as Map, Set and List. But, many of the schemas that I need data from, > do include these collection data types. > Here is my local environment. > SparkR Version: 1.5.1 > Cassandra Version: 2.1.6 > R Version: 3.2.2 > Cassandra Connector version: 1.5.0-M2 > To test this issue, I did the following iterative process. > sudo ./sparkR --packages > com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M2 --conf > spark.cassandra.connection.host=127.0.0.1 > Running this command, with sparkR gives me access to the spark cassandra > connector package I need, > and connects me to my local cqlsh server ( which is up and running while > running this code in sparkR shell ). > CREATE TABLE test_table ( > column_1 int, > column_2 text, > column_3 float, > column_4 uuid, > column_5 timestamp, > column_6 boolean, > column_7 timeuuid, > column_8 bigint, > column_9 blob, > column_10 ascii, > column_11 decimal, > column_12 double, > column_13 inet, > column_14 varchar, > column_15 varint, > PRIMARY KEY( ( column_1, column_2 ) ) > ); > All of the above data types are supported. I insert dummy data after creating > this test schema. > For example, now in my sparkR shell, I run the following code. > df.test <- read.df(sqlContext, source = "org.apache.spark.sql.cassandra", > keyspace = "datahub", table = "test_table") > assigns with no errors, then, > > schema(df.test) > StructType > |-name = "column_1", type = "IntegerType", nullable = TRUE > |-name = "column_2", type = "StringType", nullable = TRUE > |-name = "column_10", type = "StringType", nullable = TRUE > |-name = "column_11", type = "DecimalType(38,18)", nullable = TRUE > |-name = "column_12", type = "DoubleType", nullable = TRUE > |-name = "column_13", type = "InetAddressType", nullable = TRUE > |-name = "column_14", type = "StringType", nullable = TRUE > |-name = "column_15", type = "DecimalType(38,0)", nullable = TRUE > |-name = "column_3", type = "FloatType", nullable = TRUE > |-name = "column_4", type = "UUIDType", nullable = TRUE > |-name = "column_5", type = "TimestampType", nullable = TRUE > |-name = "column_6", type = "BooleanType", nullable = TRUE > |-name = "column_7", type = "UUIDType", nullable = TRUE > |-name = "column_8", type = "LongType", nullable = TRUE > |-name = "column_9", type = "BinaryType", nullable = TRUE > Schema is correct. > > class(df.test) > [1] "DataFrame" > attr(,"package") > [1] "SparkR" > df.test is clearly defined to be a DataFrame Object. > > head(df.test) > column_1 column_2 column_10 column_11 column_12 column_13 column_14 > column_15 > 11helloNANANANANA > NA > column_3 column_4 column_5 column_6 column_7 column_8 column_9 > 1 3.4 NA NA NA NA NA NA > sparkR is reading from the column_family correctly, but now lets add a > collection data type to the schema. > Now I will drop that test_table, and recreate the table, with with an extra > column of data type map> CREATE TABLE test_table ( > column_1 int, > column_2 text, > column_3 float, > column_4 uuid, > column_5 timestamp, > column_6 boolean, > column_7 timeuuid, > column_8 bigint, > column_9
[jira] [Commented] (SPARK-11190) SparkR support for cassandra collection types.
[ https://issues.apache.org/jira/browse/SPARK-11190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14965320#comment-14965320 ] Bilind Hajer commented on SPARK-11190: -- Sorry about that Sean Owen. So I'm assuming this fix is in master branch of spark 1.52? > SparkR support for cassandra collection types. > --- > > Key: SPARK-11190 > URL: https://issues.apache.org/jira/browse/SPARK-11190 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.5.1 > Environment: SparkR Version: 1.5.1 > Cassandra Version: 2.1.6 > R Version: 3.2.2 > Cassandra Connector version: 1.5.0-M2 >Reporter: Bilind Hajer > Labels: cassandra, dataframe, sparkR > > I want to create a data frame from a Cassandra keyspace and column family in > sparkR. > I am able to create data frames from tables which do not include any > Cassandra collection datatypes, > such as Map, Set and List. But, many of the schemas that I need data from, > do include these collection data types. > Here is my local environment. > SparkR Version: 1.5.1 > Cassandra Version: 2.1.6 > R Version: 3.2.2 > Cassandra Connector version: 1.5.0-M2 > To test this issue, I did the following iterative process. > sudo ./sparkR --packages > com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M2 --conf > spark.cassandra.connection.host=127.0.0.1 > Running this command, with sparkR gives me access to the spark cassandra > connector package I need, > and connects me to my local cqlsh server ( which is up and running while > running this code in sparkR shell ). > CREATE TABLE test_table ( > column_1 int, > column_2 text, > column_3 float, > column_4 uuid, > column_5 timestamp, > column_6 boolean, > column_7 timeuuid, > column_8 bigint, > column_9 blob, > column_10 ascii, > column_11 decimal, > column_12 double, > column_13 inet, > column_14 varchar, > column_15 varint, > PRIMARY KEY( ( column_1, column_2 ) ) > ); > All of the above data types are supported. I insert dummy data after creating > this test schema. > For example, now in my sparkR shell, I run the following code. > df.test <- read.df(sqlContext, source = "org.apache.spark.sql.cassandra", > keyspace = "datahub", table = "test_table") > assigns with no errors, then, > > schema(df.test) > StructType > |-name = "column_1", type = "IntegerType", nullable = TRUE > |-name = "column_2", type = "StringType", nullable = TRUE > |-name = "column_10", type = "StringType", nullable = TRUE > |-name = "column_11", type = "DecimalType(38,18)", nullable = TRUE > |-name = "column_12", type = "DoubleType", nullable = TRUE > |-name = "column_13", type = "InetAddressType", nullable = TRUE > |-name = "column_14", type = "StringType", nullable = TRUE > |-name = "column_15", type = "DecimalType(38,0)", nullable = TRUE > |-name = "column_3", type = "FloatType", nullable = TRUE > |-name = "column_4", type = "UUIDType", nullable = TRUE > |-name = "column_5", type = "TimestampType", nullable = TRUE > |-name = "column_6", type = "BooleanType", nullable = TRUE > |-name = "column_7", type = "UUIDType", nullable = TRUE > |-name = "column_8", type = "LongType", nullable = TRUE > |-name = "column_9", type = "BinaryType", nullable = TRUE > Schema is correct. > > class(df.test) > [1] "DataFrame" > attr(,"package") > [1] "SparkR" > df.test is clearly defined to be a DataFrame Object. > > head(df.test) > column_1 column_2 column_10 column_11 column_12 column_13 column_14 > column_15 > 11helloNANANANANA > NA > column_3 column_4 column_5 column_6 column_7 column_8 column_9 > 1 3.4 NA NA NA NA NA NA > sparkR is reading from the column_family correctly, but now lets add a > collection data type to the schema. > Now I will drop that test_table, and recreate the table, with with an extra > column of data type map> CREATE TABLE test_table ( > column_1 int, > column_2 text, > column_3 float, > column_4 uuid, > column_5 timestamp, > column_6 boolean, > column_7 timeuuid, > column_8 bigint, > column_9
[jira] [Commented] (SPARK-11190) SparkR support for cassandra collection types.
[ https://issues.apache.org/jira/browse/SPARK-11190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14965327#comment-14965327 ] Shivaram Venkataraman commented on SPARK-11190: --- The changes I mentioned are not in 1.5.2. They are in the master branch which will become 1.6.0 when the next release happens. We dont know if the changes fix your problem though which is why it'd be great if you could test against the master branch > SparkR support for cassandra collection types. > --- > > Key: SPARK-11190 > URL: https://issues.apache.org/jira/browse/SPARK-11190 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.5.1 > Environment: SparkR Version: 1.5.1 > Cassandra Version: 2.1.6 > R Version: 3.2.2 > Cassandra Connector version: 1.5.0-M2 >Reporter: Bilind Hajer > Labels: cassandra, dataframe, sparkR > > I want to create a data frame from a Cassandra keyspace and column family in > sparkR. > I am able to create data frames from tables which do not include any > Cassandra collection datatypes, > such as Map, Set and List. But, many of the schemas that I need data from, > do include these collection data types. > Here is my local environment. > SparkR Version: 1.5.1 > Cassandra Version: 2.1.6 > R Version: 3.2.2 > Cassandra Connector version: 1.5.0-M2 > To test this issue, I did the following iterative process. > sudo ./sparkR --packages > com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M2 --conf > spark.cassandra.connection.host=127.0.0.1 > Running this command, with sparkR gives me access to the spark cassandra > connector package I need, > and connects me to my local cqlsh server ( which is up and running while > running this code in sparkR shell ). > CREATE TABLE test_table ( > column_1 int, > column_2 text, > column_3 float, > column_4 uuid, > column_5 timestamp, > column_6 boolean, > column_7 timeuuid, > column_8 bigint, > column_9 blob, > column_10 ascii, > column_11 decimal, > column_12 double, > column_13 inet, > column_14 varchar, > column_15 varint, > PRIMARY KEY( ( column_1, column_2 ) ) > ); > All of the above data types are supported. I insert dummy data after creating > this test schema. > For example, now in my sparkR shell, I run the following code. > df.test <- read.df(sqlContext, source = "org.apache.spark.sql.cassandra", > keyspace = "datahub", table = "test_table") > assigns with no errors, then, > > schema(df.test) > StructType > |-name = "column_1", type = "IntegerType", nullable = TRUE > |-name = "column_2", type = "StringType", nullable = TRUE > |-name = "column_10", type = "StringType", nullable = TRUE > |-name = "column_11", type = "DecimalType(38,18)", nullable = TRUE > |-name = "column_12", type = "DoubleType", nullable = TRUE > |-name = "column_13", type = "InetAddressType", nullable = TRUE > |-name = "column_14", type = "StringType", nullable = TRUE > |-name = "column_15", type = "DecimalType(38,0)", nullable = TRUE > |-name = "column_3", type = "FloatType", nullable = TRUE > |-name = "column_4", type = "UUIDType", nullable = TRUE > |-name = "column_5", type = "TimestampType", nullable = TRUE > |-name = "column_6", type = "BooleanType", nullable = TRUE > |-name = "column_7", type = "UUIDType", nullable = TRUE > |-name = "column_8", type = "LongType", nullable = TRUE > |-name = "column_9", type = "BinaryType", nullable = TRUE > Schema is correct. > > class(df.test) > [1] "DataFrame" > attr(,"package") > [1] "SparkR" > df.test is clearly defined to be a DataFrame Object. > > head(df.test) > column_1 column_2 column_10 column_11 column_12 column_13 column_14 > column_15 > 11helloNANANANANA > NA > column_3 column_4 column_5 column_6 column_7 column_8 column_9 > 1 3.4 NA NA NA NA NA NA > sparkR is reading from the column_family correctly, but now lets add a > collection data type to the schema. > Now I will drop that test_table, and recreate the table, with with an extra > column of data type map> CREATE TABLE test_table ( > column_1 int, > column_2 text, > column_3 float, > column_4 uuid, > column_5
[jira] [Commented] (SPARK-11190) SparkR support for cassandra collection types.
[ https://issues.apache.org/jira/browse/SPARK-11190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14965363#comment-14965363 ] Bilind Hajer commented on SPARK-11190: -- How can I get access to the master branch? Would I be able to clone the current repo and test on my local? > SparkR support for cassandra collection types. > --- > > Key: SPARK-11190 > URL: https://issues.apache.org/jira/browse/SPARK-11190 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.5.1 > Environment: SparkR Version: 1.5.1 > Cassandra Version: 2.1.6 > R Version: 3.2.2 > Cassandra Connector version: 1.5.0-M2 >Reporter: Bilind Hajer > Labels: cassandra, dataframe, sparkR > > I want to create a data frame from a Cassandra keyspace and column family in > sparkR. > I am able to create data frames from tables which do not include any > Cassandra collection datatypes, > such as Map, Set and List. But, many of the schemas that I need data from, > do include these collection data types. > Here is my local environment. > SparkR Version: 1.5.1 > Cassandra Version: 2.1.6 > R Version: 3.2.2 > Cassandra Connector version: 1.5.0-M2 > To test this issue, I did the following iterative process. > sudo ./sparkR --packages > com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M2 --conf > spark.cassandra.connection.host=127.0.0.1 > Running this command, with sparkR gives me access to the spark cassandra > connector package I need, > and connects me to my local cqlsh server ( which is up and running while > running this code in sparkR shell ). > CREATE TABLE test_table ( > column_1 int, > column_2 text, > column_3 float, > column_4 uuid, > column_5 timestamp, > column_6 boolean, > column_7 timeuuid, > column_8 bigint, > column_9 blob, > column_10 ascii, > column_11 decimal, > column_12 double, > column_13 inet, > column_14 varchar, > column_15 varint, > PRIMARY KEY( ( column_1, column_2 ) ) > ); > All of the above data types are supported. I insert dummy data after creating > this test schema. > For example, now in my sparkR shell, I run the following code. > df.test <- read.df(sqlContext, source = "org.apache.spark.sql.cassandra", > keyspace = "datahub", table = "test_table") > assigns with no errors, then, > > schema(df.test) > StructType > |-name = "column_1", type = "IntegerType", nullable = TRUE > |-name = "column_2", type = "StringType", nullable = TRUE > |-name = "column_10", type = "StringType", nullable = TRUE > |-name = "column_11", type = "DecimalType(38,18)", nullable = TRUE > |-name = "column_12", type = "DoubleType", nullable = TRUE > |-name = "column_13", type = "InetAddressType", nullable = TRUE > |-name = "column_14", type = "StringType", nullable = TRUE > |-name = "column_15", type = "DecimalType(38,0)", nullable = TRUE > |-name = "column_3", type = "FloatType", nullable = TRUE > |-name = "column_4", type = "UUIDType", nullable = TRUE > |-name = "column_5", type = "TimestampType", nullable = TRUE > |-name = "column_6", type = "BooleanType", nullable = TRUE > |-name = "column_7", type = "UUIDType", nullable = TRUE > |-name = "column_8", type = "LongType", nullable = TRUE > |-name = "column_9", type = "BinaryType", nullable = TRUE > Schema is correct. > > class(df.test) > [1] "DataFrame" > attr(,"package") > [1] "SparkR" > df.test is clearly defined to be a DataFrame Object. > > head(df.test) > column_1 column_2 column_10 column_11 column_12 column_13 column_14 > column_15 > 11helloNANANANANA > NA > column_3 column_4 column_5 column_6 column_7 column_8 column_9 > 1 3.4 NA NA NA NA NA NA > sparkR is reading from the column_family correctly, but now lets add a > collection data type to the schema. > Now I will drop that test_table, and recreate the table, with with an extra > column of data type map> CREATE TABLE test_table ( > column_1 int, > column_2 text, > column_3 float, > column_4 uuid, > column_5 timestamp, > column_6 boolean, > column_7 timeuuid, > column_8 bigint, >
[jira] [Commented] (SPARK-11190) SparkR support for cassandra collection types.
[ https://issues.apache.org/jira/browse/SPARK-11190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14964032#comment-14964032 ] Shivaram Venkataraman commented on SPARK-11190: --- cc [~sunrui] Could you try this on the master branch ? We recently added support for Lists, Maps etc. in the master branch > SparkR support for cassandra collection types. > --- > > Key: SPARK-11190 > URL: https://issues.apache.org/jira/browse/SPARK-11190 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.1 > Environment: SparkR Version: 1.5.1 > Cassandra Version: 2.1.6 > R Version: 3.2.2 > Cassandra Connector version: 1.5.0-M2 >Reporter: Bilind Hajer > Labels: cassandra, dataframe, sparkR > Fix For: 1.5.2 > > > I want to create a data frame from a Cassandra keyspace and column family in > sparkR. > I am able to create data frames from tables which do not include any > Cassandra collection datatypes, > such as Map, Set and List. But, many of the schemas that I need data from, > do include these collection data types. > Here is my local environment. > SparkR Version: 1.5.1 > Cassandra Version: 2.1.6 > R Version: 3.2.2 > Cassandra Connector version: 1.5.0-M2 > To test this issue, I did the following iterative process. > sudo ./sparkR --packages > com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M2 --conf > spark.cassandra.connection.host=127.0.0.1 > Running this command, with sparkR gives me access to the spark cassandra > connector package I need, > and connects me to my local cqlsh server ( which is up and running while > running this code in sparkR shell ). > CREATE TABLE test_table ( > column_1 int, > column_2 text, > column_3 float, > column_4 uuid, > column_5 timestamp, > column_6 boolean, > column_7 timeuuid, > column_8 bigint, > column_9 blob, > column_10 ascii, > column_11 decimal, > column_12 double, > column_13 inet, > column_14 varchar, > column_15 varint, > PRIMARY KEY( ( column_1, column_2 ) ) > ); > All of the above data types are supported. I insert dummy data after creating > this test schema. > For example, now in my sparkR shell, I run the following code. > df.test <- read.df(sqlContext, source = "org.apache.spark.sql.cassandra", > keyspace = "datahub", table = "test_table") > assigns with no errors, then, > > schema(df.test) > StructType > |-name = "column_1", type = "IntegerType", nullable = TRUE > |-name = "column_2", type = "StringType", nullable = TRUE > |-name = "column_10", type = "StringType", nullable = TRUE > |-name = "column_11", type = "DecimalType(38,18)", nullable = TRUE > |-name = "column_12", type = "DoubleType", nullable = TRUE > |-name = "column_13", type = "InetAddressType", nullable = TRUE > |-name = "column_14", type = "StringType", nullable = TRUE > |-name = "column_15", type = "DecimalType(38,0)", nullable = TRUE > |-name = "column_3", type = "FloatType", nullable = TRUE > |-name = "column_4", type = "UUIDType", nullable = TRUE > |-name = "column_5", type = "TimestampType", nullable = TRUE > |-name = "column_6", type = "BooleanType", nullable = TRUE > |-name = "column_7", type = "UUIDType", nullable = TRUE > |-name = "column_8", type = "LongType", nullable = TRUE > |-name = "column_9", type = "BinaryType", nullable = TRUE > Schema is correct. > > class(df.test) > [1] "DataFrame" > attr(,"package") > [1] "SparkR" > df.test is clearly defined to be a DataFrame Object. > > head(df.test) > column_1 column_2 column_10 column_11 column_12 column_13 column_14 > column_15 > 11helloNANANANANA > NA > column_3 column_4 column_5 column_6 column_7 column_8 column_9 > 1 3.4 NA NA NA NA NA NA > sparkR is reading from the column_family correctly, but now lets add a > collection data type to the schema. > Now I will drop that test_table, and recreate the table, with with an extra > column of data type map> CREATE TABLE test_table ( > column_1 int, > column_2 text, > column_3 float, > column_4 uuid, > column_5 timestamp, > column_6 boolean, > column_7 timeuuid, > column_8
[jira] [Commented] (SPARK-11190) SparkR support for cassandra collection types.
[ https://issues.apache.org/jira/browse/SPARK-11190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14964397#comment-14964397 ] Sun Rui commented on SPARK-11190: - yes, please try it on the latest master branch. > SparkR support for cassandra collection types. > --- > > Key: SPARK-11190 > URL: https://issues.apache.org/jira/browse/SPARK-11190 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.1 > Environment: SparkR Version: 1.5.1 > Cassandra Version: 2.1.6 > R Version: 3.2.2 > Cassandra Connector version: 1.5.0-M2 >Reporter: Bilind Hajer > Labels: cassandra, dataframe, sparkR > Fix For: 1.5.2 > > > I want to create a data frame from a Cassandra keyspace and column family in > sparkR. > I am able to create data frames from tables which do not include any > Cassandra collection datatypes, > such as Map, Set and List. But, many of the schemas that I need data from, > do include these collection data types. > Here is my local environment. > SparkR Version: 1.5.1 > Cassandra Version: 2.1.6 > R Version: 3.2.2 > Cassandra Connector version: 1.5.0-M2 > To test this issue, I did the following iterative process. > sudo ./sparkR --packages > com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M2 --conf > spark.cassandra.connection.host=127.0.0.1 > Running this command, with sparkR gives me access to the spark cassandra > connector package I need, > and connects me to my local cqlsh server ( which is up and running while > running this code in sparkR shell ). > CREATE TABLE test_table ( > column_1 int, > column_2 text, > column_3 float, > column_4 uuid, > column_5 timestamp, > column_6 boolean, > column_7 timeuuid, > column_8 bigint, > column_9 blob, > column_10 ascii, > column_11 decimal, > column_12 double, > column_13 inet, > column_14 varchar, > column_15 varint, > PRIMARY KEY( ( column_1, column_2 ) ) > ); > All of the above data types are supported. I insert dummy data after creating > this test schema. > For example, now in my sparkR shell, I run the following code. > df.test <- read.df(sqlContext, source = "org.apache.spark.sql.cassandra", > keyspace = "datahub", table = "test_table") > assigns with no errors, then, > > schema(df.test) > StructType > |-name = "column_1", type = "IntegerType", nullable = TRUE > |-name = "column_2", type = "StringType", nullable = TRUE > |-name = "column_10", type = "StringType", nullable = TRUE > |-name = "column_11", type = "DecimalType(38,18)", nullable = TRUE > |-name = "column_12", type = "DoubleType", nullable = TRUE > |-name = "column_13", type = "InetAddressType", nullable = TRUE > |-name = "column_14", type = "StringType", nullable = TRUE > |-name = "column_15", type = "DecimalType(38,0)", nullable = TRUE > |-name = "column_3", type = "FloatType", nullable = TRUE > |-name = "column_4", type = "UUIDType", nullable = TRUE > |-name = "column_5", type = "TimestampType", nullable = TRUE > |-name = "column_6", type = "BooleanType", nullable = TRUE > |-name = "column_7", type = "UUIDType", nullable = TRUE > |-name = "column_8", type = "LongType", nullable = TRUE > |-name = "column_9", type = "BinaryType", nullable = TRUE > Schema is correct. > > class(df.test) > [1] "DataFrame" > attr(,"package") > [1] "SparkR" > df.test is clearly defined to be a DataFrame Object. > > head(df.test) > column_1 column_2 column_10 column_11 column_12 column_13 column_14 > column_15 > 11helloNANANANANA > NA > column_3 column_4 column_5 column_6 column_7 column_8 column_9 > 1 3.4 NA NA NA NA NA NA > sparkR is reading from the column_family correctly, but now lets add a > collection data type to the schema. > Now I will drop that test_table, and recreate the table, with with an extra > column of data type map> CREATE TABLE test_table ( > column_1 int, > column_2 text, > column_3 float, > column_4 uuid, > column_5 timestamp, > column_6 boolean, > column_7 timeuuid, > column_8 bigint, > column_9 blob, > column_10