Re: Creating Spark DataFrame from large pandas DataFrame
The easiest option I found to put jars in SPARK CLASSPATH On 21 Aug 2015 06:20, Burak Yavuz brk...@gmail.com wrote: If you would like to try using spark-csv, please use `pyspark --packages com.databricks:spark-csv_2.11:1.2.0` You're missing a dependency. Best, Burak On Thu, Aug 20, 2015 at 1:08 PM, Charlie Hack charles.t.h...@gmail.com wrote: Hi, I'm new to spark and am trying to create a Spark df from a pandas df with ~5 million rows. Using Spark 1.4.1. When I type: df = sqlContext.createDataFrame(pandas_df.where(pd.notnull(didf), None)) (the df.where is a hack I found on the Spark JIRA to avoid a problem with NaN values making mixed column types) I get: TypeError: cannot create an RDD from type: type 'list' Converting a smaller pandas dataframe (~2000 rows) works fine. Anyone had this issue? This is already a workaround-- ideally I'd like to read the spark dataframe from a Hive table. But this is currently not an option for my setup. I also tried reading the data into spark from a CSV using spark-csv. Haven't been able to make this work as yet. I launch $ pyspark --jars path/to/spark-csv_2.11-1.2.0.jar and when I attempt to read the csv I get: Py4JJavaError: An error occurred while calling o22.load. : java.lang.NoClassDefFoundError: org/apache/commons/csv/CSVFormat ... Other options I can think of: - Convert my CSV to json (use Pig?) and read into Spark - Read in using jdbc connect from postgres But want to make sure I'm not misusing Spark or missing something obvious. Thanks! Charlie
Re: Creating Spark DataFrame from large pandas DataFrame
If you would like to try using spark-csv, please use `pyspark --packages com.databricks:spark-csv_2.11:1.2.0` You're missing a dependency. Best, Burak On Thu, Aug 20, 2015 at 1:08 PM, Charlie Hack charles.t.h...@gmail.com wrote: Hi, I'm new to spark and am trying to create a Spark df from a pandas df with ~5 million rows. Using Spark 1.4.1. When I type: df = sqlContext.createDataFrame(pandas_df.where(pd.notnull(didf), None)) (the df.where is a hack I found on the Spark JIRA to avoid a problem with NaN values making mixed column types) I get: TypeError: cannot create an RDD from type: type 'list' Converting a smaller pandas dataframe (~2000 rows) works fine. Anyone had this issue? This is already a workaround-- ideally I'd like to read the spark dataframe from a Hive table. But this is currently not an option for my setup. I also tried reading the data into spark from a CSV using spark-csv. Haven't been able to make this work as yet. I launch $ pyspark --jars path/to/spark-csv_2.11-1.2.0.jar and when I attempt to read the csv I get: Py4JJavaError: An error occurred while calling o22.load. : java.lang.NoClassDefFoundError: org/apache/commons/csv/CSVFormat ... Other options I can think of: - Convert my CSV to json (use Pig?) and read into Spark - Read in using jdbc connect from postgres But want to make sure I'm not misusing Spark or missing something obvious. Thanks! Charlie
Creating Spark DataFrame from large pandas DataFrame
Hi, I'm new to spark and am trying to create a Spark df from a pandas df with ~5 million rows. Using Spark 1.4.1. When I type: df = sqlContext.createDataFrame(pandas_df.where(pd.notnull(didf), None)) (the df.where is a hack I found on the Spark JIRA to avoid a problem with NaN values making mixed column types) I get: TypeError: cannot create an RDD from type: type 'list' Converting a smaller pandas dataframe (~2000 rows) works fine. Anyone had this issue? This is already a workaround-- ideally I'd like to read the spark dataframe from a Hive table. But this is currently not an option for my setup. I also tried reading the data into spark from a CSV using spark-csv. Haven't been able to make this work as yet. I launch $ pyspark --jars path/to/spark-csv_2.11-1.2.0.jar and when I attempt to read the csv I get: Py4JJavaError: An error occurred while calling o22.load. : java.lang.NoClassDefFoundError: org/apache/commons/csv/CSVFormat ... Other options I can think of: - Convert my CSV to json (use Pig?) and read into Spark - Read in using jdbc connect from postgres But want to make sure I'm not misusing Spark or missing something obvious. Thanks! Charlie