Re: Creating Spark DataFrame from large pandas DataFrame

2015-08-21 Thread ayan guha
The easiest option I found to put jars in SPARK CLASSPATH
On 21 Aug 2015 06:20, Burak Yavuz brk...@gmail.com wrote:

 If you would like to try using spark-csv, please use
 `pyspark --packages com.databricks:spark-csv_2.11:1.2.0`

 You're missing a dependency.

 Best,
 Burak

 On Thu, Aug 20, 2015 at 1:08 PM, Charlie Hack charles.t.h...@gmail.com
 wrote:

 Hi,

 I'm new to spark and am trying to create a Spark df from a pandas df with
 ~5 million rows. Using Spark 1.4.1.

 When I type:

 df = sqlContext.createDataFrame(pandas_df.where(pd.notnull(didf), None))

 (the df.where is a hack I found on the Spark JIRA to avoid a problem with
 NaN values making mixed column types)

 I get:

 TypeError: cannot create an RDD from type: type 'list'

 Converting a smaller pandas dataframe (~2000 rows) works fine. Anyone had
 this issue?


 This is already a workaround-- ideally I'd like to read the spark
 dataframe from a Hive table. But this is currently not an option for my
 setup.

 I also tried reading the data into spark from a CSV using spark-csv.
 Haven't been able to make this work as yet. I launch

 $ pyspark --jars path/to/spark-csv_2.11-1.2.0.jar

 and when I attempt to read the csv I get:

 Py4JJavaError: An error occurred while calling o22.load. :
 java.lang.NoClassDefFoundError: org/apache/commons/csv/CSVFormat ...

 Other options I can think of:

 - Convert my CSV to json (use Pig?) and read into Spark
 - Read in using jdbc connect from postgres

 But want to make sure I'm not misusing Spark or missing something obvious.

 Thanks!

 Charlie





Re: Creating Spark DataFrame from large pandas DataFrame

2015-08-20 Thread Burak Yavuz
If you would like to try using spark-csv, please use
`pyspark --packages com.databricks:spark-csv_2.11:1.2.0`

You're missing a dependency.

Best,
Burak

On Thu, Aug 20, 2015 at 1:08 PM, Charlie Hack charles.t.h...@gmail.com
wrote:

 Hi,

 I'm new to spark and am trying to create a Spark df from a pandas df with
 ~5 million rows. Using Spark 1.4.1.

 When I type:

 df = sqlContext.createDataFrame(pandas_df.where(pd.notnull(didf), None))

 (the df.where is a hack I found on the Spark JIRA to avoid a problem with
 NaN values making mixed column types)

 I get:

 TypeError: cannot create an RDD from type: type 'list'

 Converting a smaller pandas dataframe (~2000 rows) works fine. Anyone had
 this issue?


 This is already a workaround-- ideally I'd like to read the spark
 dataframe from a Hive table. But this is currently not an option for my
 setup.

 I also tried reading the data into spark from a CSV using spark-csv.
 Haven't been able to make this work as yet. I launch

 $ pyspark --jars path/to/spark-csv_2.11-1.2.0.jar

 and when I attempt to read the csv I get:

 Py4JJavaError: An error occurred while calling o22.load. :
 java.lang.NoClassDefFoundError: org/apache/commons/csv/CSVFormat ...

 Other options I can think of:

 - Convert my CSV to json (use Pig?) and read into Spark
 - Read in using jdbc connect from postgres

 But want to make sure I'm not misusing Spark or missing something obvious.

 Thanks!

 Charlie



Creating Spark DataFrame from large pandas DataFrame

2015-08-20 Thread Charlie Hack
Hi,

I'm new to spark and am trying to create a Spark df from a pandas df with
~5 million rows. Using Spark 1.4.1.

When I type:

df = sqlContext.createDataFrame(pandas_df.where(pd.notnull(didf), None))

(the df.where is a hack I found on the Spark JIRA to avoid a problem with
NaN values making mixed column types)

I get:

TypeError: cannot create an RDD from type: type 'list'

Converting a smaller pandas dataframe (~2000 rows) works fine. Anyone had
this issue?


This is already a workaround-- ideally I'd like to read the spark dataframe
from a Hive table. But this is currently not an option for my setup.

I also tried reading the data into spark from a CSV using spark-csv.
Haven't been able to make this work as yet. I launch

$ pyspark --jars path/to/spark-csv_2.11-1.2.0.jar

and when I attempt to read the csv I get:

Py4JJavaError: An error occurred while calling o22.load. :
java.lang.NoClassDefFoundError: org/apache/commons/csv/CSVFormat ...

Other options I can think of:

- Convert my CSV to json (use Pig?) and read into Spark
- Read in using jdbc connect from postgres

But want to make sure I'm not misusing Spark or missing something obvious.

Thanks!

Charlie