Re: How to read LZO file in Spark?

2017-09-28 Thread Vida Ha
https://docs.databricks.com/spark/latest/data-sources/read-lzo.html On Wed, Sep 27, 2017 at 6:36 AM 孫澤恩 wrote: > Hi All, > > Currently, I follow this blog > http://blog.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/ > that > I could use hdfs

Re: How to join two PairRDD together?

2014-08-25 Thread Vida Ha
Can you paste the code? It's unclear to me how/when the out of memory is occurring without seeing the code. On Sun, Aug 24, 2014 at 11:37 PM, Gefei Li gefeili.2...@gmail.com wrote: Hello everyone, I am transplanting a clustering algorithm to spark platform, and I meet a problem

Re: Got NotSerializableException when access broadcast variable

2014-08-20 Thread Vida Ha
Hi, I doubt the the broadcast variable is your problem, since you are seeing: org.apache.spark.SparkException: Task not serializable Caused by: java.io.NotSerializableException: org.apache.spark.sql .hive.HiveContext$$anon$3 We have a knowledgebase article that explains why this happens - it's

Re: java.io.NotSerializableException: org.scalatest.Assertions$AssertionsHelper

2014-08-20 Thread Vida Ha
Hi Chris, We have a knowledge base article to explain what's happening here: https://github.com/databricks/spark-knowledgebase/blob/master/troubleshooting/javaionotserializableexception.md Let me know if the article is not clear enough - I would be happy to edit and improve it. -Vida On Wed,

Re: Writing to RabbitMQ

2014-08-18 Thread Vida Ha
Hi John, It seems like original problem you had was that you were initializing the RabbitMQ connection on the driver, but then calling the code to write to RabbitMQ on the workers (I'm guessing, but I don't know since I didn't see your code). That's definitely a problem because the connection

Re: Writing to RabbitMQ

2014-08-18 Thread Vida Ha
On Mon, Aug 18, 2014 at 4:25 PM, Vida Ha v...@databricks.com wrote: Hi John, It seems like original problem you had was that you were initializing the RabbitMQ connection on the driver, but then calling the code to write to RabbitMQ on the workers (I'm guessing, but I don't know since I didn't

Re: Save an RDD to a SQL Database

2014-08-07 Thread Vida Ha
The use case I was thinking of was outputting calculations made in Spark into a SQL database for the presentation layer to access. So in other words, having a Spark backend in Java that writes to a SQL database and then having a Rails front-end that can display the data nicely. On Thu, Aug 7,

Re: Save an RDD to a SQL Database

2014-08-07 Thread Vida Ha
. This is not a slow as you think, because Spark can write the output in parallel to S3, and Redshift, too, can load data from multiple files in parallel http://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-single-copy-command.html . Nick On Thu, Aug 7, 2014 at 1:52 PM, Vida Ha v

Save an RDD to a SQL Database

2014-08-05 Thread Vida Ha
Hi, I would like to save an RDD to a SQL database. It seems like this would be a common enough use case. Are there any built in libraries to do it? Otherwise, I'm just planning on mapping my RDD, and having that call a method to write to the database. Given that a lot of records are going to