Re: Example of Geoprocessing with Spark

2014-09-20 Thread Abel Coronado Iruegas
Thanks, Evan and Andy: Here a very functional version, i need to improve the syntax, but this works very well, the initial version takes around 36 hours in a 9 machines with 8 cores, and this version takes 36 minutes in a cluster with 7 machines with 8 cores : object SimpleApp { def

Re: Example of Geoprocessing with Spark

2014-09-19 Thread Abel Coronado Iruegas
Hi Evan, here a improved version, thanks for your advice. But you know the last step, the SaveAsTextFile is very Slw, :( import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf import java.net.URL import java.text.SimpleDateFormat import

Re: Example of Geoprocessing with Spark

2014-09-18 Thread Abel Coronado Iruegas
: Connection refused: axaxaxa-cloudera-s05.xxxnetworks.com/10.5.96.42:43942 On Mon, Sep 15, 2014 at 1:30 PM, Abel Coronado Iruegas acoronadoirue...@gmail.com wrote: Here an example of a working code that takes a csv with lat lon points and intersects with polygons of municipalities of Mexico

Example of Geoprocessing with Spark

2014-09-15 Thread Abel Coronado Iruegas
Here an example of a working code that takes a csv with lat lon points and intersects with polygons of municipalities of Mexico, generating a new version of the file with new attributes. Do you think that could be improved? Thanks. The Code: import org.apache.spark.SparkContext import

Re: java.lang.OutOfMemoryError: GC overhead limit exceeded

2014-07-21 Thread Abel Coronado Iruegas
Hi Yifan This works for me: export SPARK_JAVA_OPTS=-Xms10g -Xmx40g -XX:MaxPermSize=10g export ADD_JARS=/home/abel/spark/MLI/target/MLI-assembly-1.0.jar export SPARK_MEM=40g ./spark-shell Regards On Mon, Jul 21, 2014 at 7:48 AM, Yifan LI iamyifa...@gmail.com wrote: Hi, I am trying to load

Understanding how to install in HDP

2014-07-09 Thread Abel Coronado Iruegas
Hi everybody We have hortonworks cluster with many nodes, we want to test a deployment of Spark. Whats the recomended path to follow? I mean we can compile the sources in the Name Node. But i don't really understand how to pass the executable jar and configuration to the rest of the nodes.

SQL FIlter of tweets (json) running on Disk

2014-07-04 Thread Abel Coronado Iruegas
Hi everybody Someone can tell me if it is possible to read and filter a 60 GB file of tweets (Json Docs) in a Standalone Spark Deployment that runs in a single machine with 40 Gb RAM and 8 cores??? I mean, is it possible to configure Spark to work with some amount of memory (20 GB) and the rest

Re: SQL FIlter of tweets (json) running on Disk

2014-07-04 Thread Abel Coronado Iruegas
the use a combination of resources (Memory processing Disk processing) still remains. Thanks !! On Fri, Jul 4, 2014 at 9:49 AM, Abel Coronado Iruegas acoronadoirue...@gmail.com wrote: Hi everybody Someone can tell me if it is possible to read and filter a 60 GB file of tweets (Json Docs

Re: SQL FIlter of tweets (json) running on Disk

2014-07-04 Thread Abel Coronado Iruegas
Thank you, DataBricks Rules On Fri, Jul 4, 2014 at 1:58 PM, Michael Armbrust mich...@databricks.com wrote: sqlContext.jsonFile(data.json) Is this already available in the master branch??? Yes, and it will be available in the soon to come 1.0.1 release. But the question about