Re: Hive From Spark: Jdbc VS sparkContext

2017-10-10 Thread 郭鹏飞
> 在 2017年10月4日,上午2:08,Nicolas Paris 写道: > > Hi > > I wonder the differences accessing HIVE tables in two different ways: > - with jdbc access > - with sparkContext > > I would say that jdbc is better since it uses HIVE that is based on > map-reduce / TEZ and then works on

Unable to run Spark Jobs in yarn cluster mode

2017-10-10 Thread Debabrata Ghosh
Hi All, I am constantly hitting an error : "ApplicationMaster: SparkContext did not initialize after waiting for 100 ms" while running my Spark code in yarn cluster mode. Here is the command what I am using :* spark-submit --master yarn --deploy-mode cluster spark_code.py*

Re: Need help

2017-10-10 Thread Ilya Karpov
Suggest you reading «Hadoop Application Architectures» (orelly) by Mark Grover, Ted Malaska and others. There you can find some answers for your questions. > 10 окт. 2017 г., в 9:00, Mahender Sarangam > написал(а): > > Hi, > > I'm new to spark and big data, we

best spark spatial lib?

2017-10-10 Thread Imran Rajjad
I need to have a location column inside my Dataframe so that I can do spatial queries and geometry operations. Are there any third-party packages that perform this kind of operations. I have seen a few like Geospark and megalan but they don't support operations where spatial and logical operators

Reading from HDFS by increasing split size

2017-10-10 Thread Kanagha Kumar
Hi, I'm trying to read a 60GB HDFS file using spark textFile("hdfs_file_path", minPartitions). How can I control the no.of tasks by increasing the split size? With default split size of 250 MB, several tasks are created. But I would like to have a specific no.of tasks created while reading from

Re: Does Spark 2.2.0 support Dataset<List<Map<String,Object>>> ?

2017-10-10 Thread kant kodali
I have also tried these. And none of them actually compile. dataset.map(new MapFunction>>() { @Override public Seq> call(String input) throws Exception { List> temp = new ArrayList<>(); temp.add(new

Re: Reading from HDFS by increasing split size

2017-10-10 Thread Jörn Franke
Write your own input format/datasource or split the file yourself beforehand (not recommended). > On 10. Oct 2017, at 09:14, Kanagha Kumar wrote: > > Hi, > > I'm trying to read a 60GB HDFS file using spark textFile("hdfs_file_path", > minPartitions). > > How can I

Re: Hive From Spark: Jdbc VS sparkContext

2017-10-10 Thread ayan guha
That is not correct, IMHO. If I am not wrong, Spark will still load data in executor, by running some stats on the data itself to identify partitions On Tue, Oct 10, 2017 at 9:23 PM, 郭鹏飞 wrote: > > > 在 2017年10月4日,上午2:08,Nicolas Paris 写道: > > >

Re: Reading from HDFS by increasing split size

2017-10-10 Thread ayan guha
I have not tested this, but you should be able to pass on any map-reduce like conf to underlying hadoop config.essentially you should be able to control behaviour of split as you can do in a map-reduce program (as Spark uses the same input format) On Tue, Oct 10, 2017 at 10:21 PM, Jörn Franke

Spark-submit on a sample program gives Syntax Error

2017-10-10 Thread shekar
Hi My environment: Windows 10, Spark 1.6.1 built for Hadoop 2.6.0 Build Python 2.7 Java 1.8 Issue: Go to C:\Spark The command: bin\spark-submit --master local C:\Spark\examples\src\main\python\pi.py 10 gives: File "", line 1 bin\spark-submit --master local

Re: best spark spatial lib?

2017-10-10 Thread Anastasios Zouzias
Hi, Which spatial operations do you require exactly? Also, I don't follow what you mean by combining logical operators? I have created a library that wraps Lucene's spatial functionality here: https://github.com/zouzias/spark-lucenerdd/wiki/Spatial-search You could give a try to the library, it

Re: best spark spatial lib?

2017-10-10 Thread Jim Hughes
Hi all, GeoMesa integrates with Spark SQL and allows for queries like: select * from chicago where case_number = 1 and st_intersects(geom, st_makeBox2d(st_point(-77, 38), st_point(-76, 39))) GeoMesa does this by calling package protected Spark methods to implement geospatial user defined

Re: best spark spatial lib?

2017-10-10 Thread Silvio Fiorito
There’s a number of packages for geospatial analysis, depends on the features you need. Here are a few I know of and/or have used: Magellan: https://github.com/harsha2010/magellan MrGeo: https://github.com/ngageoint/mrgeo GeoMesa: http://www.geomesa.org/documentation/tutorials/spark.html

RE: How to convert Array of Json rows into Dataset of specific columns in Spark 2.2.0?

2017-10-10 Thread JG Perrin
Something along the line of: Dataset df = spark.read().json(jsonDf); ? From: kant kodali [mailto:kanth...@gmail.com] Sent: Saturday, October 07, 2017 2:31 AM To: user @spark Subject: How to convert Array of Json rows into Dataset of specific columns in Spark 2.2.0? I

Re: Unable to run Spark Jobs in yarn cluster mode

2017-10-10 Thread Vadim Semenov
Try increasing the `spark.yarn.am.waitTime` parameter, it's by default set to 100ms which might not be enough in certain cases. On Tue, Oct 10, 2017 at 7:02 AM, Debabrata Ghosh wrote: > Hi All, > I am constantly hitting an error : "ApplicationMaster: >

Re: best spark spatial lib?

2017-10-10 Thread Georg Heiler
What about someting like gromesa? Anastasios Zouzias schrieb am Di. 10. Okt. 2017 um 15:29: > Hi, > > Which spatial operations do you require exactly? Also, I don't follow what > you mean by combining logical operators? > > I have created a library that wraps Lucene's spatial

Re: Reading from HDFS by increasing split size

2017-10-10 Thread Kanagha Kumar
Thanks for the inputs!! I passed in spark.mapred.max.split.size, spark.mapred.min.split.size to the size I wanted to read. It didn't take any effect. I also tried passing in spark.dfs.block.size, with all the params set to the same value.

Re: EMR: Use extra mounted EBS volumes for spark.local.dir

2017-10-10 Thread Vadim Semenov
that's probably better be directed to the AWS support On Sun, Oct 8, 2017 at 9:54 PM, Tushar Sudake wrote: > Hello everyone, > > I'm using 'r4.8xlarge' instances on EMR for my Spark Application. > To each node, I'm attaching one 512 GB EBS volume. > > By logging in into

Re: best spark spatial lib?

2017-10-10 Thread Ram Sriharsha
why can't you do this in Magellan? Can you post a sample query that you are trying to run that has spatial and logical operators combined? Maybe I am not understanding the issue properly Ram On Tue, Oct 10, 2017 at 2:21 AM, Imran Rajjad wrote: > I need to have a location

Re: Reading from HDFS by increasing split size

2017-10-10 Thread Jörn Franke
Maybe you need to set the parameters for the mapreduce api and not the mapred api. I do not have in mind now how they differ but the Hadoop web page should tell you ;-) > On 10. Oct 2017, at 17:53, Kanagha Kumar wrote: > > Thanks for the inputs!! > > I passed in

Re: Reading from HDFS by increasing split size

2017-10-10 Thread ayan guha
Have you seen this: https://stackoverflow.com/questions/42796561/set-hadoop-configuration-values-on-spark-submit-command-line ? Please try and let us know. On Wed, Oct 11, 2017 at 2:53 AM, Kanagha Kumar wrote: > Thanks for the inputs!! > > I passed in

Re: Hive From Spark: Jdbc VS sparkContext

2017-10-10 Thread weand
Is Hive from Spark via JDBC working for you? In case it does, I would be interested in your setup :-) We can't get this working. See bug here, especially my last comment: https://issues.apache.org/jira/browse/SPARK-21063 Regards Andreas -- Sent from:

RE: Hive From Spark: Jdbc VS sparkContext

2017-10-10 Thread Walia, Reema
I am able to connect to Spark via JDBC - tested with Squirrel. I am referencing all the jars of current Spark distribution under /usr/hdp/current/spark2-client/jars/* Thanks, Reema -Original Message- From: weand [mailto:andreas.we...@gmail.com] Sent: Tuesday, October 10, 2017 5:14 PM

Re: Reading from HDFS by increasing split size

2017-10-10 Thread Kanagha Kumar
Thanks Ayan! Finally it worked!! Thanks a lot everyone for the inputs! Once I prefixed the params with "spark.hadoop", I see the no.of tasks getting reduced. I'm setting the following params: --conf spark.hadoop.dfs.block.size --conf spark.hadoop.mapreduce.input.fileinputformat.split.minsize

Re: Hive From Spark: Jdbc VS sparkContext

2017-10-10 Thread Gourav Sengupta
Hi, I do not think that SPARK will automatically determine the partitions. Actually it does not automatically determine the partitions. In case a table has a few million records, it all goes through the driver. Ofcourse, I have only tried JDBC connections in AURORA, Oracle and Postgres.

Need help

2017-10-10 Thread Mahender Sarangam
Hi, I'm new to spark and big data, we are doing some poc and building our warehouse application using Spark. Can any one share with me guidance like Naming Convention for HDFS Name,Table Names, UDF and DB Name. Any sample architecture diagram. -Mahens

Re: Unable to run Spark Jobs in yarn cluster mode

2017-10-10 Thread mailfordebu
Thanks Vadim! Sent from my iPhone > On 10-Oct-2017, at 11:09 PM, Vadim Semenov > wrote: > > Try increasing the `spark.yarn.am.waitTime` parameter, it's by default set to > 100ms which might not be enough in certain cases. > >> On Tue, Oct 10, 2017 at 7:02 AM,