Re: pyspark split pair rdd to multiple

2016-04-20 Thread Gourav Sengupta
Is there any reason why you are not using data frames? Regards, Gourav On Tue, Apr 19, 2016 at 8:51 PM, pth001 wrote: > Hi, > > How can I split pair rdd [K, V] to map [K, Array(V)] efficiently in > Pyspark? > > Best, > Patcharee > > -

Re: pyspark split pair rdd to multiple

2016-04-20 Thread Gourav Sengupta
Hi, you do not need to do anything with the RDD at all. Just follow the instructions in this site https://github.com/databricks/spark-csv and everything will be super fast and smooth. Remember that in case the data is large then converting RDD to dataframes takes a very very very very long time.

Re: Reading from Amazon S3

2016-04-28 Thread Gourav Sengupta
Why would you use JAVA (create a problem and then try to solve it)? Have you tried using Scala or Python or even R? Regards, Gourav On Thu, Apr 28, 2016 at 10:07 AM, Steve Loughran wrote: > > On 26 Apr 2016, at 18:49, Ted Yu wrote: > > Looking at the cause of the error, it seems hadoop-aws-xx.

Re: Spark on AWS

2016-05-02 Thread Gourav Sengupta
://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-spark-configure.html In case you are trying to load enough data in the spark Master node for graphing or exploratory analysis using Matlab, seaborn or bokeh its better to increase the driver memory by recreating spark context. Regards Gourav Sengupta

Re: Reading from Amazon S3

2016-05-02 Thread Gourav Sengupta
JAVA does not easily parallelize, JAVA is verbose, uses different classes for serializing, and on top of that you are using RDD's instead of dataframes. Should a senior project not have an implied understanding that it should be technically superior? Why not use SCALA? Regards, Gourav On Mon, M

Re: Weird results with Spark SQL Outer joins

2016-05-02 Thread Gourav Sengupta
Hi, As always, can you please write down details regarding your SPARK cluster - the version, OS, IDE used, etc? Regards, Gourav Sengupta On Mon, May 2, 2016 at 5:58 PM, kpeng1 wrote: > Hi All, > > I am running into a weird result with Spark SQL Outer joins. The results > for all

Re: SparkSQL with large result size

2016-05-02 Thread Gourav Sengupta
Hi, I have worked on 300GB data by querying it from CSV (using SPARK CSV) and writing it to Parquet format and then querying parquet format to query it and partition the data and write out individual csv files without any issues on a single node SPARK cluster installation. Are you trying to cac

Re: Reading from Amazon S3

2016-05-02 Thread Gourav Sengupta
with the problem, because Spark supports Java. Java and Scala > run on the same underlying JVM. > > On 02 May 2016, at 17:42, Gourav Sengupta > wrote: > > JAVA does not easily parallelize, JAVA is verbose, uses different classes > for serializing, and on top of that

Re: Weird results with Spark SQL Outer joins

2016-05-02 Thread Gourav Sengupta
sult from spark shell > OS: Linux version 2.6.32-431.20.3.el6.x86_64 ( > mockbu...@c6b9.bsys.dev.centos.org) (gcc version 4.4.7 20120313 (Red Hat > 4.4.7-4) (GCC) ) #1 SMP Thu Jun 19 21:14:45 UTC 2014 > > Thanks, > > KP > > On Mon, May 2, 2016 at 11:05 AM, Gourav Sen

Re: Weird results with Spark SQL Outer joins

2016-05-02 Thread Gourav Sengupta
This shows that both the tables have matching records and no mismatches. Therefore obviously you have the same results irrespective of whether you use right or left join. I think that there is no problem here, unless I am missing something. Regards, Gourav On Mon, May 2, 2016 at 7:48 PM, kpeng1

Re: Weird results with Spark SQL Outer joins

2016-05-02 Thread Gourav Sengupta
; Subject: Re: Weird results with Spark SQL Outer joins >> To: gourav.sengu...@gmail.com >> CC: user@spark.apache.org >> >> >> Gourav, >> >> I wish that was case, but I have done a select count on each of the two >> tables individually and they return back

Re: Error from reading S3 in Scala

2016-05-03 Thread Gourav Sengupta
Hi, The best thing to do is start the EMR clusters with proper permissions in the roles that way you do not need to worry about the keys at all. Another thing, why are we using s3a// instead of s3:// ? Besides that you can increase s3 speeds using the instructions mentioned here: https://aws.ama

Re: Weird results with Spark SQL Outer joins

2016-05-03 Thread Gourav Sengupta
Hi Kevin, Having given it a first look I do think that you have hit something here and this does not look quite fine. I have to work on the multiple AND conditions in ON and see whether that is causing any issues. Regards, Gourav Sengupta On Tue, May 3, 2016 at 8:28 AM, Kevin Peng wrote

Re: Spark-csv- partitionBy

2016-05-09 Thread Gourav Sengupta
Hi, its supported, try to use coalesce(1) (the spelling is wrong) and after that do the partitions. Regards, Gourav On Mon, May 9, 2016 at 7:12 PM, Mail.com wrote: > Hi, > > I have to write tab delimited file and need to have one directory for each > unique value of a column. > > I tried using

Re: Secondary Indexing?

2016-05-30 Thread Gourav Sengupta
Hi, have you tried using partitioning and parquet format. It works super fast in SPARK. Regards, Gourav On Mon, May 30, 2016 at 5:08 PM, Michael Segel wrote: > I’m not sure where to post this since its a bit of a philosophical > question in terms of design and vision for spark. > > If we look

Re: Accessing s3a files from Spark

2016-05-31 Thread Gourav Sengupta
Hi, Is your spark cluster running in EMR or via self created SPARK cluster using EC2 or from a local cluster behind firewall? What is the SPARK version you are using? Regards, Gourav Sengupta On Sun, May 29, 2016 at 10:55 PM, Mayuresh Kunjir wrote: > I'm running into permission issu

Re: Accessing s3a files from Spark

2016-05-31 Thread Gourav Sengupta
Hi, And on another note, is it required to use s3a? Why not use s3:// only? I prefer to use s3a:// only while writing files to S3 from EMR. Regards, Gourav Sengupta On Tue, May 31, 2016 at 12:04 PM, Gourav Sengupta wrote: > Hi, > > Is your spark cluster running in EMR or via sel

Re: Accessing s3a files from Spark

2016-06-01 Thread Gourav Sengupta
, Gourav Sengupta On Tue, May 31, 2016 at 12:22 PM, Mayuresh Kunjir wrote: > How do I use it? I'm accessing s3a from Spark's textFile API. > > On Tue, May 31, 2016 at 7:16 AM, Deepak Sharma > wrote: > >> Hi Mayuresh >> Instead of s3a , have you tried th

Re: ImportError: No module named numpy

2016-06-04 Thread Gourav Sengupta
including the following: PYSPARK_PYTHON=<>/anaconda2/bin/python2.7 PATH=$PATH:<>/anaconda/bin <>/pyspark :) In case you are using it in EMR the solution is a bit tricky. Just let me know in case you want any further help. Regards, Gourav Sengupta On Thu, Jun 2, 2016 at 7:59 PM

Re: HiveContext: Unable to load AWS credentials from any provider in the chain

2016-06-09 Thread Gourav Sengupta
Hi, are you using EC2 instances or local cluster behind firewall. Regards, Gourav Sengupta On Wed, Jun 8, 2016 at 4:34 PM, Daniel Haviv < daniel.ha...@veracity-group.com> wrote: > Hi, > > I'm trying to create a table on s3a but I keep hitting the following error: > >

HIVE Query 25x faster than SPARK Query

2016-06-09 Thread Gourav Sengupta
on (A.PK = B.FK) where B.FK is not null; This query takes 4.5 mins in SPARK Regards, Gourav Sengupta

Re: HIVE Query 25x faster than SPARK Query

2016-06-09 Thread Gourav Sengupta
t; > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > > On 9 June 2016 at

Re: HIVE Query 25x faster than SPARK Query

2016-06-09 Thread Gourav Sengupta
: > ooc are the tables partitioned on a.pk and b.fk? Hive might be using > copartitioning in that case: it is one of hive's strengths. > > 2016-06-09 7:28 GMT-07:00 Gourav Sengupta : > >> Hi Mich, >> >> does not Hive use map-reduce? I thought it to be so. An

Re: HIVE Query 25x faster than SPARK Query

2016-06-10 Thread Gourav Sengupta
sql execution plan? My guess is about broadcast > join. > > > > On Jun 9, 2016, at 07:14, Gourav Sengupta > wrote: > > Hi, > > Query1 is almost 25x faster in HIVE than in SPARK. What is happening here > and is there a way we can optimize the queries in SPARK

Re: HIVE Query 25x faster than SPARK Query

2016-06-10 Thread Gourav Sengupta
will surely be excited to see if I am going wrong here and post the results of sql.describe(). Thanks a ton once again. Hi Ted, Is there anyway you can throw some light on this before I post this in a blog? Regards, Gourav Sengupta On Fri, Jun 10, 2016 at 7:22 PM, Gavin Yue wrote: >

Re: HIVE Query 25x faster than SPARK Query

2016-06-15 Thread Gourav Sengupta
sec for 1 gb of data whereas in Spark, it is taking 4 mins > of time. > On 6/9/2016 3:19 PM, Gavin Yue wrote: > > Could you print out the sql execution plan? My guess is about broadcast > join. > > > > On Jun 9, 2016, at 07:14, Gourav Sengupta < > gourav.sengu...@gmail.

Re: HIVE Query 25x faster than SPARK Query

2016-06-16 Thread Gourav Sengupta
used case. Spark in local mode will be way faster compared to SPARK running on HADOOP. I have a system with 64 GB RAM and SSD and its performance on local cluster SPARK is way better. Did your join include the same number of columns and rows for the dimension table? Regards, Gourav Sengupta On

Re: Spark UI shows finished when job had an error

2016-06-17 Thread Gourav Sengupta
Hi, Can you please see the query plan (in case you are using a query)? There is a very high chance that the query was broken into multiple steps and only a subsequent step failed. Regards, Gourav Sengupta On Fri, Jun 17, 2016 at 2:49 PM, Sumona Routh wrote: > Hi there, > Our Spark j

Re: FullOuterJoin on Spark

2016-06-22 Thread Gourav Sengupta
+1 for the guidance from Nirvan. Also it would be better to repartition and store the data in parquet format in case you are planning to do the joins more than once or with other data sources. Parquet with SPARK works likes a charm. Over S3 I have seen its performance being quite close to cached da

Re: Tools for Balancing Partitions by Size

2016-07-13 Thread Gourav Sengupta
thinking about data in logical partitions helps overcome most of the design problems that is mentioned above. You can either use reparition with shuffling or colasce with shuffle turned off to manage loads. If you are using HIVE just let me know. Regards, Gourav Sengupta On Wed, Jul 13, 2016 at 5

Re: Role-based S3 access outside of EMR

2016-07-20 Thread Gourav Sengupta
But that would mean you would be accessing data over internet increasing data read latency, data transmission failures. Why are you not using EMR? Regards, Gourav On Thu, Jul 21, 2016 at 1:06 AM, Everett Anderson wrote: > Thanks, Andy. > > I am indeed often doing something similar, now -- copyi

Re: Role-based S3 access outside of EMR

2016-07-21 Thread Gourav Sengupta
bove, using EMRFS libs solved this problem: > > http://docs.aws.amazon.com//ElasticMapReduce/latest/ReleaseGuide/emr-fs.html > > > 2016-07-21 8:37 GMT+02:00 Gourav Sengupta : > > But that would mean you would be accessing data over internet increasing > > data read latency

Re: Programmatic use of UDFs from Java

2016-07-21 Thread Gourav Sengupta
JAVA seriously? On Thu, Jul 21, 2016 at 6:10 PM, Everett Anderson wrote: > Hi, > > In the Java Spark DataFrames API, you can create a UDF, register it, and > then access it by string name by using the convenience UDF classes in > org.apache.spark.sql.api.java >

Re: Reading multiple json files form nested folders for data frame

2016-07-21 Thread Gourav Sengupta
If you are using EMR, please try their latest release, there will be very few reasons left for using SPARK ever at all (particularly given that hiveContext rides a lot on HIVE) if you are using SQL. Just over regular csv data I have seen Hive on TEZ performance gains by 100x (query 64 million rows

Re: the spark job is so slow - almost frozen

2016-07-21 Thread Gourav Sengupta
Andrew, you have pretty much consolidated my entire experience, please give a presentation in a meetup on this, and send across the links :) Regards, Gourav On Wed, Jul 20, 2016 at 4:35 AM, Andrew Ehrlich wrote: > Try: > > - filtering down the data as soon as possible in the job, dropping col

Distributed Matrices - spark mllib

2016-07-22 Thread Gourav Sengupta
dd.map(lambda row: MatrixEntry(*row))) This gives me the number or rows and columns. But I am not able to extract the values and it always reports back the error: AttributeError: 'NoneType' object has no attribute 'setCallSite' Thanks and Regards, Gourav Sengupta

Re: spark and plot data

2016-07-22 Thread Gourav Sengupta
The biggest stumbling block to using Zeppelin has been that we cannot download the notebooks, cannot export them and certainly cannot sync them back to Github, without mind numbing and sometimes irritating hacks. Have those issues been resolved? Regards, Gourav On Fri, Jul 22, 2016 at 2:22 PM,

Re: spark and plot data

2016-07-23 Thread Gourav Sengupta
Hi Taotao, that is the way its usually used to visualize data from SPARK. But I do see that people transfer the data to list to feed to Matplot (as in the SPARK course currently running in EDX). Please try using blaze and bokeh and you will be in a new world altogether. Regards, Gourav On Sat,

Re: spark and plot data

2016-07-23 Thread Gourav Sengupta
ebooks floating around, Apache Toree seems the > most promising for portability since its based on jupyter > https://github.com/apache/incubator-toree > > On Fri, Jul 22, 2016 at 3:53 PM, Gourav Sengupta < > gourav.sengu...@gmail.com> wrote: > >> The biggest stumbling block to u

Re: spark and plot data

2016-07-23 Thread Gourav Sengupta
And we are all smiling: https://github.com/bokeh/bokeh-scala Something that helped me immensely, particularly the example. https://github.com/bokeh/bokeh-scala/issues/24 Please note that I use Toree as the Jupyter kernel. Regards, Gourav Sengupta On Sat, Jul 23, 2016 at 8:01 PM, Andrew

Re: ORC v/s Parquet for Spark 2.0

2016-07-26 Thread Gourav Sengupta
K 2.0. Regards, Gourav Sengupta On Tue, Jul 26, 2016 at 11:50 AM, Ofir Manor wrote: > One additional point specific to Spark 2.0 - for the alpha Structured > Streaming API (only), the file sink only supports Parquet format (I'm sure > that limitation will be lifted in a future releas

Re: dataframe.foreach VS dataframe.collect().foreach

2016-07-26 Thread Gourav Sengupta
And Pedro has made sense of a world running amok, scared, and drunken stupor. Regards, Gourav On Tue, Jul 26, 2016 at 2:01 PM, Pedro Rodriguez wrote: > I am not 100% as I haven't tried this out, but there is a huge difference > between the two. Both foreach and collect are actions irregardless

Re: ORC v/s Parquet for Spark 2.0

2016-07-27 Thread Gourav Sengupta
Gosh, whether ORC came from this or that, it runs queries in HIVE with TEZ at a speed that is better than SPARK. Has anyone heard of KUDA? Its better than Parquet. But I think that someone might just start saying that KUDA has difficult lineage as well. After all dynastic rules dictate. Personal

Re: ORC v/s Parquet for Spark 2.0

2016-07-27 Thread Gourav Sengupta
Sorry, in my email above I was referring to KUDU, and there is goes how can KUDU be right if it is mentioned in forums first with a wrong spelling. Its got a difficult beginning where people were trying to figure out its name. Regards, Gourav Sengupta On Wed, Jul 27, 2016 at 8:15 AM, Gourav

Re: create external table from partitioned avro file

2016-07-28 Thread Gourav Sengupta
Why avro? Regards, Gourav Sengupta On Thu, Jul 28, 2016 at 8:15 AM, Yang Cao wrote: > Hi, > > I am using spark 1.6 and I hope to create a hive external table based on > one partitioned avro file. Currently, I don’t find any build-in api to do > this work. I tried the write.forma

Re: performance problem when reading lots of small files created by spark streaming.

2016-07-28 Thread Gourav Sengupta
There is an option to join small files up. If you are unable to find it just let me know. Regards, Gourav On Thu, Jul 28, 2016 at 4:58 PM, Andy Davidson < a...@santacruzintegration.com> wrote: > Hi Pedro > > Thanks for the explanation. I started watching your repo. In the short > term I think I

Re: how to save spark files as parquets efficiently

2016-07-29 Thread Gourav Sengupta
Hi, The default write format in SPARK is parquet. And I have never faced any issues writing over a billion records in SPARK. Are you using virtualization by any chance or an obsolete hard disk or Intel Celeron may be? Regards, Gourav Sengupta On Fri, Jul 29, 2016 at 7:27 AM, Sumit Khanna wrote

Re: Visualization of data analysed using spark

2016-07-30 Thread Gourav Sengupta
If you are using Python, please try using Bokeh and its related stack. Most of the people in this forum including guys at data bricks have not tried that stack from Anaconda, its worth a try when you are visualizing data in big data stack. Regards, Gourav On Sat, Jul 30, 2016 at 10:25 PM, Rerng

Re: Java Recipes for Spark

2016-08-01 Thread Gourav Sengupta
JAVA? AGAIN? I am getting into serious depression Regards, Gourav On Mon, Aug 1, 2016 at 9:03 PM, Marco Mistroni wrote: > Hi jg > +1 for link. I'd add ML and graph examples if u can > -1 for programmign language choice :)) > > > kr > > On 31 Jul 2016 9:13 pm, "Jean Georges Perrin" wro

Re: [SQL] Reading from hive table is listing all files in S3

2016-08-03 Thread Gourav Sengupta
other thing that is a bit confusing is that you have declared day as STRING and treating them as DATE in your select statement. Does that work? Regards, Gourav Sengupta On Wed, Aug 3, 2016 at 5:08 PM, Mehdi Meziane wrote: > Hi Mich, > > The data is stored as parquet. > The table defi

Re: hdfs persist rollbacks when spark job is killed

2016-08-07 Thread Gourav Sengupta
there is no move operation. I generally have a set of Data Quality checks after each job to ascertain whether everything went fine, the results are stored so that it can be published in a graph for monitoring, thus solving two purposes. Regards, Gourav Sengupta On Mon, Aug 8, 2016 at 7:41 AM

Re: hdfs persist rollbacks when spark job is killed

2016-08-08 Thread Gourav Sengupta
job has completed successfully (without quitting). If the Data Quality checks fail within certain threshold then the data is not deleted, but just generate a warning. If more than a particular threshold, then the data is deleted and then a warning is raised. Regards, Gourav Sengupta On Mon, Aug

Re: Spark join and large temp files

2016-08-09 Thread Gourav Sengupta
In case of skewed data the joins will mess things up. Try to write a UDF with the lookup on broadcast variable and then let me know the results. It should not take more than 40 mins in a 32 GB RAM system with 6 core processors. Gourav On Tue, Aug 9, 2016 at 6:02 PM, Ashic Mahtab wrote: > Hi Mi

Re: Spark 2 cannot create ORC table when CLUSTERED. This worked in Spark 1.6.1

2016-08-11 Thread Gourav Sengupta
And SPARK even reads ORC data very slowly. And in case the HIVE table is partitioned, then it just hangs. Regards, Gourav On Thu, Aug 11, 2016 at 6:02 PM, Mich Talebzadeh wrote: > > > This does not work with CLUSTERED BY clause in Spark 2 now! > > CREATE TABLE test.dummy2 > ( > ID INT >

Re: Spark join and large temp files

2016-08-12 Thread Gourav Sengupta
. Did you try using UDF's on broadcast data? The solution is pretty much the same, except that instead of REDIS you use the broadcast variable and it scales wonderfully across several cluster of machines. Regards, Gourav Sengupta On Thu, Aug 11, 2016 at 11:23 PM, Ashic Mahtab wrote: >

Re: Spark join and large temp files

2016-08-12 Thread Gourav Sengupta
11, 2016, at 9:48 PM, Gourav Sengupta > wrote: > > Hi Ben, > > and that will take care of skewed data? > > Gourav > > On Thu, Aug 11, 2016 at 8:41 PM, Ben Teeuwen wrote: > >> When you read both ‘a’ and ‘b', can you try repartitioning both by column >

Re: Does Spark SQL support indexes?

2016-08-15 Thread Gourav Sengupta
The world has moved in from indexes, materialized views, and other single processor non-distributed system algorithms. Nice that you are not asking questions regarding hierarchical file systems. Regards, Gourav On Sun, Aug 14, 2016 at 4:03 AM, Taotao.Li wrote: > > hi, guys, does Spark SQL supp

Re: Does Spark SQL support indexes?

2016-08-15 Thread Gourav Sengupta
y damages arising from > such loss, damage or destruction. > > > > On 15 August 2016 at 11:19, u...@moosheimer.com wrote: > >> So you mean HBase, Cassandra, Hana, Elasticsearch and so on do not use >> idexes? >> There might be some very interesting new concepts

hiveContext: storing lookup of partitions

2015-12-15 Thread Gourav Sengupta
Hi, I have a HIVE table with few thousand partitions (based on date and time). It takes a long time to run if for the first time and then subsequently it is fast. Is there a way to store the cache of partition lookups so that every time I start a new SPARK instance (cannot keep my personal server

Re: hiveContext: storing lookup of partitions

2015-12-16 Thread Gourav Sengupta
of getting the split info. I suspect it might > be your cluster issue (or metadata store), unusually it won't take such > long time for splitting. > > On Wed, Dec 16, 2015 at 8:06 AM, Gourav Sengupta < > gourav.sengu...@gmail.com> wrote: > >> Hi, >> >>

Re: hiveContext: storing lookup of partitions

2015-12-16 Thread Gourav Sengupta
Hi Jeff, sadly that does not resolve the issue. I am sure that the memory mapping to physical files locations can be saved and recovered in SPARK. Regards, Gourav Sengupta On Wed, Dec 16, 2015 at 12:13 PM, Jeff Zhang wrote: > oh, you are using S3. As I remember, S3 has performance is

HiveContext Self join not reading from cache

2015-12-16 Thread Gourav Sengupta
Hi, This is how the data can be created: 1. TableA : cached() 2. TableB : cached() 3. TableC: TableA inner join TableB cached() 4. TableC join TableC does not take the data from cache but starts reading the data for TableA and TableB from disk. Does this sound like a bug? The self join between

Re: HiveContext Self join not reading from cache

2015-12-17 Thread Gourav Sengupta
+- Sort [c#253 ASC], false, 0 > +- TungstenExchange hashpartitioning(c#253,200), None > +- InMemoryColumnarTableScan [c#253], InMemoryRelation > [b#246,c#253], true, 1, StorageLevel(true, true, false, true, 1), > Project [b#4,c#90], Some(d) > > Is the above what you

Re: HiveContext Self join not reading from cache

2015-12-18 Thread Gourav Sengupta
hi, I think that people have reported the same issue elsewhere, and this should be registered as a bug in SPARK https://forums.databricks.com/questions/2142/self-join-in-spark-sql.html Regards, Gourav On Thu, Dec 17, 2015 at 10:52 AM, Gourav Sengupta wrote: > Hi Ted, > > The self j

Re: HiveContext Self join not reading from cache

2015-12-18 Thread Gourav Sengupta
month#1289 IN (2015-11),hit_day#1290 IN (20)] Code Generation: true Regards, Gourav On Fri, Dec 18, 2015 at 8:55 AM, Gourav Sengupta wrote: > hi, > > I think that people have reported the same issue elsewhere, and this > should be registered as a bug in SPARK > > http

Re: HiveContext Self join not reading from cache

2015-12-18 Thread Gourav Sengupta
Hi, the attached DAG shows that for the same table (self join) SPARK is unnecessarily getting data from S3 for one side of the join where as its able to use cache for the other side. Regards, Gourav On Fri, Dec 18, 2015 at 10:29 AM, Gourav Sengupta wrote: > Hi, > > I have a table

Re: Stuck with DataFrame df.select("select * from table");

2015-12-27 Thread Gourav Sengupta
ing it as a table? I think we should be using hivecontext or sqlcontext to run queries on a registered table. Regards, Gourav Sengupta On Sat, Dec 26, 2015 at 6:27 PM, Eugene Morozov wrote: > Chris, thanks. That'd be great to try =) > > -- > Be well! > Jean Morozov > >

storing query object

2016-01-19 Thread Gourav Sengupta
, Gourav Sengupta

Fwd: storing query object

2016-01-22 Thread Gourav Sengupta
, Gourav Sengupta

Re: storing query object

2016-01-22 Thread Gourav Sengupta
browse/SPARK-8125 > > You can also look at parent issue. > > Which Spark release are you using ? > > > On Jan 22, 2016, at 1:08 AM, Gourav Sengupta > wrote: > > > > > > Hi, > > > > I have a SPARK table (created from hiveContext) with couple of hun

Re: Spark task hangs infinitely when accessing S3 from AWS

2016-01-26 Thread Gourav Sengupta
Hi, are you creating RDD's out of the data? Regards, Gourav On Tue, Jan 26, 2016 at 12:45 PM, aecc wrote: > Sorry, I have not been able to solve the issue. I used speculation mode as > workaround to this. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.n

Re: Spark task hangs infinitely when accessing S3 from AWS

2016-01-26 Thread Gourav Sengupta
Hi, Are you creating RDD's using textfile option? Can you please let me know the following: 1. Number of partitions 2. Number of files 3. Time taken to create the RDD's Regards, Gourav Sengupta On Tue, Jan 26, 2016 at 1:12 PM, Gourav Sengupta wrote: > Hi, > > are you cr

Re: Spark task hangs infinitely when accessing S3 from AWS

2016-01-27 Thread Gourav Sengupta
able, >>> AvroKeyInputFormat[T]](s"s3n://path-to-avro-file") >>> >>> Because of dependency issues, I had to use an older version of Spark, >>> and the job was hanging while reading from S3, but right now I upgraded to >>> spark 1.5.2 and seems lik

Re: Is there a way to save csv file fast ?

2016-02-10 Thread Gourav Sengupta
Hi, The writes, in terms of number of records written simultaneously, can be increased if you increased the number of partitions. You can try to increase the number of partitions and check out how it works. There is though an upper cap (the one that I faced in Ubuntu) on the number of parallel wri

Using SPARK packages in Spark Cluster

2016-02-12 Thread Gourav Sengupta
fine. I will be grateful if someone could kindly let me know how to load packages when starting a cluster as mentioned above. Regards, Gourav Sengupta

Re: Using SPARK packages in Spark Cluster

2016-02-13 Thread Gourav Sengupta
oing to use them. > > Best, > Burak > > > > On Fri, Feb 12, 2016 at 4:22 AM, Gourav Sengupta < > gourav.sengu...@gmail.com> wrote: > >> Hi, >> >> I am creating sparkcontext in a SPARK standalone cluster as mentioned >> here: ht

Re: Using SPARK packages in Spark Cluster

2016-02-15 Thread Gourav Sengupta
Hi, So far no one is able to get my question at all. I know what it takes to load packages via SPARK shell or SPARK submit. How do I load packages when starting a SPARK cluster, as mentioned here http://spark.apache.org/docs/latest/spark-standalone.html ? Regards, Gourav Sengupta On Mon

Re: Using SPARK packages in Spark Cluster

2016-02-15 Thread Gourav Sengupta
Hi, How to we include the following package: https://github.com/databricks/spark-csv while starting a SPARK standalone cluster as mentioned here: http://spark.apache.org/docs/latest/spark-standalone.html Thanks and Regards, Gourav Sengupta On Mon, Feb 15, 2016 at 10:32 AM, Ramanathan R wrote

Re: Using SPARK packages in Spark Cluster

2016-02-15 Thread Gourav Sengupta
HOME/bin/spark-shell --packages com.databricks:spark-csv_2.10:1.3.0 > > > > It will download everything for you and register into your JVM. If you > want to use it in your Prod just package it with maven. > > On 15/02/2016, at 12:14, Gourav Sengupta > wrote: > > H

Re: Using SPARK packages in Spark Cluster

2016-02-15 Thread Gourav Sengupta
ter in local mode kindly do not attempt in answering this question. My question is how to use packages like https://github.com/databricks/spark-csv when I using SPARK cluster in local mode. Regards, Gourav Sengupta <http://spark.apache.org/docs/latest/spark-standalone.html> On Mon, Feb 15, 201

Re: Stored proc with spark

2016-02-16 Thread Gourav Sengupta
Hi Gaurav, do you mean stored proc that returns a table? Regards, Gourav On Tue, Feb 16, 2016 at 9:04 AM, Gaurav Agarwal wrote: > Hi > Can I load the data into spark from oracle storedproc > > Thanks >

Re: Scala from Jupyter

2016-02-16 Thread Gourav Sengupta
Apache Zeppelin will be the right solution with in built plugins for python and visualizations as well. Are you planning to use this in EMR? Regards, Gourav On Tue, Feb 16, 2016 at 12:04 PM, Rajeev Reddy wrote: > Hello, > > Let me understand your query correctly. > > Case 1. You have a jupyte

Re: Scala from Jupyter

2016-02-16 Thread Gourav Sengupta
take a look here as well http://zeppelin-project.org/ it executes Scala and Python and Markup document in the same notebook and draws beautiful visualisations as well. It comes built in AWS EMR as well. Regards, Gourav On Tue, Feb 16, 2016 at 12:43 PM, Aleksandr Modestov < aleksandrmodes...@gmai

Re: Reading CSV file using pyspark

2016-02-18 Thread Gourav Sengupta
as there are some write issues which 2.11 resolves. Hopefully you are using the latest release of SPARK. $SPARK_HOME/bin/spark-shell --packages com.databricks:spark-csv_2.11:1.3.0 Regards, Gourav Sengupta On Thu, Feb 18, 2016 at 11:05 AM, Teng Qiu wrote: > download a right version of this

Re: Is this likely to cause any problems?

2016-02-18 Thread Gourav Sengupta
Hi, Just out of sheet curiosity why are you not using EMR to start your SPARK cluster? Regards, Gourav On Thu, Feb 18, 2016 at 12:23 PM, Ted Yu wrote: > Have you seen this ? > > HADOOP-10988 > > Cheers > > On Thu, Feb 18, 2016 at 3:39 AM, James Hammerton wrote: > >> HI, >> >> I am seeing war

Re: Is this likely to cause any problems?

2016-02-18 Thread Gourav Sengupta
interesting. And I am almost sure that none of EMR hosted services of HADOOP, SPARK, Zepplin, etc are exposed to the external IP addresses even if you are using the classical setting. Regards, Gourav Sengupta On Thu, Feb 18, 2016 at 2:25 PM, Teng Qiu wrote: > EMR is great, but I'm c

Re: Is this likely to cause any problems?

2016-02-18 Thread Gourav Sengupta
, Gourav Sengupta On Thu, Feb 18, 2016 at 2:30 PM, Ted Yu wrote: > Please see the last 3 posts on this thread: > > http://search-hadoop.com/m/q3RTtTorTf2o3UGK1&subj=Re+spark+ec2+vs+EMR > > FYI > > On Thu, Feb 18, 2016 at 6:25 AM, Teng Qiu wrote: > >> EMR is gr

Re: Why no computations run on workers/slaves in cluster mode?

2016-02-18 Thread Gourav Sengupta
em and not other then the workers will only run from that system. Regards, Gourav Sengupta On Wed, Feb 17, 2016 at 4:20 PM, Junjie Qian wrote: > Hi all, > > I am new to Spark, and have one problem that, no computations run on > workers/slave_servers in the standalone cluster mode. &g

Re: Accessing Web UI

2016-02-19 Thread Gourav Sengupta
can you please try localhost:8080? Regards, Gourav Sengupta On Fri, Feb 19, 2016 at 11:18 AM, vasbhat wrote: > Hi, > >I have installed the spark1.6 and trying to start the master > (start-master.sh) and access the webUI. > > I get the following logs on running t

Re: Spark Job Hanging on Join

2016-02-21 Thread Gourav Sengupta
know. From what I reckon joins like yours should not take more than a few milliseconds. Regards, Gourav Sengupta On Fri, Feb 19, 2016 at 5:31 PM, Tamara Mendt wrote: > Hi all, > > I am running a Spark job that gets stuck attempting to join two > dataframes. The dataframes are no

Re: Spark Job Hanging on Join

2016-02-21 Thread Gourav Sengupta
Sorry, please include the following questions to the list above: the SPARK version? whether you are using RDD or DataFrames? is the code run locally or in SPARK Cluster mode or in AWS EMR? Regards, Gourav Sengupta On Sun, Feb 21, 2016 at 7:37 PM, Gourav Sengupta wrote: > Hi Tamara, >

Re: Accessing Web UI

2016-02-23 Thread Gourav Sengupta
> >>>> On Mon, Feb 22, 2016 at 8:23 AM, Vasanth Bhat >>>> wrote: >>>> >>>>> Thanks Gourav, Eduardo >>>>> >>>>> I tried http://localhost:8080 and http://OAhtvJ5MCA:8080/ >>>>> <http://oah

Re: pandas dataframe to spark csv

2016-02-23 Thread Gourav Sengupta
Hi, The solutions is here: https://github.com/databricks/spark-csv Using the above solution you can read CSV directly into a dataframe as well. Regards, Gourav On Tue, Feb 23, 2016 at 12:03 PM, Devesh Raj Singh wrote: > Hi, > > I have imported spark csv dataframe in python and read the spark

Re: Spark standalone peer2peer network

2016-02-23 Thread Gourav Sengupta
se the file path that you mention exists or is available only in one system. Regards, Gourav Sengupta On Tue, Feb 23, 2016 at 8:39 PM, Robineast wrote: > Hi Thomas > > I can confirm that I have had this working in the past. I'm pretty sure you > don't need password-less SSH

Starting SPARK application in cluster mode from an IDE

2016-02-26 Thread Gourav Sengupta
- from pyspark import SparkConf, SparkContext conf = SparkConf().setMaster("spark://systemhostname:7077").setAppName("test").set("spark.executor.memory", "1g").set("spark.executor.cores", "2") conf.getAll() sc = SparkContext(conf = conf) Further description and links to this issue is mentioned here: http://stackoverflow.com/questions/33222045/classnotfoundexception-anonfun-when-deploy-scala-code-to-spark Thanks and Regards, Gourav Sengupta

Re: s3 access through proxy

2016-02-26 Thread Gourav Sengupta
the files in a s3://bucket/ or s3://bucket/key/ to your local system. And then you can point your spark cluster to the local data store and run the queries.Of course that depends on the data volume as well. Regards, Gourav Sengupta On Fri, Feb 26, 2016 at 7:29 PM, Joshua Buss wrote: > H

SPARK SQL HiveContext Error

2016-03-01 Thread Gourav Sengupta
;San Francisco", 12, 44.52, true), Row("Palo Alto", 12, 22.33, false), Row("Munich", 8, 3.14, true))) val hiveContext = new HiveContext(sc) //val sqlContext = new org.apache.spark.sql.SQLContext(sc) } } - Regards, Gourav Sengupta

Fwd: Starting SPARK application in cluster mode from an IDE

2016-03-01 Thread Gourav Sengupta
Hi, I will be grateful if someone could kindly respond back to this query. Thanks and Regards, Gourav Sengupta -- Forwarded message -- From: Gourav Sengupta Date: Sat, Feb 27, 2016 at 12:39 AM Subject: Starting SPARK application in cluster mode from an IDE To: user Hi, The

Re: SPARK SQL HiveContext Error

2016-03-01 Thread Gourav Sengupta
library folder than the ones which are usually supplied with the SPARK distribution: 1. ojdbc7.jar 2. spark-csv***jar file Regards, Gourav Sengupta On Tue, Mar 1, 2016 at 5:19 PM, Gourav Sengupta wrote: > Hi, > > I am getting the error "*java.lang.SecurityException: sealing violation

Re: EMR 4.3.0 spark 1.6 shell problem

2016-03-01 Thread Gourav Sengupta
. Regards, Gourav Sengupta On Tue, Mar 1, 2016 at 9:15 AM, Oleg Ruchovets wrote: > Hi , I am installed EMR 4.3.0 with spark. I tries to enter spark shell but > it looks it does't work and throws exceptions. > Please advice: > > [hadoop@ip-172-31-39-37 conf]$ cd /usr/bin/ &g

<    1   2   3   4   5   6   >