Re: Creating Dataframe by querying Impala

2017-06-01 Thread Anubhav Agarwal
The issue seems to be with primordial class loader. I cannot load the drivers to all the nodes at the same location but have loaded the jars to HDFS. I have tried SPARK_YARN_DIST_FILES as well as SPARK_CLASSPATH on the edge node with no luck. Is there another way to load these jars through primord

Re: removing columns from file

2017-04-28 Thread Anubhav Agarwal
Are you using Spark's textFiles method? If so, go through this blog :- http://tech.kinja.com/how-not-to-pull-from-s3-using-apache-spark-1704509219 Anubhav On Mon, Apr 24, 2017 at 12:48 PM, Afshin, Bardia < bardia.afs...@capitalone.com> wrote: > Hi there, > > > > I have a process that downloads t

Re: Best Savemode option to write Parquet file

2016-10-06 Thread Anubhav Agarwal
r.marksuccessfuljobs", > "false") > > > Regards, > Chanh > > > On Oct 6, 2016, at 10:32 PM, Anubhav Agarwal wrote: > > Hi all, > I have searched a bit before posting this query. > > Using Spark 1.6.1 > Dataframe.write().format("parquet

Best Savemode option to write Parquet file

2016-10-06 Thread Anubhav Agarwal
Hi all, I have searched a bit before posting this query. Using Spark 1.6.1 Dataframe.write().format("parquet").mode(SaveMode.Append).save("location) Note:- The data in that folder can be deleted and most of the times that folder doesn't even exist. Which Savemode is the best, if necessary at all

SLF4J binding error while running Spark using YARN as Cluster Manager

2016-05-18 Thread Anubhav Agarwal
Hi, I am having log4j trouble while running Spark using YARN as cluster manager in CDH 5.3.3. I get the following error:- SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/data/12/yarn/nm/filecache/34/spark-assembly-1.6.0-hadoop2.6.0.jar!/org/slf4j/impl/StaticL

Getting out of memory error during coalesce

2016-02-17 Thread Anubhav Agarwal
Hi, We have a very small 12 mb file that we join with other data. The job runs fine and save the data as a parquet file. But if we use coalesce(1) we get the following error:- Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAnd

Re: Improve parquet write speed to HDFS and spark.sql.execution.id is already set ERROR

2015-11-03 Thread Anubhav Agarwal
. On Tue, Nov 3, 2015 at 7:48 AM, Ted Yu wrote: > I am a bit curious: why is the synchronization on finalLock is needed ? > > Thanks > > On Oct 23, 2015, at 8:25 AM, Anubhav Agarwal wrote: > > I have a spark job that creates 6 million rows in RDDs. I convert the RDD > into

Improve parquet write speed to HDFS and spark.sql.execution.id is already set ERROR

2015-10-23 Thread Anubhav Agarwal
I have a spark job that creates 6 million rows in RDDs. I convert the RDD into Data-frame and write it to HDFS. Currently it takes 3 minutes to write it to HDFS. Here is the snippet:- RDDList.parallelStream().forEach(mapJavaRDD -> { if (mapJavaRDD != null) {

Re: [jira] Ankit shared "SPARK-11213: Documentation for remote spark Submit for R Scripts from 1.5 on CDH 5.4" with you

2015-10-22 Thread Anubhav Agarwal
Hi Ankit, Here is my solution for this:- 1) Download the latest Spark 1.5.1(Just copied the following link from spark.apache.org, if it doesn't work then gran a new one from the website.) wget http://d3kbcqa49mib13.cloudfront.net/spark-1.5.1-bin-hadoop2.6.tgz 2) Unzip the folder and rename/move t

Application failed error

2015-08-11 Thread Anubhav Agarwal
I am running Spark 1.3 on CDH 5.4 stack. I am getting the following error when I spark-submit my application:- 15/08/11 16:03:49 INFO Remoting: Starting remoting 15/08/11 16:03:49 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkdri...@cdh54-22a4101a-14d7-4f06-b3f8-079c6f

Re: NullPointException Help while using accumulators

2015-08-03 Thread Anubhav Agarwal
d was that there were two addAccumulator() > calls at the top of stack trace while in your code I don't > see addAccumulator() calling itself. > > FYI > > On Mon, Aug 3, 2015 at 3:22 PM, Anubhav Agarwal > wrote: > >> The code was written in 1.4

Re: NullPointException Help while using accumulators

2015-08-03 Thread Anubhav Agarwal
> > Cheers > > On Mon, Aug 3, 2015 at 3:13 PM, Anubhav Agarwal > wrote: > >> Hi, >> I am trying to modify my code to use HDFS and multiple nodes. The code >> works fine when I run it locally in a single machine with a single worker. >> I have been trying t

NullPointException Help while using accumulators

2015-08-03 Thread Anubhav Agarwal
Hi, I am trying to modify my code to use HDFS and multiple nodes. The code works fine when I run it locally in a single machine with a single worker. I have been trying to modify it and I get the following error. Any hint would be helpful. java.lang.NullPointerException at thomsonreuters.

Re: Spark-thriftserver Issue

2015-03-24 Thread Anubhav Agarwal
Zhan specifying port fixed the port issue. Is it possible to specify the log directory while starting the spark thriftserver? Still getting this error even through the folder exists and everyone has permission to use that directory. drwxr-xr-x 2 root root 4096 Mar 24 19:04 spark-