Re: ORC v/s Parquet for Spark 2.0

2016-07-28 Thread Alexander Pivovarov
Found 0 matching posts for *ORC v/s Parquet for Spark 2.0* in Apache Spark User List http://apache-spark-user-list.1001560.n3.nabble.com/ Anyone have a link to this discussion? Want to share it with my colleagues. On Thu, Jul 28, 2016 at

Re: Is there a way to merge parquet small files?

2016-05-19 Thread Alexander Pivovarov
Try to use hadoop setting mapreduce.input.fileinputformat.split.maxsize to control RDD partition size I heard that DF can several files in 1 task On Thu, May 19, 2016 at 8:50 PM, 王晓龙/0515 wrote: > I’m using a spark streaming program to store log message into

Re: Spark on AWS

2016-04-28 Thread Alexander Pivovarov
Fatima, the easiest way to create Spark cluster on AWS is to create EMR cluster and select Spark application. (the latest EMR includes Spark 1.6.1) Spark works well with S3 (read and write). However it's recommended to set spark.speculation true (it's expected that some tasks fail if you read

Re: YARN vs Standalone Spark Usage in production

2016-04-14 Thread Alexander Pivovarov
e. > > On Thu, Apr 14, 2016 at 6:46 PM, Alexander Pivovarov > <apivova...@gmail.com> wrote: > > AWS EMR includes Spark on Yarn > > Hortonworks and Cloudera platforms include Spark on Yarn as well > > > > > > On Thu, Apr 14, 2016 at 7:29 AM,

Re: YARN vs Standalone Spark Usage in production

2016-04-14 Thread Alexander Pivovarov
AWS EMR includes Spark on Yarn Hortonworks and Cloudera platforms include Spark on Yarn as well On Thu, Apr 14, 2016 at 7:29 AM, Arkadiusz Bicz wrote: > Hello, > > Is there any statistics regarding YARN vs Standalone Spark Usage in > production ? > > I would like to

Re: Spark and N-tier architecture

2016-03-29 Thread Alexander Pivovarov
n official part of spark or something else? > > From what I can find via a quick Google … this isn’t part of the core > spark distribution. > > On Mar 29, 2016, at 3:50 PM, Alexander Pivovarov <apivova...@gmail.com> > wrote: > > https://github.com/spark-jobserver/spark-jobserver > > > >

Re: Spark and N-tier architecture

2016-03-29 Thread Alexander Pivovarov
Spark-jobserver was originally created by Ooyala Now it's Open Source Apache Licensed project ​

Re: Spark and N-tier architecture

2016-03-29 Thread Alexander Pivovarov
Spark is a distributed data processing engine plus distributed in-memory / disk data cache spark-jobserver provides REST API to your spark applications. It allows you to submit jobs to spark and get results in sync or async mode It also can create long running Spark context to cache RDDs in

Re: Running Spark on Yarn

2016-03-29 Thread Alexander Pivovarov
ok, start EMR-4.3.0 or 4.2.0 cluster and look at how to configure spark on yarn properly

Re: Running Spark on Yarn

2016-03-29 Thread Alexander Pivovarov
t; *Progress* > *Tracking UI* > *application_1459287061048_0001 myhost word count MAPREDUCE root.myhost > Tue, 29 Mar 2016 21:31:39 GMT Tue, 29 Mar 2016 21:31:59 GMT FINISHED > SUCCEEDED * > *History* > > On Wed, Mar 30, 2016 at 2:52 AM, Alexander Pivovarov <apivova...@gmail.com

Re: Running Spark on Yarn

2016-03-29 Thread Alexander Pivovarov
VCores Used VCores Pending VCores Reserved* > *0 1 0 0 0 0 0 0 B 0 B 0 B 0 0 0* > > Any Other trace? > > On Wed, Mar 30, 2016 at 2:31 AM, Alexander Pivovarov <apivova...@gmail.com > > wrote: > >> check 8088 ui >> - how many cores and memory available &g

Re: Running Spark on Yarn

2016-03-29 Thread Alexander Pivovarov
check 8088 ui - how many cores and memory available - how many slaves are active run teragen or pi from hadoop examples to make sure that yarn works On Tue, Mar 29, 2016 at 1:25 PM, Surendra , Manchikanti < surendra.manchika...@gmail.com> wrote: > Hi Vineeth, > > Can you please check

Re: Testing spark with AWS spot instances

2016-03-27 Thread Alexander Pivovarov
I use spot instances for 100 slaves cluster (r3.2xlarge on us-west-1) Jobs I run usually take about 15 hours - cluster is stable and fast. 1-2 computers might be terminated but it's very rare event and Spark can handle it. On Fri, Mar 25, 2016 at 6:28 PM, Sven Krasser wrote:

Re: YARN process with Spark

2016-03-14 Thread Alexander Pivovarov
-ratio to a relatively larger value. https://www.mapr.com/blog/best-practices-yarn-resource-management On Mon, Mar 14, 2016 at 3:36 AM, Steve Loughran <ste...@hortonworks.com> wrote: > > On 11 Mar 2016, at 23:01, Alexander Pivovarov <apivova...@gmail.com> > wrote: > >

Re: Spark with Yarn Client

2016-03-11 Thread Alexander Pivovarov
Check doc - http://spark.apache.org/docs/latest/running-on-yarn.html also you can start EMR-4.2.0 or 4.3.0 cluster with Spark app and see how it's configured On Fri, Mar 11, 2016 at 7:50 PM, Divya Gehlot wrote: > Hi, > I am trying to understand behaviour /configuration

Re: YARN process with Spark

2016-03-11 Thread Alexander Pivovarov
; Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > > On

Re: YARN process with Spark

2016-03-11 Thread Alexander Pivovarov
Forgot to mention. To avoid unnecessary container termination add the following setting to yarn yarn.nodemanager.vmem-check-enabled = false

Re: YARN process with Spark

2016-03-11 Thread Alexander Pivovarov
YARN cores are virtual cores which are used just to calculate available resources. But usually memory is used to manage yarn resources (not cores) spark executor memory should be ~90% of yarn.scheduler.maximum-allocation-mb (which should be the same as yarn.nodemanager.resource.memory-mb) ~10%

Re: Is there Graph Partitioning impl for Scala/Spark?

2016-03-11 Thread Alexander Pivovarov
, Alexander Pivovarov <apivova...@gmail.com> wrote: > Is there Graph Partitioning impl (e.g. Spectral ) which can be used in > Spark? > I guess it should be at least java/scala lib > Maybe even tuned to work with GraphX >

Re: Graphx

2016-03-11 Thread Alexander Pivovarov
we use it in prod 70 boxes, 61GB RAM each GraphX Connected Components works fine on 250M Vertices and 1B Edges (takes about 5-10 min) Spark likes memory, so use r3.2xlarge boxes (61GB) For example 10 x r3.2xlarge (61GB) work much faster than 20 x r3.xlarge (30.5 GB) (especially if you have

Re: EMR 4.3.0 spark 1.6 shell problem

2016-03-01 Thread Alexander Pivovarov
EMR-4.3.0 and Spark-1.6.0 works fine for me I use r3.2xlarge boxes (spot) (even 3 slave boxes works fine) I use the following settings (in json) [ { "Classification": "spark-defaults", "Properties": { "spark.driver.extraJavaOptions": "-Dfile.encoding=UTF-8",

Re: Spark Integration Patterns

2016-02-29 Thread Alexander Pivovarov
There is a spark-jobserver (SJS) which is REST interface for spark and spark-sql you can deploy your jar file with Jobs impl to spark-jobserver and use rest API to submit jobs in synch or async mode in sync mode you need to poll SJS to get job result job result might be actual data in json or path

Re: DirectFileOutputCommiter

2016-02-26 Thread Alexander Pivovarov
DirectOutputCommitter doc says: The FileOutputCommitter is required for HDFS + speculation, which allows only one writer at a time for a file (so two people racing to write the same file would not work). However, S3 supports multiple writers outputting to the same file, where visibility is

Re: DirectFileOutputCommiter

2016-02-26 Thread Alexander Pivovarov
Amazon uses the following impl https://gist.github.com/apivovarov/bb215f08318318570567 But for some reason Spark show error at the end of the job 16/02/26 08:16:54 INFO scheduler.DAGScheduler: ResultStage 0 (saveAsTextFile at :28) finished in 14.305 s 16/02/26 08:16:54 INFO cluster.YarnScheduler:

Re: Difference between spark-shell and spark-submit.Which one to use when ?

2016-02-14 Thread Alexander Pivovarov
Consider streaming for real time cases http://zdatainc.com/2014/08/real-time-streaming-apache-spark-streaming/ On Sun, Feb 14, 2016 at 7:28 PM, Divya Gehlot wrote: > Hi, > I would like to know difference between spark-shell and spark-submit in > terms of real time

Re: AM creation in yarn-client mode

2016-02-09 Thread Alexander Pivovarov
the pictures to illustrate it http://www.cloudera.com/documentation/enterprise/5-4-x/topics/cdh_ig_running_spark_on_yarn.html On Tue, Feb 9, 2016 at 10:18 PM, Jonathan Kelly wrote: > In yarn-client mode, the driver is separate from the AM. The AM is created > in YARN,

Re: Is spark-ec2 going away?

2016-01-27 Thread Alexander Pivovarov
you can use EMR-4.3.0 run on spot instances to control the price yes, you can add/remove instances to the cluster on fly (CORE instances support add only, TASK instances - add and remove) On Wed, Jan 27, 2016 at 2:07 PM, Sung Hwan Chung wrote: > I noticed that in

save rdd with gzip compresson but without .gz extension?

2016-01-26 Thread Alexander Pivovarov
Question #1 When spark saves rdd using Gzip codec it generates files with .gz extension. Is it possible to ask spark not to add .gz extension to file names and keep file names like part-x I want to compress existing text files to gzip and want to keep original file names (and context)

Re: [Spark-SQL] from_unixtime with user-specified timezone

2016-01-18 Thread Alexander Pivovarov
Look at to_utc_timestamp from_utc_timestamp On Jan 18, 2016 9:39 AM, "Jerry Lam" wrote: > Hi spark users and developers, > > what do you do if you want the from_unixtime function in spark sql to > return the timezone you want instead of the system timezone? > > Best

Re: [Spark-SQL] from_unixtime with user-specified timezone

2016-01-18 Thread Alexander Pivovarov
take timezone > information? > > I think I will make a UDF if this is the only way out of the box. > > Thanks! > > Jerry > > On Mon, Jan 18, 2016 at 2:32 PM, Alexander Pivovarov <apivova...@gmail.com > > wrote: > >> Look at >> to_utc_timestamp

automatically unpersist RDDs which are not used for 24 hours?

2016-01-13 Thread Alexander Pivovarov
Is it possible to automatically unpersist RDDs which are not used for 24 hours?

Re: coalesce(1).saveAsTextfile() takes forever?

2016-01-05 Thread Alexander Pivovarov
try coalesce(1, true). On Tue, Jan 5, 2016 at 11:58 AM, unk1102 wrote: > hi I am trying to save many partitions of Dataframe into one CSV file and > it > take forever for large data sets of around 5-6 GB. > > >

Re: combining multiple JSON files to one DataFrame

2015-12-20 Thread Alexander Pivovarov
Just point loader to the folder. You do not need * On Dec 19, 2015 11:21 PM, "Eran Witkon" wrote: > Hi, > Can I combine multiple JSON files to one DataFrame? > > I tried > val df = sqlContext.read.json("/home/eranw/Workspace/JSON/sample/*") > but I get an empty DF > Eran >

Re: How to do map join in Spark SQL

2015-12-20 Thread Alexander Pivovarov
esource allocation, etc. > > the goal being to minimize manual configuration and enable many diff types > of workloads to run efficiently on the same Spark cluster. > > On Dec 19, 2015, at 12:10 PM, Alexander Pivovarov <apivova...@gmail.com> > wrote: > > I collected smal

Re: How to do map join in Spark SQL

2015-12-19 Thread Alexander Pivovarov
do a map side join. This article > is a good start http://dmtolpeko.com/2015/02/20/map-side-join-in-spark/ > > Thanks > Best Regards > > On Wed, Dec 16, 2015 at 2:51 AM, Alexander Pivovarov <apivova...@gmail.com > > wrote: > >> I have big folder having ORC files. Fil

Re: which aws instance type for shuffle performance

2015-12-18 Thread Alexander Pivovarov
Andrew, it's going to be 4 execotor jvms on each r3.8xlarge. Rastan, you can run quick test using emr spark cluster on spot instances and see what configuration works better. Without the tests it is all speculation. On Dec 18, 2015 1:53 PM, "Andrew Or" wrote: > Hi Rastan,

Re: Can't run spark on yarn

2015-12-17 Thread Alexander Pivovarov
Try to start aws EMR 4.2.0 with hadoop and spark applications on spot instances. Then look at how hadoop and spark configured. Try to configure your hadoop and spark similar way On Dec 17, 2015 6:09 PM, "Saisai Shao" wrote: > Please check the Yarn AM log to see why AM is

RE: How to submit spark job to YARN from scala code

2015-12-17 Thread Alexander Pivovarov
Spark-submit --master yarn-cluster Look docs for more details On Dec 17, 2015 5:00 PM, "Forest Fang" wrote: > Maybe I'm not understanding your question correctly but would it be > possible for you to piece up your job submission information as if you are > operating

How to do map join in Spark SQL

2015-12-15 Thread Alexander Pivovarov
I have big folder having ORC files. Files have duration field (e.g. 3,12,26, etc) Also I have small json file (just 8 rows) with ranges definition (min, max , name) 0, 10, A 10, 20, B 20, 30, C etc Because I can not do equi-join btw duration and range min/max I need to do cross join and apply

Spark does not clean garbage in blockmgr folders on slaves if long running spark-shell is used

2015-12-12 Thread Alexander Pivovarov
Recently I faced an issue with Spark 1.5.2 standalone. Spark does not clean garbage in blockmgr folders on slaves until I exit from spark-shell. I opened spark-shell and run my spark program for several input folders. Then I noticed that Spark uses several GBs of disk space on all slaves in

Workflow manager for Spark and Spark SQL

2015-12-10 Thread Alexander Pivovarov
Hi Everyone I'm curious what people usually use to build ETL workflows based on DataFrames and Spark API? In Hadoop/Hive world people usually use Oozie. Is it different in Spark world?

Re: spark-ec2 vs. EMR

2015-12-02 Thread Alexander Pivovarov
at 9:44 AM, Dana Powers <dana.pow...@gmail.com> wrote: > EMR was a pain to configure on a private VPC last I tried. Has anyone had > success with that? I found spark-ec2 easier to use w private networking, > but also agree that I would use for prod. > > -Dana > On Dec 1, 2

Re: spark-ec2 vs. EMR

2015-12-01 Thread Alexander Pivovarov
1. Emr 4.2.0 has Zeppelin as an alternative to DataBricks Notebooks 2. Emr has Ganglia 3.6.0 3. Emr has hadoop fs settings to make s3 work fast (direct.EmrFileSystem) 4. EMR has s3 keys in hadoop configs 5. EMR allows to resize cluster on fly. 6. EMR has aws sdk in spark classpath. Helps to

Re: Spark Expand Cluster

2015-12-01 Thread Alexander Pivovarov
Try to run spark shell with correct number of executors e.g. for 10 box cluster running on r3.2xlarge (61 RAM, 8 cores) you can use the following spark-shell \ --num-executors 20 \ --driver-memory 2g \ --executor-memory 24g \ --executor-cores 4 you might also want to set

Join and HashPartitioner question

2015-11-13 Thread Alexander Pivovarov
Hi Everyone Is there any difference in performance btw the following two joins? val r1: RDD[(String, String]) = ??? val r2: RDD[(String, String]) = ??? val partNum = 80 val partitioner = new HashPartitioner(partNum) // Join 1 val res1 =

Re: spark ec2 script doest not install necessary files to launch spark

2015-11-06 Thread Alexander Pivovarov
try to use EMR-4.1.0. it has spark-1.5.0 running on yarn replace subnet-xxx with correct one $ aws emr create-cluster --name emr41_3 --release-label emr-4.1.0 --instance-groups InstanceCount=1,Name=sparkMaster,InstanceGroupType=MASTER,InstanceType=r3.2xlarge

Re: Generated ORC files cause NPE in Hive

2015-10-13 Thread Alexander Pivovarov
Daniel, Looks like we already have Jira for that error https://issues.apache.org/jira/browse/HIVE-11431 Could you put details on how to reproduce the issue to the ticket? Thank you Alex On Tue, Oct 13, 2015 at 11:14 AM, Daniel Haviv < daniel.ha...@veracity-group.com> wrote: > Hi, > We are

OutOfMemoryError OOM ByteArrayOutputStream.hugeCapacity

2015-10-12 Thread Alexander Pivovarov
I have one job which fails if I enable KryoSerializer I use spark 1.5.0 on emr-4.1.0 Settings: spark.serializer org.apache.spark.serializer.KryoSerializer spark.kryoserializer.buffer.max 1024m spark.executor.memory47924M spark.yarn.executor.memoryOverhead 5324 The

Re: How can I disable logging when running local[*]?

2015-10-06 Thread Alexander Pivovarov
The easiest way to control logging in spark shell is to run Logger.setLevel commands at the beginning of your program e.g. org.apache.log4j.Logger.getLogger("com.amazon").setLevel(org.apache.log4j.Level.WARN) org.apache.log4j.Logger.getLogger("com.amazonaws").setLevel(org.apache.log4j.Level.WARN)

Does YARN start new executor in place of the failed one?

2015-09-28 Thread Alexander Pivovarov
Hello Everyone I use Spark on YARN on EMR-4 The spark program which I run has several jobs/stages and run for about 10 hours During the execution some executors might fail for some reason. BUT I do not see that new executor are started in place of the failed ones So, what I see in spark UI is

Re: Spark on Yarn vs Standalone

2015-09-21 Thread Alexander Pivovarov
d to increase the memory size of executor through command >> arguments "--executor-memory", or configuration "spark.executor.memory". >> >> Also yarn.scheduler.maximum-allocation-mb in Yarn side if necessary. >> >> Thanks >> Saisai >> >> &

Re: Spark on Yarn vs Standalone

2015-09-21 Thread Alexander Pivovarov
bled. > > -Sandy > > On Tue, Sep 8, 2015 at 10:48 PM, Alexander Pivovarov <apivova...@gmail.com > > wrote: > >> The problem which we have now is skew data (2360 tasks done in 5 min, 3 >> tasks in 40 min and 1 task in 2 hours) >> >> Some people from th

Re: Spark on Yarn vs Standalone

2015-09-08 Thread Alexander Pivovarov
than Yarn allows) On Tue, Sep 8, 2015 at 3:02 PM, Sandy Ryza <sandy.r...@cloudera.com> wrote: > Those settings seem reasonable to me. > > Are you observing performance that's worse than you would expect? > > -Sandy > > On Mon, Sep 7, 2015 at 11:22 AM, Alexander Pivo

Re: Spark on Yarn vs Standalone

2015-09-07 Thread Alexander Pivovarov
ason that Spark > Standalone should provide performance or memory improvement over Spark on > YARN. > > -Sandy > > On Fri, Sep 4, 2015 at 1:24 PM, Alexander Pivovarov <apivova...@gmail.com> > wrote: > >> Hi Everyone >> >> We are trying the latest aws emr-

Spark on Yarn vs Standalone

2015-09-04 Thread Alexander Pivovarov
Hi Everyone We are trying the latest aws emr-4.0.0 and Spark and my question is about YARN vs Standalone mode. Our usecase is - start 100-150 nodes cluster every week, - run one heavy spark job (5-6 hours) - save data to s3 - stop cluster Officially aws emr-4.0.0 comes with Spark on Yarn It's

spark-shell does not see conf folder content on emr-4

2015-09-03 Thread Alexander Pivovarov
Hi Everyone My question is specific to running spark-1.4.1 on emr-4.0.0 spark installed to /usr/lib/spark conf folder linked to /etc/spark/conf spark-shell location /usr/bin/spark-shell I noticed that if I run spark-shell it does not read /etc/spark/conf folder files (e.g. spark-env.sh and

Re: spark 1.4.1 saveAsTextFile is slow on emr-4.0.0

2015-09-02 Thread Alexander Pivovarov
uster launch time with the following > Classification via EMR console: > > > classification=mapred-site,properties=[mapred.output.direct.EmrFileSystem=true,mapred.output.direct.NativeS3FileSystem=true] > > > Thank you > > On Wed, Sep 2, 2015 at 6:02 AM, Alexander Pivovarov <apivova.

Re: spark 1.4.1 saveAsTextFile is slow on emr-4.0.0

2015-09-01 Thread Alexander Pivovarov
Should I use DirectOutputCommitter? spark.hadoop.mapred.output.committer.class com.appsflyer.spark.DirectOutputCommitter On Tue, Sep 1, 2015 at 4:01 PM, Alexander Pivovarov <apivova...@gmail.com> wrote: > I run spark 1.4.1 in amazom aws emr 4.0.0 > > For some reason spark

spark 1.4.1 saveAsTextFile is slow on emr-4.0.0

2015-09-01 Thread Alexander Pivovarov
I run spark 1.4.1 in amazom aws emr 4.0.0 For some reason spark saveAsTextFile is very slow on emr 4.0.0 in comparison to emr 3.8 (was 5 sec, now 95 sec) Actually saveAsTextFile says that it's done in 4.356 sec but after that I see lots of INFO messages with 404 error from com.amazonaws.latency

Re: spark 1.4.1 saveAsTextFile is slow on emr-4.0.0

2015-09-01 Thread Alexander Pivovarov
I checked previous emr config (emr-3.8) mapred-site.xml has the following setting mapred.output.committer.classorg.apache.hadoop.mapred.DirectFileOutputCommitter On Tue, Sep 1, 2015 at 7:33 PM, Alexander Pivovarov <apivova...@gmail.com> wrote: > Should I use DirectOutput

Re: TimeoutException on start-slave spark 1.4.0

2015-08-28 Thread Alexander Pivovarov
': break On Thu, Aug 27, 2015 at 3:07 PM, Alexander Pivovarov apivova...@gmail.com wrote: I see the following error time to time when try to start slaves on spark 1.4.0 [hadoop@ip-10-0-27-240 apps]$ pwd /mnt/var/log/apps [hadoop@ip-10-0-27-240 apps]$ cat spark-hadoop

TimeoutException on start-slave spark 1.4.0

2015-08-27 Thread Alexander Pivovarov
I see the following error time to time when try to start slaves on spark 1.4.0 [hadoop@ip-10-0-27-240 apps]$ pwd /mnt/var/log/apps [hadoop@ip-10-0-27-240 apps]$ cat spark-hadoop-org.apache.spark.deploy.worker.Worker-1-ip-10-0-27-240.ec2.internal.out Spark Command: /usr/java/latest/bin/java -cp

Reduce number of partitions before saving to file. coalesce or repartition?

2015-08-13 Thread Alexander Pivovarov
Hi Everyone Which one should work faster (coalesce or repartition) if I need to reduce number of partitions from 5000 to 3 before saving RDD asTextFile Total data size is about 400MB on disk in text format Thank you