Re: take() works on RDD but .write.json() does not work in 2.0.0

2016-09-17 Thread Hyukjin Kwon
Hi Kevin,

I have few questions on this.

Does that only not work with write.json() ? I just wonder if write.text,
csv or another API does not work as well and it is a JSON specific issue.

Also, does that work with small data? I want to make sure if this happen
only on large data.

Thanks!



2016-09-18 6:42 GMT+09:00 Kevin Burton :

> I'm seeing some weird behavior and wanted some feedback.
>
> I have a fairly large, multi-hour job that operates over about 5TB of data.
>
> It builds it out into a ranked category index of about 25000 categories
> sorted by rank, descending.
>
> I want to write this to a file but it's not actually writing any data.
>
> if I run myrdd.take(100) ... that works fine and prints data to a file.
>
> If I run
>
> myrdd.write.json(), it takes the same amount of time, and then writes a
> local file with a SUCCESS file but no actual partition data in the file.
> There's only one small file with SUCCESS.
>
> Any advice on how to debug this?
>
> --
>
> We’re hiring if you know of any awesome Java Devops or Linux Operations
> Engineers!
>
> Founder/CEO Spinn3r.com
> Location: *San Francisco, CA*
> blog: http://burtonator.wordpress.com
> … or check out my Google+ profile
> 
>
>


Re: Recovered state for updateStateByKey and incremental streams processing

2016-09-17 Thread manasdebashiskar
If you are using spark 1.6 onwards there is a better solution for you.
It is called mapwithState

mapwithState takes a state function and an initial RDD.

1) When you start your program for the first time/OR version changes and new
code can't use the checkpoint, the initialRDD comes handy.
2) For the rest of the occasion(i.e. program re-start after failure, or
regular stop/start for the same version) the checkpoint works for you.

Also, mapwithstate is easier to reason about then updatestatebykey and is
optimized to handle larger amount of data for the same amount of memory.

I use the same mechanism in production to great success.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Recovered-state-for-updateStateByKey-and-incremental-streams-processing-tp9p27747.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark metrics when running with YARN?

2016-09-17 Thread Saisai Shao
H Vladimir,

I think you mixed cluster manager and Spark application running on it, the
master and workers are two components for Standalone cluster manager, the
yarn counterparts are RM and NM. the URL you listed above is only worked
for standalone master and workers.

It would be more clear if you could try running simple Spark applications
on Standalone and Yarn.

On Fri, Sep 16, 2016 at 10:32 PM, Vladimir Tretyakov <
vladimir.tretya...@sematext.com> wrote:

> Hello.
>
> Found that there is also Spark metric Sink like MetricsServlet.
> which is enabled by default:
>
> https://apache.googlesource.com/spark/+/refs/heads/master/
> core/src/main/scala/org/apache/spark/metrics/MetricsConfig.scala#40
>
> Tried urls:
>
> On master:
> http://localhost:8080/metrics/master/json/
> http://localhost:8080/metrics/applications/json
>
> On slaves (with workers):
> http://localhost:4040/metrics/json/
>
> got information I need.
>
> Questions:
> 1. Will URLs for masted work in YARN (client/server mode) and Mesos modes?
> Or this is only for Standalone mode?
> 2. Will URL for slave also work for modes other than Standalone?
>
> Why are there 2 ways to get information, REST API and this Sink?
>
>
> Best regards, Vladimir.
>
>
>
>
>
>
> On Mon, Sep 12, 2016 at 3:53 PM, Vladimir Tretyakov <
> vladimir.tretya...@sematext.com> wrote:
>
>> Hello Saisai Shao, Jacek Laskowski , thx for information.
>>
>> We are working on Spark monitoring tool and our users have different
>> setup modes (Standalone, Mesos, YARN).
>>
>> Looked at code, found:
>>
>> /**
>>  * Attempt to start a Jetty server bound to the supplied hostName:port using 
>> the given
>>  * context handlers.
>>  *
>>  * If the desired port number is contended, continues
>> *incrementing ports until a free port is** * found*. Return the jetty Server 
>> object, the chosen port, and a mutable collection of handlers.
>>  */
>>
>> It seems most generic way (which will work for most users) will be start
>> looking at ports:
>>
>> spark.ui.port (4040 by default)
>> spark.ui.port + 1
>> spark.ui.port + 2
>> spark.ui.port + 3
>> ...
>>
>> Until we will get responses from Spark.
>>
>> PS: yeah they may be some intersections with some other applications for
>> some setups, in this case we may ask users about these exceptions and do
>> our housework around them.
>>
>> Best regards, Vladimir.
>>
>> On Mon, Sep 12, 2016 at 12:07 PM, Saisai Shao 
>> wrote:
>>
>>> Here is the yarn RM REST API for you to refer (
>>> http://hadoop.apache.org/docs/r2.7.0/hadoop-yarn/hadoop-yar
>>> n-site/ResourceManagerRest.html). You can use these APIs to query
>>> applications running on yarn.
>>>
>>> On Sun, Sep 11, 2016 at 11:25 PM, Jacek Laskowski 
>>> wrote:
>>>
 Hi Vladimir,

 You'd have to talk to your cluster manager to query for all the
 running Spark applications. I'm pretty sure YARN and Mesos can do that
 but unsure about Spark Standalone. This is certainly not something a
 Spark application's web UI could do for you since it is designed to
 handle the single Spark application.

 Pozdrawiam,
 Jacek Laskowski
 
 https://medium.com/@jaceklaskowski/
 Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
 Follow me at https://twitter.com/jaceklaskowski


 On Sun, Sep 11, 2016 at 11:18 AM, Vladimir Tretyakov
  wrote:
 > Hello Jacek, thx a lot, it works.
 >
 > Is there a way how to get list of running applications from REST API?
 Or I
 > have to try connect 4040 4041... 40xx ports and check if ports answer
 > something?
 >
 > Best regards, Vladimir.
 >
 > On Sat, Sep 10, 2016 at 6:00 AM, Jacek Laskowski 
 wrote:
 >>
 >> Hi,
 >>
 >> That's correct. One app one web UI. Open 4041 and you'll see the
 other
 >> app.
 >>
 >> Jacek
 >>
 >>
 >> On 9 Sep 2016 11:53 a.m., "Vladimir Tretyakov"
 >>  wrote:
 >>>
 >>> Hello again.
 >>>
 >>> I am trying to play with Spark version "2.11-2.0.0".
 >>>
 >>> Problem that REST API and UI shows me different things.
 >>>
 >>> I've stared 2 applications from "examples set": opened 2 consoles
 and run
 >>> following command in each:
 >>>
 >>> ./bin/spark-submit   --class org.apache.spark.examples.SparkPi
  --master
 >>> spark://wawanawna:7077  --executor-memory 2G
 --total-executor-cores 30
 >>> examples/jars/spark-examples_2.11-2.0.0.jar  1
 >>>
 >>> Request to API endpoint:
 >>>
 >>> http://localhost:4040/api/v1/applications
 >>>
 >>> returned me following JSON:
 >>>
 >>> [ {
 >>>   "id" : "app-20160909184529-0016",
 >>>   "name" : "Spark Pi",
 >>>   "attempts" : [ {
 >>> "startTime" : "2016-09-09T15:45:25.047GMT",
 >>> "endTime" : 

NoSuchField Error : INSTANCE specify user defined httpclient jar

2016-09-17 Thread sagarcasual .
Hello,
I am using Spark 1.6.1 distribution over Cloudera CDH 5.7.0 cluster.
When I am running my fatJar - spark jar and when it is making a call to
HttpClient it is getting classic NoSuchField Error : INSTANCE. Which
happens usually when httrpclient in classpath is older than anticipated
httpclient jar version (should be 4.5.2).
Can someone help me how do I overcome this, I tried multiple options
1. specifying --conf spark.files={path-to-httpclient-4.5.2.jar} and then
using --conf spark.driver.extraClassPath=./httpclient-4.5.2.jar and
--conf spark.executor.extraClassPath=./httpclient-4.5.2.jar
2. I specified the jar in --jar option
3. I specified jar in hdfs://master:8020/path/to/jar/httpclient.4.5.2.jar
In each case I had error.
I also would like to know if
--conf spark.files={path-to-httpclient-4.5.2.jar},
is path-to-httpclient-4.5.2.jar can be giving to some local path from where
I am issuing spark-submit?
And also are there any other suggestions how should I resolve this conflict?

Any clues?
-Regards
Sagar


Re: Is there such thing as cache fusion with the underlying tables/files on HDFS

2016-09-17 Thread Mich Talebzadeh
Thanks Todd

As I thought Apache Ignite is a data fabric much like Oracle Coherence
cache or HazelCast.

The use case is different between an in-memory-database (IMDB) and Data
Fabric. The build that I am dealing with has a 'database centric' view of
its data (i.e. it accesses its data using Spark sql and JDBC) so an
in-memory database will be a better fit. On the other hand If the
application deals solely with Java objects and does not have any notion of
a 'database', does not need SQL style queries and really just wants a
distributed, high performance object storage grid, then I think Ignite would
likely be the preferred choice.

So will likely go if needed for an in-memory database like Alluxio. I have
seen a rather debatable comparison between Spark and Ignite
that
looks to be like a one sided rant.

HTH



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 17 September 2016 at 20:53, Mich Talebzadeh 
wrote:

> Thanks Todd.
>
> I will have a look.
>
> Regards
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 17 September 2016 at 20:45, Todd Nist  wrote:
>
>> Hi Mich,
>>
>> Have you looked at Apache Ignite?  https://apacheignite-fs.readme.io/docs.
>>
>>
>> This looks like something that may be what your looking for:
>>
>> http://apacheignite.gridgain.org/docs/data-analysis-with-apache-zeppelin
>>
>> HTH.
>>
>> -Todd
>>
>>
>> On Sat, Sep 17, 2016 at 12:53 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I am seeing similar issues when I was working on Oracle with Tableau as
>>> the dashboard.
>>>
>>> Currently I have a batch layer that gets streaming data from
>>>
>>> source -> Kafka -> Flume -> HDFS
>>>
>>> It stored on HDFS as text files and a cron process sinks Hive table with
>>> the the external table build on the directory. I tried both ORC and Parquet
>>> but I don't think the query itself is the issue.
>>>
>>> Meaning it does not matter how clever your execution engine is, the fact
>>> you still have to do  considerable amount of Physical IO (PIO) as opposed
>>> to Logical IO (LIO) to get the data to Zeppelin is on the critical path.
>>>
>>> One option is to limit the amount of data in Zeppelin to certain number
>>> of rows or something similar. However, you cannot tell a user he/she cannot
>>> see the full data.
>>>
>>> We resolved this with Oracle by using Oracle TimesTen
>>> IMDB
>>> to cache certain tables in memory and get them refreshed (depending on
>>> refresh frequency) from the underlying table in Oracle when data is
>>> updated). That is done through cache fusion.
>>>
>>> I was looking around and came across Alluxio .
>>> Ideally I like to utilise such concept like TimesTen. Can one distribute
>>> Hive table data (or any table data) across the nodes cached. In that case
>>> we will be doing Logical IO which is about 20 times or more lightweight
>>> compared to Physical IO.
>>>
>>> Anyway this is the concept.
>>>
>>> Thanks
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>
>>
>


take() works on RDD but .write.json() does not work in 2.0.0

2016-09-17 Thread Kevin Burton
I'm seeing some weird behavior and wanted some feedback.

I have a fairly large, multi-hour job that operates over about 5TB of data.

It builds it out into a ranked category index of about 25000 categories
sorted by rank, descending.

I want to write this to a file but it's not actually writing any data.

if I run myrdd.take(100) ... that works fine and prints data to a file.

If I run

myrdd.write.json(), it takes the same amount of time, and then writes a
local file with a SUCCESS file but no actual partition data in the file.
There's only one small file with SUCCESS.

Any advice on how to debug this?

-- 

We’re hiring if you know of any awesome Java Devops or Linux Operations
Engineers!

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile



DataFrame defined within conditional IF ELSE statement

2016-09-17 Thread Mich Talebzadeh
In Spark 2 this gives me an error in a conditional  IF ELSE statement

I recall seeing the same in standard SQL

I am doing a test for different sources (text file, ORC or Parquet) to be
read in dependent on value of var option

I wrote this

import org.apache.spark.sql.functions._
import java.util.Calendar
import org.joda.time._
var option = 1
val today = new DateTime()
val minutes = -15
val  minutesago =
today.plusMinutes(minutes).toString.toString.substring(11,19)
val date = java.time.LocalDate.now.toString
val hour = java.time.LocalTime.now.toString
case class columns(INDEX: Int, TIMECREATED: String, SECURITY: String,
PRICE: String)













*if(option == 1 ) {   println("option = 1")   val df =
spark.read.option("header",
false).csv("hdfs://rhes564:9000/data/prices/prices.*")   val df2 = df.map(p
=> columns(p(0).toString.toInt,p(1).toString,
p(2).toString,p(3).toString))   df2.printSchema} else if (option == 2) {
val df2 =
spark.table("test.marketData").select('TIMECREATED,'SECURITY,'PRICE)} else
if (option == 3) {val df2 =
spark.table("test.marketDataParquet").select('TIMECREATED,'SECURITY,'PRICE)}
else {println("no valid option provided")sys.exit(0)}*

With option 1 selected it goes through and shows this

option = 1
root
 |-- INDEX: integer (nullable = true)
 |-- TIMECREATED: string (nullable = true)
 |-- SECURITY: string (nullable = true)
 |-- PRICE: string (nullable = true)

But when I try to do df2.printSchema OUTSEDE of the LOOP, it comes back
with error

scala> df2.printSchema
:31: error: not found: value df2
   df2.printSchema
   ^

I can define a stud df2 before IF ELSE statement. Is that the best way of
dealing with it?

Thanks

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Re: Is there such thing as cache fusion with the underlying tables/files on HDFS

2016-09-17 Thread Mich Talebzadeh
Thanks Todd.

I will have a look.

Regards

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 17 September 2016 at 20:45, Todd Nist  wrote:

> Hi Mich,
>
> Have you looked at Apache Ignite?  https://apacheignite-fs.readme.io/docs.
>
>
> This looks like something that may be what your looking for:
>
> http://apacheignite.gridgain.org/docs/data-analysis-with-apache-zeppelin
>
> HTH.
>
> -Todd
>
>
> On Sat, Sep 17, 2016 at 12:53 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Hi,
>>
>> I am seeing similar issues when I was working on Oracle with Tableau as
>> the dashboard.
>>
>> Currently I have a batch layer that gets streaming data from
>>
>> source -> Kafka -> Flume -> HDFS
>>
>> It stored on HDFS as text files and a cron process sinks Hive table with
>> the the external table build on the directory. I tried both ORC and Parquet
>> but I don't think the query itself is the issue.
>>
>> Meaning it does not matter how clever your execution engine is, the fact
>> you still have to do  considerable amount of Physical IO (PIO) as opposed
>> to Logical IO (LIO) to get the data to Zeppelin is on the critical path.
>>
>> One option is to limit the amount of data in Zeppelin to certain number
>> of rows or something similar. However, you cannot tell a user he/she cannot
>> see the full data.
>>
>> We resolved this with Oracle by using Oracle TimesTen
>> IMDB
>> to cache certain tables in memory and get them refreshed (depending on
>> refresh frequency) from the underlying table in Oracle when data is
>> updated). That is done through cache fusion.
>>
>> I was looking around and came across Alluxio .
>> Ideally I like to utilise such concept like TimesTen. Can one distribute
>> Hive table data (or any table data) across the nodes cached. In that case
>> we will be doing Logical IO which is about 20 times or more lightweight
>> compared to Physical IO.
>>
>> Anyway this is the concept.
>>
>> Thanks
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>
>


Re: Is there such thing as cache fusion with the underlying tables/files on HDFS

2016-09-17 Thread Todd Nist
Hi Mich,

Have you looked at Apache Ignite?  https://apacheignite-fs.readme.io/docs.

This looks like something that may be what your looking for:

http://apacheignite.gridgain.org/docs/data-analysis-with-apache-zeppelin

HTH.

-Todd


On Sat, Sep 17, 2016 at 12:53 PM, Mich Talebzadeh  wrote:

> Hi,
>
> I am seeing similar issues when I was working on Oracle with Tableau as
> the dashboard.
>
> Currently I have a batch layer that gets streaming data from
>
> source -> Kafka -> Flume -> HDFS
>
> It stored on HDFS as text files and a cron process sinks Hive table with
> the the external table build on the directory. I tried both ORC and Parquet
> but I don't think the query itself is the issue.
>
> Meaning it does not matter how clever your execution engine is, the fact
> you still have to do  considerable amount of Physical IO (PIO) as opposed
> to Logical IO (LIO) to get the data to Zeppelin is on the critical path.
>
> One option is to limit the amount of data in Zeppelin to certain number of
> rows or something similar. However, you cannot tell a user he/she cannot
> see the full data.
>
> We resolved this with Oracle by using Oracle TimesTen
> IMDB
> to cache certain tables in memory and get them refreshed (depending on
> refresh frequency) from the underlying table in Oracle when data is
> updated). That is done through cache fusion.
>
> I was looking around and came across Alluxio .
> Ideally I like to utilise such concept like TimesTen. Can one distribute
> Hive table data (or any table data) across the nodes cached. In that case
> we will be doing Logical IO which is about 20 times or more lightweight
> compared to Physical IO.
>
> Anyway this is the concept.
>
> Thanks
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>


Is there such thing as cache fusion with the underlying tables/files on HDFS

2016-09-17 Thread Mich Talebzadeh
Hi,

I am seeing similar issues when I was working on Oracle with Tableau as the
dashboard.

Currently I have a batch layer that gets streaming data from

source -> Kafka -> Flume -> HDFS

It stored on HDFS as text files and a cron process sinks Hive table with
the the external table build on the directory. I tried both ORC and Parquet
but I don't think the query itself is the issue.

Meaning it does not matter how clever your execution engine is, the fact
you still have to do  considerable amount of Physical IO (PIO) as opposed
to Logical IO (LIO) to get the data to Zeppelin is on the critical path.

One option is to limit the amount of data in Zeppelin to certain number of
rows or something similar. However, you cannot tell a user he/she cannot
see the full data.

We resolved this with Oracle by using Oracle TimesTen
IMDB
to cache certain tables in memory and get them refreshed (depending on
refresh frequency) from the underlying table in Oracle when data is
updated). That is done through cache fusion.

I was looking around and came across Alluxio .
Ideally I like to utilise such concept like TimesTen. Can one distribute
Hive table data (or any table data) across the nodes cached. In that case
we will be doing Logical IO which is about 20 times or more lightweight
compared to Physical IO.

Anyway this is the concept.

Thanks

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Re: Error trying to connect to Hive from Spark (Yarn-Cluster Mode)

2016-09-17 Thread Mich Talebzadeh
Hi

CLOSE_WAIT!

According to this  link

- CLOSE_WAIT - Indicates that the server has received the first FIN signal
from the client and the connection is in the process of being closed .So
this essentially means that his is a state where socket is waiting for the
application to execute close() . A socket can be in CLOSE_WAIT state
indefinitely until the application closes it. Faulty scenarios would be
like file descriptor leak, server not being execute close() on socket
leading to pile up of close_wait sockets
- The CLOSE_WAIT status means that the other side has initiated a
connection close, but the application on the local side has not yet closed
the socket

Normally it should be LISTEN or ESTABLISHED.

HTH




Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 17 September 2016 at 16:14,  wrote:

> Hi,
>
>
>
> Yes. I am able to connect to Hive from simple Java program running in the
> cluster. When using spark-submit I faced the issue.
>
> The output of command is given below
>
>
>
> $> netstat -alnp |grep 10001
>
> (Not all processes could be identified, non-owned process info
>
> will not be shown, you would have to be root to see it all.)
>
> tcp1  0 53.244.194.223:2561253.244.194.221:10001
> CLOSE_WAIT  -
>
>
>
> Thanks
>
> Anupama
>
>
>
> *From:* Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
> *Sent:* Saturday, September 17, 2016 12:36 AM
> *To:* Gangadhar, Anupama (623)
> *Cc:* user @spark
> *Subject:* Re: Error trying to connect to Hive from Spark (Yarn-Cluster
> Mode)
>
>
>
> Is your Hive Thrift Server up and running on port
> jdbc:hive2://10001?
>
>
>
> Do  the following
>
>
>
>  netstat -alnp |grep 10001
>
> and see whether it is actually running
>
>
>
> HTH
>
>
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn  
> *https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>
> On 16 September 2016 at 19:53,  wrote:
>
> Hi,
>
>
>
> I am trying to connect to Hive from Spark application in Kerborized
> cluster and get the following exception.  Spark version is 1.4.1 and Hive
> is 1.2.1. Outside of spark the connection goes through fine.
>
> Am I missing any configuration parameters?
>
>
>
> ava.sql.SQLException: Could not open connection to
> jdbc:hive2://10001/default;principal=hive/ server2 host>;ssl=false;transportMode=http;httpPath=cliservice: null
>
>at org.apache.hive.jdbc.HiveConnection.openTransport(
> HiveConnection.java:206)
>
>at org.apache.hive.jdbc.HiveConnection.(
> HiveConnection.java:178)
>
>at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.
> java:105)
>
>at java.sql.DriverManager.getConnection(DriverManager.
> java:571)
>
>at java.sql.DriverManager.getConnection(DriverManager.
> java:215)
>
>at SparkHiveJDBCTest$1.call(SparkHiveJDBCTest.java:124)
>
>at SparkHiveJDBCTest$1.call(SparkHiveJDBCTest.java:1)
>
>at org.apache.spark.api.java.JavaPairRDD$$anonfun$
> toScalaFunction$1.apply(JavaPairRDD.scala:1027)
>
>at scala.collection.Iterator$$anon$11.next(Iterator.scala:
> 328)
>
>at scala.collection.Iterator$$anon$11.next(Iterator.scala:
> 328)
>
>at org.apache.spark.rdd.PairRDDFunctions$$anonfun$
> saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.
> apply$mcV$sp(PairRDDFunctions.scala:1109)
>
>at org.apache.spark.rdd.PairRDDFunctions$$anonfun$
> saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.
> apply(PairRDDFunctions.scala:1108)
>
>at org.apache.spark.rdd.PairRDDFunctions$$anonfun$
> saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.
> apply(PairRDDFunctions.scala:1108)
>
>at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.
> scala:1285)
>
>at org.apache.spark.rdd.PairRDDFunctions$$anonfun$
> 

RE: Error trying to connect to Hive from Spark (Yarn-Cluster Mode)

2016-09-17 Thread anupama . gangadhar
Hi,

Yes. I am able to connect to Hive from simple Java program running in the 
cluster. When using spark-submit I faced the issue.
The output of command is given below

$> netstat -alnp |grep 10001
(Not all processes could be identified, non-owned process info
will not be shown, you would have to be root to see it all.)
tcp1  0 53.244.194.223:2561253.244.194.221:10001CLOSE_WAIT  
-

Thanks
Anupama

From: Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
Sent: Saturday, September 17, 2016 12:36 AM
To: Gangadhar, Anupama (623)
Cc: user @spark
Subject: Re: Error trying to connect to Hive from Spark (Yarn-Cluster Mode)

Is your Hive Thrift Server up and running on port  jdbc:hive2://10001?

Do  the following

 netstat -alnp |grep 10001

and see whether it is actually running

HTH







Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.



On 16 September 2016 at 19:53, 
> wrote:
Hi,

I am trying to connect to Hive from Spark application in Kerborized cluster and 
get the following exception.  Spark version is 1.4.1 and Hive is 1.2.1. Outside 
of spark the connection goes through fine.
Am I missing any configuration parameters?

ava.sql.SQLException: Could not open connection to jdbc:hive2://10001/default;principal=hive/;ssl=false;transportMode=http;httpPath=cliservice: null
   at 
org.apache.hive.jdbc.HiveConnection.openTransport(HiveConnection.java:206)
   at 
org.apache.hive.jdbc.HiveConnection.(HiveConnection.java:178)
   at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:105)
   at java.sql.DriverManager.getConnection(DriverManager.java:571)
   at java.sql.DriverManager.getConnection(DriverManager.java:215)
   at SparkHiveJDBCTest$1.call(SparkHiveJDBCTest.java:124)
   at SparkHiveJDBCTest$1.call(SparkHiveJDBCTest.java:1)
   at 
org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1.apply(JavaPairRDD.scala:1027)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply$mcV$sp(PairRDDFunctions.scala:1109)
   at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply(PairRDDFunctions.scala:1108)
   at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply(PairRDDFunctions.scala:1108)
   at 
org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1285)
   at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1116)
   at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1095)
   at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
   at org.apache.spark.scheduler.Task.run(Task.scala:70)
   at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.thrift.transport.TTransportException
   at 
org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
   at 
org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
   at 
org.apache.thrift.transport.TSaslTransport.receiveSaslMessage(TSaslTransport.java:182)
   at 
org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:258)
   at 
org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37)
   at 
org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:52)
   at 
org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:49)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
   

RE: Error trying to connect to Hive from Spark (Yarn-Cluster Mode)

2016-09-17 Thread anupama . gangadhar
Hi,
@Deepak
I have used a separate user keytab(not hadoop services keytab) and able to 
connect to Hive via simple java program.
I am able to connect to Hive from spark-shell as well. However when I submit a 
spark job using this same keytab, I see the issue.
Do cache have a role to play here? In the cluster, transport mode is http and 
ssl is disabled.

Thanks

Anupama



From: Deepak Sharma [mailto:deepakmc...@gmail.com]
Sent: Saturday, September 17, 2016 8:35 AM
To: Gangadhar, Anupama (623)
Cc: spark users
Subject: Re: Error trying to connect to Hive from Spark (Yarn-Cluster Mode)


Hi Anupama

To me it looks like issue with the SPN with which you are trying to connect to 
hive2 , i.e. hive@hostname.

Are you able to connect to hive from spark-shell?

Try getting the tkt using any other user keytab but not hadoop services keytab 
and then try running the spark submit.



Thanks

Deepak

On 17 Sep 2016 12:23 am, 
> wrote:
Hi,

I am trying to connect to Hive from Spark application in Kerborized cluster and 
get the following exception.  Spark version is 1.4.1 and Hive is 1.2.1. Outside 
of spark the connection goes through fine.
Am I missing any configuration parameters?

ava.sql.SQLException: Could not open connection to jdbc:hive2://10001/default;principal=hive/;ssl=false;transportMode=http;httpPath=cliservice: null
   at 
org.apache.hive.jdbc.HiveConnection.openTransport(HiveConnection.java:206)
   at 
org.apache.hive.jdbc.HiveConnection.(HiveConnection.java:178)
   at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:105)
   at java.sql.DriverManager.getConnection(DriverManager.java:571)
   at java.sql.DriverManager.getConnection(DriverManager.java:215)
   at SparkHiveJDBCTest$1.call(SparkHiveJDBCTest.java:124)
   at SparkHiveJDBCTest$1.call(SparkHiveJDBCTest.java:1)
   at 
org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1.apply(JavaPairRDD.scala:1027)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply$mcV$sp(PairRDDFunctions.scala:1109)
   at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply(PairRDDFunctions.scala:1108)
   at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply(PairRDDFunctions.scala:1108)
   at 
org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1285)
   at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1116)
   at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1095)
   at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
   at org.apache.spark.scheduler.Task.run(Task.scala:70)
   at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
Caused by: 
org.apache.thrift.transport.TTransportException
   at 
org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
   at 
org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
   at 
org.apache.thrift.transport.TSaslTransport.receiveSaslMessage(TSaslTransport.java:182)
   at 
org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:258)
   at 
org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37)
   at 
org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:52)
   at 
org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:49)
   at java.security.AccessController.doPrivileged(Native Method)
   at 
javax.security.auth.Subject.doAs(Subject.java:415)
   at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
   at 
org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport.open(TUGIAssumingTransport.java:49)
   at 
org.apache.hive.jdbc.HiveConnection.openTransport(HiveConnection.java:203)
   ... 21 more

In spark conf directory hive-site.xml has the following properties



 

Re: Can not control bucket files number if it was speficed

2016-09-17 Thread Mich Talebzadeh
Ok

You have an external table in Hive  on S3 with partition and bucket. say

..
PARTITIONED BY (year int, month string)
CLUSTERED BY (prod_id) INTO 256 BUCKETS
STORED AS ORC.

with have within each partition buckets on prod_id equally spread to 256
hash partitions/bucket. bucket is the hash partitioning within a Hive table
partition.

Now my question is how do you force data to go for a given partition p into
bucket n. Since you have already specified say 256 buckets then whatever
prod_id is, it still has to go to one of 256 buckets.

Within Spark , the number of files is actually the number of underlying RDD
partitions.  You can find this out by invoking toJavaRDD.partitions.size() and
force it to accept a certain number of partitions by using coalesce(n) or
something like that. However, I am not sure the output will be what you
expect to be.

Worth trying to sort it out the way you want with partition 8

val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
val s = spark.read.parquet("oraclehadoop.sales2")
s.coalesce(8).registerTempTable("tmp")
HiveContext.sql("SELECT * FROM tmp SORT BY
prod_id").write.mode("overwrite").parquet("test.sales6")


It may work.

HTH



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 17 September 2016 at 15:00, Qiang Li  wrote:

> I want to run job to load existing data from one S3 bucket, process it,
> then store to another bucket with Partition, and Bucket (data format
> conversion from tsv to parquet with gzip). So source data and results both
> are in S3, different are the tools which I used to process data.
>
> First I process data with Hive, create external tables with s3  location
> with partition and bucket number, jobs will generate files under each
> partition directory, and it was equal bucket number.
> then everything is ok, I also can use hive/presto/spark to run other jobs
> on results data in S3.
>
> But if I run spark job with partition and bucket, sort feature, spark job
> will generate far more files than bucket number under each partition
> directory, so presto or hive can not recongnize  the bucket because wrong
> files number is not equal bucket number in spark job.
>
> for example:
> ...
> val options = Map("path" -> "result_bucket_path", "compression" -> "gzip")
> res.write.mode("append").format("parquet").partitionBy("year", "month",
> "day").bucketBy(8, "xxx_id").sortBy("xxx_id").
> options(options).saveAsTable("result_bucket_name")
> ...
>
> The results bucket files under each partition is far more than 8.
>
>
> On Sat, Sep 17, 2016 at 9:27 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> It is difficult to guess what is happening with your data.
>>
>> First when you say you use Spark to generate test data are these selected
>> randomly and then stored in Hive/etc table?
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 17 September 2016 at 13:59, Qiang Li  wrote:
>>
>>> Hi,
>>>
>>> I use spark to generate data , then we use hive/pig/presto/spark to
>>> analyze data, but I found even I add used bucketBy and sortBy with bucket
>>> number in Spark, the results files was generate by Spark is always far more
>>> than bucket number under each partition, then Presto can not recognize the
>>> bucket, how can I control that in Spark ?
>>>
>>> Unfortunately, I did not find any way to do that.
>>>
>>> Thank you.
>>>
>>> --
>>> Adam - App Annie Ops
>>> Phone: +86 18610024053
>>> Email: q...@appannie.com
>>>
>>> *This email may contain or reference confidential information and is
>>> intended only for the individual to whom it is addressed.  Please refrain
>>> from distributing, disclosing or copying this email and the information
>>> contained within unless you are the intended recipient.  If you received
>>> this email in error, please notify us at le...@appannie.com
>>> 

Re: Can not control bucket files number if it was speficed

2016-09-17 Thread Qiang Li
I want to run job to load existing data from one S3 bucket, process it,
then store to another bucket with Partition, and Bucket (data format
conversion from tsv to parquet with gzip). So source data and results both
are in S3, different are the tools which I used to process data.

First I process data with Hive, create external tables with s3  location
with partition and bucket number, jobs will generate files under each
partition directory, and it was equal bucket number.
then everything is ok, I also can use hive/presto/spark to run other jobs
on results data in S3.

But if I run spark job with partition and bucket, sort feature, spark job
will generate far more files than bucket number under each partition
directory, so presto or hive can not recongnize  the bucket because wrong
files number is not equal bucket number in spark job.

for example:
...
val options = Map("path" -> "result_bucket_path", "compression" -> "gzip")
res.write.mode("append").format("parquet").partitionBy("year", "month",
"day").bucketBy(8,
"xxx_id").sortBy("xxx_id").options(options).saveAsTable("result_bucket_name")
...

The results bucket files under each partition is far more than 8.


On Sat, Sep 17, 2016 at 9:27 PM, Mich Talebzadeh 
wrote:

> It is difficult to guess what is happening with your data.
>
> First when you say you use Spark to generate test data are these selected
> randomly and then stored in Hive/etc table?
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 17 September 2016 at 13:59, Qiang Li  wrote:
>
>> Hi,
>>
>> I use spark to generate data , then we use hive/pig/presto/spark to
>> analyze data, but I found even I add used bucketBy and sortBy with bucket
>> number in Spark, the results files was generate by Spark is always far more
>> than bucket number under each partition, then Presto can not recognize the
>> bucket, how can I control that in Spark ?
>>
>> Unfortunately, I did not find any way to do that.
>>
>> Thank you.
>>
>> --
>> Adam - App Annie Ops
>> Phone: +86 18610024053
>> Email: q...@appannie.com
>>
>> *This email may contain or reference confidential information and is
>> intended only for the individual to whom it is addressed.  Please refrain
>> from distributing, disclosing or copying this email and the information
>> contained within unless you are the intended recipient.  If you received
>> this email in error, please notify us at le...@appannie.com
>> ** immediately and remove it from your system.*
>
>
>


-- 
Adam - App Annie Ops
Phone: +86 18610024053
Email: q...@appannie.com

-- 
*This email may contain or reference confidential information and is 
intended only for the individual to whom it is addressed.  Please refrain 
from distributing, disclosing or copying this email and the information 
contained within unless you are the intended recipient.  If you received 
this email in error, please notify us at le...@appannie.com 
** immediately and remove it from your system.*


Re: Can not control bucket files number if it was speficed

2016-09-17 Thread Mich Talebzadeh
It is difficult to guess what is happening with your data.

First when you say you use Spark to generate test data are these selected
randomly and then stored in Hive/etc table?

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 17 September 2016 at 13:59, Qiang Li  wrote:

> Hi,
>
> I use spark to generate data , then we use hive/pig/presto/spark to
> analyze data, but I found even I add used bucketBy and sortBy with bucket
> number in Spark, the results files was generate by Spark is always far more
> than bucket number under each partition, then Presto can not recognize the
> bucket, how can I control that in Spark ?
>
> Unfortunately, I did not find any way to do that.
>
> Thank you.
>
> --
> Adam - App Annie Ops
> Phone: +86 18610024053
> Email: q...@appannie.com
>
> *This email may contain or reference confidential information and is
> intended only for the individual to whom it is addressed.  Please refrain
> from distributing, disclosing or copying this email and the information
> contained within unless you are the intended recipient.  If you received
> this email in error, please notify us at le...@appannie.com
> ** immediately and remove it from your system.*


Can not control bucket files number if it was speficed

2016-09-17 Thread Qiang Li
Hi,

I use spark to generate data , then we use hive/pig/presto/spark to analyze
data, but I found even I add used bucketBy and sortBy with bucket number in
Spark, the results files was generate by Spark is always far more than
bucket number under each partition, then Presto can not recognize the
bucket, how can I control that in Spark ?

Unfortunately, I did not find any way to do that.

Thank you.

-- 
Adam - App Annie Ops
Phone: +86 18610024053
Email: q...@appannie.com

-- 
*This email may contain or reference confidential information and is 
intended only for the individual to whom it is addressed.  Please refrain 
from distributing, disclosing or copying this email and the information 
contained within unless you are the intended recipient.  If you received 
this email in error, please notify us at le...@appannie.com 
** immediately and remove it from your system.*


Re: Spark output data to S3 is very slow

2016-09-17 Thread Qiang Li
Tried several times, it is slow same as before, I will let spark output
data to HDFS, then sync data to S3 as temporary solution.

Thank you.

On Sat, Sep 17, 2016 at 10:43 AM, Takeshi Yamamuro 
wrote:

> Hi,
>
> Have you seen the previous thread?
> https://www.mail-archive.com/user@spark.apache.org/msg56791.html
>
> // maropu
>
>
> On Sat, Sep 17, 2016 at 11:34 AM, Qiang Li  wrote:
>
>> Hi,
>>
>>
>> I ran some jobs with Spark 2.0 on Yarn, I found all tasks finished very
>> quickly, but the last step, spark spend lots of time to rename or move data
>> from s3 temporary directory to real directory, then I try to set
>>
>> spark.hadoop.spark.sql.parquet.output.committer.class=org.
>> apache.spark.sql.execution.datasources.parquet.DirectParq
>> uetOutputCommitter
>> or
>> spark.sql.parquet.output.committer.class=org.apache.spark.
>> sql.parquet.DirectParquetOutputCommitter
>>
>> But both doesn't work, looks like spark 2.0 removed these configs, how
>> can I let spark output directly without temporary directory ?
>>
>>
>>
>> *This email may contain or reference confidential information and is
>> intended only for the individual to whom it is addressed.  Please refrain
>> from distributing, disclosing or copying this email and the information
>> contained within unless you are the intended recipient.  If you received
>> this email in error, please notify us at le...@appannie.com
>> ** immediately and remove it from your system.*
>
>
>
>
> --
> ---
> Takeshi Yamamuro
>

-- 
*This email may contain or reference confidential information and is 
intended only for the individual to whom it is addressed.  Please refrain 
from distributing, disclosing or copying this email and the information 
contained within unless you are the intended recipient.  If you received 
this email in error, please notify us at le...@appannie.com 
** immediately and remove it from your system.*