date:20170526

Re: Disable queuing of spark job on Mesos cluster if sufficient resources are not found

2017-05-26 Thread Michael Gummelt

Nope, sorry.

On Fri, May 26, 2017 at 4:38 AM, Mevada, Vatsal 
wrote:

> Hello,
>
> I am using Mesos with cluster deployment mode to submit my jobs.
>
> When sufficient resources are not available on Mesos cluster, I can see
> that my jobs are queuing up on Mesos dispatcher UI.
>
> Is it possible to tweak some configuration so that my job submission fails
> gracefully(instead of queuing up) if sufficient resources are not found on
> Mesos cluster?
>
> Regards,
>
> Vatsal
>



-- 
Michael Gummelt
Software Engineer
Mesosphere

[Spark Streaming] DAG Execution Model Clarification

2017-05-26 Thread Nipun Arora

Hi,

I would like some clarification on the execution model for spark streaming.

Broadly, I am trying to understand if output operations in a DAG are only
processed after all intermediate operations are finished for all parts of
the DAG.

Let me give an example:

I have a dstream -A , I do map operations on this dstream and create two
different dstreams -B and C such that

A ---> B -> (some operations) ---> kafka output 1
  \> C---> ( some operations) --> kafka output 2

I want to understand will kafka output 1 and kafka output 2 wait for all
operations to finish on B and C before sending an output, or will they
simply send an output as soon as the ops in B and C are done.

What kind of synchronization guarantees are there?

Thanks
Nipun

user-unsubscr...@spark.apache.org

2017-05-26 Thread williamtellme123

user-unsubscr...@spark.apache.org

 

From: ANEESH .V.V [mailto:aneeshnair.ku...@gmail.com] 
Sent: Friday, May 26, 2017 1:50 AM
To: user@spark.apache.org
Subject: unsubscribe

 

unsubscribe

Temp checkpoint directory for EMR (S3 or HDFS)

2017-05-26 Thread Everett Anderson

Hi,

I need to set a checkpoint directory as I'm starting to use GraphFrames.
(Also, occasionally my regular DataFrame lineages get too long so it'd be
nice to use checkpointing to squash the lineage.)

I don't actually need this checkpointed data to live beyond the life of the
job, however. I'm running jobs on AWS EMR (so on YARN + HDFS) and reading
and writing non-transient data to S3.

Two questions:

1. Is there a Spark --conf option to set the checkpoint directory? Somehow
I couldn't find it, but surely it exists.

2. What's a good checkpoint directory for this use case? I imagine it'd be
on HDFS and presumably in a YARN application-specific temporary path that
gets cleaned up afterwards. Does anyone have a recommendation?

Thanks!

- Everett

Re: Spark checkpoint - nonstreaming

2017-05-26 Thread Jörn Franke

Just load it as from any other directory.

> On 26. May 2017, at 17:26, Priya PM  wrote:
> 
> 
> -- Forwarded message --
> From: Priya PM 
> Date: Fri, May 26, 2017 at 8:54 PM
> Subject: Re: Spark checkpoint - nonstreaming
> To: Jörn Franke 
> 
> 
> Oh, how do i do it. I dont see it mentioned anywhere in the documentation. 
> 
> I have followed this link 
> https://github.com/JerryLead/SparkInternals/blob/master/markdown/english/6-CacheAndCheckpoint.md
>  to understand checkpoint work flow. 
> 
> But it doesnt seem to work the way it was mentioned below during the second 
> run to read from checkpointed RDD. 
> 
>  
> 
> Q: How to read checkpointed RDD ?
> 
> runJob() will call finalRDD.partitions() to determine how many tasks there 
> will be. rdd.partitions() checks if the RDD has been checkpointed via 
> RDDCheckpointData which manages checkpointed RDD. If yes, return the 
> partitions of the RDD (Array[Partition]). When rdd.iterator() is called to 
> compute RDD's partition, computeOrReadCheckpoint(split: Partition) is also 
> called to check if the RDD is checkpointed. If yes, the parent RDD's 
> iterator(), a.k.a CheckpointRDD.iterator() will be called. CheckpointRDD 
> reads files on file system to produce RDD partition. That's why a parent 
> CheckpointRDD is added to checkpointed rdd trickly
> 
> 
>> On Fri, May 26, 2017 at 8:48 PM, Jörn Franke  wrote:
>> Did you explicitly tell the application to read from the checkpoint 
>> directory ?
>> This you have to do in non-streaming scenarios.
>> 
>>> On 26. May 2017, at 16:52, Priya PM  wrote:
>>> 
>>> yes, i did set the checkpoint directory. I could see the checkpointed RDD 
>>> too. 
>>> 
>>> [root@ rdd-28]# pwd
>>> /root/checkpointDir/9dd1acf0-bef8-4a4f-bf0e-f7624334abc5/rdd-28
>>> 
>>> I am using the MovieLens application to check spark checkpointing feature. 
>>> 
>>> code: MovieLensALS.scala
>>> 
>>> def main(args: Array[String]) { 
>>> ..
>>> ..
>>> sc.setCheckpointDir("/root/checkpointDir")
>>> }
>>> 
>>> 
>>> 
 On Fri, May 26, 2017 at 8:09 PM, Jörn Franke  wrote:
 Do you have some source code?
 Did you set the checkpoint directory ?
 
 > On 26. May 2017, at 16:06, Priya  wrote:
 >
 > Hi,
 >
 > With nonstreaming spark application, did checkpoint the RDD and I could 
 > see
 > the RDD getting checkpointed. I have killed the application after
 > checkpointing the RDD and restarted the same application again 
 > immediately,
 > but it doesn't seem to pick from checkpoint and it again checkpoints the
 > RDD. Could anyone please explain why am I seeing this behavior, why it is
 > not picking from the checkpoint and proceeding further from there on the
 > second run of the same application. Would really help me understand spark
 > checkpoint work flow if I can get some clarity on the behavior. Please 
 > let
 > me know if I am missing something.
 >
 > [root@checkpointDir]# ls
 > 9dd1acf0-bef8-4a4f-bf0e-f7624334abc5  
 > a4f14f43-e7c3-4f64-a980-8483b42bb11d
 >
 > [root@9dd1acf0-bef8-4a4f-bf0e-f7624334abc5]# ls -la
 > total 0
 > drwxr-xr-x. 3 root root  20 May 26 16:26 .
 > drwxr-xr-x. 4 root root  94 May 26 16:24 ..
 > drwxr-xr-x. 2 root root 133 May 26 16:26 rdd-28
 >
 > [root@priya-vm 9dd1acf0-bef8-4a4f-bf0e-f7624334abc5]# cd rdd-28/
 > [root@priya-vm rdd-28]# ls
 > part-0  part-1  _partitioner
 >
 > Thanks
 >
 >
 >
 >
 >
 > --
 > View this message in context: 
 > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-checkpoint-nonstreaming-tp28712.html
 > Sent from the Apache Spark User List mailing list archive at Nabble.com.
 >
 > -
 > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
 >
>>> 
> 
>

Fwd: Spark checkpoint - nonstreaming

2017-05-26 Thread Priya PM

-- Forwarded message --
From: Priya PM 
Date: Fri, May 26, 2017 at 8:54 PM
Subject: Re: Spark checkpoint - nonstreaming
To: Jörn Franke 


Oh, how do i do it. I dont see it mentioned anywhere in the documentation.

I have followed this link https://github.com/JerryLead/SparkInternals/blob/
master/markdown/english/6-CacheAndCheckpoint.md to understand checkpoint
work flow.

But it doesnt seem to work the way it was mentioned below during the second
run to read from checkpointed RDD.



*Q: How to read checkpointed RDD ?*

runJob() will call finalRDD.partitions() to determine how many tasks there
will be. rdd.partitions() checks if the RDD has been checkpointed via
RDDCheckpointData which manages checkpointed RDD. If yes, return the
partitions of the RDD (Array[Partition]). When rdd.iterator() is called to
compute RDD's partition, computeOrReadCheckpoint(split: Partition) is also
called to check if the RDD is checkpointed. If yes, the parent RDD's
iterator(), a.k.a CheckpointRDD.iterator() will be called. CheckpointRDD
reads files on file system to produce RDD partition. *That's why a parent *
*CheckpointRDD** is added to checkpointed rdd trickly*

On Fri, May 26, 2017 at 8:48 PM, Jörn Franke  wrote:

> Did you explicitly tell the application to read from the checkpoint
> directory ?
> This you have to do in non-streaming scenarios.
>
> On 26. May 2017, at 16:52, Priya PM  wrote:
>
> yes, i did set the checkpoint directory. I could see the checkpointed RDD
> too.
>
> [root@ rdd-28]# pwd
> /root/checkpointDir/9dd1acf0-bef8-4a4f-bf0e-f7624334abc5/rdd-28
>
> I am using the MovieLens application to check spark checkpointing feature.
>
> code: MovieLensALS.scala
>
> def main(args: Array[String]) {
> ..
> ..
> sc.setCheckpointDir("/root/checkpointDir")
> }
>
>
>
> On Fri, May 26, 2017 at 8:09 PM, Jörn Franke  wrote:
>
>> Do you have some source code?
>> Did you set the checkpoint directory ?
>>
>> > On 26. May 2017, at 16:06, Priya  wrote:
>> >
>> > Hi,
>> >
>> > With nonstreaming spark application, did checkpoint the RDD and I could
>> see
>> > the RDD getting checkpointed. I have killed the application after
>> > checkpointing the RDD and restarted the same application again
>> immediately,
>> > but it doesn't seem to pick from checkpoint and it again checkpoints the
>> > RDD. Could anyone please explain why am I seeing this behavior, why it
>> is
>> > not picking from the checkpoint and proceeding further from there on the
>> > second run of the same application. Would really help me understand
>> spark
>> > checkpoint work flow if I can get some clarity on the behavior. Please
>> let
>> > me know if I am missing something.
>> >
>> > [root@checkpointDir]# ls
>> > 9dd1acf0-bef8-4a4f-bf0e-f7624334abc5  a4f14f43-e7c3-4f64-a980-8483b4
>> 2bb11d
>> >
>> > [root@9dd1acf0-bef8-4a4f-bf0e-f7624334abc5]# ls -la
>> > total 0
>> > drwxr-xr-x. 3 root root  20 May 26 16:26 .
>> > drwxr-xr-x. 4 root root  94 May 26 16:24 ..
>> > drwxr-xr-x. 2 root root 133 May 26 16:26 rdd-28
>> >
>> > [root@priya-vm 9dd1acf0-bef8-4a4f-bf0e-f7624334abc5]# cd rdd-28/
>> > [root@priya-vm rdd-28]# ls
>> > part-0  part-1  _partitioner
>> >
>> > Thanks
>> >
>> >
>> >
>> >
>> >
>> > --
>> > View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/Spark-checkpoint-nonstreaming-tp28712.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com
>> .
>> >
>> > -
>> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> >
>>
>
>

[no subject]

2017-05-26 Thread Anton Kravchenko

df.rdd.foreachPartition(convert_to_sas_single_partition)

def convert_to_sas_single_partition(ipartition: Iterator[Row]): Unit = {

for (irow <- ipartition) {

Re: Spark checkpoint - nonstreaming

2017-05-26 Thread Jörn Franke

Do you have some source code?
Did you set the checkpoint directory ?

> On 26. May 2017, at 16:06, Priya  wrote:
> 
> Hi,
> 
> With nonstreaming spark application, did checkpoint the RDD and I could see
> the RDD getting checkpointed. I have killed the application after
> checkpointing the RDD and restarted the same application again immediately,
> but it doesn't seem to pick from checkpoint and it again checkpoints the
> RDD. Could anyone please explain why am I seeing this behavior, why it is
> not picking from the checkpoint and proceeding further from there on the
> second run of the same application. Would really help me understand spark
> checkpoint work flow if I can get some clarity on the behavior. Please let
> me know if I am missing something. 
> 
> [root@checkpointDir]# ls
> 9dd1acf0-bef8-4a4f-bf0e-f7624334abc5  a4f14f43-e7c3-4f64-a980-8483b42bb11d
> 
> [root@9dd1acf0-bef8-4a4f-bf0e-f7624334abc5]# ls -la
> total 0
> drwxr-xr-x. 3 root root  20 May 26 16:26 .
> drwxr-xr-x. 4 root root  94 May 26 16:24 ..
> drwxr-xr-x. 2 root root 133 May 26 16:26 rdd-28
> 
> [root@priya-vm 9dd1acf0-bef8-4a4f-bf0e-f7624334abc5]# cd rdd-28/
> [root@priya-vm rdd-28]# ls
> part-0  part-1  _partitioner
> 
> Thanks
> 
> 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-checkpoint-nonstreaming-tp28712.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Spark checkpoint - nonstreaming

2017-05-26 Thread Holden Karau

In non streaming Spark checkpoints aren't for inter-application recovery,
rather you can think of them as doing persist but to a HDFS rather than
each nodes local memory / storage.


On Fri, May 26, 2017 at 3:06 PM Priya  wrote:

> Hi,
>
> With nonstreaming spark application, did checkpoint the RDD and I could see
> the RDD getting checkpointed. I have killed the application after
> checkpointing the RDD and restarted the same application again immediately,
> but it doesn't seem to pick from checkpoint and it again checkpoints the
> RDD. Could anyone please explain why am I seeing this behavior, why it is
> not picking from the checkpoint and proceeding further from there on the
> second run of the same application. Would really help me understand spark
> checkpoint work flow if I can get some clarity on the behavior. Please let
> me know if I am missing something.
>
> [root@checkpointDir]# ls
> 9dd1acf0-bef8-4a4f-bf0e-f7624334abc5  a4f14f43-e7c3-4f64-a980-8483b42bb11d
>
> [root@9dd1acf0-bef8-4a4f-bf0e-f7624334abc5]# ls -la
> total 0
> drwxr-xr-x. 3 root root  20 May 26 16:26 .
> drwxr-xr-x. 4 root root  94 May 26 16:24 ..
> drwxr-xr-x. 2 root root 133 May 26 16:26 rdd-28
>
> [root@priya-vm 9dd1acf0-bef8-4a4f-bf0e-f7624334abc5]# cd rdd-28/
> [root@priya-vm rdd-28]# ls
> part-0  part-1  _partitioner
>
> Thanks
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-checkpoint-nonstreaming-tp28712.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
> --
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau

using pandas and pyspark to run ETL job - always failing after about 40 minutes

2017-05-26 Thread Zeming Yu

Hi,

I tried running the ETL job a few times. It always fails after 40 minutes
or so. When I relaunch jupyter and rerun the job, it runs without error.
Then it fails again after some time. Just wondering if anyone else has
encountered this before?

Here's the error message:


Exception happened during processing of request from ('127.0.0.1', 40292)

Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/lib/python3.5/socketserver.py", line
313, in _handle_request_noblock
self.process_request(request, client_address)
  File "/home/ubuntu/anaconda3/lib/python3.5/socketserver.py", line
341, in process_request
self.finish_request(request, client_address)
  File "/home/ubuntu/anaconda3/lib/python3.5/socketserver.py", line
354, in finish_request
self.RequestHandlerClass(request, client_address, self)
  File "/home/ubuntu/anaconda3/lib/python3.5/socketserver.py", line
681, in __init__
self.handle()
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/pyspark/accumulators.py",
line 235, in handle
num_updates = read_int(self.rfile)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/pyspark/serializers.py",
line 577, in read_int
raise EOFError
EOFError

Re: Documentation on "Automatic file coalescing for native data sources"?

2017-05-26 Thread Daniel Siegmann

Thanks for the help everyone.

It seems the automatic coalescing doesn't happen when accessing ORC data
through a Hive metastore unless you configure
spark.sql.hive.convertMetastoreOrc to be true (it is false by default). I'm
not sure if this is documented somewhere, or if there's any reason not to
enable it, but I haven't had any problem with it.


--
Daniel Siegmann
Senior Software Engineer
*SecurityScorecard Inc.*
214 W 29th Street, 5th Floor
New York, NY 10001


On Sat, May 20, 2017 at 9:14 PM, Kabeer Ahmed  wrote:

> Thank you Takeshi.
>
> As far as I see from the code pointed, the default number of bytes to pack
> in a partition is set to 128MB - size of the parquet block size.
>
> Daniel,
>
> It seems you do have a need to modify the number of bytes you want to pack
> per partition. I am curious to know the scenario. Please share if you can.
>
> Thanks,
> Kabeer.
>
> On May 20 2017, at 4:54 pm, Takeshi Yamamuro 
> wrote:
>
>> I think this document points to a logic here: https://github.com/
>> apache/spark/blob/master/sql/core/src/main/scala/org/
>> apache/spark/sql/execution/DataSourceScanExec.scala#L418
>> 
>>
>> This logic merge small files into a partition and you can control this
>> threshold via `spark.sql.files.maxPartitionBytes`.
>>
>> // maropu
>>
>>
>> On Sat, May 20, 2017 at 8:15 AM, ayan guha  wrote:
>>
>> I think like all other read operations, it is driven by input format
>> used, and I think some variation of combine file input format is used by
>> default. I think you can test it by force a particular input format which
>> gets ine file per split, then you should end up with same number of
>> partitions as your dsta files
>>
>> On Sat, 20 May 2017 at 5:12 am, Aakash Basu 
>> wrote:
>>
>> Hey all,
>>
>> A reply on this would be great!
>>
>> Thanks,
>> A.B.
>>
>> On 17-May-2017 1:43 AM, "Daniel Siegmann" 
>> wrote:
>>
>> When using spark.read on a large number of small files, these are
>> automatically coalesced into fewer partitions. The only documentation I can
>> find on this is in the Spark 2.0.0 release notes, where it simply says (
>> http://spark.apache.org/releases/spark-release-2-0-0.html
>> 
>> ):
>>
>> "Automatic file coalescing for native data sources"
>>
>> Can anyone point me to documentation explaining what triggers this
>> feature, how it decides how many partitions to coalesce to, and what counts
>> as a "native data source"? I couldn't find any mention of this feature in
>> the SQL Programming Guide and Google was not helpful.
>>
>> --
>> Daniel Siegmann
>> Senior Software Engineer
>> *SecurityScorecard Inc.*
>> 214 W 29th Street, 5th Floor
>> New York, NY 10001
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>>
>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>

Spark checkpoint - nonstreaming

2017-05-26 Thread Priya

Hi,

With nonstreaming spark application, did checkpoint the RDD and I could see
the RDD getting checkpointed. I have killed the application after
checkpointing the RDD and restarted the same application again immediately,
but it doesn't seem to pick from checkpoint and it again checkpoints the
RDD. Could anyone please explain why am I seeing this behavior, why it is
not picking from the checkpoint and proceeding further from there on the
second run of the same application. Would really help me understand spark
checkpoint work flow if I can get some clarity on the behavior. Please let
me know if I am missing something. 

[root@checkpointDir]# ls
9dd1acf0-bef8-4a4f-bf0e-f7624334abc5  a4f14f43-e7c3-4f64-a980-8483b42bb11d

[root@9dd1acf0-bef8-4a4f-bf0e-f7624334abc5]# ls -la
total 0
drwxr-xr-x. 3 root root  20 May 26 16:26 .
drwxr-xr-x. 4 root root  94 May 26 16:24 ..
drwxr-xr-x. 2 root root 133 May 26 16:26 rdd-28

[root@priya-vm 9dd1acf0-bef8-4a4f-bf0e-f7624334abc5]# cd rdd-28/
[root@priya-vm rdd-28]# ls
part-0  part-1  _partitioner

Thanks





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-checkpoint-nonstreaming-tp28712.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

convert ps to jpg file

2017-05-26 Thread Selvam Raman

Hi,

is there any good open source to convert the ps to jpg?.

I am running spark job within that i am using Imagemagick/Graphicsmagick
with Ghostscript to convert/resize image.

IM/GM is took lot of memory/map memory/disk to convert KB of image file and
took lot of time. Because of this issue frequently i got yan OOM and disk
full issue.

Could you please share your thoughts?

-- 
Selvam Raman
"லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"

Disable queuing of spark job on Mesos cluster if sufficient resources are not found

2017-05-26 Thread Mevada, Vatsal

Hello,

I am using Mesos with cluster deployment mode to submit my jobs.

When sufficient resources are not available on Mesos cluster, I can see that my 
jobs are queuing up on Mesos dispatcher UI.

Is it possible to tweak some configuration so that my job submission fails 
gracefully(instead of queuing up) if sufficient resources are not found on 
Mesos cluster?
Regards,
Vatsal

unsubscribe

2017-05-26 Thread ANEESH .V.V

unsubscribe

Re: Running into the same problem as JIRA SPARK-19268

2017-05-26 Thread kant kodali

https://issues.apache.org/jira/browse/SPARK-20894

On Thu, May 25, 2017 at 4:31 PM, Shixiong(Ryan) Zhu  wrote:

> I don't know what happened in your case so cannot provide any work around.
> It would be great if you can provide logs output
> by HDFSBackedStateStoreProvider.
>
> On Thu, May 25, 2017 at 4:05 PM, kant kodali  wrote:
>
>>
>> On Thu, May 25, 2017 at 3:41 PM, Shixiong(Ryan) Zhu <
>> shixi...@databricks.com> wrote:
>>
>>> bin/hadoop fs -ls /usr/local/hadoop/checkpoint/state/0/*
>>>
>>
>> Hi,
>>
>> There are no files under bin/hadoop fs -ls 
>> /usr/local/hadoop/checkpoint/state/0/*
>> but all the directories until /usr/local/hadoop/checkpoint/state/0 does
>> exist(which are created by spark).
>>
>> yes I can attach the log but pretty much it looks like same as I sent on
>> this thread.
>>
>> Is there any work around to this for now? Will create a ticket shortly.
>>
>> Thanks!
>>
>
>

Re: Disable queuing of spark job on Mesos cluster if sufficient resources are not found

[Spark Streaming] DAG Execution Model Clarification

user-unsubscr...@spark.apache.org

Temp checkpoint directory for EMR (S3 or HDFS)

Re: Spark checkpoint - nonstreaming

Fwd: Spark checkpoint - nonstreaming

[no subject]

Re: Spark checkpoint - nonstreaming

Re: Spark checkpoint - nonstreaming

using pandas and pyspark to run ETL job - always failing after about 40 minutes

Re: Documentation on "Automatic file coalescing for native data sources"?

Spark checkpoint - nonstreaming

convert ps to jpg file

Disable queuing of spark job on Mesos cluster if sufficient resources are not found

unsubscribe

Re: Running into the same problem as JIRA SPARK-19268

16 matches

Site Navigation

Mail list logo

Footer information