date:20160121

Spark Streaming - Custom ReceiverInputDStream ( Custom Source) In java

2016-01-21 Thread Nagu Kothapalli

Hi All

Facing an Issuee With CustomInputDStream object in java



*public CustomInputDStream(StreamingContext ssc_, ClassTag classTag)*
* {*
* super(ssc_, classTag);*
* }*
Can you help me to create the Instance in above class with *ClassTag* In
java

avg(df$column) not returning a value but just the text "Column avg"

2016-01-21 Thread Devesh Raj Singh

Hi,

I want to create average of numerical columns in iris dataset using sparkR

Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages"
"com.databricks:spark-csv_2.10:1.3.0" "sparkr-shell"')

library(SparkR)
sc=sparkR.init(master="local",sparkHome =
"/Users/devesh/Downloads/spark-1.4.1-bin-hadoop2.6",sparkPackages =
c("com.databricks:spark-csv_2.10:1.3.0"))
# To read csv files

#initiating the sql context
sqlContext <- sparkRSQL.init(sc)
# example of sparkR
df <- createDataFrame(sqlContext, iris)



*avg(df$Sepal_Length)*

*gives me the output -->*
*Column avg(Sepal_Length)*

*but not a number*
-- 
Warm regards,
Devesh.

Re: General Question (Spark Hive integration )

2016-01-21 Thread Bala

Thanks for the response Silvio,  my table is not partitioned because my filter 
column is primary key , I guess we can't partition on primary key column. My 
table has 600 million data if I query single regard it seems by default its 
loding whole data and taking some time to just return single record. Pls 
suggest if any thing I can tune here, and mine 5 node spark cluster

Bala

> On Jan 21, 2016, at 7:07 PM, Silvio Fiorito  
> wrote:
> 
> Also, just to clarify it doesn’t read the whole table into memory unless you 
> specifically cache it.
> 
> From: Silvio Fiorito 
> Date: Thursday, January 21, 2016 at 10:02 PM
> To: "Balaraju.Kagidala Kagidala" , 
> "user@spark.apache.org" 
> Subject: Re: General Question (Spark Hive integration )
> 
> Hi Bala,
> 
> It depends on how your Hive table is configured. If you used partitioning and 
> you are filtering on a partition column then it will only load the relevant 
> partitions. If, however, you’re filtering on a non-partitioned column then it 
> will have to read all the data and then filter as part of the Spark job.
> 
> Thanks,
> Silvio
> 
> From: "Balaraju.Kagidala Kagidala" 
> Date: Thursday, January 21, 2016 at 9:37 PM
> To: "user@spark.apache.org" 
> Subject: General Question (Spark Hive integration )
> 
> Hi ,
> 
> 
>   I have simple question regarding Spark Hive integration with DataFrames.
> 
> When we query  for a table, does spark loads whole table into memory and 
> applies the filter on top of it or it only loads the data with filter applied.
> 
> for example if the my query 'select * from employee where deptno=10' does my 
> rdd loads whole employee data into memory and applies fileter or will it load 
> only dept number 10 data.
> 
> 
> Thanks
> Bala
> 
> 
> 
> 
>

Re: 10hrs of Scheduler Delay

2016-01-21 Thread Sanders, Isaac B

I have run the driver on a smaller dataset (k=2, n=5000) and it worked quickly 
and didn’t hang like this. This dataset is closer to k=10, n=4.4m, but I am 
using more resources on this one.

- Isaac

On Jan 21, 2016, at 11:06 PM, Ted Yu 
mailto:yuzhih...@gmail.com>> wrote:

You may have seen the following on github page:

Latest commit 50fdf0e  on Feb 22, 2015

That was 11 months ago.

Can you search for similar algorithm which runs on Spark and is newer ?

If nothing found, consider running the tests coming from the project to 
determine whether the delay is intrinsic.

Cheers

On Thu, Jan 21, 2016 at 7:46 PM, Sanders, Isaac B 
mailto:sande...@rose-hulman.edu>> wrote:
That thread seems to be moving, it oscillates between a few different traces… 
Maybe it is working. It seems odd that it would take that long.

This is 3rd party code, and after looking at some of it, I think it might not 
be as Spark-y as it could be.

I linked it below. I don’t know a lot about spark, so it might be fine, but I 
have my suspicions.

https://github.com/alitouka/spark_dbscan/blob/master/src/src/main/scala/org/alitouka/spark/dbscan/exploratoryAnalysis/DistanceToNearestNeighborDriver.scala

- Isaac

On Jan 21, 2016, at 10:08 PM, Ted Yu 
mailto:yuzhih...@gmail.com>> wrote:

You may have noticed the following - did this indicate prolonged computation in 
your code ?

org.apache.commons.math3.util.MathArrays.distance(MathArrays.java:205)
org.apache.commons.math3.ml.distance.EuclideanDistance.compute(EuclideanDistance.java:34)
org.alitouka.spark.dbscan.spatial.DistanceCalculation$class.calculateDistance(DistanceCalculation.scala:15)
org.alitouka.spark.dbscan.exploratoryAnalysis.DistanceToNearestNeighborDriver$.calculateDistance(DistanceToNearestNeighborDriver.scala:16)

On Thu, Jan 21, 2016 at 5:13 PM, Sanders, Isaac B 
mailto:sande...@rose-hulman.edu>> wrote:
Hadoop is: HDP 2.3.2.0-2950

Here is a gist (pastebin) of my versions en masse and a stacktrace: 
https://gist.github.com/isaacsanders/2e59131758469097651b

Thanks

On Jan 21, 2016, at 7:44 PM, Ted Yu 
mailto:yuzhih...@gmail.com>> wrote:

Looks like you were running on YARN.

What hadoop version are you using ?

Can you capture a few stack traces of the AppMaster during the delay and 
pastebin them ?

Thanks

On Thu, Jan 21, 2016 at 8:08 AM, Sanders, Isaac B 
mailto:sande...@rose-hulman.edu>> wrote:
The Spark Version is 1.4.1

The logs are full of standard fair, nothing like an exception or even 
interesting [INFO] lines.

Here is the script I am using: 
https://gist.github.com/isaacsanders/660f480810fbc07d4df2

Thanks
Isaac

On Jan 21, 2016, at 11:03 AM, Ted Yu 
mailto:yuzhih...@gmail.com>> wrote:

Can you provide a bit more information ?

command line for submitting Spark job
version of Spark
anything interesting from driver / executor logs ?

Thanks

On Thu, Jan 21, 2016 at 7:35 AM, Sanders, Isaac B 
mailto:sande...@rose-hulman.edu>> wrote:
Hey all,

I am a CS student in the United States working on my senior thesis.

My thesis uses Spark, and I am encountering some trouble.

I am using https://github.com/alitouka/spark_dbscan, and to determine 
parameters, I am using the utility class they supply, 
org.alitouka.spark.dbscan.exploratoryAnalysis.DistanceToNearestNeighborDriver.

I am on a 10 node cluster with one machine with 8 cores and 32G of memory and 
nine machines with 6 cores and 16G of memory.

I have 442M of data, which seems like it would be a joke, but the job stalls at 
the last stage.

It was stuck in Scheduler Delay for 10 hours overnight, and I have tried a 
number of things for the last couple days, but nothing seems to be helping.

I have tried:
- Increasing heap sizes and numbers of cores
- More/less executors with different amounts of resources.
- Kyro Serialization
- FAIR Scheduling

It doesn’t seem like it should require this much. Any ideas?

- Isaac

Re: 10hrs of Scheduler Delay

2016-01-21 Thread Ted Yu

You may have seen the following on github page:

Latest commit 50fdf0e  on Feb 22, 2015

That was 11 months ago.

Can you search for similar algorithm which runs on Spark and is newer ?

If nothing found, consider running the tests coming from the project to
determine whether the delay is intrinsic.

Cheers

On Thu, Jan 21, 2016 at 7:46 PM, Sanders, Isaac B 
wrote:

> That thread seems to be moving, it oscillates between a few different
> traces… Maybe it is working. It seems odd that it would take that long.
>
> This is 3rd party code, and after looking at some of it, I think it might
> not be as Spark-y as it could be.
>
> I linked it below. I don’t know a lot about spark, so it might be fine,
> but I have my suspicions.
>
>
> https://github.com/alitouka/spark_dbscan/blob/master/src/src/main/scala/org/alitouka/spark/dbscan/exploratoryAnalysis/DistanceToNearestNeighborDriver.scala
>
> - Isaac
>
> On Jan 21, 2016, at 10:08 PM, Ted Yu  wrote:
>
> You may have noticed the following - did this indicate prolonged
> computation in your code ?
>
> org.apache.commons.math3.util.MathArrays.distance(MathArrays.java:205)
> org.apache.commons.math3.ml.distance.EuclideanDistance.compute(EuclideanDistance.java:34)
> org.alitouka.spark.dbscan.spatial.DistanceCalculation$class.calculateDistance(DistanceCalculation.scala:15)
> org.alitouka.spark.dbscan.exploratoryAnalysis.DistanceToNearestNeighborDriver$.calculateDistance(DistanceToNearestNeighborDriver.scala:16)
>
>
> On Thu, Jan 21, 2016 at 5:13 PM, Sanders, Isaac B <
> sande...@rose-hulman.edu> wrote:
>
>> Hadoop is: HDP 2.3.2.0-2950
>>
>> Here is a gist (pastebin) of my versions en masse and a stacktrace:
>> https://gist.github.com/isaacsanders/2e59131758469097651b
>>
>> Thanks
>>
>> On Jan 21, 2016, at 7:44 PM, Ted Yu  wrote:
>>
>> Looks like you were running on YARN.
>>
>> What hadoop version are you using ?
>>
>> Can you capture a few stack traces of the AppMaster during the delay and
>> pastebin them ?
>>
>> Thanks
>>
>> On Thu, Jan 21, 2016 at 8:08 AM, Sanders, Isaac B <
>> sande...@rose-hulman.edu> wrote:
>>
>>> The Spark Version is 1.4.1
>>>
>>> The logs are full of standard fair, nothing like an exception or even
>>> interesting [INFO] lines.
>>>
>>> Here is the script I am using:
>>> https://gist.github.com/isaacsanders/660f480810fbc07d4df2
>>>
>>> Thanks
>>> Isaac
>>>
>>> On Jan 21, 2016, at 11:03 AM, Ted Yu  wrote:
>>>
>>> Can you provide a bit more information ?
>>>
>>> command line for submitting Spark job
>>> version of Spark
>>> anything interesting from driver / executor logs ?
>>>
>>> Thanks
>>>
>>> On Thu, Jan 21, 2016 at 7:35 AM, Sanders, Isaac B <
>>> sande...@rose-hulman.edu> wrote:
>>>
 Hey all,

 I am a CS student in the United States working on my senior thesis.

 My thesis uses Spark, and I am encountering some trouble.

 I am using https://github.com/alitouka/spark_dbscan, and to determine
 parameters, I am using the utility class they supply,
 org.alitouka.spark.dbscan.exploratoryAnalysis.DistanceToNearestNeighborDriver.

 I am on a 10 node cluster with one machine with 8 cores and 32G of
 memory and nine machines with 6 cores and 16G of memory.

 I have 442M of data, which seems like it would be a joke, but the job
 stalls at the last stage.

 It was stuck in Scheduler Delay for 10 hours overnight, and I have
 tried a number of things for the last couple days, but nothing seems to be
 helping.

 I have tried:
 - Increasing heap sizes and numbers of cores
 - More/less executors with different amounts of resources.
 - Kyro Serialization
 - FAIR Scheduling

 It doesn’t seem like it should require this much. Any ideas?

 - Isaac
>>>
>>>
>>>
>>>
>>
>>
>
>

Spark partition size tuning

2016-01-21 Thread Jia Zou

Dear all!

When using Spark to read from local file system, the default partition size
is 32MB, how can I increase the partition size to 128MB, to reduce the
number of tasks?

Thank you very much!

Best Regards,
Jia

Re: 10hrs of Scheduler Delay

2016-01-21 Thread Sanders, Isaac B

That thread seems to be moving, it oscillates between a few different traces… 
Maybe it is working. It seems odd that it would take that long.

This is 3rd party code, and after looking at some of it, I think it might not 
be as Spark-y as it could be.

I linked it below. I don’t know a lot about spark, so it might be fine, but I 
have my suspicions.

https://github.com/alitouka/spark_dbscan/blob/master/src/src/main/scala/org/alitouka/spark/dbscan/exploratoryAnalysis/DistanceToNearestNeighborDriver.scala

- Isaac

On Jan 21, 2016, at 10:08 PM, Ted Yu 
mailto:yuzhih...@gmail.com>> wrote:

You may have noticed the following - did this indicate prolonged computation in 
your code ?

org.apache.commons.math3.util.MathArrays.distance(MathArrays.java:205)
org.apache.commons.math3.ml.distance.EuclideanDistance.compute(EuclideanDistance.java:34)
org.alitouka.spark.dbscan.spatial.DistanceCalculation$class.calculateDistance(DistanceCalculation.scala:15)
org.alitouka.spark.dbscan.exploratoryAnalysis.DistanceToNearestNeighborDriver$.calculateDistance(DistanceToNearestNeighborDriver.scala:16)

On Thu, Jan 21, 2016 at 5:13 PM, Sanders, Isaac B 
mailto:sande...@rose-hulman.edu>> wrote:
Hadoop is: HDP 2.3.2.0-2950

Here is a gist (pastebin) of my versions en masse and a stacktrace: 
https://gist.github.com/isaacsanders/2e59131758469097651b

Thanks

On Jan 21, 2016, at 7:44 PM, Ted Yu 
mailto:yuzhih...@gmail.com>> wrote:

Looks like you were running on YARN.

What hadoop version are you using ?

Can you capture a few stack traces of the AppMaster during the delay and 
pastebin them ?

Thanks

On Thu, Jan 21, 2016 at 8:08 AM, Sanders, Isaac B 
mailto:sande...@rose-hulman.edu>> wrote:
The Spark Version is 1.4.1

The logs are full of standard fair, nothing like an exception or even 
interesting [INFO] lines.

Here is the script I am using: 
https://gist.github.com/isaacsanders/660f480810fbc07d4df2

Thanks
Isaac

On Jan 21, 2016, at 11:03 AM, Ted Yu 
mailto:yuzhih...@gmail.com>> wrote:

Can you provide a bit more information ?

command line for submitting Spark job
version of Spark
anything interesting from driver / executor logs ?

Thanks

On Thu, Jan 21, 2016 at 7:35 AM, Sanders, Isaac B 
mailto:sande...@rose-hulman.edu>> wrote:
Hey all,

I am a CS student in the United States working on my senior thesis.

My thesis uses Spark, and I am encountering some trouble.

I am using https://github.com/alitouka/spark_dbscan, and to determine 
parameters, I am using the utility class they supply, 
org.alitouka.spark.dbscan.exploratoryAnalysis.DistanceToNearestNeighborDriver.

I am on a 10 node cluster with one machine with 8 cores and 32G of memory and 
nine machines with 6 cores and 16G of memory.

I have 442M of data, which seems like it would be a joke, but the job stalls at 
the last stage.

It was stuck in Scheduler Delay for 10 hours overnight, and I have tried a 
number of things for the last couple days, but nothing seems to be helping.

I have tried:
- Increasing heap sizes and numbers of cores
- More/less executors with different amounts of resources.
- Kyro Serialization
- FAIR Scheduling

It doesn’t seem like it should require this much. Any ideas?

- Isaac

Re: 10hrs of Scheduler Delay

2016-01-21 Thread Ted Yu

You may have noticed the following - did this indicate prolonged
computation in your code ?

org.apache.commons.math3.util.MathArrays.distance(MathArrays.java:205)
org.apache.commons.math3.ml.distance.EuclideanDistance.compute(EuclideanDistance.java:34)
org.alitouka.spark.dbscan.spatial.DistanceCalculation$class.calculateDistance(DistanceCalculation.scala:15)
org.alitouka.spark.dbscan.exploratoryAnalysis.DistanceToNearestNeighborDriver$.calculateDistance(DistanceToNearestNeighborDriver.scala:16)


On Thu, Jan 21, 2016 at 5:13 PM, Sanders, Isaac B 
wrote:

> Hadoop is: HDP 2.3.2.0-2950
>
> Here is a gist (pastebin) of my versions en masse and a stacktrace:
> https://gist.github.com/isaacsanders/2e59131758469097651b
>
> Thanks
>
> On Jan 21, 2016, at 7:44 PM, Ted Yu  wrote:
>
> Looks like you were running on YARN.
>
> What hadoop version are you using ?
>
> Can you capture a few stack traces of the AppMaster during the delay and
> pastebin them ?
>
> Thanks
>
> On Thu, Jan 21, 2016 at 8:08 AM, Sanders, Isaac B <
> sande...@rose-hulman.edu> wrote:
>
>> The Spark Version is 1.4.1
>>
>> The logs are full of standard fair, nothing like an exception or even
>> interesting [INFO] lines.
>>
>> Here is the script I am using:
>> https://gist.github.com/isaacsanders/660f480810fbc07d4df2
>>
>> Thanks
>> Isaac
>>
>> On Jan 21, 2016, at 11:03 AM, Ted Yu  wrote:
>>
>> Can you provide a bit more information ?
>>
>> command line for submitting Spark job
>> version of Spark
>> anything interesting from driver / executor logs ?
>>
>> Thanks
>>
>> On Thu, Jan 21, 2016 at 7:35 AM, Sanders, Isaac B <
>> sande...@rose-hulman.edu> wrote:
>>
>>> Hey all,
>>>
>>> I am a CS student in the United States working on my senior thesis.
>>>
>>> My thesis uses Spark, and I am encountering some trouble.
>>>
>>> I am using https://github.com/alitouka/spark_dbscan, and to determine
>>> parameters, I am using the utility class they supply,
>>> org.alitouka.spark.dbscan.exploratoryAnalysis.DistanceToNearestNeighborDriver.
>>>
>>> I am on a 10 node cluster with one machine with 8 cores and 32G of
>>> memory and nine machines with 6 cores and 16G of memory.
>>>
>>> I have 442M of data, which seems like it would be a joke, but the job
>>> stalls at the last stage.
>>>
>>> It was stuck in Scheduler Delay for 10 hours overnight, and I have tried
>>> a number of things for the last couple days, but nothing seems to be
>>> helping.
>>>
>>> I have tried:
>>> - Increasing heap sizes and numbers of cores
>>> - More/less executors with different amounts of resources.
>>> - Kyro Serialization
>>> - FAIR Scheduling
>>>
>>> It doesn’t seem like it should require this much. Any ideas?
>>>
>>> - Isaac
>>
>>
>>
>>
>
>

Re: General Question (Spark Hive integration )

2016-01-21 Thread Silvio Fiorito

Also, just to clarify it doesn’t read the whole table into memory unless you 
specifically cache it.

From: Silvio Fiorito 
mailto:silvio.fior...@granturing.com>>
Date: Thursday, January 21, 2016 at 10:02 PM
To: "Balaraju.Kagidala Kagidala" 
mailto:balaraju.kagid...@gmail.com>>, 
"user@spark.apache.org" 
mailto:user@spark.apache.org>>
Subject: Re: General Question (Spark Hive integration )

Hi Bala,

It depends on how your Hive table is configured. If you used partitioning and 
you are filtering on a partition column then it will only load the relevant 
partitions. If, however, you’re filtering on a non-partitioned column then it 
will have to read all the data and then filter as part of the Spark job.

Thanks,
Silvio

From: "Balaraju.Kagidala Kagidala" 
mailto:balaraju.kagid...@gmail.com>>
Date: Thursday, January 21, 2016 at 9:37 PM
To: "user@spark.apache.org" 
mailto:user@spark.apache.org>>
Subject: General Question (Spark Hive integration )

Hi ,

  I have simple question regarding Spark Hive integration with DataFrames.

When we query  for a table, does spark loads whole table into memory and 
applies the filter on top of it or it only loads the data with filter applied.

for example if the my query 'select * from employee where deptno=10' does my 
rdd loads whole employee data into memory and applies fileter or will it load 
only dept number 10 data.

Thanks
Bala

Re: General Question (Spark Hive integration )

2016-01-21 Thread Silvio Fiorito

Hi Bala,

It depends on how your Hive table is configured. If you used partitioning and 
you are filtering on a partition column then it will only load the relevant 
partitions. If, however, you’re filtering on a non-partitioned column then it 
will have to read all the data and then filter as part of the Spark job.

Thanks,
Silvio

From: "Balaraju.Kagidala Kagidala" 
mailto:balaraju.kagid...@gmail.com>>
Date: Thursday, January 21, 2016 at 9:37 PM
To: "user@spark.apache.org" 
mailto:user@spark.apache.org>>
Subject: General Question (Spark Hive integration )

Hi ,


  I have simple question regarding Spark Hive integration with DataFrames.

When we query  for a table, does spark loads whole table into memory and 
applies the filter on top of it or it only loads the data with filter applied.

for example if the my query 'select * from employee where deptno=10' does my 
rdd loads whole employee data into memory and applies fileter or will it load 
only dept number 10 data.


Thanks
Bala

General Question (Spark Hive integration )

2016-01-21 Thread Balaraju.Kagidala Kagidala

Hi ,


  I have simple question regarding Spark Hive integration with DataFrames.

When we query  for a table, does spark loads whole table into memory and
applies the filter on top of it or it only loads the data with filter
applied.

for example if the my query 'select * from employee where deptno=10' does
my rdd loads whole employee data into memory and applies fileter or will it
load only dept number 10 data.


Thanks
Bala

Re: TaskCommitDenied (Driver denied task commit)

2016-01-21 Thread Arun Luthra

Two changes I made that appear to be keeping various errors at bay:

1) bumped up spark.yarn.executor.memoryOverhead to 2000 in the spirit of
https://mail-archives.apache.org/mod_mbox/spark-user/201511.mbox/%3ccacbyxkld8qasymj2ghk__vttzv4gejczcqfaw++s1d5te1d...@mail.gmail.com%3E
. Even though I couldn't find the same error in my yarn log.

2) very important: I ran coalesce(1000) on the RDD at the start of the DAG.
I know keeping the # of partitions lower is helpful, based on past
experience with groupByKey. I haven't run this pipeline in a bit so that
rule of thumb was not forefront in my mind.

On Thu, Jan 21, 2016 at 5:35 PM, Arun Luthra  wrote:

> Looking into the yarn logs for a similar job where an executor was
> associated with the same error, I find:
>
> ...
> 16/01/22 01:17:18 INFO client.TransportClientFactory: Found inactive
> connection to (SERVER), creating a new one.
> 16/01/22 01:17:18 *ERROR shuffle.RetryingBlockFetcher: Exception while
> beginning fetch of 46 outstanding blocks*
> *java.io.IOException: Failed to connect to (SERVER)*
> at
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:193)
> at
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)
> at
> org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:88)
> at
> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
> at
> org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
> at
> org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:97)
> at
> org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:152)
> at
> org.apache.spark.storage.ShuffleBlockFetcherIterator.initialize(ShuffleBlockFetcherIterator.scala:265)
> at
> org.apache.spark.storage.ShuffleBlockFetcherIterator.(ShuffleBlockFetcherIterator.scala:112)
> at
> org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:43)
> at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:90)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> *Caused by: java.net.ConnectException: Connection refused:* (SERVER)
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> at
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
> at
> io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
> at
> io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
> at
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
> at
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
> at
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
> at
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
> ... 1 more
>
> ...
>
>
> Not sure if this reveals anything at all.
>
>
> On Thu, Jan 21, 2016 at 2:58 PM, Holden Karau 
> wrote:
>
>> My hunch is that the TaskCommitDenied is perhaps a red hearing and the
>> problem is groupByKey - but I've also just seen a lot of people be bitten
>> by it so that might not be issue. If you just do a count at the point of
>> the groupByKey does the pipeline succeed?
>>
>> On Thu, Jan 21, 2016 at

Re: TaskCommitDenied (Driver denied task commit)

2016-01-21 Thread Arun Luthra

Looking into the yarn logs for a similar job where an executor was
associated with the same error, I find:

...
16/01/22 01:17:18 INFO client.TransportClientFactory: Found inactive
connection to (SERVER), creating a new one.
16/01/22 01:17:18 *ERROR shuffle.RetryingBlockFetcher: Exception while
beginning fetch of 46 outstanding blocks*
*java.io.IOException: Failed to connect to (SERVER)*
at
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:193)
at
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)
at
org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:88)
at
org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
at
org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
at
org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:97)
at
org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:152)
at
org.apache.spark.storage.ShuffleBlockFetcherIterator.initialize(ShuffleBlockFetcherIterator.scala:265)
at
org.apache.spark.storage.ShuffleBlockFetcherIterator.(ShuffleBlockFetcherIterator.scala:112)
at
org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:43)
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:90)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
*Caused by: java.net.ConnectException: Connection refused:* (SERVER)
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
at
io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
at
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
at
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
at
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
... 1 more

...

Not sure if this reveals anything at all.

On Thu, Jan 21, 2016 at 2:58 PM, Holden Karau  wrote:

> My hunch is that the TaskCommitDenied is perhaps a red hearing and the
> problem is groupByKey - but I've also just seen a lot of people be bitten
> by it so that might not be issue. If you just do a count at the point of
> the groupByKey does the pipeline succeed?
>
> On Thu, Jan 21, 2016 at 2:56 PM, Arun Luthra 
> wrote:
>
>> Usually the pipeline works, it just failed on this particular input data.
>> The other data it has run on is of similar size.
>>
>> Speculation is enabled.
>>
>> I'm using Spark 1.5.0.
>>
>> Here is the config. Many of these may not be needed anymore, they are
>> from trying to get things working in Spark 1.2 and 1.3.
>>
>> .set("spark.storage.memoryFraction","0.2") // default 0.6
>> .set("spark.shuffle.memoryFraction","0.2") // default 0.2
>> .set("spark.shuffle.manager","SORT") // preferred setting for
>> optimized joins
>> .set("spark.shuffle.consolidateFiles","true") // helpful for "too
>> many files open"
>> .set("spark.mesos.coarse", "true") // helpful for
>> MapOutputTracker errors?
>> .set("spark.akka.frameSize","300") // he

Re: 10hrs of Scheduler Delay

2016-01-21 Thread Darren Govoni



I've experienced this same problem. Always the last stage hangs. Indeterminant. 
No errors in logs. I run spark 1.5.2. Can't find an explanation. But it's 
definitely a showstopper.


Sent from my Verizon Wireless 4G LTE smartphone

 Original message 
From: Ted Yu  
Date: 01/21/2016  7:44 PM  (GMT-05:00) 
To: "Sanders, Isaac B"  
Cc: user@spark.apache.org 
Subject: Re: 10hrs of Scheduler Delay 

Looks like you were running on YARN.
What hadoop version are you using ?
Can you capture a few stack traces of the AppMaster during the delay and 
pastebin them ?
Thanks
On Thu, Jan 21, 2016 at 8:08 AM, Sanders, Isaac B  
wrote:





The Spark Version is 1.4.1



The logs are full of standard fair, nothing like an exception or even 
interesting [INFO] lines.



Here is the script I am using: 
https://gist.github.com/isaacsanders/660f480810fbc07d4df2



Thanks
Isaac




On Jan 21, 2016, at 11:03 AM, Ted Yu  wrote:



Can you provide a bit more information ?



command line for submitting Spark job
version of Spark
anything interesting from driver / executor logs ?



Thanks 





On Thu, Jan 21, 2016 at 7:35 AM, Sanders, Isaac B 
 wrote:


Hey all,



I am a CS student in the United States working on my senior thesis.



My thesis uses Spark, and I am encountering some trouble.



I am using 
https://github.com/alitouka/spark_dbscan, and to determine parameters, I am 
using the utility class they supply, 
org.alitouka.spark.dbscan.exploratoryAnalysis.DistanceToNearestNeighborDriver.



I am on a 10 node cluster with one machine with 8 cores and 32G of memory and 
nine machines with 6 cores and 16G of memory.



I have 442M of data, which seems like it would be a joke, but the job stalls at 
the last stage.



It was stuck in Scheduler Delay for 10 hours overnight, and I have tried a 
number of things for the last couple days, but nothing seems to be helping.



I have tried:

- Increasing heap sizes and numbers of cores

- More/less executors with different amounts of resources.

- Kyro Serialization

- FAIR Scheduling



It doesn’t seem like it should require this much. Any ideas?



- Isaac

Re: 10hrs of Scheduler Delay

2016-01-21 Thread Sanders, Isaac B

Hadoop is: HDP 2.3.2.0-2950

Here is a gist (pastebin) of my versions en masse and a stacktrace: 
https://gist.github.com/isaacsanders/2e59131758469097651b

Thanks

On Jan 21, 2016, at 7:44 PM, Ted Yu 
mailto:yuzhih...@gmail.com>> wrote:

Looks like you were running on YARN.

What hadoop version are you using ?

Can you capture a few stack traces of the AppMaster during the delay and 
pastebin them ?

Thanks

On Thu, Jan 21, 2016 at 8:08 AM, Sanders, Isaac B 
mailto:sande...@rose-hulman.edu>> wrote:
The Spark Version is 1.4.1

The logs are full of standard fair, nothing like an exception or even 
interesting [INFO] lines.

Here is the script I am using: 
https://gist.github.com/isaacsanders/660f480810fbc07d4df2

Thanks
Isaac

On Jan 21, 2016, at 11:03 AM, Ted Yu 
mailto:yuzhih...@gmail.com>> wrote:

Can you provide a bit more information ?

command line for submitting Spark job
version of Spark
anything interesting from driver / executor logs ?

Thanks

On Thu, Jan 21, 2016 at 7:35 AM, Sanders, Isaac B 
mailto:sande...@rose-hulman.edu>> wrote:
Hey all,

I am a CS student in the United States working on my senior thesis.

My thesis uses Spark, and I am encountering some trouble.

I am using https://github.com/alitouka/spark_dbscan, and to determine 
parameters, I am using the utility class they supply, 
org.alitouka.spark.dbscan.exploratoryAnalysis.DistanceToNearestNeighborDriver.

I am on a 10 node cluster with one machine with 8 cores and 32G of memory and 
nine machines with 6 cores and 16G of memory.

I have 442M of data, which seems like it would be a joke, but the job stalls at 
the last stage.

It was stuck in Scheduler Delay for 10 hours overnight, and I have tried a 
number of things for the last couple days, but nothing seems to be helping.

I have tried:
- Increasing heap sizes and numbers of cores
- More/less executors with different amounts of resources.
- Kyro Serialization
- FAIR Scheduling

It doesn’t seem like it should require this much. Any ideas?

- Isaac

Re: 10hrs of Scheduler Delay

2016-01-21 Thread Ted Yu

Looks like you were running on YARN.

What hadoop version are you using ?

Can you capture a few stack traces of the AppMaster during the delay and
pastebin them ?

Thanks

On Thu, Jan 21, 2016 at 8:08 AM, Sanders, Isaac B 
wrote:

> The Spark Version is 1.4.1
>
> The logs are full of standard fair, nothing like an exception or even
> interesting [INFO] lines.
>
> Here is the script I am using:
> https://gist.github.com/isaacsanders/660f480810fbc07d4df2
>
> Thanks
> Isaac
>
> On Jan 21, 2016, at 11:03 AM, Ted Yu  wrote:
>
> Can you provide a bit more information ?
>
> command line for submitting Spark job
> version of Spark
> anything interesting from driver / executor logs ?
>
> Thanks
>
> On Thu, Jan 21, 2016 at 7:35 AM, Sanders, Isaac B <
> sande...@rose-hulman.edu> wrote:
>
>> Hey all,
>>
>> I am a CS student in the United States working on my senior thesis.
>>
>> My thesis uses Spark, and I am encountering some trouble.
>>
>> I am using https://github.com/alitouka/spark_dbscan, and to determine
>> parameters, I am using the utility class they supply,
>> org.alitouka.spark.dbscan.exploratoryAnalysis.DistanceToNearestNeighborDriver.
>>
>> I am on a 10 node cluster with one machine with 8 cores and 32G of memory
>> and nine machines with 6 cores and 16G of memory.
>>
>> I have 442M of data, which seems like it would be a joke, but the job
>> stalls at the last stage.
>>
>> It was stuck in Scheduler Delay for 10 hours overnight, and I have tried
>> a number of things for the last couple days, but nothing seems to be
>> helping.
>>
>> I have tried:
>> - Increasing heap sizes and numbers of cores
>> - More/less executors with different amounts of resources.
>> - Kyro Serialization
>> - FAIR Scheduling
>>
>> It doesn’t seem like it should require this much. Any ideas?
>>
>> - Isaac
>
>
>
>

Re: Job History Logs for spark jobs submitted on YARN

2016-01-21 Thread nsalian

Hello,

Thanks for the question.
1) Typically the Resource Manager in YARN would print out the Aggregate
Resource Allocation for the application after you have found the specific
application using the application id.

2) As MapReduce, there is a parameter that is part of either the
spark-defaults.conf or the application specific configuration.
spark.eventLog.dir=hdfs://:8020/user/spark/applicationHistory
This is where the Spark History Server gets the information after the
application is completed.

3) In the History server on Spark there are the tabs that allow you to look
at the information that you need:
Jobs
Stages
Storage
Environment
Executors

Especially the Executors will give a bit more detailed information:
Storage Memory  
Disk Used

Hopefully that helps.
Thank you.




-
Neelesh S. Salian
Cloudera
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Job-History-Logs-for-spark-jobs-submitted-on-YARN-tp25946p26043.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: TaskCommitDenied (Driver denied task commit)

2016-01-21 Thread Holden Karau

My hunch is that the TaskCommitDenied is perhaps a red hearing and the
problem is groupByKey - but I've also just seen a lot of people be bitten
by it so that might not be issue. If you just do a count at the point of
the groupByKey does the pipeline succeed?

On Thu, Jan 21, 2016 at 2:56 PM, Arun Luthra  wrote:

> Usually the pipeline works, it just failed on this particular input data.
> The other data it has run on is of similar size.
>
> Speculation is enabled.
>
> I'm using Spark 1.5.0.
>
> Here is the config. Many of these may not be needed anymore, they are from
> trying to get things working in Spark 1.2 and 1.3.
>
> .set("spark.storage.memoryFraction","0.2") // default 0.6
> .set("spark.shuffle.memoryFraction","0.2") // default 0.2
> .set("spark.shuffle.manager","SORT") // preferred setting for
> optimized joins
> .set("spark.shuffle.consolidateFiles","true") // helpful for "too
> many files open"
> .set("spark.mesos.coarse", "true") // helpful for MapOutputTracker
> errors?
> .set("spark.akka.frameSize","300") // helpful when using
> consildateFiles=true
> .set("spark.shuffle.compress","false") //
> http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787p20811.html
> .set("spark.file.transferTo","false") //
> http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787p20811.html
> .set("spark.core.connection.ack.wait.timeout","600") //
> http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787p20811.html
> .set("spark.speculation","true")
> .set("spark.worker.timeout","600") //
> http://apache-spark-user-list.1001560.n3.nabble.com/Heartbeat-exceeds-td3798.html
> .set("spark.akka.timeout","300") //
> http://apache-spark-user-list.1001560.n3.nabble.com/Heartbeat-exceeds-td3798.html
> .set("spark.storage.blockManagerSlaveTimeoutMs","12")
> .set("spark.driver.maxResultSize","2048") // in response to error:
> Total size of serialized results of 39901 tasks (1024.0 MB) is bigger than
> spark.driver.maxResultSize (1024.0 MB)
> .set("spark.serializer",
> "org.apache.spark.serializer.KryoSerializer")
> .set("spark.kryo.registrator","--.MyRegistrator")
> .set("spark.kryo.registrationRequired", "true")
> .set("spark.yarn.executor.memoryOverhead","600")
>
> On Thu, Jan 21, 2016 at 2:50 PM, Josh Rosen 
> wrote:
>
>> Is speculation enabled? This TaskCommitDenied by driver error is thrown
>> by writers who lost the race to commit an output partition. I don't think
>> this had anything to do with key skew etc. Replacing the groupbykey with a
>> count will mask this exception because the coordination does not get
>> triggered in non save/write operations.
>>
>> On Thu, Jan 21, 2016 at 2:46 PM Holden Karau 
>> wrote:
>>
>>> Before we dig too far into this, the thing which most quickly jumps out
>>> to me is groupByKey which could be causing some problems - whats the
>>> distribution of keys like? Try replacing the groupByKey with a count() and
>>> see if the pipeline works up until that stage. Also 1G of driver memory is
>>> also a bit small for something with 90 executors...
>>>
>>> On Thu, Jan 21, 2016 at 2:40 PM, Arun Luthra 
>>> wrote:
>>>


 16/01/21 21:52:11 WARN NativeCodeLoader: Unable to load native-hadoop
 library for your platform... using builtin-java classes where applicable

 16/01/21 21:52:14 WARN MetricsSystem: Using default name DAGScheduler
 for source because spark.app.id is not set.

 spark.yarn.driver.memoryOverhead is set but does not apply in client
 mode.

 16/01/21 21:52:16 WARN DomainSocketFactory: The short-circuit local
 reads feature cannot be used because libhadoop cannot be loaded.

 16/01/21 21:52:52 WARN MemoryStore: Not enough space to cache
 broadcast_4 in memory! (computed 60.2 MB so far)

 16/01/21 21:52:52 WARN MemoryStore: Persisting block broadcast_4 to
 disk instead.

 [Stage 1:>(2260 +
 7) / 2262]16/01/21 21:57:24 WARN TaskSetManager: Lost task 1440.1 in stage
 1.0 (TID 4530, --): TaskCommitDenied (Driver denied task commit) for job:
 1, partition: 1440, attempt: 4530

 [Stage 1:>(2260 +
 6) / 2262]16/01/21 21:57:27 WARN TaskSetManager: Lost task 1488.1 in stage
 1.0 (TID 4531, --): TaskCommitDenied (Driver denied task commit) for job:
 1, partition: 1488, attempt: 4531

 [Stage 1:>(2261 +
 4) / 2262]16/01/21 21:57:39 WARN TaskSetManager: Lost task 1982.1 in stage
 1.0 (TID 4532, --): TaskCommitDenied (Driver denied task commit) for job:
 1, partition: 1982, attempt: 4532

 16/01/21 21:57:57 WARN TaskSetManager: Lost task 2214.0 in stage 1.0
 (TID 4482, --): TaskCommitDenied (D

Re: TaskCommitDenied (Driver denied task commit)

2016-01-21 Thread Arun Luthra

Usually the pipeline works, it just failed on this particular input data.
The other data it has run on is of similar size.

Speculation is enabled.

I'm using Spark 1.5.0.

Here is the config. Many of these may not be needed anymore, they are from
trying to get things working in Spark 1.2 and 1.3.

.set("spark.storage.memoryFraction","0.2") // default 0.6
.set("spark.shuffle.memoryFraction","0.2") // default 0.2
.set("spark.shuffle.manager","SORT") // preferred setting for
optimized joins
.set("spark.shuffle.consolidateFiles","true") // helpful for "too
many files open"
.set("spark.mesos.coarse", "true") // helpful for MapOutputTracker
errors?
.set("spark.akka.frameSize","300") // helpful when using
consildateFiles=true
.set("spark.shuffle.compress","false") //
http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787p20811.html
.set("spark.file.transferTo","false") //
http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787p20811.html
.set("spark.core.connection.ack.wait.timeout","600") //
http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787p20811.html
.set("spark.speculation","true")
.set("spark.worker.timeout","600") //
http://apache-spark-user-list.1001560.n3.nabble.com/Heartbeat-exceeds-td3798.html
.set("spark.akka.timeout","300") //
http://apache-spark-user-list.1001560.n3.nabble.com/Heartbeat-exceeds-td3798.html
.set("spark.storage.blockManagerSlaveTimeoutMs","12")
.set("spark.driver.maxResultSize","2048") // in response to error:
Total size of serialized results of 39901 tasks (1024.0 MB) is bigger than
spark.driver.maxResultSize (1024.0 MB)
.set("spark.serializer",
"org.apache.spark.serializer.KryoSerializer")
.set("spark.kryo.registrator","--.MyRegistrator")
.set("spark.kryo.registrationRequired", "true")
.set("spark.yarn.executor.memoryOverhead","600")

On Thu, Jan 21, 2016 at 2:50 PM, Josh Rosen 
wrote:

> Is speculation enabled? This TaskCommitDenied by driver error is thrown by
> writers who lost the race to commit an output partition. I don't think this
> had anything to do with key skew etc. Replacing the groupbykey with a count
> will mask this exception because the coordination does not get triggered in
> non save/write operations.
>
> On Thu, Jan 21, 2016 at 2:46 PM Holden Karau  wrote:
>
>> Before we dig too far into this, the thing which most quickly jumps out
>> to me is groupByKey which could be causing some problems - whats the
>> distribution of keys like? Try replacing the groupByKey with a count() and
>> see if the pipeline works up until that stage. Also 1G of driver memory is
>> also a bit small for something with 90 executors...
>>
>> On Thu, Jan 21, 2016 at 2:40 PM, Arun Luthra 
>> wrote:
>>
>>>
>>>
>>> 16/01/21 21:52:11 WARN NativeCodeLoader: Unable to load native-hadoop
>>> library for your platform... using builtin-java classes where applicable
>>>
>>> 16/01/21 21:52:14 WARN MetricsSystem: Using default name DAGScheduler
>>> for source because spark.app.id is not set.
>>>
>>> spark.yarn.driver.memoryOverhead is set but does not apply in client
>>> mode.
>>>
>>> 16/01/21 21:52:16 WARN DomainSocketFactory: The short-circuit local
>>> reads feature cannot be used because libhadoop cannot be loaded.
>>>
>>> 16/01/21 21:52:52 WARN MemoryStore: Not enough space to cache
>>> broadcast_4 in memory! (computed 60.2 MB so far)
>>>
>>> 16/01/21 21:52:52 WARN MemoryStore: Persisting block broadcast_4 to disk
>>> instead.
>>>
>>> [Stage 1:>(2260 + 7)
>>> / 2262]16/01/21 21:57:24 WARN TaskSetManager: Lost task 1440.1 in stage 1.0
>>> (TID 4530, --): TaskCommitDenied (Driver denied task commit) for job: 1,
>>> partition: 1440, attempt: 4530
>>>
>>> [Stage 1:>(2260 + 6)
>>> / 2262]16/01/21 21:57:27 WARN TaskSetManager: Lost task 1488.1 in stage 1.0
>>> (TID 4531, --): TaskCommitDenied (Driver denied task commit) for job: 1,
>>> partition: 1488, attempt: 4531
>>>
>>> [Stage 1:>(2261 + 4)
>>> / 2262]16/01/21 21:57:39 WARN TaskSetManager: Lost task 1982.1 in stage 1.0
>>> (TID 4532, --): TaskCommitDenied (Driver denied task commit) for job: 1,
>>> partition: 1982, attempt: 4532
>>>
>>> 16/01/21 21:57:57 WARN TaskSetManager: Lost task 2214.0 in stage 1.0
>>> (TID 4482, --): TaskCommitDenied (Driver denied task commit) for job: 1,
>>> partition: 2214, attempt: 4482
>>>
>>> 16/01/21 21:57:57 WARN TaskSetManager: Lost task 2168.0 in stage 1.0
>>> (TID 4436, --): TaskCommitDenied (Driver denied task commit) for job: 1,
>>> partition: 2168, attempt: 4436
>>>
>>>
>>> I am running with:
>>>
>>> spark-submit --class "myclass" \
>>>
>>>   --num-executors 90 \
>>>
>>>   --driver-memory 1g \
>>>
>>>   --executor-memory 60g \
>>>
>>>

Re: TaskCommitDenied (Driver denied task commit)

2016-01-21 Thread Josh Rosen

Is speculation enabled? This TaskCommitDenied by driver error is thrown by
writers who lost the race to commit an output partition. I don't think this
had anything to do with key skew etc. Replacing the groupbykey with a count
will mask this exception because the coordination does not get triggered in
non save/write operations.
On Thu, Jan 21, 2016 at 2:46 PM Holden Karau  wrote:

> Before we dig too far into this, the thing which most quickly jumps out to
> me is groupByKey which could be causing some problems - whats the
> distribution of keys like? Try replacing the groupByKey with a count() and
> see if the pipeline works up until that stage. Also 1G of driver memory is
> also a bit small for something with 90 executors...
>
> On Thu, Jan 21, 2016 at 2:40 PM, Arun Luthra 
> wrote:
>
>>
>>
>> 16/01/21 21:52:11 WARN NativeCodeLoader: Unable to load native-hadoop
>> library for your platform... using builtin-java classes where applicable
>>
>> 16/01/21 21:52:14 WARN MetricsSystem: Using default name DAGScheduler for
>> source because spark.app.id is not set.
>>
>> spark.yarn.driver.memoryOverhead is set but does not apply in client mode.
>>
>> 16/01/21 21:52:16 WARN DomainSocketFactory: The short-circuit local reads
>> feature cannot be used because libhadoop cannot be loaded.
>>
>> 16/01/21 21:52:52 WARN MemoryStore: Not enough space to cache broadcast_4
>> in memory! (computed 60.2 MB so far)
>>
>> 16/01/21 21:52:52 WARN MemoryStore: Persisting block broadcast_4 to disk
>> instead.
>>
>> [Stage 1:>(2260 + 7)
>> / 2262]16/01/21 21:57:24 WARN TaskSetManager: Lost task 1440.1 in stage 1.0
>> (TID 4530, --): TaskCommitDenied (Driver denied task commit) for job: 1,
>> partition: 1440, attempt: 4530
>>
>> [Stage 1:>(2260 + 6)
>> / 2262]16/01/21 21:57:27 WARN TaskSetManager: Lost task 1488.1 in stage 1.0
>> (TID 4531, --): TaskCommitDenied (Driver denied task commit) for job: 1,
>> partition: 1488, attempt: 4531
>>
>> [Stage 1:>(2261 + 4)
>> / 2262]16/01/21 21:57:39 WARN TaskSetManager: Lost task 1982.1 in stage 1.0
>> (TID 4532, --): TaskCommitDenied (Driver denied task commit) for job: 1,
>> partition: 1982, attempt: 4532
>>
>> 16/01/21 21:57:57 WARN TaskSetManager: Lost task 2214.0 in stage 1.0 (TID
>> 4482, --): TaskCommitDenied (Driver denied task commit) for job: 1,
>> partition: 2214, attempt: 4482
>>
>> 16/01/21 21:57:57 WARN TaskSetManager: Lost task 2168.0 in stage 1.0 (TID
>> 4436, --): TaskCommitDenied (Driver denied task commit) for job: 1,
>> partition: 2168, attempt: 4436
>>
>>
>> I am running with:
>>
>> spark-submit --class "myclass" \
>>
>>   --num-executors 90 \
>>
>>   --driver-memory 1g \
>>
>>   --executor-memory 60g \
>>
>>   --executor-cores 8 \
>>
>>   --master yarn-client \
>>
>>   --conf "spark.executor.extraJavaOptions=-verbose:gc
>> -XX:+PrintGCDetails -XX:+PrintGCTimeStamps" \
>>
>>   my.jar
>>
>>
>> There are 2262 input files totaling just 98.6G. The DAG is basically
>> textFile().map().filter().groupByKey().saveAsTextFile().
>>
>> On Thu, Jan 21, 2016 at 2:14 PM, Holden Karau 
>> wrote:
>>
>>> Can you post more of your log? How big are the partitions? What is the
>>> action you are performing?
>>>
>>> On Thu, Jan 21, 2016 at 2:02 PM, Arun Luthra 
>>> wrote:
>>>
 Example warning:

 16/01/21 21:57:57 WARN TaskSetManager: Lost task 2168.0 in stage 1.0
 (TID 4436, XXX): TaskCommitDenied (Driver denied task commit) for job:
 1, partition: 2168, attempt: 4436


 Is there a solution for this? Increase driver memory? I'm using just 1G
 driver memory but ideally I won't have to increase it.

 The RDD being processed has 2262 partitions.

 Arun

>>>
>>>
>>>
>>> --
>>> Cell : 425-233-8271
>>> Twitter: https://twitter.com/holdenkarau
>>>
>>
>>
>
>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
>

Re: TaskCommitDenied (Driver denied task commit)

2016-01-21 Thread Holden Karau

Before we dig too far into this, the thing which most quickly jumps out to
me is groupByKey which could be causing some problems - whats the
distribution of keys like? Try replacing the groupByKey with a count() and
see if the pipeline works up until that stage. Also 1G of driver memory is
also a bit small for something with 90 executors...

On Thu, Jan 21, 2016 at 2:40 PM, Arun Luthra  wrote:

>
>
> 16/01/21 21:52:11 WARN NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
>
> 16/01/21 21:52:14 WARN MetricsSystem: Using default name DAGScheduler for
> source because spark.app.id is not set.
>
> spark.yarn.driver.memoryOverhead is set but does not apply in client mode.
>
> 16/01/21 21:52:16 WARN DomainSocketFactory: The short-circuit local reads
> feature cannot be used because libhadoop cannot be loaded.
>
> 16/01/21 21:52:52 WARN MemoryStore: Not enough space to cache broadcast_4
> in memory! (computed 60.2 MB so far)
>
> 16/01/21 21:52:52 WARN MemoryStore: Persisting block broadcast_4 to disk
> instead.
>
> [Stage 1:>(2260 + 7) /
> 2262]16/01/21 21:57:24 WARN TaskSetManager: Lost task 1440.1 in stage 1.0
> (TID 4530, --): TaskCommitDenied (Driver denied task commit) for job: 1,
> partition: 1440, attempt: 4530
>
> [Stage 1:>(2260 + 6) /
> 2262]16/01/21 21:57:27 WARN TaskSetManager: Lost task 1488.1 in stage 1.0
> (TID 4531, --): TaskCommitDenied (Driver denied task commit) for job: 1,
> partition: 1488, attempt: 4531
>
> [Stage 1:>(2261 + 4) /
> 2262]16/01/21 21:57:39 WARN TaskSetManager: Lost task 1982.1 in stage 1.0
> (TID 4532, --): TaskCommitDenied (Driver denied task commit) for job: 1,
> partition: 1982, attempt: 4532
>
> 16/01/21 21:57:57 WARN TaskSetManager: Lost task 2214.0 in stage 1.0 (TID
> 4482, --): TaskCommitDenied (Driver denied task commit) for job: 1,
> partition: 2214, attempt: 4482
>
> 16/01/21 21:57:57 WARN TaskSetManager: Lost task 2168.0 in stage 1.0 (TID
> 4436, --): TaskCommitDenied (Driver denied task commit) for job: 1,
> partition: 2168, attempt: 4436
>
>
> I am running with:
>
> spark-submit --class "myclass" \
>
>   --num-executors 90 \
>
>   --driver-memory 1g \
>
>   --executor-memory 60g \
>
>   --executor-cores 8 \
>
>   --master yarn-client \
>
>   --conf "spark.executor.extraJavaOptions=-verbose:gc
> -XX:+PrintGCDetails -XX:+PrintGCTimeStamps" \
>
>   my.jar
>
>
> There are 2262 input files totaling just 98.6G. The DAG is basically
> textFile().map().filter().groupByKey().saveAsTextFile().
>
> On Thu, Jan 21, 2016 at 2:14 PM, Holden Karau 
> wrote:
>
>> Can you post more of your log? How big are the partitions? What is the
>> action you are performing?
>>
>> On Thu, Jan 21, 2016 at 2:02 PM, Arun Luthra 
>> wrote:
>>
>>> Example warning:
>>>
>>> 16/01/21 21:57:57 WARN TaskSetManager: Lost task 2168.0 in stage 1.0
>>> (TID 4436, XXX): TaskCommitDenied (Driver denied task commit) for job:
>>> 1, partition: 2168, attempt: 4436
>>>
>>>
>>> Is there a solution for this? Increase driver memory? I'm using just 1G
>>> driver memory but ideally I won't have to increase it.
>>>
>>> The RDD being processed has 2262 partitions.
>>>
>>> Arun
>>>
>>
>>
>>
>> --
>> Cell : 425-233-8271
>> Twitter: https://twitter.com/holdenkarau
>>
>
>


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau

Re: TaskCommitDenied (Driver denied task commit)

2016-01-21 Thread Arun Luthra

16/01/21 21:52:11 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable

16/01/21 21:52:14 WARN MetricsSystem: Using default name DAGScheduler for
source because spark.app.id is not set.

spark.yarn.driver.memoryOverhead is set but does not apply in client mode.

16/01/21 21:52:16 WARN DomainSocketFactory: The short-circuit local reads
feature cannot be used because libhadoop cannot be loaded.

16/01/21 21:52:52 WARN MemoryStore: Not enough space to cache broadcast_4
in memory! (computed 60.2 MB so far)

16/01/21 21:52:52 WARN MemoryStore: Persisting block broadcast_4 to disk
instead.

[Stage 1:>(2260 + 7) /
2262]16/01/21 21:57:24 WARN TaskSetManager: Lost task 1440.1 in stage 1.0
(TID 4530, --): TaskCommitDenied (Driver denied task commit) for job: 1,
partition: 1440, attempt: 4530

[Stage 1:>(2260 + 6) /
2262]16/01/21 21:57:27 WARN TaskSetManager: Lost task 1488.1 in stage 1.0
(TID 4531, --): TaskCommitDenied (Driver denied task commit) for job: 1,
partition: 1488, attempt: 4531

[Stage 1:>(2261 + 4) /
2262]16/01/21 21:57:39 WARN TaskSetManager: Lost task 1982.1 in stage 1.0
(TID 4532, --): TaskCommitDenied (Driver denied task commit) for job: 1,
partition: 1982, attempt: 4532

16/01/21 21:57:57 WARN TaskSetManager: Lost task 2214.0 in stage 1.0 (TID
4482, --): TaskCommitDenied (Driver denied task commit) for job: 1,
partition: 2214, attempt: 4482

16/01/21 21:57:57 WARN TaskSetManager: Lost task 2168.0 in stage 1.0 (TID
4436, --): TaskCommitDenied (Driver denied task commit) for job: 1,
partition: 2168, attempt: 4436

I am running with:

spark-submit --class "myclass" \

  --num-executors 90 \

  --driver-memory 1g \

  --executor-memory 60g \

  --executor-cores 8 \

  --master yarn-client \

  --conf "spark.executor.extraJavaOptions=-verbose:gc
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps" \

  my.jar

There are 2262 input files totaling just 98.6G. The DAG is basically
textFile().map().filter().groupByKey().saveAsTextFile().

On Thu, Jan 21, 2016 at 2:14 PM, Holden Karau  wrote:

> Can you post more of your log? How big are the partitions? What is the
> action you are performing?
>
> On Thu, Jan 21, 2016 at 2:02 PM, Arun Luthra 
> wrote:
>
>> Example warning:
>>
>> 16/01/21 21:57:57 WARN TaskSetManager: Lost task 2168.0 in stage 1.0 (TID
>> 4436, XXX): TaskCommitDenied (Driver denied task commit) for job: 1,
>> partition: 2168, attempt: 4436
>>
>>
>> Is there a solution for this? Increase driver memory? I'm using just 1G
>> driver memory but ideally I won't have to increase it.
>>
>> The RDD being processed has 2262 partitions.
>>
>> Arun
>>
>
>
>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
>

Re: TaskCommitDenied (Driver denied task commit)

2016-01-21 Thread Holden Karau

Can you post more of your log? How big are the partitions? What is the
action you are performing?

On Thu, Jan 21, 2016 at 2:02 PM, Arun Luthra  wrote:

> Example warning:
>
> 16/01/21 21:57:57 WARN TaskSetManager: Lost task 2168.0 in stage 1.0 (TID
> 4436, XXX): TaskCommitDenied (Driver denied task commit) for job: 1,
> partition: 2168, attempt: 4436
>
>
> Is there a solution for this? Increase driver memory? I'm using just 1G
> driver memory but ideally I won't have to increase it.
>
> The RDD being processed has 2262 partitions.
>
> Arun
>



-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau

MemoryStore: Not enough space to cache broadcast_N in memory

2016-01-21 Thread Arun Luthra

WARN MemoryStore: Not enough space to cache broadcast_4 in memory!
(computed 60.2 MB so far)
WARN MemoryStore: Persisting block broadcast_4 to disk instead.


Can I increase the memory allocation for broadcast variables?

I have a few broadcast variables that I create with sc.broadcast() . Are
these labeled starting from 0 or from 1 (in reference to "broadcast_N")? I
want to debug/track down which one is offending.

As a feature request, it would be good if there were an optional argument
(or perhaps a requireed argument) added to sc.broadcast() so that we could
give it an internal label. Then it would work the same as the
sc.accumulator() "name" argument. It would enable more useful warn/error
messages.

Arun

Re: Getting Co-oefficients of a logistic regression model for a pipelinemodel Spark ML library

2016-01-21 Thread Holden Karau

Hi Vinayaka,

You can access the different stages in your pipeline through the stages
array on our pipeline model (
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.PipelineModel
) and then cast it to the correct stage (if working in Scala or if in
Python just access the coefficients at this point). What programming
language are you working in?

Cheers,

Holden :)

On Thu, Jan 21, 2016 at 2:05 PM, Vinayak Agrawal  wrote:

> Hi All,
> I am working with Spark ML package, NOT mllib.
> I have a working pipelinemodel. I am trying to get the co-efficients of my
> model but I cant find a method to do so.
> Documentation here
>
> http://spark.apache.org/docs/latest/ml-classification-regression.html#logistic-regression
>
> shows how to get co-efficient for a logistic regression model but not
> pipeline model.
> I found an old archive mail asking the same but it does not have a
> solution.
>
> https://www.mail-archive.com/search?l=iss...@spark.apache.org&q=subject:%22\[jira\]+\[Commented\]+\%28SPARK\-9492\%29+LogisticRegression+in+R+should+provide+model+statistics%22&o=newest&f=1
> 
>
> Any suggestions?
> --
> Vinayak Agrawal
>
> "To Strive, To Seek, To Find and Not to Yield!"
> ~Lord Alfred Tennyson
>

-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau

Getting Co-oefficients of a logistic regression model for a pipelinemodel Spark ML library

2016-01-21 Thread Vinayak Agrawal

Hi All,
I am working with Spark ML package, NOT mllib.
I have a working pipelinemodel. I am trying to get the co-efficients of my
model but I cant find a method to do so.
Documentation here
http://spark.apache.org/docs/latest/ml-classification-regression.html#logistic-regression

shows how to get co-efficient for a logistic regression model but not
pipeline model.
I found an old archive mail asking the same but it does not have a
solution.
https://www.mail-archive.com/search?l=iss...@spark.apache.org&q=subject:%22\[jira\]+\[Commented\]+\%28SPARK\-9492\%29+LogisticRegression+in+R+should+provide+model+statistics%22&o=newest&f=1

Any suggestions?
-- 
Vinayak Agrawal

"To Strive, To Seek, To Find and Not to Yield!"
~Lord Alfred Tennyson

TaskCommitDenied (Driver denied task commit)

2016-01-21 Thread Arun Luthra

Example warning:

16/01/21 21:57:57 WARN TaskSetManager: Lost task 2168.0 in stage 1.0 (TID
4436, XXX): TaskCommitDenied (Driver denied task commit) for job: 1,
partition: 2168, attempt: 4436


Is there a solution for this? Increase driver memory? I'm using just 1G
driver memory but ideally I won't have to increase it.

The RDD being processed has 2262 partitions.

Arun

Re: process of executing a program in a distributed environment without hadoop

2016-01-21 Thread nsalian

Thanks for the question.

The documentation here:
https://spark.apache.org/docs/latest/submitting-applications.html#launching-applications-with-spark-submit
enlists a variety of submission techniques.
You can vary the Master URLs to suit your needs whether it be local/ yarn or
mesos.





-
Neelesh S. Salian
Cloudera
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/process-of-executing-a-program-in-a-distributed-environment-without-hadoop-tp26015p26039.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Date / time stuff with spark.

2016-01-21 Thread Andrew Holway

P.S. We are working with Python.

On Thu, Jan 21, 2016 at 8:24 PM, Andrew Holway
 wrote:
> Hello,
>
> I am importing this data from HDFS into a data frame with
> sqlContext.read.json().
>
> {“a": 42, “a": 56, "Id": "621368e2f829f230", “smunkId":
> "CKm26sDMucoCFReRGwodbHAAgw", “popsicleRange": "17610", "time":
> "2016-01-20T23:59:53+00:00”}
>
> I want to do some date/time operations on this json data but I cannot
> find clear documentation on how to
>
> A) specify the “time” field as a date/time in the schema.
> B) the format the date should be in to be correctly in the raw data
> for an easy import.
>
> Cheers,
>
> Andrew



-- 
Otter Networks UG
http://otternetworks.de
fon: +49 30 54 88 5197
Gotenstraße 17
10829 Berlin

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Date / time stuff with spark.

2016-01-21 Thread Andrew Holway

Hello,

I am importing this data from HDFS into a data frame with
sqlContext.read.json().

{“a": 42, “a": 56, "Id": "621368e2f829f230", “smunkId":
"CKm26sDMucoCFReRGwodbHAAgw", “popsicleRange": "17610", "time":
"2016-01-20T23:59:53+00:00”}

I want to do some date/time operations on this json data but I cannot
find clear documentation on how to

A) specify the “time” field as a date/time in the schema.
B) the format the date should be in to be correctly in the raw data
for an easy import.

Cheers,

Andrew

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: No plan for BroadcastHint when attempting broadcastjoin

2016-01-21 Thread Sebastian Piu

I made some modifications to my code where I broadcast the Dataframe I want
to join directly through the SparkContext, and that seems to work as
expected. Still don't understand what is going wrong with the missing Plan.

On Thu, Jan 21, 2016 at 3:36 PM, Ted Yu  wrote:

> Modified subject to reflect new error encountered.
>
> Interesting - SPARK-12275 is marked fixed against 1.6.0
>
> On Thu, Jan 21, 2016 at 7:30 AM, Sebastian Piu 
> wrote:
>
>> I'm using Spark 1.6.0.
>>
>> I tried removing Kryo and reverting back to Java Serialisation, and get a
>> different error which maybe points in the right direction...
>>
>> java.lang.AssertionError: assertion failed: No plan for BroadcastHint
>> +- InMemoryRelation
>> [tradeId#30,tradeVersion#31,agreement#49,counterParty#38], true, 1,
>> StorageLevel(true, true, false, true, 1), Union,
>> Some(ingest_all_union_trades)
>>
>> at scala.Predef$.assert(Predef.scala:179)
>> at
>> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
>> at
>> org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
>> at
>> org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:336)
>> at
>> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>> at
>> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>> at
>> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
>> at
>> org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
>> at
>> org.apache.spark.sql.execution.SparkStrategies$EquiJoinSelection$.apply(SparkStrategies.scala:105)
>> at
>> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>> at
>> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>> at
>> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
>> at
>> org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
>> at
>> org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:336)
>> at
>> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>> at
>> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>> at
>> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
>> at
>> org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
>> at
>> org.apache.spark.sql.execution.SparkStrategies$Aggregation$.apply(SparkStrategies.scala:217)
>> at
>> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>> at
>> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>> at
>> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
>> at
>> org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:47)
>> at
>> org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:45)
>> at
>> org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:52)
>> at
>> org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:52)
>> at
>> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:53)
>> at
>> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108)
>> at
>> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
>> at
>> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
>> at
>> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
>> at
>> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:127)
>> at
>> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:125)
>> at
>> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>> at
>> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:125)
>> at
>> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
>> at
>> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala

Re: Re: --driver-java-options not support multiple JVM configuration ?

2016-01-21 Thread Marcelo Vanzin

That's because you should have used the other, correct version,
separated by spaces instead of commas.

On Wed, Jan 20, 2016 at 9:55 PM, our...@cnsuning.com
 wrote:
> Marcelo,
> error also exists  with quotes around "$sparkdriverextraJavaOptions":
>
> Unrecognized VM option
> 'newsize=2096m,-XX:MaxPermSize=512m,-XX:+PrintGCDetails,-XX:+PrintGCTimeStamps,-XX:+UseParNewGC,-XX:+UseConcMarkSweepGC,-XX:CMSInitiatingOccupancyFraction=80,-XX:GCTimeLimit=5,-XX:GCHeapFreeLimit=95'
>
>
>
>
> From: Marcelo Vanzin
> Date: 2016-01-21 12:09
> To: our...@cnsuning.com
> CC: user
> Subject: Re: --driver-java-options not support multiple JVM configuration ?
> On Wed, Jan 20, 2016 at 7:38 PM, our...@cnsuning.com
>  wrote:
>> --driver-java-options $sparkdriverextraJavaOptions \
>
> You need quotes around "$sparkdriverextraJavaOptions".
>
> --
> Marcelo
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>



-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark Yarn executor memory overhead content

2016-01-21 Thread Marcelo Vanzin

On Thu, Jan 21, 2016 at 5:42 AM, Olivier Devoisin
 wrote:
> The documentation states that it contains VM overheads, interned strings and 
> other native overheads. However it's really vague.

It's intentionally vague, because it's "everything that is not Java
objects". That includes JVM internal memory, direct buffers, memory
used by native libraries (e.g. snappy), and all sorts of other things.

-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Recovery for Spark Streaming Kafka Direct with OffsetOutOfRangeException

2016-01-21 Thread Cody Koeninger

Looks like this response did go to the list.

As far as OffsetOutOfRange goes, right now that's an unrecoverable error,
because it breaks the underlying invariants (e.g. that the number of
messages in a partition is deterministic once the RDD is defined)

If you want to do some hacking for your own purposes, the place to start
looking would be in KafkaRDD.scala, in fetchBatch.  Just be aware that's a
situation where data has been lost, so you can't get the "right" answer,
you just have to decide what variety of wrong answer you want to get :)

On Thu, Jan 21, 2016 at 11:11 AM, Dan Dutrow  wrote:

> Hey Cody, I would have responded to the mailing list but it looks like
> this thread got aged off. I have the problem where one of my topics dumps
> more data than my spark job can keep up with. We limit the input rate with
> maxRatePerPartition Eventually, when the data is aged off, I get the
> OffsetOutOfRangeException from Kafka, as we would expect. As we work
> towards more efficient processing of that topic, or get more resources, I'd
> like to be able to log the error and continue the application without
> failing. Is there a place where I can catch that error before it gets to
> org.apache.spark.util.Utils$.logUncaughtExceptions ? Maybe somewhere in
> DirectKafkaInputDStream::compute?
>
>
> https://mail-archives.apache.org/mod_mbox/spark-user/201512.mbox/%3CCAKWX9VUoNd4ATGF+0TkNJ+9=b8r2nr9pt7sbgr-bv4nnttk...@mail.gmail.com%3E
> --
> Dan ✆
>

Re: SparkContext SyntaxError: invalid syntax

2016-01-21 Thread Andrew Weiner

Thanks Felix.  I think I was missing gem install pygments.rb and I also had
to roll back to Python 2.7 but I got it working.  I submitted the PR
submitted with the added explanation in the docs.

Andrew

On Wed, Jan 20, 2016 at 1:44 AM, Felix Cheung 
wrote:

>
> I have to run this to install the pre-req to get jeykyll build to work,
> you do need the python pygments package:
>
> (I’m on ubuntu)
> sudo apt-get install ruby ruby-dev make gcc nodejs
> sudo gem install jekyll --no-rdoc --no-ri
> sudo gem install jekyll-redirect-from
> sudo apt-get install python-Pygments
> sudo apt-get install python-sphinx
> sudo gem install pygments.rb
>
>
> Hope that helps!
> If not, I can try putting together doc change but I’d rather you could
> make progress :)
>
>
>
>
>
> On Mon, Jan 18, 2016 at 6:36 AM -0800, "Andrew Weiner" <
> andrewweiner2...@u.northwestern.edu> wrote:
>
> Hi Felix,
>
> Yeah, when I try to build the docs using jekyll build, I get a LoadError
> (cannot load such file -- pygments) and I'm having trouble getting past it
> at the moment.
>
> From what I could tell, this does not apply to YARN in client mode.  I was
> able to submit jobs in client mode and they would run fine without using
> the appMasterEnv property.  I even confirmed that my environment variables
> persisted during the job when run in client mode.  There is something about
> YARN cluster mode that uses a different environment (the YARN Application
> Master environment) and requires the appMasterEnv property for setting
> environment variables.
>
> On Sun, Jan 17, 2016 at 11:37 PM, Felix Cheung 
> wrote:
>
> Do you still need help on the PR?
> btw, does this apply to YARN client mode?
>
> --
> From: andrewweiner2...@u.northwestern.edu
> Date: Sun, 17 Jan 2016 17:00:39 -0600
> Subject: Re: SparkContext SyntaxError: invalid syntax
> To: cutl...@gmail.com
> CC: user@spark.apache.org
>
>
> Yeah, I do think it would be worth explicitly stating this in the docs.  I
> was going to try to edit the docs myself and submit a pull request, but I'm
> having trouble building the docs from github.  If anyone else wants to do
> this, here is approximately what I would say:
>
> (To be added to
> http://spark.apache.org/docs/latest/configuration.html#environment-variables
> )
> "Note: When running Spark on YARN in cluster mode, environment variables
> need to be set using the spark.yarn.appMasterEnv.[EnvironmentVariableName]
> property in your conf/spark-defaults.conf file.  Environment variables
> that are set in spark-env.sh will not be reflected in the YARN
> Application Master process in cluster mode.  See the YARN-related Spark
> Properties
> 
> for more information."
>
> I might take another crack at building the docs myself if nobody beats me
> to this.
>
> Andrew
>
>
> On Fri, Jan 15, 2016 at 5:01 PM, Bryan Cutler  wrote:
>
> Glad you got it going!  It's wasn't very obvious what needed to be set,
> maybe it is worth explicitly stating this in the docs since it seems to
> have come up a couple times before too.
>
> Bryan
>
> On Fri, Jan 15, 2016 at 12:33 PM, Andrew Weiner <
> andrewweiner2...@u.northwestern.edu> wrote:
>
> Actually, I just found this [
> https://issues.apache.org/jira/browse/SPARK-1680], which after a bit of
> googling and reading leads me to believe that the preferred way to change
> the yarn environment is to edit the spark-defaults.conf file by adding this
> line:
> spark.yarn.appMasterEnv.PYSPARK_PYTHON/path/to/python
>
> While both this solution and the solution from my prior email work, I
> believe this is the preferred solution.
>
> Sorry for the flurry of emails.  Again, thanks for all the help!
>
> Andrew
>
> On Fri, Jan 15, 2016 at 1:47 PM, Andrew Weiner <
> andrewweiner2...@u.northwestern.edu> wrote:
>
> I finally got the pi.py example to run in yarn cluster mode.  This was the
> key insight:
> https://issues.apache.org/jira/browse/SPARK-9229
>
> I had to set SPARK_YARN_USER_ENV in spark-env.sh:
> export SPARK_YARN_USER_ENV="PYSPARK_PYTHON=/home/aqualab/local/bin/python"
>
> This caused the PYSPARK_PYTHON environment variable to be used in my yarn
> environment in cluster mode.
>
> Thank you for all your help!
>
> Best,
> Andrew
>
>
>
> On Fri, Jan 15, 2016 at 12:57 PM, Andrew Weiner <
> andrewweiner2...@u.northwestern.edu> wrote:
>
> I tried playing around with my environment variables, and here is an
> update.
>
> When I run in cluster mode, my environment variables do not persist
> throughout the entire job.
> For example, I tried creating a local copy of HADOOP_CONF_DIR in
> /home//local/etc/hadoop/conf, and then, in spark-env.sh I the
> variable:
> export HADOOP_CONF_DIR=/home//local/etc/hadoop/conf
>
> Later, when we print the environment variables in the python code, I see
> this:
>
> ('HADOOP_CONF_DIR', '/etc/hadoop/conf')
>
> However, when I run in client mode, I see this:
>
> ('HADOOP_CONF_DIR', '/home/aw

[ANNOUNCE] Apache Nutch 2.3.1 Release

2016-01-21 Thread lewis john mcgibbney

Hi Folks,

!!Apologies for cross posting!!

The Apache Nutch PMC are pleased to announce the immediate release of
Apache Nutch v2.3.1, we advise all current users and developers of the 2.X
series to upgrade to this release.

Nutch is a well matured, production ready Web crawler. Nutch 2.X branch is
becoming an emerging alternative taking direct inspiration from Nutch 1.X
series. 2.X differs in one key area; storage is abstracted away from any
specific underlying data store by using Apache Gora™
 for handling object to persistent data store
mappings.

The recommended Gora backends for this Nutch release are

   - Apache Avro 1.7.6
   - Apache Hadoop 1.2.1 and 2.5.2
   - Apache HBase 0.98.8-hadoop2 (although also tested with 1.X)
   - Apache Cassandra 2.0.2
   - Apache Solr 4.10.3
   - MongoDB 2.6.X
   - Apache Accumlo 1.5.1
   - Apache Spark 1.4.1

This bug fix release contains around 40 issues addressed. For a complete
overview of these issues please see the release report
.

As usual in the 2.X series, release artifacts are made available as only
source and also available within Maven Central

as a Maven dependency. The release is available from our DOWNLAODS PAGE
.

Thank you to everyone that contributed towards this release.

Re: updateStateByKey not persisting in Spark 1.5.1

2016-01-21 Thread Ted Yu

I searched for checkpoint related methods in various Listener classes but
haven't found any.

Analyzing DAG is tedious and fragile since DAG may change in future Spark
releases.

Cheers

On Thu, Jan 21, 2016 at 8:25 AM, Brian London 
wrote:

> Thanks. It looks like extending my batch duration to 7 seconds is a
> work-around.  I'd like to build a check for the lack of checkpointing in
> our integration tests.  Is there a way to parse the DAG at runtime?
>
> On Wed, Jan 20, 2016 at 2:01 PM Ted Yu  wrote:
>
>> This is related:
>>
>> SPARK-6847
>>
>> FYI
>>
>> On Wed, Jan 20, 2016 at 7:55 AM, Brian London 
>> wrote:
>>
>>> I'm running a streaming job that has two calls to updateStateByKey.
>>> When run in standalone mode both calls to updateStateByKey behave as
>>> expected.  When run on a cluster, however, it appears that the first call
>>> is not being checkpointed as shown in this DAG image:
>>>
>>> http://i.imgur.com/zmQ8O2z.png
>>>
>>> The middle column continues to grow one level deeper every batch until I
>>> get a stack overflow error.  I'm guessing its a problem of the stateRDD not
>>> being persisted, but I can't imagine why they wouldn't be.  I thought
>>> updateStateByKey was supposed to just handle that for you internally.
>>>
>>> Any ideas?
>>>
>>> I'll post stack trace excperpts of the stack overflow if anyone is
>>> interested below:
>>>
>>> Job aborted due to stage failure: Task 7 in stage 195811.0 failed 4
>>> times, most recent failure: Lost task 7.3 in stage 195811.0 (TID 213529,
>>> ip-10-168-177-216.ec2.internal): java.lang.StackOverflowError at
>>> java.lang.Exception.(Exception.java:102) at
>>> java.lang.ReflectiveOperationException.(ReflectiveOperationException.java:89)
>>> at
>>> java.lang.reflect.InvocationTargetException.(InvocationTargetException.java:72)
>>> at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source) at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>> at java.lang.reflect.Method.invoke(Method.java:606) at
>>> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058) at
>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1897) at
>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at
>>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1997) at
>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1921) at
>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>>> ...
>>>
>>> And
>>>
>>> scala.collection.immutable.$colon$colon in readObject at line 362
>>> scala.collection.immutable.$colon$colon in readObject at line 366
>>> scala.collection.immutable.$colon$colon in readObject at line 362
>>> scala.collection.immutable.$colon$colon in readObject at line 362
>>> scala.collection.immutable.$colon$colon in readObject at line 366
>>> scala.collection.immutable.$colon$colon in readObject at line 362
>>> scala.collection.immutable.$colon$colon in readObject at line 362
>>> ...
>>>
>>>

Re: Spark Cassandra Java Connector: records missing despite consistency=ALL

2016-01-21 Thread Dennis Birkholz


Hi Anthony,

no, the logging is not done via Spark (but PHP). But that does not 
really matter, as the records are eventually there. So it is the 
READ_CONSISTENCY=ALL that is not working.


Btw. it seems that using withReadConf() and setting the consistency 
level there is working but I need to wait a few more days before I am 
sure of that.


Kind regards,
Dennis

Am 19.01.2016 um 19:39 schrieb Femi Anthony:

So is the logging to Cassandra being done via Spark ?

On Wed, Jan 13, 2016 at 7:17 AM, Dennis Birkholz mailto:birkh...@pubgrade.com>> wrote:

Hi together,

we Cassandra to log event data and process it every 15 minutes with
Spark. We are using the Cassandra Java Connector for Spark.

Randomly our Spark runs produce too few output records because no
data is returned from Cassandra for a several minutes window of
input data. When querying the data (with cqlsh), after multiple
tries, the data eventually becomes available.

To solve the problem, we tried to use consistency=ALL when reading
the data in Spark. We use the
CassandraJavaUtil.javafunctions().cassandraTable() method and have
set "spark.cassandra.input.consistency.level"="ALL" on the config
when creating the Spark context. The problem persists but according
to http://stackoverflow.com/a/25043599 using a consistency level of
ONE on the write side (which we use) and ALL on the READ side should
be sufficient for data consistency.

I would really appreciate if someone could give me a hint how to fix
this problem, thanks!

Greets,
Dennis

P.s.:
some information about our setup:
Cassandra 2.1.12 in a two Node configuration with replication factor=2
Spark 1.5.1
Cassandra Java Driver 2.2.0-rc3
Spark Cassandra Java Connector 2.10-1.5.0-M2

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org

For additional commands, e-mail: user-h...@spark.apache.org





--
http://www.femibyte.com/twiki5/bin/view/Tech/
http://www.nextmatrix.com
"Great spirits have always encountered violent opposition from mediocre
minds." - Albert Einstein.



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: retrieve cell value from a rowMatrix.

2016-01-21 Thread Srivathsan Srinivas

Hi Zhang,
  I am new to Scala and Spark. I am not a Java guy (more of Python and R
guy). Just began playing with matrices in MlLib and looks painful to do
simple things. If you can show me a small example, it would help. Apply
function is not available in RowMatrix.

For eg.,

import org.apache.spark.mllib.linalg.distributed.RowMatrix

/* retrive a cell value */
def getValue(m: RowMatrix): Double = {
   ???
}


Likewise, I have trouble adding two RowMatrices

/* add two RowMatrices */
def addRowMatrices(a: RowMatrix, b: RowMatrix): RowMatrix = {

}


>From what I have read on Stackoverflow and other places is that such simple
things are not exposed in MlLib. But, they are heavily used in the
underlying Breeze libraries. Hence, one should convert the rowMatrics to
its Breeze equivalent, do the required operations and convert it back to
rowMatrix. I am still learning how to do this kind of conversion back and
forth. If you have small examples, it would be very helpful.


Thanks!
Srini.

On Wed, Jan 20, 2016 at 10:08 PM, zhangjp <592426...@qq.com> wrote:

>
> use apply(i,j) function.
> can u know how  to save matrix to a file using java language
> 
> ?
>
> -- 原始邮件 --
> *发件人:* "Srivathsan Srinivas";;
> *发送时间:* 2016年1月21日(星期四) 上午9:04
> *收件人:* "user";
> *主题:* retrieve cell value from a rowMatrix.
>
> Hi,
>Is there a way to retrieve the cell value of a rowMatrix? Like m(i,j)?
> The docs say that the indices are long. Maybe I am doing something
> wrong...but, there doesn't seem to be any such direct method.
>
> Any suggestions?
>
> --
> Thanks,
> Srini. 
>



-- 
Thanks,
Srini.

Recovery for Spark Streaming Kafka Direct with OffsetOutOfRangeException

2016-01-21 Thread Dan Dutrow

Hey Cody, I would have responded to the mailing list but it looks like this
thread got aged off. I have the problem where one of my topics dumps more
data than my spark job can keep up with. We limit the input rate with
maxRatePerPartition Eventually, when the data is aged off, I get the
OffsetOutOfRangeException from Kafka, as we would expect. As we work
towards more efficient processing of that topic, or get more resources, I'd
like to be able to log the error and continue the application without
failing. Is there a place where I can catch that error before it gets to
org.apache.spark.util.Utils$.logUncaughtExceptions ? Maybe somewhere in
DirectKafkaInputDStream::compute?

https://mail-archives.apache.org/mod_mbox/spark-user/201512.mbox/%3CCAKWX9VUoNd4ATGF+0TkNJ+9=b8r2nr9pt7sbgr-bv4nnttk...@mail.gmail.com%3E
-- 
Dan ✆

Re: updateStateByKey not persisting in Spark 1.5.1

2016-01-21 Thread Brian London

Thanks. It looks like extending my batch duration to 7 seconds is a
work-around.  I'd like to build a check for the lack of checkpointing in
our integration tests.  Is there a way to parse the DAG at runtime?

On Wed, Jan 20, 2016 at 2:01 PM Ted Yu  wrote:

> This is related:
>
> SPARK-6847
>
> FYI
>
> On Wed, Jan 20, 2016 at 7:55 AM, Brian London 
> wrote:
>
>> I'm running a streaming job that has two calls to updateStateByKey.  When
>> run in standalone mode both calls to updateStateByKey behave as expected.
>> When run on a cluster, however, it appears that the first call is not being
>> checkpointed as shown in this DAG image:
>>
>> http://i.imgur.com/zmQ8O2z.png
>>
>> The middle column continues to grow one level deeper every batch until I
>> get a stack overflow error.  I'm guessing its a problem of the stateRDD not
>> being persisted, but I can't imagine why they wouldn't be.  I thought
>> updateStateByKey was supposed to just handle that for you internally.
>>
>> Any ideas?
>>
>> I'll post stack trace excperpts of the stack overflow if anyone is
>> interested below:
>>
>> Job aborted due to stage failure: Task 7 in stage 195811.0 failed 4
>> times, most recent failure: Lost task 7.3 in stage 195811.0 (TID 213529,
>> ip-10-168-177-216.ec2.internal): java.lang.StackOverflowError at
>> java.lang.Exception.(Exception.java:102) at
>> java.lang.ReflectiveOperationException.(ReflectiveOperationException.java:89)
>> at
>> java.lang.reflect.InvocationTargetException.(InvocationTargetException.java:72)
>> at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source) at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:606) at
>> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058) at
>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1897) at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at
>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1997) at
>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1921) at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>> ...
>>
>> And
>>
>> scala.collection.immutable.$colon$colon in readObject at line 362
>> scala.collection.immutable.$colon$colon in readObject at line 366
>> scala.collection.immutable.$colon$colon in readObject at line 362
>> scala.collection.immutable.$colon$colon in readObject at line 362
>> scala.collection.immutable.$colon$colon in readObject at line 366
>> scala.collection.immutable.$colon$colon in readObject at line 362
>> scala.collection.immutable.$colon$colon in readObject at line 362
>> ...
>>
>>

Re: Passing binding variable in query used in Data Source API

2016-01-21 Thread Kevin Mellott

Another alternative that you can consider is to use Sqoop
 to move your data from PostgreSQL to HDFS, and
then just load it into your DataFrame without needing to use JDBC drivers.
I've had success using this approach, and depending on your setup you can
easily manage/schedule this type of workflow using a tool like Oozie
.

On Thu, Jan 21, 2016 at 8:34 AM, Todd Nist  wrote:

> Hi Satish,
>
> You should be able to do something like this:
>
>val props = new java.util.Properties()
>props.put("user", username)
>props.put("password",pwd)
>props.put("driver", "org.postgresql.Drive")
>val deptNo = 10
>val where = Some(s"dept_number = $deptNo")
>val df = sqlContext.read.jdbc("jdbc:postgresql://
> 10.00.00.000:5432/db_test?user=username&password=password
> ", "
> schema.table1", Array(where.getOrElse("")), props)
>
> or just add the fillter to your query like this and I believe these should
> get pushed down.
>
>   val df = sqlContext.read
> .format("jdbc")
> .option("url", "jdbc:postgresql://
> 10.00.00.000:5432/db_test?user=username&password=password
> ")
> .option("user", username)
> .option("password", pwd)
> .option("driver", "org.postgresql.Driver")
> .option("dbtable", "schema.table1")
> .load().filter('dept_number === $deptNo)
>
> This is form the top of my head and the code has not been tested or
> compiled.
>
> HTH.
>
> -Todd
>
>
> On Thu, Jan 21, 2016 at 6:02 AM, satish chandra j <
> jsatishchan...@gmail.com> wrote:
>
>> Hi All,
>>
>> We have requirement to fetch data from source PostgreSQL database as per
>> a condition, hence need to pass a binding variable in query used in Data
>> Source API as below:
>>
>>
>> var DeptNbr = 10
>>
>> val dataSource_dF=cc.load("jdbc",Map("url"->"jdbc:postgresql://
>> 10.00.00.000:5432/db_test?user=username&password=password","driver"->"org.postgresql.Driver","dbtable"->"(select*
>> from schema.table1 where dept_number=DeptNbr) as table1"))
>>
>>
>> But it errors saying expected ';' but found '='
>>
>>
>> Note: As it is an iterative approach hence cannot use constants but need
>> to pass variable to query
>>
>>
>> If anybody had a similar implementation to pass binding variable while
>> fetching data from source database using Data Source than please provide
>> details on the same
>>
>>
>> Regards,
>>
>> Satish Chandra
>>
>
>

Re: [Spark Streaming][Problem with DataFrame UDFs]

2016-01-21 Thread Jean-Pierre OCALAN

Quick correction in the code snippet I sent in my previous email:
Line: val enrichedDF = inputDF.withColumn("semantic", udf(col("url")))
Should be replaced by: val enrichedDF = inputDF.withColumn("semantic",
enrichUDF(col("url")))



On Thu, Jan 21, 2016 at 11:07 AM, Jean-Pierre OCALAN 
wrote:

> Hi Cody,
>
> First of all thanks a lot for your quick reply, although I have removed
> this post couple of hours after posting it because I ended up finding it
> was due to the way I was using DataFrame UDFs.
>
> Essentially I didn't know that UDFs were purely lazy and in case of the
> example below the UDF gets executed 3 times on the entire data set.
>
> // let's imagine we have an input DataFrame inputDF with "url" column
>
> // semanticClassifier.classify(url) returns Map[Int, Int]
> val enrichUDF = udf { (url: String) => semanticClassifier.classify(url) }
>
> // enrichedDF will have all columns contained in inputDF + a semantic
> column which will contain result of execution of the classification on the
> url column value.
> val enrichedDF = inputDF.withColumn("semantic", udf(col("url")))
>
> val outputDF = enrichedDF.select(col("*"),
> col("semantic")(0).as("semantic1"), col("semantic")(1).as("semantic2"),
> col("semantic")(2).as("semantic3")).drop("semantic")
>
> // The udf will be executed 3 times on the entire dataset
> outputDF.count()
>
> By adding enrichedDF.persist() and unpersist() later on I was able to
> quickly solve the issue but don't really like it, so I will probably work
> directly with the RDD[Row] and schema maintenance in order to recreate
> DataFrame.
>
> Thanks again,
> JP.
>
>
>
> On Thu, Jan 21, 2016 at 10:45 AM, Cody Koeninger 
> wrote:
>
>> If you can share an isolated example I'll take a look.  Not something
>> I've run into before.
>>
>> On Wed, Jan 20, 2016 at 3:53 PM, jpocalan  wrote:
>>
>>> Hi,
>>>
>>> I have an application which creates a Kafka Direct Stream from 1 topic
>>> having 5 partitions.
>>> As a result each batch is composed of an RDD having 5 partitions.
>>> In order to apply transformation to my batch I have decided to convert
>>> the
>>> RDD to DataFrame (DF) so that I can easily add column to the initial DF
>>> by
>>> using custom UDFs.
>>>
>>> Although, when I am applying any udf to the DF I am noticing that the udf
>>> will get execute multiple times and this factor is driven by the number
>>> of
>>> partitions.
>>> For example, imagine I have a RDD with 10 records and 5 partitions
>>> ideally
>>> my UDF should get called 10 times, although it gets consistently called
>>> 50
>>> times, but the resulting DF is correct and when executing a count()
>>> properly
>>> return 10, as expected.
>>>
>>> I have changed my code to work directly with RDDs using mapPartitions and
>>> the transformation gets called proper amount of time.
>>>
>>> As additional information, I have set spark.speculation to false and no
>>> tasks failed.
>>>
>>> I am working on a smaller example that would isolate this potential
>>> issue,
>>> but in the meantime I would like to know if somebody encountered this
>>> issue.
>>>
>>> Thank you.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Problem-with-DataFrame-UDFs-tp26024.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>


-- 
jean-pierre ocalan
jpoca...@gmail.com

Re: 10hrs of Scheduler Delay

2016-01-21 Thread Sanders, Isaac B

The Spark Version is 1.4.1

The logs are full of standard fair, nothing like an exception or even 
interesting [INFO] lines.

Here is the script I am using: 
https://gist.github.com/isaacsanders/660f480810fbc07d4df2

Thanks
Isaac

On Jan 21, 2016, at 11:03 AM, Ted Yu 
mailto:yuzhih...@gmail.com>> wrote:

Can you provide a bit more information ?

command line for submitting Spark job
version of Spark
anything interesting from driver / executor logs ?

Thanks

On Thu, Jan 21, 2016 at 7:35 AM, Sanders, Isaac B 
mailto:sande...@rose-hulman.edu>> wrote:
Hey all,

I am a CS student in the United States working on my senior thesis.

My thesis uses Spark, and I am encountering some trouble.

I am using https://github.com/alitouka/spark_dbscan, and to determine 
parameters, I am using the utility class they supply, 
org.alitouka.spark.dbscan.exploratoryAnalysis.DistanceToNearestNeighborDriver.

I am on a 10 node cluster with one machine with 8 cores and 32G of memory and 
nine machines with 6 cores and 16G of memory.

I have 442M of data, which seems like it would be a joke, but the job stalls at 
the last stage.

It was stuck in Scheduler Delay for 10 hours overnight, and I have tried a 
number of things for the last couple days, but nothing seems to be helping.

I have tried:
- Increasing heap sizes and numbers of cores
- More/less executors with different amounts of resources.
- Kyro Serialization
- FAIR Scheduling

It doesn’t seem like it should require this much. Any ideas?

- Isaac

Re: [Spark Streaming][Problem with DataFrame UDFs]

2016-01-21 Thread Jean-Pierre OCALAN

Hi Cody,

First of all thanks a lot for your quick reply, although I have removed
this post couple of hours after posting it because I ended up finding it
was due to the way I was using DataFrame UDFs.

Essentially I didn't know that UDFs were purely lazy and in case of the
example below the UDF gets executed 3 times on the entire data set.

// let's imagine we have an input DataFrame inputDF with "url" column

// semanticClassifier.classify(url) returns Map[Int, Int]
val enrichUDF = udf { (url: String) => semanticClassifier.classify(url) }

// enrichedDF will have all columns contained in inputDF + a semantic
column which will contain result of execution of the classification on the
url column value.
val enrichedDF = inputDF.withColumn("semantic", udf(col("url")))

val outputDF = enrichedDF.select(col("*"),
col("semantic")(0).as("semantic1"), col("semantic")(1).as("semantic2"),
col("semantic")(2).as("semantic3")).drop("semantic")

// The udf will be executed 3 times on the entire dataset
outputDF.count()

By adding enrichedDF.persist() and unpersist() later on I was able to
quickly solve the issue but don't really like it, so I will probably work
directly with the RDD[Row] and schema maintenance in order to recreate
DataFrame.

Thanks again,
JP.



On Thu, Jan 21, 2016 at 10:45 AM, Cody Koeninger  wrote:

> If you can share an isolated example I'll take a look.  Not something I've
> run into before.
>
> On Wed, Jan 20, 2016 at 3:53 PM, jpocalan  wrote:
>
>> Hi,
>>
>> I have an application which creates a Kafka Direct Stream from 1 topic
>> having 5 partitions.
>> As a result each batch is composed of an RDD having 5 partitions.
>> In order to apply transformation to my batch I have decided to convert the
>> RDD to DataFrame (DF) so that I can easily add column to the initial DF by
>> using custom UDFs.
>>
>> Although, when I am applying any udf to the DF I am noticing that the udf
>> will get execute multiple times and this factor is driven by the number of
>> partitions.
>> For example, imagine I have a RDD with 10 records and 5 partitions ideally
>> my UDF should get called 10 times, although it gets consistently called 50
>> times, but the resulting DF is correct and when executing a count()
>> properly
>> return 10, as expected.
>>
>> I have changed my code to work directly with RDDs using mapPartitions and
>> the transformation gets called proper amount of time.
>>
>> As additional information, I have set spark.speculation to false and no
>> tasks failed.
>>
>> I am working on a smaller example that would isolate this potential issue,
>> but in the meantime I would like to know if somebody encountered this
>> issue.
>>
>> Thank you.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Problem-with-DataFrame-UDFs-tp26024.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

Re: 10hrs of Scheduler Delay

2016-01-21 Thread Ted Yu

Can you provide a bit more information ?

command line for submitting Spark job
version of Spark
anything interesting from driver / executor logs ?

Thanks

On Thu, Jan 21, 2016 at 7:35 AM, Sanders, Isaac B 
wrote:

> Hey all,
>
> I am a CS student in the United States working on my senior thesis.
>
> My thesis uses Spark, and I am encountering some trouble.
>
> I am using https://github.com/alitouka/spark_dbscan, and to determine
> parameters, I am using the utility class they supply,
> org.alitouka.spark.dbscan.exploratoryAnalysis.DistanceToNearestNeighborDriver.
>
> I am on a 10 node cluster with one machine with 8 cores and 32G of memory
> and nine machines with 6 cores and 16G of memory.
>
> I have 442M of data, which seems like it would be a joke, but the job
> stalls at the last stage.
>
> It was stuck in Scheduler Delay for 10 hours overnight, and I have tried a
> number of things for the last couple days, but nothing seems to be helping.
>
> I have tried:
> - Increasing heap sizes and numbers of cores
> - More/less executors with different amounts of resources.
> - Kyro Serialization
> - FAIR Scheduling
>
> It doesn’t seem like it should require this much. Any ideas?
>
> - Isaac

Re: Spark job stops after a while.

2016-01-21 Thread Guillermo Ortiz

I think that it's that bug, because the error is the same.. thanks a lot.

2016-01-21 16:46 GMT+01:00 Guillermo Ortiz :

> I'm using 1.5.0 of  Spark confirmed. Less this
> jar file:/opt/centralLogs/lib/spark-catalyst_2.10-1.5.1.jar.
>
> I'm going to keep looking for,, Thank you!.
>
> 2016-01-21 16:29 GMT+01:00 Ted Yu :
>
>> Maybe this is related (fixed in 1.5.3):
>> SPARK-11195 Exception thrown on executor throws ClassNotFoundException on
>> driver
>>
>> FYI
>>
>> On Thu, Jan 21, 2016 at 7:10 AM, Guillermo Ortiz 
>> wrote:
>>
>>> I'm using CDH 5.5.1 with Spark 1.5.x (I think that it's 1.5.2).
>>>
>>> I know that the library is here:
>>> cloud-user@ose10kafkaelk:/opt/centralLogs/lib$ jar tf
>>> elasticsearch-hadoop-2.2.0-beta1.jar | grep
>>>  EsHadoopIllegalArgumentException
>>> org/elasticsearch/hadoop/EsHadoopIllegalArgumentException.class
>>>
>>> I have check in SparkUI with the process running
>>> http://10.129.96.55:39320/jars/elasticsearch-hadoop-2.2.0-beta1.jar Added
>>> By User
>>> And spark.jars from SparkUI.
>>>
>>> .file:/opt/centralLogs/lib/elasticsearch-hadoop-2.2.0-beta1.jar,file:/opt/centralLogs/lib/geronimo-annotation_1.0_spec-1.1.1.jar,
>>>
>>> I think that in yarn-client although it has the error it doesn't stop
>>> the execution, but I don't know why.
>>>
>>>
>>>
>>> 2016-01-21 15:55 GMT+01:00 Ted Yu :
>>>
 Looks like jar containing EsHadoopIllegalArgumentException class
 wasn't in the classpath.
 Can you double check ?

 Which Spark version are you using ?

 Cheers

 On Thu, Jan 21, 2016 at 6:50 AM, Guillermo Ortiz 
 wrote:

> I'm runing a Spark Streaming process and it stops in a while. It makes
> some process an insert the result in ElasticSeach with its library. After 
> a
> while the process fail.
>
> I have been checking the logs and I have seen this error
> 2016-01-21 14:57:54,388 [sparkDriver-akka.actor.default-dispatcher-17]
> INFO  org.apache.spark.storage.BlockManagerInfo - Added broadcast_2_piece0
> in memory on ose11kafkaelk.novalocal:46913 (size: 6.0 KB, free: 530.3 MB)
> 2016-01-21 14:57:54,646 [task-result-getter-1] INFO
>  org.apache.spark.scheduler.TaskSetManager - Finished task 0.0 in stage 
> 2.0
> (TID 7) in 397 ms on ose12kafkaelk.novalocal (1/6)
> 2016-01-21 14:57:54,647 [task-result-getter-2] INFO
>  org.apache.spark.scheduler.TaskSetManager - Finished task 2.0 in stage 
> 2.0
> (TID 10) in 395 ms on ose12kafkaelk.novalocal (2/6)
> 2016-01-21 14:57:54,731 [task-result-getter-3] INFO
>  org.apache.spark.scheduler.TaskSetManager - Finished task 5.0 in stage 
> 2.0
> (TID 9) in 481 ms on ose11kafkaelk.novalocal (3/6)
> 2016-01-21 14:57:54,844 [task-result-getter-1] INFO
>  org.apache.spark.scheduler.TaskSetManager - Finished task 4.0 in stage 
> 2.0
> (TID 8) in 595 ms on ose10kafkaelk.novalocal (4/6)
> 2016-01-21 14:57:54,850 [task-result-getter-0] WARN
>  org.apache.spark.ThrowableSerializationWrapper - Task exception could not
> be deserialized
> java.lang.ClassNotFoundException:
> org.elasticsearch.hadoop.EsHadoopIllegalArgumentException
> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
> at java.security.AccessController.doPrivileged(Native Method)
>
> I don't know why I'm getting this error because the class
> org.elasticsearch.hadoop.EsHadoopIllegalArgumentException is in the 
> library
> of elasticSearch.
>
> After this error I get others error and finally Spark ends.
> 2016-01-21 14:57:55,012 [JobScheduler] INFO
>  org.apache.spark.streaming.scheduler.JobScheduler - Starting job 
> streaming
> job 145338464 ms.0 from job set of time 145338464 ms
> 2016-01-21 14:57:55,012 [JobScheduler] ERROR
> org.apache.spark.streaming.scheduler.JobScheduler - Error running job
> streaming job 1453384635000 ms.0
> org.apache.spark.SparkException: Job aborted due to stage failure:
> Task 1 in stage 2.0 failed 4 times, most recent failure: Lost task 1.3 in
> stage 2.0 (TID 13, ose11kafkaelk.novalocal): UnknownReason
> Driver stacktrace:
> at org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1294)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1282)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1281)
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1281)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGSche

Re: Spark job stops after a while.

2016-01-21 Thread Guillermo Ortiz

I'm using 1.5.0 of  Spark confirmed. Less this
jar file:/opt/centralLogs/lib/spark-catalyst_2.10-1.5.1.jar.

I'm going to keep looking for,, Thank you!.

2016-01-21 16:29 GMT+01:00 Ted Yu :

> Maybe this is related (fixed in 1.5.3):
> SPARK-11195 Exception thrown on executor throws ClassNotFoundException on
> driver
>
> FYI
>
> On Thu, Jan 21, 2016 at 7:10 AM, Guillermo Ortiz 
> wrote:
>
>> I'm using CDH 5.5.1 with Spark 1.5.x (I think that it's 1.5.2).
>>
>> I know that the library is here:
>> cloud-user@ose10kafkaelk:/opt/centralLogs/lib$ jar tf
>> elasticsearch-hadoop-2.2.0-beta1.jar | grep
>>  EsHadoopIllegalArgumentException
>> org/elasticsearch/hadoop/EsHadoopIllegalArgumentException.class
>>
>> I have check in SparkUI with the process running
>> http://10.129.96.55:39320/jars/elasticsearch-hadoop-2.2.0-beta1.jar Added
>> By User
>> And spark.jars from SparkUI.
>>
>> .file:/opt/centralLogs/lib/elasticsearch-hadoop-2.2.0-beta1.jar,file:/opt/centralLogs/lib/geronimo-annotation_1.0_spec-1.1.1.jar,
>>
>> I think that in yarn-client although it has the error it doesn't stop the
>> execution, but I don't know why.
>>
>>
>>
>> 2016-01-21 15:55 GMT+01:00 Ted Yu :
>>
>>> Looks like jar containing EsHadoopIllegalArgumentException class wasn't
>>> in the classpath.
>>> Can you double check ?
>>>
>>> Which Spark version are you using ?
>>>
>>> Cheers
>>>
>>> On Thu, Jan 21, 2016 at 6:50 AM, Guillermo Ortiz 
>>> wrote:
>>>
 I'm runing a Spark Streaming process and it stops in a while. It makes
 some process an insert the result in ElasticSeach with its library. After a
 while the process fail.

 I have been checking the logs and I have seen this error
 2016-01-21 14:57:54,388 [sparkDriver-akka.actor.default-dispatcher-17]
 INFO  org.apache.spark.storage.BlockManagerInfo - Added broadcast_2_piece0
 in memory on ose11kafkaelk.novalocal:46913 (size: 6.0 KB, free: 530.3 MB)
 2016-01-21 14:57:54,646 [task-result-getter-1] INFO
  org.apache.spark.scheduler.TaskSetManager - Finished task 0.0 in stage 2.0
 (TID 7) in 397 ms on ose12kafkaelk.novalocal (1/6)
 2016-01-21 14:57:54,647 [task-result-getter-2] INFO
  org.apache.spark.scheduler.TaskSetManager - Finished task 2.0 in stage 2.0
 (TID 10) in 395 ms on ose12kafkaelk.novalocal (2/6)
 2016-01-21 14:57:54,731 [task-result-getter-3] INFO
  org.apache.spark.scheduler.TaskSetManager - Finished task 5.0 in stage 2.0
 (TID 9) in 481 ms on ose11kafkaelk.novalocal (3/6)
 2016-01-21 14:57:54,844 [task-result-getter-1] INFO
  org.apache.spark.scheduler.TaskSetManager - Finished task 4.0 in stage 2.0
 (TID 8) in 595 ms on ose10kafkaelk.novalocal (4/6)
 2016-01-21 14:57:54,850 [task-result-getter-0] WARN
  org.apache.spark.ThrowableSerializationWrapper - Task exception could not
 be deserialized
 java.lang.ClassNotFoundException:
 org.elasticsearch.hadoop.EsHadoopIllegalArgumentException
 at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 at java.security.AccessController.doPrivileged(Native Method)

 I don't know why I'm getting this error because the class
 org.elasticsearch.hadoop.EsHadoopIllegalArgumentException is in the library
 of elasticSearch.

 After this error I get others error and finally Spark ends.
 2016-01-21 14:57:55,012 [JobScheduler] INFO
  org.apache.spark.streaming.scheduler.JobScheduler - Starting job streaming
 job 145338464 ms.0 from job set of time 145338464 ms
 2016-01-21 14:57:55,012 [JobScheduler] ERROR
 org.apache.spark.streaming.scheduler.JobScheduler - Error running job
 streaming job 1453384635000 ms.0
 org.apache.spark.SparkException: Job aborted due to stage failure: Task
 1 in stage 2.0 failed 4 times, most recent failure: Lost task 1.3 in stage
 2.0 (TID 13, ose11kafkaelk.novalocal): UnknownReason
 Driver stacktrace:
 at org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1294)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1282)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1281)
 at
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1281)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
 at scala.Option.foreach(Option.scala:236)
 at
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)

Re: [Spark Streaming][Problem with DataFrame UDFs]

2016-01-21 Thread Cody Koeninger

If you can share an isolated example I'll take a look.  Not something I've
run into before.

On Wed, Jan 20, 2016 at 3:53 PM, jpocalan  wrote:

> Hi,
>
> I have an application which creates a Kafka Direct Stream from 1 topic
> having 5 partitions.
> As a result each batch is composed of an RDD having 5 partitions.
> In order to apply transformation to my batch I have decided to convert the
> RDD to DataFrame (DF) so that I can easily add column to the initial DF by
> using custom UDFs.
>
> Although, when I am applying any udf to the DF I am noticing that the udf
> will get execute multiple times and this factor is driven by the number of
> partitions.
> For example, imagine I have a RDD with 10 records and 5 partitions ideally
> my UDF should get called 10 times, although it gets consistently called 50
> times, but the resulting DF is correct and when executing a count()
> properly
> return 10, as expected.
>
> I have changed my code to work directly with RDDs using mapPartitions and
> the transformation gets called proper amount of time.
>
> As additional information, I have set spark.speculation to false and no
> tasks failed.
>
> I am working on a smaller example that would isolate this potential issue,
> but in the meantime I would like to know if somebody encountered this
> issue.
>
> Thank you.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Problem-with-DataFrame-UDFs-tp26024.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Number of executors in Spark - Kafka

2016-01-21 Thread Cody Koeninger

6 kafka partitions will result in 6 spark partitions, not 6 spark rdds.

The question of whether you will have a backlog isn't just a matter of
having 1 executor per partition.  If a single executor can process all of
the partitions fast enough to complete a batch in under the required time,
you won't have a backlog.

On Thu, Jan 21, 2016 at 5:35 AM, Guillermo Ortiz 
wrote:

>
> I'm using Spark Streaming and Kafka with Direct Approach. I have created a
> topic with 6 partitions so when I execute Spark there are six RDD. I
> understand than ideally it should have six executors to process each one
> one RDD. To do it, when I execute spark-submit (I use  YARN) I specific the
> number executors to six.
> If I don't specific anything it just create one executor. Looking for
> information I have read:
>
> "The --num-executors command-line flag or spark.executor.instances 
> configuration
> property control the number of executors requested. Starting in CDH
> 5.4/Spark 1.3, you will be able to avoid setting this property by turning
> on dynamic allocation
> 
>  with
> thespark.dynamicAllocation.enabled property. Dynamic allocation enables a
> Spark application to request executors when there is a backlog of pending
> tasks and free up executors when idle."
>
> I have this parameter enabled, I understand than if I don't set the
> parameter --num-executors it must create six executors or am I wrong?
>

No plan for BroadcastHint when attempting broadcastjoin

2016-01-21 Thread Ted Yu

Modified subject to reflect new error encountered.

Interesting - SPARK-12275 is marked fixed against 1.6.0

On Thu, Jan 21, 2016 at 7:30 AM, Sebastian Piu 
wrote:

> I'm using Spark 1.6.0.
>
> I tried removing Kryo and reverting back to Java Serialisation, and get a
> different error which maybe points in the right direction...
>
> java.lang.AssertionError: assertion failed: No plan for BroadcastHint
> +- InMemoryRelation
> [tradeId#30,tradeVersion#31,agreement#49,counterParty#38], true, 1,
> StorageLevel(true, true, false, true, 1), Union,
> Some(ingest_all_union_trades)
>
> at scala.Predef$.assert(Predef.scala:179)
> at
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
> at
> org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
> at
> org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:336)
> at
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
> at
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> at
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
> at
> org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
> at
> org.apache.spark.sql.execution.SparkStrategies$EquiJoinSelection$.apply(SparkStrategies.scala:105)
> at
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
> at
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> at
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
> at
> org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
> at
> org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:336)
> at
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
> at
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> at
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
> at
> org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
> at
> org.apache.spark.sql.execution.SparkStrategies$Aggregation$.apply(SparkStrategies.scala:217)
> at
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
> at
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> at
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
> at
> org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:47)
> at
> org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:45)
> at
> org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:52)
> at
> org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:52)
> at
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:53)
> at
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108)
> at
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
> at
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
> at
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:127)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:125)
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
> at
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:125)
> at
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
> at
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
> at
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:242)
> at
> org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148)
> at
> org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:139)
> at
> com.hsbc.rsl.spark.streaming.receiver.functions.PersistLevel3WithDat

10hrs of Scheduler Delay

2016-01-21 Thread Sanders, Isaac B

Hey all,

I am a CS student in the United States working on my senior thesis.

My thesis uses Spark, and I am encountering some trouble.

I am using https://github.com/alitouka/spark_dbscan, and to determine 
parameters, I am using the utility class they supply, 
org.alitouka.spark.dbscan.exploratoryAnalysis.DistanceToNearestNeighborDriver.

I am on a 10 node cluster with one machine with 8 cores and 32G of memory and 
nine machines with 6 cores and 16G of memory.

I have 442M of data, which seems like it would be a joke, but the job stalls at 
the last stage.

It was stuck in Scheduler Delay for 10 hours overnight, and I have tried a 
number of things for the last couple days, but nothing seems to be helping.

I have tried:
- Increasing heap sizes and numbers of cores
- More/less executors with different amounts of resources.
- Kyro Serialization
- FAIR Scheduling

It doesn’t seem like it should require this much. Any ideas?

- Isaac

Re: spark job submisson on yarn-cluster mode failing

2016-01-21 Thread Ted Yu

Exception below is at WARN level.
Can you check hdfs healthiness ?
Which hadoop version are you using ?

There should be other fatal error if your job failed.

Cheers

On Thu, Jan 21, 2016 at 4:50 AM, Soni spark 
wrote:

> Hi,
>
> I am facing below error msg now. please help me.
>
> 2016-01-21 16:06:14,123 WARN org.apache.hadoop.hdfs.DFSClient: Failed to
> connect to /xxx.xx.xx.xx:50010 for block, add to deadNodes and continue.
> java.nio.channels.ClosedByInterruptException
> java.nio.channels.ClosedByInterruptException
> at
> java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
> at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:658)
> at
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:192)
> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530)
> at
> org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3101)
> at
> org.apache.hadoop.hdfs.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:755)
> at
> org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:670)
> at
> org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:337)
> at
> org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:576)
> at
> org.apache.hadoop.hdfs.DFSInputStream.seekToBlockSource(DFSInputStream.java:1460)
> at
> org.apache.hadoop.hdfs.DFSInputStream.readBuffer(DFSInputStream.java:773)
> at
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:806)
> at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:847)
> at java.io.DataInputStream.read(DataInputStream.java:100)
> at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:84)
> at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:52)
> at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:112)
> at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:366)
> at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:265)
> at
> org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:61)
> at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)
> at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:357)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
> at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:356)
> at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:60)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
>
>
> Thanks
> Soniya
>
> On Thu, Jan 21, 2016 at 5:42 PM, Ted Yu  wrote:
>
>> Please also check AppMaster log.
>>
>> Thanks
>>
>> On Jan 21, 2016, at 3:51 AM, Akhil Das 
>> wrote:
>>
>> Can you look in the executor logs and see why the sparkcontext is being
>> shutdown? Similar discussion happened here previously.
>> http://apache-spark-user-list.1001560.n3.nabble.com/RECEIVED-SIGNAL-15-SIGTERM-td23668.html
>>
>> Thanks
>> Best Regards
>>
>> On Thu, Jan 21, 2016 at 5:11 PM, Soni spark 
>> wrote:
>>
>>> Hi Friends,
>>>
>>> I spark job is successfully running on local mode but failing on cluster 
>>> mode. Below is the error message i am getting. anyone can help me.
>>>
>>>
>>>
>>> 16/01/21 16:38:07 INFO twitter4j.TwitterStreamImpl: Establishing connection.
>>> 16/01/21 16:38:07 INFO twitter.TwitterReceiver: Twitter receiver started
>>> 16/01/21 16:38:07 INFO receiver.ReceiverSupervisorImpl: Called receiver 
>>> onStart
>>> 16/01/21 16:38:07 INFO receiver.ReceiverSupervisorImpl: Waiting for 
>>> receiver to be stopped*16/01/21 16:38:10 ERROR yarn.ApplicationMaster: 
>>> RECEIVED SIGNAL 15: SIGTERM*
>>> 16/01/21 16:38:10 INFO streaming.StreamingContext: Invoking 
>>> stop(stopGracefully=false) from shutdown hook
>>> 16/01/21 16:38:10 INFO scheduler.ReceiverTracker: Sent stop signal to all 1 
>>> receivers
>>> 16/01/21 16:38:10 INFO receiver.ReceiverSupervisorImpl: Received stop signal
>>> 16/01/21 16:38:10 INFO receiver.ReceiverSupervisorImpl: Stopping receiver 
>>> with message: Stopped by driver:
>>> 16/01/21 16:38:10 INFO twitter.TwitterReceiver: Twitter receiver stopped
>>> 16/01/21 16:38:10 INFO receiver.ReceiverSupervisorImpl: Called receiver 
>>> onStop
>>> 16/01/21 16:38:10 INFO receiver.ReceiverSupervisorImpl: Deregistering 
>>> receiver 0*16/01/21 16:38:10 ERROR scheduler.ReceiverTracker: Deregistered 
>>> receiver for stream 0: Stopped by driver*

Re: java.lang.ArrayIndexOutOfBoundsException when attempting broadcastjoin

2016-01-21 Thread Sebastian Piu

I'm using Spark 1.6.0.

I tried removing Kryo and reverting back to Java Serialisation, and get a
different error which maybe points in the right direction...

java.lang.AssertionError: assertion failed: No plan for BroadcastHint
+- InMemoryRelation
[tradeId#30,tradeVersion#31,agreement#49,counterParty#38], true, 1,
StorageLevel(true, true, false, true, 1), Union,
Some(ingest_all_union_trades)

at scala.Predef$.assert(Predef.scala:179)
at
org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
at
org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
at
org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:336)
at
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at
org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
at
org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
at
org.apache.spark.sql.execution.SparkStrategies$EquiJoinSelection$.apply(SparkStrategies.scala:105)
at
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at
org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
at
org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
at
org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:336)
at
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at
org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
at
org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
at
org.apache.spark.sql.execution.SparkStrategies$Aggregation$.apply(SparkStrategies.scala:217)
at
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at
org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
at
org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:47)
at
org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:45)
at
org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:52)
at
org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:52)
at
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:53)
at
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108)
at
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
at
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
at
org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:127)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:125)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at
org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:125)
at
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
at
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
at
org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:242)
at
org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148)
at
org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:139)
at
com.hsbc.rsl.spark.streaming.receiver.functions.PersistLevel3WithDataframes.preAggregateL4(PersistLevel3WithDataframes.java:133)
at
com.hsbc.rsl.spark.streaming.receiver.functions.PersistLevel3WithDataframes.call(PersistLevel3WithDataframes.java:93)
at
com.hsbc.rsl.spark.streaming.receiver.functions.PersistLevel3WithDataframes.call(PersistLevel3WithDataframes.java:27)
at
org.apache.spark.streami

Re: Spark job stops after a while.

2016-01-21 Thread Ted Yu

Maybe this is related (fixed in 1.5.3):
SPARK-11195 Exception thrown on executor throws ClassNotFoundException on
driver

FYI

On Thu, Jan 21, 2016 at 7:10 AM, Guillermo Ortiz 
wrote:

> I'm using CDH 5.5.1 with Spark 1.5.x (I think that it's 1.5.2).
>
> I know that the library is here:
> cloud-user@ose10kafkaelk:/opt/centralLogs/lib$ jar tf
> elasticsearch-hadoop-2.2.0-beta1.jar | grep
>  EsHadoopIllegalArgumentException
> org/elasticsearch/hadoop/EsHadoopIllegalArgumentException.class
>
> I have check in SparkUI with the process running
> http://10.129.96.55:39320/jars/elasticsearch-hadoop-2.2.0-beta1.jar Added
> By User
> And spark.jars from SparkUI.
>
> .file:/opt/centralLogs/lib/elasticsearch-hadoop-2.2.0-beta1.jar,file:/opt/centralLogs/lib/geronimo-annotation_1.0_spec-1.1.1.jar,
>
> I think that in yarn-client although it has the error it doesn't stop the
> execution, but I don't know why.
>
>
>
> 2016-01-21 15:55 GMT+01:00 Ted Yu :
>
>> Looks like jar containing EsHadoopIllegalArgumentException class wasn't
>> in the classpath.
>> Can you double check ?
>>
>> Which Spark version are you using ?
>>
>> Cheers
>>
>> On Thu, Jan 21, 2016 at 6:50 AM, Guillermo Ortiz 
>> wrote:
>>
>>> I'm runing a Spark Streaming process and it stops in a while. It makes
>>> some process an insert the result in ElasticSeach with its library. After a
>>> while the process fail.
>>>
>>> I have been checking the logs and I have seen this error
>>> 2016-01-21 14:57:54,388 [sparkDriver-akka.actor.default-dispatcher-17]
>>> INFO  org.apache.spark.storage.BlockManagerInfo - Added broadcast_2_piece0
>>> in memory on ose11kafkaelk.novalocal:46913 (size: 6.0 KB, free: 530.3 MB)
>>> 2016-01-21 14:57:54,646 [task-result-getter-1] INFO
>>>  org.apache.spark.scheduler.TaskSetManager - Finished task 0.0 in stage 2.0
>>> (TID 7) in 397 ms on ose12kafkaelk.novalocal (1/6)
>>> 2016-01-21 14:57:54,647 [task-result-getter-2] INFO
>>>  org.apache.spark.scheduler.TaskSetManager - Finished task 2.0 in stage 2.0
>>> (TID 10) in 395 ms on ose12kafkaelk.novalocal (2/6)
>>> 2016-01-21 14:57:54,731 [task-result-getter-3] INFO
>>>  org.apache.spark.scheduler.TaskSetManager - Finished task 5.0 in stage 2.0
>>> (TID 9) in 481 ms on ose11kafkaelk.novalocal (3/6)
>>> 2016-01-21 14:57:54,844 [task-result-getter-1] INFO
>>>  org.apache.spark.scheduler.TaskSetManager - Finished task 4.0 in stage 2.0
>>> (TID 8) in 595 ms on ose10kafkaelk.novalocal (4/6)
>>> 2016-01-21 14:57:54,850 [task-result-getter-0] WARN
>>>  org.apache.spark.ThrowableSerializationWrapper - Task exception could not
>>> be deserialized
>>> java.lang.ClassNotFoundException:
>>> org.elasticsearch.hadoop.EsHadoopIllegalArgumentException
>>> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>>> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>>> at java.security.AccessController.doPrivileged(Native Method)
>>>
>>> I don't know why I'm getting this error because the class
>>> org.elasticsearch.hadoop.EsHadoopIllegalArgumentException is in the library
>>> of elasticSearch.
>>>
>>> After this error I get others error and finally Spark ends.
>>> 2016-01-21 14:57:55,012 [JobScheduler] INFO
>>>  org.apache.spark.streaming.scheduler.JobScheduler - Starting job streaming
>>> job 145338464 ms.0 from job set of time 145338464 ms
>>> 2016-01-21 14:57:55,012 [JobScheduler] ERROR
>>> org.apache.spark.streaming.scheduler.JobScheduler - Error running job
>>> streaming job 1453384635000 ms.0
>>> org.apache.spark.SparkException: Job aborted due to stage failure: Task
>>> 1 in stage 2.0 failed 4 times, most recent failure: Lost task 1.3 in stage
>>> 2.0 (TID 13, ose11kafkaelk.novalocal): UnknownReason
>>> Driver stacktrace:
>>> at org.apache.spark.scheduler.DAGScheduler.org
>>> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1294)
>>> at
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1282)
>>> at
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1281)
>>> at
>>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>>> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>>> at
>>> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1281)
>>> at
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
>>> at
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
>>> at scala.Option.foreach(Option.scala:236)
>>> at
>>> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
>>> at
>>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1507)
>>> at
>>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1469)
>>> at
>>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458

Re: java.lang.ArrayIndexOutOfBoundsException when attempting broadcastjoin

2016-01-21 Thread Ted Yu

You were using Kryo serialization ?

If you switch to Java serialization, your job should run fine.

Which Spark release are you using ?

Thanks

On Thu, Jan 21, 2016 at 6:59 AM, sebastian.piu 
wrote:

> Hi all,
>
> I'm trying to work out a problem when using Spark Streaming, currently I
> have the following piece of code inside a foreachRDD call:
>
> Dataframe results = ... //some dataframe created from the incoming rdd -
> moderately big, I don't want this to be shuffled
> DataFrame t = sqlContext.table("a_temp_cached_table");  //a very small
> table
> - that might mutate over time
> DataFrame x = results.join(*broadcast(t)*, JOIN_COLUMNS_SEQ)
> .groupBy("column1").count()
> .write()
> .mode(SaveMode.Append)
> .save("/some-path/");
>
> The intention of the code above is to distribute the "t" dataframe if
> required, but avoid shuffling the "results".
>
> This works fine when ran on scala-shell / spark-submit, but when ran from
> within my executors I get the exception below...
>
> Any thoughts? If I remove the *broadcast(t)* then it works fine but where
> my
> big table is shuffled around.
>
> 2016-01-21 14:47:00 ERROR JobScheduler:95 - Error running job streaming job
> 145338756 ms.0
> java.lang.ArrayIndexOutOfBoundsException: 8388607
> at
>
> com.esotericsoftware.kryo.util.IdentityObjectIntMap.clear(IdentityObjectIntMap.java:345)
> at
>
> com.esotericsoftware.kryo.util.MapReferenceResolver.reset(MapReferenceResolver.java:47)
> at com.esotericsoftware.kryo.Kryo.reset(Kryo.java:804)
> at
> com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:570)
> at
>
> org.apache.spark.serializer.KryoSerializationStream.writeObject(KryoSerializer.scala:194)
> at
>
> org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:203)
> at
>
> org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:102)
> at
>
> org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:85)
> at
>
> org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
> at
>
> org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:63)
> at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1326)
> at
>
> org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$broadcastFuture$1$$anonfun$apply$1.apply(BroadcastHashJoin.scala:91)
> at
>
> org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$broadcastFuture$1$$anonfun$apply$1.apply(BroadcastHashJoin.scala:79)
> at
>
> org.apache.spark.sql.execution.SQLExecution$.withExecutionId(SQLExecution.scala:100)
> at
>
> org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$broadcastFuture$1.apply(BroadcastHashJoin.scala:79)
> at
>
> org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$broadcastFuture$1.apply(BroadcastHashJoin.scala:79)
> at
>
> scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
> at
> scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
> at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> 2016-01-21 14:47:00 ERROR RslApp:60 - ERROR executing the application
> java.lang.ArrayIndexOutOfBoundsException: 8388607
> at
>
> com.esotericsoftware.kryo.util.IdentityObjectIntMap.clear(IdentityObjectIntMap.java:345)
> at
>
> com.esotericsoftware.kryo.util.MapReferenceResolver.reset(MapReferenceResolver.java:47)
> at com.esotericsoftware.kryo.Kryo.reset(Kryo.java:804)
> at
> com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:570)
> at
>
> org.apache.spark.serializer.KryoSerializationStream.writeObject(KryoSerializer.scala:194)
> at
>
> org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:203)
> at
>
> org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:102)
> at
>
> org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:85)
> at
>
> org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
> at
>
> org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:63)
> at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1326)
> at
>
> org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$broadcastFuture$1$$anonfun$apply$1.apply(BroadcastHashJoin.scala:91)
> at
>
> org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$broadcastFuture$1$$anonfun$apply$1.apply(BroadcastHashJoin.scala:79)
> at
>
> org.apache.spark.sql.execution.SQLExecution$.withExecutionId(SQLExe

Re: Spark job stops after a while.

2016-01-21 Thread Guillermo Ortiz

I'm using CDH 5.5.1 with Spark 1.5.x (I think that it's 1.5.2).

I know that the library is here:
cloud-user@ose10kafkaelk:/opt/centralLogs/lib$ jar tf
elasticsearch-hadoop-2.2.0-beta1.jar | grep
 EsHadoopIllegalArgumentException
org/elasticsearch/hadoop/EsHadoopIllegalArgumentException.class

I have check in SparkUI with the process running
http://10.129.96.55:39320/jars/elasticsearch-hadoop-2.2.0-beta1.jar Added
By User
And spark.jars from SparkUI.
.file:/opt/centralLogs/lib/elasticsearch-hadoop-2.2.0-beta1.jar,file:/opt/centralLogs/lib/geronimo-annotation_1.0_spec-1.1.1.jar,

I think that in yarn-client although it has the error it doesn't stop the
execution, but I don't know why.



2016-01-21 15:55 GMT+01:00 Ted Yu :

> Looks like jar containing EsHadoopIllegalArgumentException class wasn't
> in the classpath.
> Can you double check ?
>
> Which Spark version are you using ?
>
> Cheers
>
> On Thu, Jan 21, 2016 at 6:50 AM, Guillermo Ortiz 
> wrote:
>
>> I'm runing a Spark Streaming process and it stops in a while. It makes
>> some process an insert the result in ElasticSeach with its library. After a
>> while the process fail.
>>
>> I have been checking the logs and I have seen this error
>> 2016-01-21 14:57:54,388 [sparkDriver-akka.actor.default-dispatcher-17]
>> INFO  org.apache.spark.storage.BlockManagerInfo - Added broadcast_2_piece0
>> in memory on ose11kafkaelk.novalocal:46913 (size: 6.0 KB, free: 530.3 MB)
>> 2016-01-21 14:57:54,646 [task-result-getter-1] INFO
>>  org.apache.spark.scheduler.TaskSetManager - Finished task 0.0 in stage 2.0
>> (TID 7) in 397 ms on ose12kafkaelk.novalocal (1/6)
>> 2016-01-21 14:57:54,647 [task-result-getter-2] INFO
>>  org.apache.spark.scheduler.TaskSetManager - Finished task 2.0 in stage 2.0
>> (TID 10) in 395 ms on ose12kafkaelk.novalocal (2/6)
>> 2016-01-21 14:57:54,731 [task-result-getter-3] INFO
>>  org.apache.spark.scheduler.TaskSetManager - Finished task 5.0 in stage 2.0
>> (TID 9) in 481 ms on ose11kafkaelk.novalocal (3/6)
>> 2016-01-21 14:57:54,844 [task-result-getter-1] INFO
>>  org.apache.spark.scheduler.TaskSetManager - Finished task 4.0 in stage 2.0
>> (TID 8) in 595 ms on ose10kafkaelk.novalocal (4/6)
>> 2016-01-21 14:57:54,850 [task-result-getter-0] WARN
>>  org.apache.spark.ThrowableSerializationWrapper - Task exception could not
>> be deserialized
>> java.lang.ClassNotFoundException:
>> org.elasticsearch.hadoop.EsHadoopIllegalArgumentException
>> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>> at java.security.AccessController.doPrivileged(Native Method)
>>
>> I don't know why I'm getting this error because the class
>> org.elasticsearch.hadoop.EsHadoopIllegalArgumentException is in the library
>> of elasticSearch.
>>
>> After this error I get others error and finally Spark ends.
>> 2016-01-21 14:57:55,012 [JobScheduler] INFO
>>  org.apache.spark.streaming.scheduler.JobScheduler - Starting job streaming
>> job 145338464 ms.0 from job set of time 145338464 ms
>> 2016-01-21 14:57:55,012 [JobScheduler] ERROR
>> org.apache.spark.streaming.scheduler.JobScheduler - Error running job
>> streaming job 1453384635000 ms.0
>> org.apache.spark.SparkException: Job aborted due to stage failure: Task 1
>> in stage 2.0 failed 4 times, most recent failure: Lost task 1.3 in stage
>> 2.0 (TID 13, ose11kafkaelk.novalocal): UnknownReason
>> Driver stacktrace:
>> at org.apache.spark.scheduler.DAGScheduler.org
>> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1294)
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1282)
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1281)
>> at
>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>> at
>> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1281)
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
>> at scala.Option.foreach(Option.scala:236)
>> at
>> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
>> at
>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1507)
>> at
>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1469)
>> at
>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458)
>> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>> at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)
>> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1824)
>> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1837)
>> at org.apache

RE: Spark SQL . How to enlarge output rows ?

2016-01-21 Thread Spencer, Alex (Santander)

I forgot to add this is (I think) from 1.5.0.

And yeah that looks like a Python – I’m not hot with Python but it may be 
capitalised as False or FALSE?

From: Eli Super [mailto:eli.su...@gmail.com]
Sent: 21 January 2016 14:48
To: Spencer, Alex (Santander)
Cc: user@spark.apache.org
Subject: Re: Spark SQL . How to enlarge output rows ?

Thanks Alex

I get NameError

NameError: name 'false' is not defined

Is it because of PySpark ?

On Thu, Jan 14, 2016 at 3:34 PM, Spencer, Alex (Santander) 
mailto:alex.spen...@santander.co.uk>> wrote:
Hi,

Try …..show(false)

public void show(int numRows,
boolean truncate)

Kind Regards,
Alex.

From: Eli Super [mailto:eli.su...@gmail.com]
Sent: 14 January 2016 13:09
To: user@spark.apache.org
Subject: Spark SQL . How to enlarge output rows ?

Hi

After executing sql

sqlContext.sql("select day_time from my_table limit 10").show()

my output looks like  :

++

|  day_time|

++

|2015/12/15 15:52:...|

|2015/12/15 15:53:...|

|2015/12/15 15:52:...|

|2015/12/15 15:52:...|

|2015/12/15 15:52:...|

|2015/12/15 15:52:...|

|2015/12/15 15:51:...|

|2015/12/15 15:52:...|

|2015/12/15 15:52:...|

|2015/12/15 15:53:...|

++

I'd like to get full rows

Thanks !
Emails aren't always secure, and they may be intercepted or changed after
they've been sent. Santander doesn't accept liability if this happens. If you
think someone may have interfered with this email, please get in touch with the
sender another way. This message doesn't create or change any contract.
Santander doesn't accept responsibility for damage caused by any viruses
contained in this email or its attachments. Emails may be monitored. If you've
received this email by mistake, please let the sender know at once that it's
gone to the wrong person and then destroy it without copying, using, or telling
anyone about its contents.
Santander UK plc Reg. No. 2294747 and Abbey National Treasury Services plc Reg.
No. 2338548 Registered Offices: 2 Triton Square, Regent's Place, London NW1 3AN.
Registered in England. www.santander.co.uk. 
Authorised by the Prudential
Regulation Authority and regulated by the Financial Conduct Authority and the
Prudential Regulation Authority. FCA Reg. No. 106054 and 146003 respectively.
Santander Sharedealing is a trading name of Abbey Stockbrokers Limited Reg. No.
02666793. Registered Office: Kingfisher House, Radford Way, Billericay, Essex
CM12 0GZ. Authorised and regulated by the Financial Conduct Authority. FCA Reg.
No. 154210. You can check this on the Financial Services Register by visiting
the FCA’s website www.fca.org.uk/register or by 
contacting the FCA on 0800 111
6768. Santander UK plc is also licensed by the Financial Supervision Commission
of the Isle of Man for its branch in the Isle of Man. Deposits held with the
Isle of Man branch are covered by the Isle of Man Depositors’ Compensation
Scheme as set out in the Isle of Man Depositors’ Compensation Scheme Regulations
2010. In the Isle of Man, Santander UK plc’s principal place of business is at
19/21 Prospect Hill, Douglas, Isle of Man, IM1 1ET. Santander and the flame logo
are registered trademarks.
Santander Asset Finance plc. Reg. No. 1533123. Registered Office: 2 Triton
Square, Regent’s Place, London NW1 3AN. Registered in England. Santander
Corporate & Commercial is a brand name used by Santander UK plc, Abbey National
Treasury Services plc and Santander Asset Finance plc.
Ref:[PDB#1-4A]

java.lang.ArrayIndexOutOfBoundsException when attempting broadcastjoin

2016-01-21 Thread sebastian.piu

Hi all,

I'm trying to work out a problem when using Spark Streaming, currently I
have the following piece of code inside a foreachRDD call:

Dataframe results = ... //some dataframe created from the incoming rdd -
moderately big, I don't want this to be shuffled
DataFrame t = sqlContext.table("a_temp_cached_table");  //a very small table
- that might mutate over time
DataFrame x = results.join(*broadcast(t)*, JOIN_COLUMNS_SEQ) 
.groupBy("column1").count()
.write()
.mode(SaveMode.Append)
.save("/some-path/");

The intention of the code above is to distribute the "t" dataframe if
required, but avoid shuffling the "results".

This works fine when ran on scala-shell / spark-submit, but when ran from
within my executors I get the exception below...

Any thoughts? If I remove the *broadcast(t)* then it works fine but where my
big table is shuffled around.

2016-01-21 14:47:00 ERROR JobScheduler:95 - Error running job streaming job
145338756 ms.0
java.lang.ArrayIndexOutOfBoundsException: 8388607
at
com.esotericsoftware.kryo.util.IdentityObjectIntMap.clear(IdentityObjectIntMap.java:345)
at
com.esotericsoftware.kryo.util.MapReferenceResolver.reset(MapReferenceResolver.java:47)
at com.esotericsoftware.kryo.Kryo.reset(Kryo.java:804)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:570)
at
org.apache.spark.serializer.KryoSerializationStream.writeObject(KryoSerializer.scala:194)
at
org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:203)
at
org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:102)
at
org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:85)
at
org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
at
org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:63)
at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1326)
at
org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$broadcastFuture$1$$anonfun$apply$1.apply(BroadcastHashJoin.scala:91)
at
org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$broadcastFuture$1$$anonfun$apply$1.apply(BroadcastHashJoin.scala:79)
at
org.apache.spark.sql.execution.SQLExecution$.withExecutionId(SQLExecution.scala:100)
at
org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$broadcastFuture$1.apply(BroadcastHashJoin.scala:79)
at
org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$broadcastFuture$1.apply(BroadcastHashJoin.scala:79)
at
scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at
scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
2016-01-21 14:47:00 ERROR RslApp:60 - ERROR executing the application
java.lang.ArrayIndexOutOfBoundsException: 8388607
at
com.esotericsoftware.kryo.util.IdentityObjectIntMap.clear(IdentityObjectIntMap.java:345)
at
com.esotericsoftware.kryo.util.MapReferenceResolver.reset(MapReferenceResolver.java:47)
at com.esotericsoftware.kryo.Kryo.reset(Kryo.java:804)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:570)
at
org.apache.spark.serializer.KryoSerializationStream.writeObject(KryoSerializer.scala:194)
at
org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:203)
at
org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:102)
at
org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:85)
at
org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
at
org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:63)
at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1326)
at
org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$broadcastFuture$1$$anonfun$apply$1.apply(BroadcastHashJoin.scala:91)
at
org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$broadcastFuture$1$$anonfun$apply$1.apply(BroadcastHashJoin.scala:79)
at
org.apache.spark.sql.execution.SQLExecution$.withExecutionId(SQLExecution.scala:100)
at
org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$broadcastFuture$1.apply(BroadcastHashJoin.scala:79)
at
org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$broadcastFuture$1.apply(BroadcastHashJoin.scala:79)
at
scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at
scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Fu

Re: cast column string -> timestamp in Parquet file

2016-01-21 Thread Muthu Jayakumar

DataFrame and udf. This may be more performant than doing an RDD
transformation as you'll only transform just the column that requires to be
changed.

Hope this helps.

On Thu, Jan 21, 2016 at 6:17 AM, Eli Super  wrote:

> Hi
>
> I have a large size parquet file .
>
> I need to cast the whole column to timestamp format , then save
>
> What the right way to do it ?
>
> Thanks a lot
>
>

Re: Spark job stops after a while.

2016-01-21 Thread Ted Yu

Looks like jar containing EsHadoopIllegalArgumentException class wasn't in
the classpath.
Can you double check ?

Which Spark version are you using ?

Cheers

On Thu, Jan 21, 2016 at 6:50 AM, Guillermo Ortiz 
wrote:

> I'm runing a Spark Streaming process and it stops in a while. It makes
> some process an insert the result in ElasticSeach with its library. After a
> while the process fail.
>
> I have been checking the logs and I have seen this error
> 2016-01-21 14:57:54,388 [sparkDriver-akka.actor.default-dispatcher-17]
> INFO  org.apache.spark.storage.BlockManagerInfo - Added broadcast_2_piece0
> in memory on ose11kafkaelk.novalocal:46913 (size: 6.0 KB, free: 530.3 MB)
> 2016-01-21 14:57:54,646 [task-result-getter-1] INFO
>  org.apache.spark.scheduler.TaskSetManager - Finished task 0.0 in stage 2.0
> (TID 7) in 397 ms on ose12kafkaelk.novalocal (1/6)
> 2016-01-21 14:57:54,647 [task-result-getter-2] INFO
>  org.apache.spark.scheduler.TaskSetManager - Finished task 2.0 in stage 2.0
> (TID 10) in 395 ms on ose12kafkaelk.novalocal (2/6)
> 2016-01-21 14:57:54,731 [task-result-getter-3] INFO
>  org.apache.spark.scheduler.TaskSetManager - Finished task 5.0 in stage 2.0
> (TID 9) in 481 ms on ose11kafkaelk.novalocal (3/6)
> 2016-01-21 14:57:54,844 [task-result-getter-1] INFO
>  org.apache.spark.scheduler.TaskSetManager - Finished task 4.0 in stage 2.0
> (TID 8) in 595 ms on ose10kafkaelk.novalocal (4/6)
> 2016-01-21 14:57:54,850 [task-result-getter-0] WARN
>  org.apache.spark.ThrowableSerializationWrapper - Task exception could not
> be deserialized
> java.lang.ClassNotFoundException:
> org.elasticsearch.hadoop.EsHadoopIllegalArgumentException
> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
> at java.security.AccessController.doPrivileged(Native Method)
>
> I don't know why I'm getting this error because the class
> org.elasticsearch.hadoop.EsHadoopIllegalArgumentException is in the library
> of elasticSearch.
>
> After this error I get others error and finally Spark ends.
> 2016-01-21 14:57:55,012 [JobScheduler] INFO
>  org.apache.spark.streaming.scheduler.JobScheduler - Starting job streaming
> job 145338464 ms.0 from job set of time 145338464 ms
> 2016-01-21 14:57:55,012 [JobScheduler] ERROR
> org.apache.spark.streaming.scheduler.JobScheduler - Error running job
> streaming job 1453384635000 ms.0
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 1
> in stage 2.0 failed 4 times, most recent failure: Lost task 1.3 in stage
> 2.0 (TID 13, ose11kafkaelk.novalocal): UnknownReason
> Driver stacktrace:
> at org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1294)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1282)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1281)
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1281)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
> at scala.Option.foreach(Option.scala:236)
> at
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1507)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1469)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1824)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1837)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1914)
> at org.elasticsearch.spark.rdd.EsSpark$.saveToEs(EsSpark.scala:67)
> at org.elasticsearch.spark.rdd.EsSpark$.saveToEs(EsSpark.scala:54)
> at org.elasticsearch.spark.rdd.EsSpark$.saveJsonToEs(EsSpark.scala:90)
> at
> org.elasticsearch.spark.package$SparkJsonRDDFunctions.saveJsonToEs(package.scala:44)
> at
> produban.spark.CentralLog$$anonfun$createContext$1.apply(CentralLog.scala:56)
> at
> produban.spark.CentralLog$$anonfun$createContext$1.apply(CentralLog.scala:33)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:631)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:631)
> at
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun

Re: Spark SQL . How to enlarge output rows ?

2016-01-21 Thread Eli Super

Thanks Alex

I get NameError

NameError: name 'false' is not defined

Is it because of PySpark ?



On Thu, Jan 14, 2016 at 3:34 PM, Spencer, Alex (Santander) <
alex.spen...@santander.co.uk> wrote:

> Hi,
>
>
>
> Try …..show(*false*)
>
>
>
> public void show(int numRows,
>
> boolean truncate)
>
>
>
>
>
> Kind Regards,
>
> Alex.
>
>
>
> *From:* Eli Super [mailto:eli.su...@gmail.com]
> *Sent:* 14 January 2016 13:09
> *To:* user@spark.apache.org
> *Subject:* Spark SQL . How to enlarge output rows ?
>
>
>
>
>
> Hi
>
>
>
> After executing sql
>
>
>
> sqlContext.sql("select day_time from my_table limit 10").show()
>
>
>
> my output looks like  :
>
> ++
>
> |  day_time|
>
> ++
>
> |2015/12/15 15:52:...|
>
> |2015/12/15 15:53:...|
>
> |2015/12/15 15:52:...|
>
> |2015/12/15 15:52:...|
>
> |2015/12/15 15:52:...|
>
> |2015/12/15 15:52:...|
>
> |2015/12/15 15:51:...|
>
> |2015/12/15 15:52:...|
>
> |2015/12/15 15:52:...|
>
> |2015/12/15 15:53:...|
>
> ++
>
>
>
> I'd like to get full rows
>
> Thanks !
>
> Emails aren't always secure, and they may be intercepted or changed after
> they've been sent. Santander doesn't accept liability if this happens. If
> you
> think someone may have interfered with this email, please get in touch
> with the
> sender another way. This message doesn't create or change any contract.
> Santander doesn't accept responsibility for damage caused by any viruses
> contained in this email or its attachments. Emails may be monitored. If
> you've
> received this email by mistake, please let the sender know at once that
> it's
> gone to the wrong person and then destroy it without copying, using, or
> telling
> anyone about its contents.
> Santander UK plc Reg. No. 2294747 and Abbey National Treasury Services plc
> Reg.
> No. 2338548 Registered Offices: 2 Triton Square, Regent's Place, London
> NW1 3AN.
> Registered in England. www.santander.co.uk. Authorised by the Prudential
> Regulation Authority and regulated by the Financial Conduct Authority and
> the
> Prudential Regulation Authority. FCA Reg. No. 106054 and 146003
> respectively.
> Santander Sharedealing is a trading name of Abbey Stockbrokers Limited
> Reg. No.
> 02666793. Registered Office: Kingfisher House, Radford Way, Billericay,
> Essex
> CM12 0GZ. Authorised and regulated by the Financial Conduct Authority. FCA
> Reg.
> No. 154210. You can check this on the Financial Services Register by
> visiting
> the FCA’s website www.fca.org.uk/register or by contacting the FCA on
> 0800 111
> 6768. Santander UK plc is also licensed by the Financial Supervision
> Commission
> of the Isle of Man for its branch in the Isle of Man. Deposits held with
> the
> Isle of Man branch are covered by the Isle of Man Depositors’ Compensation
> Scheme as set out in the Isle of Man Depositors’ Compensation Scheme
> Regulations
> 2010. In the Isle of Man, Santander UK plc’s principal place of business
> is at
> 19/21 Prospect Hill, Douglas, Isle of Man, IM1 1ET. Santander and the
> flame logo
> are registered trademarks.
> Santander Asset Finance plc. Reg. No. 1533123. Registered Office: 2 Triton
> Square, Regent’s Place, London NW1 3AN. Registered in England. Santander
> Corporate & Commercial is a brand name used by Santander UK plc, Abbey
> National
> Treasury Services plc and Santander Asset Finance plc.
> Ref:[PDB#1-4A]
>

Spark job stops after a while.

2016-01-21 Thread Guillermo Ortiz

I'm runing a Spark Streaming process and it stops in a while. It makes some
process an insert the result in ElasticSeach with its library. After a
while the process fail.

I have been checking the logs and I have seen this error
2016-01-21 14:57:54,388 [sparkDriver-akka.actor.default-dispatcher-17] INFO
 org.apache.spark.storage.BlockManagerInfo - Added broadcast_2_piece0 in
memory on ose11kafkaelk.novalocal:46913 (size: 6.0 KB, free: 530.3 MB)
2016-01-21 14:57:54,646 [task-result-getter-1] INFO
 org.apache.spark.scheduler.TaskSetManager - Finished task 0.0 in stage 2.0
(TID 7) in 397 ms on ose12kafkaelk.novalocal (1/6)
2016-01-21 14:57:54,647 [task-result-getter-2] INFO
 org.apache.spark.scheduler.TaskSetManager - Finished task 2.0 in stage 2.0
(TID 10) in 395 ms on ose12kafkaelk.novalocal (2/6)
2016-01-21 14:57:54,731 [task-result-getter-3] INFO
 org.apache.spark.scheduler.TaskSetManager - Finished task 5.0 in stage 2.0
(TID 9) in 481 ms on ose11kafkaelk.novalocal (3/6)
2016-01-21 14:57:54,844 [task-result-getter-1] INFO
 org.apache.spark.scheduler.TaskSetManager - Finished task 4.0 in stage 2.0
(TID 8) in 595 ms on ose10kafkaelk.novalocal (4/6)
2016-01-21 14:57:54,850 [task-result-getter-0] WARN
 org.apache.spark.ThrowableSerializationWrapper - Task exception could not
be deserialized
java.lang.ClassNotFoundException:
org.elasticsearch.hadoop.EsHadoopIllegalArgumentException
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)

I don't know why I'm getting this error because the class
org.elasticsearch.hadoop.EsHadoopIllegalArgumentException is in the library
of elasticSearch.

After this error I get others error and finally Spark ends.
2016-01-21 14:57:55,012 [JobScheduler] INFO
 org.apache.spark.streaming.scheduler.JobScheduler - Starting job streaming
job 145338464 ms.0 from job set of time 145338464 ms
2016-01-21 14:57:55,012 [JobScheduler] ERROR
org.apache.spark.streaming.scheduler.JobScheduler - Error running job
streaming job 1453384635000 ms.0
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1
in stage 2.0 failed 4 times, most recent failure: Lost task 1.3 in stage
2.0 (TID 13, ose11kafkaelk.novalocal): UnknownReason
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1294)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1282)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1281)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1281)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1507)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1469)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1824)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1837)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1914)
at org.elasticsearch.spark.rdd.EsSpark$.saveToEs(EsSpark.scala:67)
at org.elasticsearch.spark.rdd.EsSpark$.saveToEs(EsSpark.scala:54)
at org.elasticsearch.spark.rdd.EsSpark$.saveJsonToEs(EsSpark.scala:90)
at
org.elasticsearch.spark.package$SparkJsonRDDFunctions.saveJsonToEs(package.scala:44)
at
produban.spark.CentralLog$$anonfun$createContext$1.apply(CentralLog.scala:56)
at
produban.spark.CentralLog$$anonfun$createContext$1.apply(CentralLog.scala:33)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:631)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:631)
at
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:42)
at
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:40)
at
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:40)
at
org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scal

Re: Spark 1.6 ignoreNulls in first/last aggregate functions

2016-01-21 Thread emlyn

Turns out I can't use a user defined aggregate function, as they are not
supported in Window operations. There surely must be some way to do a
last_value with ignoreNulls enabled in Spark 1.6? Any ideas for workarounds?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-6-ignoreNulls-in-first-last-aggregate-functions-tp26031p26033.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Passing binding variable in query used in Data Source API

2016-01-21 Thread Todd Nist

Hi Satish,

You should be able to do something like this:

   val props = new java.util.Properties()
   props.put("user", username)
   props.put("password",pwd)
   props.put("driver", "org.postgresql.Drive")
   val deptNo = 10
   val where = Some(s"dept_number = $deptNo")
   val df = sqlContext.read.jdbc("jdbc:postgresql://
10.00.00.000:5432/db_test?user=username&password=password
", "
schema.table1", Array(where.getOrElse("")), props)

or just add the fillter to your query like this and I believe these should
get pushed down.

  val df = sqlContext.read
.format("jdbc")
.option("url", "jdbc:postgresql://
10.00.00.000:5432/db_test?user=username&password=password
")
.option("user", username)
.option("password", pwd)
.option("driver", "org.postgresql.Driver")
.option("dbtable", "schema.table1")
.load().filter('dept_number === $deptNo)

This is form the top of my head and the code has not been tested or
compiled.

HTH.

-Todd


On Thu, Jan 21, 2016 at 6:02 AM, satish chandra j 
wrote:

> Hi All,
>
> We have requirement to fetch data from source PostgreSQL database as per a
> condition, hence need to pass a binding variable in query used in Data
> Source API as below:
>
>
> var DeptNbr = 10
>
> val dataSource_dF=cc.load("jdbc",Map("url"->"jdbc:postgresql://
> 10.00.00.000:5432/db_test?user=username&password=password","driver"->"org.postgresql.Driver","dbtable"->"(select*
> from schema.table1 where dept_number=DeptNbr) as table1"))
>
>
> But it errors saying expected ';' but found '='
>
>
> Note: As it is an iterative approach hence cannot use constants but need
> to pass variable to query
>
>
> If anybody had a similar implementation to pass binding variable while
> fetching data from source database using Data Source than please provide
> details on the same
>
>
> Regards,
>
> Satish Chandra
>

cast column string -> timestamp in Parquet file

2016-01-21 Thread Eli Super

Hi

I have a large size parquet file .

I need to cast the whole column to timestamp format , then save

What the right way to do it ?

Thanks a lot

Re: Client versus cluster mode

2016-01-21 Thread Manoj Awasthi

The only difference is that in yarn-cluster mode your driver runs within a
yarn container (called AM or application master).

You would want to run your production jobs in yarn-cluster mode while for
development environment may do with yarn-client mode. Again, I think this
just a recommendation and not mandate.
On 21-Jan-2016 7:23 pm, "Afshartous, Nick"  wrote:

>
> Hi,
>
>
> In an AWS EMR/Spark 1.5 cluster we're launching a streaming job from the
> driver node.  Would it make any sense in this case to use cluster mode ?
> More specifically would there be any benefit that YARN would provide when
> using cluster but not client mode ?
>
>
> Thanks,
>
> --
>
> Nick
>

Client versus cluster mode

2016-01-21 Thread Afshartous, Nick


Hi,


In an AWS EMR/Spark 1.5 cluster we're launching a streaming job from the driver 
node.  Would it make any sense in this case to use cluster mode ?  More 
specifically would there be any benefit that YARN would provide when using 
cluster but not client mode ?


Thanks,

--

Nick

Spark Yarn executor memory overhead content

2016-01-21 Thread Olivier Devoisin

Hello,

In some of our spark applications, when writing outputs to hdfs we
encountered an error about the spark yarn executor memory overhead :

WARN yarn.YarnAllocator: Container killed by YARN for exceeding memory
limits. 3.0 GB of 3 GB physical memory used. Consider boosting
spark.yarn.executor.memoryOverhead.

When setting an higher overhead everything works correctly, but we'd like
to know what's in this overhead in order to fix/change our code.

The documentation states that it contains VM overheads, interned strings
and other native overheads. However it's really vague.

We considered two things to be the problem :
- Interned strings, however since Java 7 it's not in the overhead anymore
- Buffers used when writting to HDFS since we use Hadoop Multiple Outputs.

So if someone could enlighten us about what's in this overhead, it could be
very helpful.

Regards,

-- 


*Olivier Devoisin*
Data Infrastructure Engineer
olivier.devoi...@contentsquare.com
http://www.contentsquare.com
50 Avenue Montaigne - 75008 Paris

How to setup a long running spark streaming job with continuous window refresh

2016-01-21 Thread Santoshakhilesh

Hi,
I have following scenario in my project;
1.I will continue to get a stream of data from a source
2.I need to calculate mean and variance for a key every minute
3.After minute is over I should restart fresh computing the values for new 
minute
Example:
10:00:00 computation and output
10:00:00 key =1 , mean = 10 , variance =2
10:00:00 key =N , mean = 10 , variance =2
10:00:01 computation and output
10:00:00 key =1 , mean = 11 , variance =2
10:00:00 key =N , mean = 12 , variance =2
10:00:01 data has no dependency with 10:00:00
How to setup such jobs in a single java spark streaming application.
Regards,|
Santosh Akhilesh

question about query SparkSQL

2016-01-21 Thread Eli Super

Hi

I try to save parts of large table as csv files

I use following commands :

sqlContext.sql("select * from my_table where trans_time between '2015/12/18
12:00' and '2015/12/18
12:06'").write.format("com.databricks.spark.csv").option("header",
"false").save('00_06')

and

sqlContext.sql("select * from my_table where trans_time between '2015/12/18
12:07' and '2015/12/18
12:12'").write.format("com.databricks.spark.csv").option("header",
"false").save('07_12')

a question is how to avoid gap of a min between two files .

Thanks !

Re: spark job submisson on yarn-cluster mode failing

2016-01-21 Thread Soni spark

Hi,

I am facing below error msg now. please help me.

2016-01-21 16:06:14,123 WARN org.apache.hadoop.hdfs.DFSClient: Failed to
connect to /xxx.xx.xx.xx:50010 for block, add to deadNodes and continue.
java.nio.channels.ClosedByInterruptException
java.nio.channels.ClosedByInterruptException
at
java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:658)
at
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:192)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530)
at
org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3101)
at
org.apache.hadoop.hdfs.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:755)
at
org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:670)
at
org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:337)
at
org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:576)
at
org.apache.hadoop.hdfs.DFSInputStream.seekToBlockSource(DFSInputStream.java:1460)
at
org.apache.hadoop.hdfs.DFSInputStream.readBuffer(DFSInputStream.java:773)
at
org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:806)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:847)
at java.io.DataInputStream.read(DataInputStream.java:100)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:84)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:52)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:112)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:366)
at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:265)
at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:61)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:357)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:356)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:60)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Thanks
Soniya

On Thu, Jan 21, 2016 at 5:42 PM, Ted Yu  wrote:

> Please also check AppMaster log.
>
> Thanks
>
> On Jan 21, 2016, at 3:51 AM, Akhil Das  wrote:
>
> Can you look in the executor logs and see why the sparkcontext is being
> shutdown? Similar discussion happened here previously.
> http://apache-spark-user-list.1001560.n3.nabble.com/RECEIVED-SIGNAL-15-SIGTERM-td23668.html
>
> Thanks
> Best Regards
>
> On Thu, Jan 21, 2016 at 5:11 PM, Soni spark 
> wrote:
>
>> Hi Friends,
>>
>> I spark job is successfully running on local mode but failing on cluster 
>> mode. Below is the error message i am getting. anyone can help me.
>>
>>
>>
>> 16/01/21 16:38:07 INFO twitter4j.TwitterStreamImpl: Establishing connection.
>> 16/01/21 16:38:07 INFO twitter.TwitterReceiver: Twitter receiver started
>> 16/01/21 16:38:07 INFO receiver.ReceiverSupervisorImpl: Called receiver 
>> onStart
>> 16/01/21 16:38:07 INFO receiver.ReceiverSupervisorImpl: Waiting for receiver 
>> to be stopped*16/01/21 16:38:10 ERROR yarn.ApplicationMaster: RECEIVED 
>> SIGNAL 15: SIGTERM*
>> 16/01/21 16:38:10 INFO streaming.StreamingContext: Invoking 
>> stop(stopGracefully=false) from shutdown hook
>> 16/01/21 16:38:10 INFO scheduler.ReceiverTracker: Sent stop signal to all 1 
>> receivers
>> 16/01/21 16:38:10 INFO receiver.ReceiverSupervisorImpl: Received stop signal
>> 16/01/21 16:38:10 INFO receiver.ReceiverSupervisorImpl: Stopping receiver 
>> with message: Stopped by driver:
>> 16/01/21 16:38:10 INFO twitter.TwitterReceiver: Twitter receiver stopped
>> 16/01/21 16:38:10 INFO receiver.ReceiverSupervisorImpl: Called receiver 
>> onStop
>> 16/01/21 16:38:10 INFO receiver.ReceiverSupervisorImpl: Deregistering 
>> receiver 0*16/01/21 16:38:10 ERROR scheduler.ReceiverTracker: Deregistered 
>> receiver for stream 0: Stopped by driver*
>> 16/01/21 16:38:10 INFO receiver.ReceiverSupervisorImpl: Stopped receiver 0
>> 16/01/21 16:38:10 INFO receiver.BlockGenerator: Stopping BlockGenerator
>> 16/01/21 16:38:10 INFO yarn.ApplicationMaster: Waiting for spark context 
>> initialization ...
>>
>> Thanks
>>
>> Soniya
>>
>>
>

Spark 1.6 ignoreNulls in first/last aggregate functions

2016-01-21 Thread emlyn

As I understand it, Spark 1.6 changes the behaviour of the first and last
aggregate functions to  take nulls into account
   (where they were
ignored in 1.5). From SQL you can use "IGNORE NULLS" to get the old
behaviour back. How do I ignore nulls from the  Java API

 
? I can't see any way to pass that option, so I suspect I may need to write
a user defined aggregate function, but I'd prefer not to if possible.

Thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-6-ignoreNulls-in-first-last-aggregate-functions-tp26031.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Concurrent Spark jobs

2016-01-21 Thread emlyn

Thanks for the responses (not sure why they aren't showing up on the list).

Michael wrote:
> The JDBC wrapper for Redshift should allow you to follow these
> instructions. Let me know if you run into any more issues. 
> http://apache-spark-user-list.1001560.n3.nabble.com/best-practices-for-pushing-an-RDD-into-a-database-td2681.html

I'm not sure that this solves my problem - if I understand it correctly,
this is to split a database write over multiple concurrent connections (one
from each partition), whereas what I want is to allow other tasks to
continue running on the cluster while the the write to Redshift is taking
place.
Also I don't think it's good practice to load data into Redshift with INSERT
statements over JDBC - it is recommended to use the bulk load commands that
can analyse the data and automatically set appropriate compression etc on
the table.

Rajesh wrote:
> Just a thought. Can we use Spark Job Server and trigger jobs through rest
> apis. In this case, all jobs will share same context and run the jobs
> parallel.
> If any one has other thoughts please share

I'm not sure this would work in my case as they are not completely separate
jobs, but just different outputs to Redshift, that share intermediate
results. Running them as completely separate jobs would mean recalculating
the intermediate results for each output. I suppose it might be possible to
persist the intermediate results somewhere, and then delete them once all
the jobs have run, but that is starting to add a lot of complication which
I'm not sure is justified.

Maybe some pseudocode would help clarify things, so here is a very
simplified view of our Spark application:

// load and transform data, then cache the result
df1 = transform1(sqlCtx.read().options(...).parquet('path/to/data'))
df1.cache()

// perform some further transforms of the cached data
df2 = transform2(df1)
df3 = transform3(df1)

// write the final data out to Redshift
df2.write().options(...).(format "com.databricks.spark.redshift").save()
df3.write().options(...).(format "com.databricks.spark.redshift").save()

When the application runs, the steps are executed in the following order:
- scan parquet folder
- transform1 executes
- df1 stored in cache
- transform2 executes
- df2 written to Redshift (while cluster sits idle)
- transform3 executes
- df3 written to Redshift

I would like transform3 to begin executing as soon as the cluster has
capacity, without having to wait for df2 to be written to Redshift, so I
tried rewriting the last two lines as (again pseudocode):

f1 = future{df2.write().options(...).(format
"com.databricks.spark.redshift").save()}.execute()
f2 = future{df3.write().options(...).(format
"com.databricks.spark.redshift").save()}.execute()
f1.get()
f2.get()

In the hope that the first write would no longer block the following steps,
but instead it fails with a TimeoutException (see stack trace in previous
message). Is there a way to start the different writes concurrently, or is
that not possible in Spark?

--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Concurrent-Spark-jobs-tp26011p26030.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: is recommendProductsForUsers available in ALS?

2016-01-21 Thread Nick Pentreath

These methods are available in Spark 1.6

On Tue, Jan 19, 2016 at 12:18 AM, Roberto Pagliari <
roberto.pagli...@asos.com> wrote:

> With Spark 1.5, the following code:
>
> from pyspark import SparkContext, SparkConf
> from pyspark.mllib.recommendation import ALS, Rating
> r1 = (1, 1, 1.0)
> r2 = (1, 2, 2.0)
> r3 = (2, 1, 2.0)
> ratings = sc.parallelize([r1, r2, r3])
> model = ALS.trainImplicit(ratings, 1, seed=10)
>
> res = model.recommendProductsForUsers(2)
>
> raises the error
>
>
> ---
> AttributeErrorTraceback (most recent call
> last)
>  in ()
>   7 model = ALS.trainImplicit(ratings, 1, seed=10)
>   8
> > 9 res = model.recommendProductsForUsers(2)
>
> AttributeError: 'MatrixFactorizationModel' object has no attribute
> ‘recommendProductsForUsers'
>
> If the method is not available, is there a workaround with a large number
> of users and products?
>

Re: spark job submisson on yarn-cluster mode failing

2016-01-21 Thread Ted Yu

Please also check AppMaster log. 

Thanks

> On Jan 21, 2016, at 3:51 AM, Akhil Das  wrote:
> 
> Can you look in the executor logs and see why the sparkcontext is being 
> shutdown? Similar discussion happened here previously. 
> http://apache-spark-user-list.1001560.n3.nabble.com/RECEIVED-SIGNAL-15-SIGTERM-td23668.html
> 
> Thanks
> Best Regards
> 
>> On Thu, Jan 21, 2016 at 5:11 PM, Soni spark  wrote:
>> Hi Friends,
>> 
>> I spark job is successfully running on local mode but failing on cluster 
>> mode. Below is the error message i am getting. anyone can help me.
>> 
>> 
>> 16/01/21 16:38:07 INFO twitter4j.TwitterStreamImpl: Establishing connection.
>> 16/01/21 16:38:07 INFO twitter.TwitterReceiver: Twitter receiver started
>> 16/01/21 16:38:07 INFO receiver.ReceiverSupervisorImpl: Called receiver 
>> onStart
>> 16/01/21 16:38:07 INFO receiver.ReceiverSupervisorImpl: Waiting for receiver 
>> to be stopped
>> 16/01/21 16:38:10 ERROR yarn.ApplicationMaster: RECEIVED SIGNAL 15: SIGTERM
>> 16/01/21 16:38:10 INFO streaming.StreamingContext: Invoking 
>> stop(stopGracefully=false) from shutdown hook
>> 16/01/21 16:38:10 INFO scheduler.ReceiverTracker: Sent stop signal to all 1 
>> receivers
>> 16/01/21 16:38:10 INFO receiver.ReceiverSupervisorImpl: Received stop signal
>> 16/01/21 16:38:10 INFO receiver.ReceiverSupervisorImpl: Stopping receiver 
>> with message: Stopped by driver: 
>> 16/01/21 16:38:10 INFO twitter.TwitterReceiver: Twitter receiver stopped
>> 16/01/21 16:38:10 INFO receiver.ReceiverSupervisorImpl: Called receiver 
>> onStop
>> 16/01/21 16:38:10 INFO receiver.ReceiverSupervisorImpl: Deregistering 
>> receiver 0
>> 16/01/21 16:38:10 ERROR scheduler.ReceiverTracker: Deregistered receiver for 
>> stream 0: Stopped by driver
>> 16/01/21 16:38:10 INFO receiver.ReceiverSupervisorImpl: Stopped receiver 0
>> 16/01/21 16:38:10 INFO receiver.BlockGenerator: Stopping BlockGenerator
>> 16/01/21 16:38:10 INFO yarn.ApplicationMaster: Waiting for spark context 
>> initialization ...
>> 
>> Thanks
>> Soniya 
>

Re: spark job submisson on yarn-cluster mode failing

2016-01-21 Thread Akhil Das

Can you look in the executor logs and see why the sparkcontext is being
shutdown? Similar discussion happened here previously.
http://apache-spark-user-list.1001560.n3.nabble.com/RECEIVED-SIGNAL-15-SIGTERM-td23668.html

Thanks
Best Regards

On Thu, Jan 21, 2016 at 5:11 PM, Soni spark 
wrote:

> Hi Friends,
>
> I spark job is successfully running on local mode but failing on cluster 
> mode. Below is the error message i am getting. anyone can help me.
>
>
>
> 16/01/21 16:38:07 INFO twitter4j.TwitterStreamImpl: Establishing connection.
> 16/01/21 16:38:07 INFO twitter.TwitterReceiver: Twitter receiver started
> 16/01/21 16:38:07 INFO receiver.ReceiverSupervisorImpl: Called receiver 
> onStart
> 16/01/21 16:38:07 INFO receiver.ReceiverSupervisorImpl: Waiting for receiver 
> to be stopped*16/01/21 16:38:10 ERROR yarn.ApplicationMaster: RECEIVED SIGNAL 
> 15: SIGTERM*
> 16/01/21 16:38:10 INFO streaming.StreamingContext: Invoking 
> stop(stopGracefully=false) from shutdown hook
> 16/01/21 16:38:10 INFO scheduler.ReceiverTracker: Sent stop signal to all 1 
> receivers
> 16/01/21 16:38:10 INFO receiver.ReceiverSupervisorImpl: Received stop signal
> 16/01/21 16:38:10 INFO receiver.ReceiverSupervisorImpl: Stopping receiver 
> with message: Stopped by driver:
> 16/01/21 16:38:10 INFO twitter.TwitterReceiver: Twitter receiver stopped
> 16/01/21 16:38:10 INFO receiver.ReceiverSupervisorImpl: Called receiver onStop
> 16/01/21 16:38:10 INFO receiver.ReceiverSupervisorImpl: Deregistering 
> receiver 0*16/01/21 16:38:10 ERROR scheduler.ReceiverTracker: Deregistered 
> receiver for stream 0: Stopped by driver*
> 16/01/21 16:38:10 INFO receiver.ReceiverSupervisorImpl: Stopped receiver 0
> 16/01/21 16:38:10 INFO receiver.BlockGenerator: Stopping BlockGenerator
> 16/01/21 16:38:10 INFO yarn.ApplicationMaster: Waiting for spark context 
> initialization ...
>
> Thanks
>
> Soniya
>
>

spark job submisson on yarn-cluster mode failing

2016-01-21 Thread Soni spark

Hi Friends,

I spark job is successfully running on local mode but failing on
cluster mode. Below is the error message i am getting. anyone can help
me.



16/01/21 16:38:07 INFO twitter4j.TwitterStreamImpl: Establishing connection.
16/01/21 16:38:07 INFO twitter.TwitterReceiver: Twitter receiver started
16/01/21 16:38:07 INFO receiver.ReceiverSupervisorImpl: Called receiver onStart
16/01/21 16:38:07 INFO receiver.ReceiverSupervisorImpl: Waiting for
receiver to be stopped*16/01/21 16:38:10 ERROR yarn.ApplicationMaster:
RECEIVED SIGNAL 15: SIGTERM*
16/01/21 16:38:10 INFO streaming.StreamingContext: Invoking
stop(stopGracefully=false) from shutdown hook
16/01/21 16:38:10 INFO scheduler.ReceiverTracker: Sent stop signal to
all 1 receivers
16/01/21 16:38:10 INFO receiver.ReceiverSupervisorImpl: Received stop signal
16/01/21 16:38:10 INFO receiver.ReceiverSupervisorImpl: Stopping
receiver with message: Stopped by driver:
16/01/21 16:38:10 INFO twitter.TwitterReceiver: Twitter receiver stopped
16/01/21 16:38:10 INFO receiver.ReceiverSupervisorImpl: Called receiver onStop
16/01/21 16:38:10 INFO receiver.ReceiverSupervisorImpl: Deregistering
receiver 0*16/01/21 16:38:10 ERROR scheduler.ReceiverTracker:
Deregistered receiver for stream 0: Stopped by driver*
16/01/21 16:38:10 INFO receiver.ReceiverSupervisorImpl: Stopped receiver 0
16/01/21 16:38:10 INFO receiver.BlockGenerator: Stopping BlockGenerator
16/01/21 16:38:10 INFO yarn.ApplicationMaster: Waiting for spark
context initialization ...

Thanks

Soniya

Number of executors in Spark - Kafka

2016-01-21 Thread Guillermo Ortiz

I'm using Spark Streaming and Kafka with Direct Approach. I have created a
topic with 6 partitions so when I execute Spark there are six RDD. I
understand than ideally it should have six executors to process each one
one RDD. To do it, when I execute spark-submit (I use  YARN) I specific the
number executors to six.
If I don't specific anything it just create one executor. Looking for
information I have read:

"The --num-executors command-line flag or spark.executor.instances
configuration
property control the number of executors requested. Starting in CDH
5.4/Spark 1.3, you will be able to avoid setting this property by turning
on dynamic allocation

with
thespark.dynamicAllocation.enabled property. Dynamic allocation enables a
Spark application to request executors when there is a backlog of pending
tasks and free up executors when idle."

I have this parameter enabled, I understand than if I don't set the
parameter --num-executors it must create six executors or am I wrong?

Spark Streaming Write Ahead Log (WAL) not replaying data after restart

2016-01-21 Thread Patrick McGloin

Hi all,

To have a simple way of testing the Spark Streaming Write Ahead Log I
created a very simple Custom Input Receiver, which will generate strings
and store those:

class InMemoryStringReceiver extends
Receiver[String](StorageLevel.MEMORY_AND_DISK_SER) {

  val batchID = System.currentTimeMillis()

  def onStart() {
new Thread("InMemoryStringReceiver") {
  override def run(): Unit = {
var i = 0
while(true) {
  //http://spark.apache.org/docs/latest/streaming-custom-receivers.html
  //To implement a reliable receiver, you have to use
store(multiple-records) to store data.
  store(ArrayBuffer(s"$batchID-$i"))
  println(s"Stored => [$batchID-$i)]")
  Thread.sleep(1000L)
  i = i + 1
}
  }
}.start()
  }

  def onStop() {}
}

I then created a simple Application which will use the Custom Receiver to
stream the data and process it:

object DStreamResilienceTest extends App {

  val conf = new
SparkConf().setMaster("local[*]").setAppName("DStreamResilienceTest").set("spark.streaming.receiver.writeAheadLog.enable",
"true")
  val ssc = new StreamingContext(conf, Seconds(1))
  
ssc.checkpoint("hdfs://myhdfsserver/user/spark/checkpoint/DStreamResilienceTest")
  val customReceiverStream: ReceiverInputDStream[String] =
ssc.receiverStream(new InMemoryStringReceiver())
  customReceiverStream.foreachRDD { (rdd: RDD[String]) =>
println(s"processed => [${rdd.collect().toList}]")
Thread.sleep(2000L)
  }
  ssc.start()
  ssc.awaitTermination()

}

As you can see the processing of each received RDD has sleep of 2 seconds
while the Strings are stored every second. This creates a backlog and the
new strings pile up, and should be stored in the WAL. Indeed, I can see the
files in the checkpoint dirs getting updated. Running the app I get output
like this:

[info] Stored => [1453374654941-0)]
[info] processed => [List(1453374654941-0)]
[info] Stored => [1453374654941-1)]
[info] Stored => [1453374654941-2)]
[info] processed => [List(1453374654941-1)]
[info] Stored => [1453374654941-3)]
[info] Stored => [1453374654941-4)]
[info] processed => [List(1453374654941-2)]
[info] Stored => [1453374654941-5)]
[info] Stored => [1453374654941-6)]
[info] processed => [List(1453374654941-3)]
[info] Stored => [1453374654941-7)]
[info] Stored => [1453374654941-8)]
[info] processed => [List(1453374654941-4)]
[info] Stored => [1453374654941-9)]
[info] Stored => [1453374654941-10)]

As you would expect, the storing is out pacing the processing. So I kill
the application and restart it. This time I commented out the sleep in the
foreachRDD so that the processing can clear any backlog:

[info] Stored => [1453374753946-0)]
[info] processed => [List(1453374753946-0)]
[info] Stored => [1453374753946-1)]
[info] processed => [List(1453374753946-1)]
[info] Stored => [1453374753946-2)]
[info] processed => [List(1453374753946-2)]
[info] Stored => [1453374753946-3)]
[info] processed => [List(1453374753946-3)]
[info] Stored => [1453374753946-4)]
[info] processed => [List(1453374753946-4)]

As you can see the new events are processed but none from the previous
batch. The old WAL logs are cleared and I see log messages like this but
the old data does not get processed.

INFO WriteAheadLogManager : Recovered 1 write ahead log files from
hdfs://myhdfsserver/user/spark/checkpoint/DStreamResilienceTest/receivedData/0

What am I doing wrong? I am using Spark 1.5.2.

Best regards,

Patrick

Passing binding variable in query used in Data Source API

2016-01-21 Thread satish chandra j

Hi All,

We have requirement to fetch data from source PostgreSQL database as per a
condition, hence need to pass a binding variable in query used in Data
Source API as below:


var DeptNbr = 10

val dataSource_dF=cc.load("jdbc",Map("url"->"jdbc:postgresql://
10.00.00.000:5432/db_test?user=username&password=password","driver"->"org.postgresql.Driver","dbtable"->"(select*
from schema.table1 where dept_number=DeptNbr) as table1"))


But it errors saying expected ';' but found '='


Note: As it is an iterative approach hence cannot use constants but need to
pass variable to query


If anybody had a similar implementation to pass binding variable while
fetching data from source database using Data Source than please provide
details on the same


Regards,

Satish Chandra

RE: Container exited with a non-zero exit code 1-SparkJOb on YARN

2016-01-21 Thread Siddharth Ubale

Hi Wellington,

Thanks for the reply.

I have kept the default values for the below 2 features which have been 
mentioned.
The zip file is expected by the spark job in the spark staging folder in hdfs. 
None of the documentation has mentioned regarding this file.
Also, I have noticed one more thing that whenever yarn allocates containers on 
the machine from where I am running the code the spark job runs else
It always fails.

Thanks,
Siddharth Ubale

-Original Message-
From: Wellington Chevreuil [mailto:wellington.chevre...@gmail.com] 
Sent: Thursday, January 21, 2016 3:44 PM
To: Siddharth Ubale 
Subject: Re: Container exited with a non-zero exit code 1-SparkJOb on YARN

Hi,

For the memory issues, you might need to review current values for maximum 
allowed container memory on YARN configuration. Check values current defined 
for "yarn.nodemanager.resource.memory-mb" and 
"yarn.scheduler.maximum-allocation-mb" properties.

Regarding the file issue, is the file available on hdfs? Is there anything else 
writing/changing the file while the job runs?

> On 20 Jan 2016, at 12:29, Siddharth Ubale  wrote:
> 
> Hi,
>  
> I am running a Spark Job on the yarn cluster.
> The spark job is a spark streaming application which is reading JSON from a 
> kafka topic , inserting the JSON values to hbase tables via Phoenix , ands 
> then sending out certain messages to a websocket if the JSON satisfies a 
> certain criteria.
>  
> My cluster is a 3 node cluster with 24GB ram and 24 cores in total.
>  
> Now :
> 1. when I am submitting the job with 10GB memory, the application 
> fails saying memory is insufficient to run the job 2. The job is submitted 
> with 6G ram. However, it does not run successfully always.Common issues faced 
> :
> a. Container exited with a non-zero exit code 1 , and after 
> multiple such warning the job is finished.
> d. The failed job notifies that it was unable to find 
> a file in HDFS which is something like _hadoop_conf_xx.zip
>  
> Can someone pls let me know why am I seeing the above 2 issues.
>  
> Thanks,
> Siddharth Ubale,

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: best practice : how to manage your Spark cluster ?

2016-01-21 Thread Arkadiusz Bicz

Hi Charles,

We are using Ambari for hadoop / spark services management, version
and monitoring in cluster.

For Spark jobs and cluster hosts, discs, memory, cpu, network realtime
monitoring we use graphite + grafana + collectd + spark metrics

http://www.hammerlab.org/2015/02/27/monitoring-spark-with-graphite-and-grafana/

BR,

Arkadiusz Bicz

On Thu, Jan 21, 2016 at 5:33 AM, charles li  wrote:
> I've put a thread before:  pre-install 3-party Python package on spark
> cluster
>
> currently I use Fabric to manage my cluster , but it's not enough for me,
> and I believe there is a much better way to manage and monitor the cluster.
>
> I believe there really exists some open source manage tools which provides a
> web UI allowing me to [ what I need exactly ]:
>
> monitor the cluster machine's state in real-time, say memory, network, disk
> list all the services, packages on each machine
> install / uninstall / upgrade / downgrade package through a web UI
> start / stop / restart services on that machine
>
>
>
> great thanks
>
> --
> --
> a spark lover, a quant, a developer and a good man.
>
> http://github.com/litaotao

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: How to use scala.math.Ordering in java

2016-01-21 Thread Dave


Thanks Ted.


On 20/01/16 18:24, Ted Yu wrote:

Please take a look at the following files for some examples:

sql/core/src/main/java/org/apache/spark/sql/execution/UnsafeKVExternalSorter.java
sql/catalyst/src/main/java/org/apache/spark/sql/execution/UnsafeExternalRowSorter.java

Cheers

On Wed, Jan 20, 2016 at 1:03 AM, ddav > wrote:


Hi,

I am writing my Spark application in java and I need to use a
RangePartitioner.

JavaPairRDD progRef1 =
sc.textFile(programReferenceDataFile, 12).filter(
(String s) ->
!s.startsWith("#")).mapToPair(
(String s) -> {
ProgramDataRef ref = new ProgramDataRef(s);
return new
Tuple2(ref.genre, ref);
}
);

RangePartitioner rangePart = new
RangePartitioner(12, progRef1.rdd(), true, ?,
progRef1.kClassTag());

I can't determine how to create the correct object for parameter 4
which is
"scala.math.Ordering evidence$1" from the documentation. From the
scala.math.Ordering code I see there are many implicit objects and one
handles Strings. How can I access them from Java.

Thanks,
Dave.



--
View this message in context:

http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-scala-math-Ordering-in-java-tp26019.html
Sent from the Apache Spark User List mailing list archive at
Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org

For additional commands, e-mail: user-h...@spark.apache.org

Re: Parquet write optimization by row group size config

2016-01-21 Thread Pavel Plotnikov

I have got about 25 separated gzipped log files per hour. File sizes is
very different, from 10MB to 50MB of gzipped JSON data. So, i'am convert
this data in parquet each hour. Code very simple on python:

text_file = sc.textFile(src_file)
df = sqlCtx.jsonRDD(text_file.map(lambda x:
x.split('\t')[2]).map(json.loads).flatMap(flatting_events).map(specific_keys_types_wrapper).map(json.dumps))
df.write.parquet(out_file, mode='overwrite')
-

The JSON in log files is not clear, and i need to make some preparation via
rdd. Output parquet files is very small about 35MB for largest source
files. This source log files converted one by one. It is cool that all
converting transformations are executed on lot of machine cores quickly,
but when i run command htop on my machines i found that it mostly use only
one core. So it very strange.
First think - create lot of spark contexts for each input file (or group of
files) and allocate then only 2 cores, and then it will be use all servers
power. But this solution looks ugly, and it eliminates all the beauty of
Spark in this case, may be this case not for spark.
I found, that on fist seconds job use all available cores but then start
work on one and it is not a IO probleb (file sizes to small for raid over
ssd). So, second think - problem in parquet files. After some docs reading,
i am understand that parquet have hot a lot of  levels of parallelism, and I
should look for a solution out there.

On Thu, Jan 21, 2016 at 10:35 AM Jörn Franke  wrote:

> What is your data size, the algorithm and the expected time?
> Depending on this the group can recommend you optimizations or tell you
> that the expectations are wrong
>
> On 20 Jan 2016, at 18:24, Pavel Plotnikov 
> wrote:
>
> Thanks, Akhil! It helps, but this jobs still not fast enough, maybe i
> missed something
>
> Regards,
> Pavel
>
> On Wed, Jan 20, 2016 at 9:51 AM Akhil Das 
> wrote:
>
>> Did you try re-partitioning the data before doing the write?
>>
>> Thanks
>> Best Regards
>>
>> On Tue, Jan 19, 2016 at 6:13 PM, Pavel Plotnikov <
>> pavel.plotni...@team.wrike.com> wrote:
>>
>>> Hello,
>>> I'm using spark on some machines in standalone mode, data storage is
>>> mounted on this machines via nfs. A have input data stream and when i'm
>>> trying to store all data for hour in parquet, a job executes mostly on one
>>> core and this hourly data are stored in 40- 50 minutes. It is very slow!
>>> And it is not IO problem. After research how parquet file works, i'm found
>>> that it can be parallelized on row group abstraction level.
>>> I think row group for my files is to large, and how can i change it?
>>> When i create to big DataFrame i devides in parts very well and writes
>>> quikly!
>>>
>>> Thanks,
>>> Pavel
>>>
>>
>>

85 matches

Mail list logo