Adding more slaves to a running cluster

2015-11-24 Thread Dillian Murphey
What's the current status on adding slaves to a running cluster?  I want to
leverage spark-ec2 and autoscaling groups.  I want to launch slaves as spot
instances when I need to do some heavy lifting, but I don't want to bring
down my cluster in order to add nodes.

Can this be done by just running start-slave.sh??

What about using Mesos?

I just want to create an AMI for a slave and on some trigger launch it and
have it automatically add itself to the cluster.

thanks


Re: RDD partition after calling mapToPair

2015-11-24 Thread trung kien
Thanks Cody for very useful information.

It's much more clear to me now. I had a lots of wrong assumptions.
On Nov 23, 2015 10:19 PM, "Cody Koeninger"  wrote:

> Partitioner is an optional field when defining an rdd.  KafkaRDD doesn't
> define one, so you can't really assume anything about the way it's
> partitioned, because spark doesn't know anything about the way it's
> partitioned.  If you want to rely on some property of how things were
> partitioned as they were being produced into kafka, you need to do
> foreachPartition or mapPartition yourself.  Otherwise, spark will do a
> shuffle for any operation that would ordinarily require a shuffle, even if
> keys are already in the "right" place.
>
> Regarding the assignment of cores to partitions, that's not really
> accurate.  Each kafka partition will correspond to a spark partition.  If
> you do an operation that shuffles, that relationship no longer holds true.
> Even if you're doing a straight map operation without a shuffle, you will
> probably get 1 executor core working on 1 partition, but there's no
> guarantee the scheduler will do that, and no guarantee it'll be the same
> core / partition relationship for the next batch.
>
>
> On Mon, Nov 23, 2015 at 9:01 AM, Thúy Hằng Lê 
> wrote:
>
>> Thanks Cody,
>>
>> I still have concerns about this.
>> What's do you mean by saying Spark direct stream doesn't have a default
>> partitioner? Could you please help me to explain more?
>>
>> When i assign 20 cores to 20 Kafka partitions, I am expecting each core
>> will work on a partition. Is it correct?
>>
>> I'm still couldn't figure out how RDD will be partitioned after mapToPair
>> function. It would be great if you could brieftly explain ( or send me some
>> document, i couldnt find it) about how shuffle work on mapToPair function.
>>
>> Thank you very much.
>> On Nov 23, 2015 12:26 AM, "Cody Koeninger"  wrote:
>>
>>> Spark direct stream doesn't have a default partitioner.
>>>
>>> If you know that you want to do an operation on keys that are already
>>> partitioned by kafka, just use mapPartitions or foreachPartition to avoid a
>>> shuffle.
>>>
>>> On Sat, Nov 21, 2015 at 11:46 AM, trung kien  wrote:
>>>
 Hi all,

 I am having problem of understanding how RDD will be partitioned after
 calling mapToPair function.
 Could anyone give me more information about parititoning in this
 function?

 I have a simple application doing following job:

 JavaPairInputDStream messages =
 KafkaUtils.createDirectStream(...)

 JavaPairDStream stats = messages.mapToPair(JSON_DECODE)

 .reduceByKey(SUM);

 saveToDB(stats)

 I setup 2 workers (each dedicate 20 cores) for this job.
 My kafka topic has 40 partitions (I want each core handle a partition),
 and the messages send to queue are partitioned by the same key as mapToPair
 function.
 I'm using default Partitioner of both Kafka and Sprark.

 Ideally, I shouldn't see the data shuffle between cores in mapToPair
 stage, right?
 However, in my Spark UI, I see that the "Locality Level" for this stage
 is "ANY", which means data need to be transfered.
 Any comments on this?

 --
 Thanks
 Kien

>>>
>>>
>


Re: Getting the batch time of the active batches in spark streaming

2015-11-24 Thread Todd Nist
Hi Abhi,

You should be able to register a
org.apache.spark.streaming.scheduler.StreamListener.

There is an example here that may help:
https://gist.github.com/akhld/b10dc491aad1a2007183 and the spark api docs
here,
http://spark.apache.org/docs/latest/api/java/org/apache/spark/scheduler/SparkListener.html
.

HTH,
-Todd

On Tue, Nov 24, 2015 at 4:50 PM, Abhishek Anand 
wrote:

> Hi ,
>
> I need to get the batch time of the active batches which appears on the UI
> of spark streaming tab,
>
> How can this be achieved in Java ?
>
> BR,
> Abhi
>


Re: Getting the batch time of the active batches in spark streaming

2015-11-24 Thread Todd Nist
Hi Abhi,

Sorry that was the wrong link should have been the StreamListener,
http://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/scheduler/StreamingListener.html

The BatchInfo can be obtained from the event, for example:

public void onBatchSubmitted(StreamingListenerBatchSubmitted batchSubmitted)
{ system.out.println("Start time: " +
batchSubmitted.batchInfo.processingStartTime)
}

Sorry for the confusion.

-Todd

On Tue, Nov 24, 2015 at 7:51 PM, Todd Nist  wrote:

> Hi Abhi,
>
> You should be able to register a
> org.apache.spark.streaming.scheduler.StreamListener.
>
> There is an example here that may help:
> https://gist.github.com/akhld/b10dc491aad1a2007183 and the spark api docs
> here,
> http://spark.apache.org/docs/latest/api/java/org/apache/spark/scheduler/SparkListener.html
> .
>
> HTH,
> -Todd
>
> On Tue, Nov 24, 2015 at 4:50 PM, Abhishek Anand 
> wrote:
>
>> Hi ,
>>
>> I need to get the batch time of the active batches which appears on the
>> UI of spark streaming tab,
>>
>> How can this be achieved in Java ?
>>
>> BR,
>> Abhi
>>
>
>


java.io.IOException when using KryoSerializer

2015-11-24 Thread Piero Cinquegrana
Hello,

I am using spark 1.4.1 with Zeppelin.  When using the kryo serializer,

spark.serializer = org.apache.spark.serializer.KryoSerializer

instead of the default Java serializer I am getting the following error. Is 
this a known issue?

Thanks,
Piero


java.io.IOException: Failed to connect to 
ip-10-0-0-243.ec2.internal/10.0.0.243:52161
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:193)
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)
at 
org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:88)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
at 
org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:97)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:149)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.initialize(ShuffleBlockFetcherIterator.scala:262)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.(ShuffleBlockFetcherIterator.scala:115)
at 
org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.fetch(BlockStoreShuffleFetcher.scala:76)
at 
org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:40)
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:90)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)



Does receiver based approach lose any data in case of a leader/broker loss in Spark Streaming?

2015-11-24 Thread SRK
Hi,

Does receiver based approach lose any data in case of a leader/broker loss
in Spark Streaming? We currently use Kafka Direct for Spark Streaming and it
seems to be failing out when there is a  leader loss and we can't really
guarantee that there won't be any leader loss due rebalancing. 

If we go with receiver based approach, would it be able to overcome that
situation?


Thanks,
Swetha



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Does-receiver-based-approach-lose-any-data-in-case-of-a-leader-broker-loss-in-Spark-Streaming-tp25470.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Does receiver based approach lose any data in case of a leader/broker loss in Spark Streaming?

2015-11-24 Thread Cody Koeninger
The direct stream shouldn't silently lose data in the case of a leader
loss.  Loss of  a leader is handled like any other failure, retrying
up to spark.task.maxFailures
times.

But really if you're losing leaders and taking that long to rebalance you
should figure out what's wrong with your kafka cluster and fix it,
regardless of what consumer you're using.



On Tue, Nov 24, 2015 at 10:55 PM, SRK  wrote:

> Hi,
>
> Does receiver based approach lose any data in case of a leader/broker loss
> in Spark Streaming? We currently use Kafka Direct for Spark Streaming and
> it
> seems to be failing out when there is a  leader loss and we can't really
> guarantee that there won't be any leader loss due rebalancing.
>
> If we go with receiver based approach, would it be able to overcome that
> situation?
>
>
> Thanks,
> Swetha
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Does-receiver-based-approach-lose-any-data-in-case-of-a-leader-broker-loss-in-Spark-Streaming-tp25470.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


queries on Spork (Pig on Spark)

2015-11-24 Thread Divya Gehlot
>
> Hi,


As a beginner ,I have below queries on Spork(Pig on Spark).
I have cloned  git clone https://github.com/apache/pig -b spark .
1.On which version of Pig and Spark , Spork  is being built ?
2. I followed the steps mentioned in   https://issues.apache.org/ji
ra/browse/PIG-4059 and try to run simple pig script just like Load the file
and dump/store it.
Getting errors :

>
grunt> A = load '/tmp/words_tb.txt' using PigStorage('\t') as
(empNo:chararray,empName:chararray,salary:chararray);
grunt> Store A into
'/tmp/spork';

2015-11-25 05:35:52,502 [main] INFO
org.apache.pig.tools.pigstats.ScriptState - Pig features used in the
script: UNKNOWN
2015-11-25 05:35:52,875 [main] WARN  org.apache.pig.data.SchemaTupleBackend
- SchemaTupleBackend has already been initialized
2015-11-25 05:35:52,883 [main] INFO
org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - Not MR
mode. RollupHIIOptimizer is disabled
2015-11-25 05:35:52,894 [main] INFO
org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer -
{RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, ConstantCalculator,
GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter,
MergeFilter, MergeForEach, PartitionFilterOptimizer,
PredicatePushdownOptimizer, PushDownForEachFlatten, PushUpFilter,
SplitFilter, StreamTypeCastInserter]}
2015-11-25 05:35:52,966 [main] INFO  org.apache.pig.data.SchemaTupleBackend
- Key [pig.schematuple] was not set... will not generate code.
2015-11-25 05:35:52,983 [main] INFO
org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher - add
Files Spark Job
2015-11-25 05:35:53,137 [main] INFO
org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher - Added
jar pig-0.15.0-SNAPSHOT-core-h2.jar
2015-11-25 05:35:53,138 [main] INFO
org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher - Added
jar pig-0.15.0-SNAPSHOT-core-h2.jar
2015-11-25 05:35:53,138 [main] INFO
org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher -
Converting operator POLoad (Name: A:
Load(/tmp/words_tb.txt:PigStorage(' ')) - scope-29 Operator Key: scope-29)
2015-11-25 05:35:53,205 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 2998: Unhandled internal error. Could not initialize class
org.apache.spark.rdd.RDDOperationScope$
Details at logfile: /home/pig/pig_1448425672112.log


Can you please help me in pointing whats wrong ?

Appreciate your help .

Thanks,

Regards,

Divya


Re: queries on Spork (Pig on Spark)

2015-11-24 Thread Divya Gehlot
Log files content :
Pig Stack Trace
---
ERROR 2998: Unhandled internal error. Could not initialize class
org.apache.spark.rdd.RDDOperationScope$
java.lang.NoClassDefFoundError: Could not initialize class
org.apache.spark.rdd.RDDOperationScope$
 at org.apache.spark.SparkContext.withScope(SparkContext.scala:681)
 at org.apache.spark.SparkContext.newAPIHadoopRDD(SparkContext.scala:1094)
 at
org.apache.pig.backend.hadoop.executionengine.spark.converter.LoadConverter.convert(LoadConverter.java:91)
 at
org.apache.pig.backend.hadoop.executionengine.spark.converter.LoadConverter.convert(LoadConverter.java:61)
 at
org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.physicalToRDD(SparkLauncher.java:666)
 at
org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.physicalToRDD(SparkLauncher.java:633)
 at
org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.physicalToRDD(SparkLauncher.java:633)
 at
org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.sparkOperToRDD(SparkLauncher.java:585)
 at
org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.sparkPlanToRDD(SparkLauncher.java:534)
 at
org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.launchPig(SparkLauncher.java:209)
 at
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.launchPig(HExecutionEngine.java:301)
 at org.apache.pig.PigServer.launchPlan(PigServer.java:1390)
 at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1375)
 at org.apache.pig.PigServer.storeEx(PigServer.java:1034)
 at org.apache.pig.PigServer.store(PigServer.java:997)
 at org.apache.pig.PigServer.openIterator(PigServer.java:910)
 at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:754)
 at
org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:376)
 at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:230)
 at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205)
 at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:66)
 at org.apache.pig.Main.run(Main.java:558)
 at org.apache.pig.Main.main(Main.java:170)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:136)


Didn't understand the problem behind the error .

Thanks,
Regards,
Divya

On 25 November 2015 at 14:00, Jeff Zhang  wrote:

> >>> Details at logfile: /home/pig/pig_1448425672112.log
>
> You need to check the log file for details
>
>
>
>
> On Wed, Nov 25, 2015 at 1:57 PM, Divya Gehlot 
> wrote:
>
>> Hi,
>>
>>
>> As a beginner ,I have below queries on Spork(Pig on Spark).
>> I have cloned  git clone https://github.com/apache/pig -b spark .
>> 1.On which version of Pig and Spark , Spork  is being built ?
>> 2. I followed the steps mentioned in   https://issues.apache.org/ji
>> ra/browse/PIG-4059 and try to run simple pig script just like Load the
>> file and dump/store it.
>> Getting errors :
>>
>>>
>> grunt> A = load '/tmp/words_tb.txt' using PigStorage('\t') as
>> (empNo:chararray,empName:chararray,salary:chararray);
>> grunt> Store A into
>> '/tmp/spork';
>>
>> 2015-11-25 05:35:52,502 [main] INFO
>> org.apache.pig.tools.pigstats.ScriptState - Pig features used in the
>> script: UNKNOWN
>> 2015-11-25 05:35:52,875 [main] WARN
>> org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has already
>> been initialized
>> 2015-11-25 05:35:52,883 [main] INFO
>> org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - Not MR
>> mode. RollupHIIOptimizer is disabled
>> 2015-11-25 05:35:52,894 [main] INFO
>> org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer -
>> {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, ConstantCalculator,
>> GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter,
>> MergeFilter, MergeForEach, PartitionFilterOptimizer,
>> PredicatePushdownOptimizer, PushDownForEachFlatten, PushUpFilter,
>> SplitFilter, StreamTypeCastInserter]}
>> 2015-11-25 05:35:52,966 [main] INFO
>> org.apache.pig.data.SchemaTupleBackend - Key [pig.schematuple] was not
>> set... will not generate code.
>> 2015-11-25 05:35:52,983 [main] INFO
>> org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher - add
>> Files Spark Job
>> 2015-11-25 05:35:53,137 [main] INFO
>> org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher - Added
>> jar pig-0.15.0-SNAPSHOT-core-h2.jar
>> 2015-11-25 05:35:53,138 [main] INFO
>> org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher - Added
>> jar pig-0.15.0-SNAPSHOT-core-h2.jar
>> 2015-11-25 05:35:53,138 [main] INFO
>> 

Why does a 3.8 T dataset take up 11.59 Tb on HDFS

2015-11-24 Thread AlexG
I downloaded a 3.8 T dataset from S3 to a freshly launched spark-ec2 cluster
with 16.73 Tb storage, using
distcp. The dataset is a collection of tar files of about 1.7 Tb each.
Nothing else was stored in the HDFS, but after completing the download, the
namenode page says that 11.59 Tb are in use. When I use hdfs du -h -s, I see
that the dataset only takes up 3.8 Tb as expected. I navigated through the
entire HDFS hierarchy from /, and don't see where the missing space is. Any
ideas what is going on and how to rectify it?

I'm using the spark-ec2 script to launch, with the command

spark-ec2 -k key -i ~/.ssh/key.pem -s 29 --instance-type=r3.8xlarge
--placement-group=pcavariants --copy-aws-credentials
--hadoop-major-version=yarn --spot-price=2.8 --region=us-west-2 launch
conversioncluster

and am not modifying any configuration files for Hadoop.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Why-does-a-3-8-T-dataset-take-up-11-59-Tb-on-HDFS-tp25471.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Why does a 3.8 T dataset take up 11.59 Tb on HDFS

2015-11-24 Thread Koert Kuipers
what is your hdfs replication set to?

On Wed, Nov 25, 2015 at 1:31 AM, AlexG  wrote:

> I downloaded a 3.8 T dataset from S3 to a freshly launched spark-ec2
> cluster
> with 16.73 Tb storage, using
> distcp. The dataset is a collection of tar files of about 1.7 Tb each.
> Nothing else was stored in the HDFS, but after completing the download, the
> namenode page says that 11.59 Tb are in use. When I use hdfs du -h -s, I
> see
> that the dataset only takes up 3.8 Tb as expected. I navigated through the
> entire HDFS hierarchy from /, and don't see where the missing space is. Any
> ideas what is going on and how to rectify it?
>
> I'm using the spark-ec2 script to launch, with the command
>
> spark-ec2 -k key -i ~/.ssh/key.pem -s 29 --instance-type=r3.8xlarge
> --placement-group=pcavariants --copy-aws-credentials
> --hadoop-major-version=yarn --spot-price=2.8 --region=us-west-2 launch
> conversioncluster
>
> and am not modifying any configuration files for Hadoop.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Why-does-a-3-8-T-dataset-take-up-11-59-Tb-on-HDFS-tp25471.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Why does a 3.8 T dataset take up 11.59 Tb on HDFS

2015-11-24 Thread Ye Xianjin
Hi AlexG:

Files(blocks more specifically) has 3 copies on HDFS by default. So 3.8 * 3 = 
11.4TB.  

-- 
Ye Xianjin
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Wednesday, November 25, 2015 at 2:31 PM, AlexG wrote:

> I downloaded a 3.8 T dataset from S3 to a freshly launched spark-ec2 cluster
> with 16.73 Tb storage, using
> distcp. The dataset is a collection of tar files of about 1.7 Tb each.
> Nothing else was stored in the HDFS, but after completing the download, the
> namenode page says that 11.59 Tb are in use. When I use hdfs du -h -s, I see
> that the dataset only takes up 3.8 Tb as expected. I navigated through the
> entire HDFS hierarchy from /, and don't see where the missing space is. Any
> ideas what is going on and how to rectify it?
> 
> I'm using the spark-ec2 script to launch, with the command
> 
> spark-ec2 -k key -i ~/.ssh/key.pem -s 29 --instance-type=r3.8xlarge
> --placement-group=pcavariants --copy-aws-credentials
> --hadoop-major-version=yarn --spot-price=2.8 --region=us-west-2 launch
> conversioncluster
> 
> and am not modifying any configuration files for Hadoop.
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Why-does-a-3-8-T-dataset-take-up-11-59-Tb-on-HDFS-tp25471.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com 
> (http://Nabble.com).
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> (mailto:user-unsubscr...@spark.apache.org)
> For additional commands, e-mail: user-h...@spark.apache.org 
> (mailto:user-h...@spark.apache.org)
> 
> 




Conversely, Hive is performing better than Spark-Sql

2015-11-24 Thread UMESH CHAUDHARY
Hi,
I am using Hive 1.1.0 and Spark 1.5.1 and creating hive context in
spark-shell.

Now, I am experiencing reversed performance by Spark-Sql over Hive.
By default Hive gives result back in 27 seconds for plain select * query on
1 GB dataset containing 3623203 records, while spark-sql gives back in 2
mins on collect operation.

Cluster Config:
Hive : 6 Node : 16 GB Memory, 4 cores each
Spark : 4 Nodes : 16 GB Memory, 4 cores each

My dataset has 45 partitions and spark-sql creates 82 jobs.

I have tried all memory and garbage collection optimizations suggested on
official website but failed to get better performance and its worth to
mention that sometimes I get OOM error when I allocate executor memory less
than 10G.

Can somebody tell whats actually going on ?


RE: Spark Tasks on second node never return in Yarn when I have more than 1 task node

2015-11-24 Thread Shuai Zheng
Hi All,

 

Hi Just an update on this case.

 

I try many different combination on settings (and I just upgrade to latest EMR 
4.2.0 with Spark 1.5.2).

 

I just found out that the problem is from:

 

spark-submit --deploy-mode client --executor-cores=24 --driver-memory=5G 
--executor-memory=45G

 

If I set the –executor-cores=20 (or anything less than 20, there is no issue). 

 

This is a quite interesting case, because the instance (C3*8xlarge) has 32 
virtual core and can run without any issue with one task .

 

So I guess the issue should come from:

1, connection limit from EC2 instance on EMR to S3 (this reason doesn’t make 
enough sense to me, I will contact EMR support to clarify)

2, some library packed in the jar cause this limit? (also not very reasonable).

 

Report here in case anyone face similar issue.

 

Regards,

 

Shuai

 

From: Jonathan Kelly [mailto:jonathaka...@gmail.com] 
Sent: Thursday, November 19, 2015 6:54 PM
To: Shuai Zheng
Cc: user
Subject: Re: Spark Tasks on second node never return in Yarn when I have more 
than 1 task node

 

I don't know if this actually has anything to do with why your job is hanging, 
but since you are using EMR you should probably not set those fs.s3 properties 
but rather let it use EMRFS, EMR's optimized Hadoop FileSystem implementation 
for interacting with S3. One benefit is that it will automatically pick up your 
AWS credentials from your EC2 instance role rather than you having to configure 
them manually (since doing so is insecure because you have to get the secret 
access key onto your instance).

 

If simply making that change does not fix the issue, a jstack of the hung 
process would help you figure out what it is doing. You should also look at the 
YARN container logs (which automatically get uploaded to your S3 logs bucket if 
you have this enabled).

 

~ Jonathan

 

On Thu, Nov 19, 2015 at 1:32 PM, Shuai Zheng  wrote:

Hi All,

 

I face a very weird case. I have already simplify the scenario to the most so 
everyone can replay the scenario. 

 

My env:

 

AWS EMR 4.1.0, Spark1.5

 

My code can run without any problem when I run it in a local mode, and it has 
no problem when it run on a EMR cluster with one master and one task node. 

 

But when I try to run a multiple node (more than 1 task node, which means 3 
nodes cluster), the tasks will never return from one of it. 

 

The log as below:

 

15/11/19 21:19:07 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 
ip-10-165-121-188.ec2.internal, PROCESS_LOCAL, 2241 bytes)

15/11/19 21:19:07 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 
ip-10-155-160-147.ec2.internal, PROCESS_LOCAL, 2241 bytes)

15/11/19 21:19:07 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, 
ip-10-165-121-188.ec2.internal, PROCESS_LOCAL, 2241 bytes)

15/11/19 21:19:07 INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID 3, 
ip-10-155-160-147.ec2.internal, PROCESS_LOCAL, 2241 bytes)

 

So you can see the task will alternatively submitted to two instances, one is 
ip-10-165-121-188 and another is ip-10-155-160-147.

And later only the tasks runs on the ip-10-165-121-188.ec2 will finish will 
always just wait there, ip-10-155-160-147.ec2 never return.

 

The data and code has been tested in local mode, single spark cluster mode, so 
it should not be an issue on logic or data.

 

And I have attached my test case here (I believe it is simple enough and no any 
business logic is involved):

 

   public void createSiteGridExposure2() {

  JavaSparkContext ctx = this.createSparkContextTest("Test");

  ctx.textFile(siteEncodeLocation).flatMapToPair(new 
PairFlatMapFunction() {

 @Override

 public Iterable> call(String line) 
throws Exception {

   List> res = new 
ArrayList>();

   return res;

 }

  }).collectAsMap();

  ctx.stop();

   }

 

protected JavaSparkContext createSparkContextTest(String appName) {

  SparkConf sparkConf = new SparkConf().setAppName(appName);

 

  JavaSparkContext ctx = new JavaSparkContext(sparkConf);

  Configuration hadoopConf = ctx.hadoopConfiguration();

  if (awsAccessKeyId != null) {

 

 hadoopConf.set("fs.s3.impl", 
"org.apache.hadoop.fs.s3native.NativeS3FileSystem");

 hadoopConf.set("fs.s3.awsAccessKeyId", awsAccessKeyId);

 hadoopConf.set("fs.s3.awsSecretAccessKey", 
awsSecretAccessKey);

 

 hadoopConf.set("fs.s3n.impl", 
"org.apache.hadoop.fs.s3native.NativeS3FileSystem");

 hadoopConf.set("fs.s3n.awsAccessKeyId", awsAccessKeyId);

 hadoopConf.set("fs.s3n.awsSecretAccessKey", 

Spark sql-1.4.1 DataFrameWrite.jdbc() SaveMode.Append

2015-11-24 Thread Siva Gudavalli
Ref:https://issues.apache.org/jira/browse/SPARK-11953

In Spark 1.3.1 we have 2 methods i.e.. CreateJdbcTable and InsertIntoJdbc

They are replaced with write.jdbc() in Spark 1.4.1


CreateJDBCTable allows to perform CREATE TABLE ... i.e... DDL on the table
followed by INSERT (DML)

InsertIntoJDBC will avoid performing DDL on the table and INSERT (DML)


In Spark 1.4.1 both of the above technologies are replaced by write.jdbc.


When we want to Insert data into the table that already exists, I am
passing  SaveMode equals to Append.


When I say SaveMode equals to Mode, I would like to by pass

1) tableExists check

2) Avoid Spark CREATE table in a scenario when there is no Table available
in the Table


Please let me know if you think differently.

Regards
Shiv



def jdbc(url: String, table: String, connectionProperties: Properties):
Unit = {
val conn = JdbcUtils.createConnection(url, connectionProperties)

try {
var tableExists = JdbcUtils.tableExists(conn, table)

if (mode == SaveMode.Ignore && tableExists)
{ return }

if (mode == SaveMode.ErrorIfExists && tableExists)
{ sys.error(s"Table $table already exists.") }

if (mode == SaveMode.Overwrite && tableExists)
{ JdbcUtils.dropTable(conn, table) tableExists = false }

// Create the table if the table didn't exist.
if (!tableExists)
{ val schema = JDBCWriteDetails.schemaString(df, url) val sql = s"CREATE
TABLE $table ($schema)" conn.prepareStatement(sql).executeUpdate() }

} finally
{ conn.close() }

JDBCWriteDetails.saveTable(df, url, table, connectionProperties)
}


Re: Spark Streaming idempotent writes to HDFS

2015-11-24 Thread Michael
so basically writing them into a temporary directory named with the
batch time and then move the files to their destination on success ? I
wished there was a way to skip moving files around and be able to set
the output filenames.

Thanks Burak :)

-Michael


On Mon, Nov 23, 2015, at 09:19 PM, Burak Yavuz wrote:
> Not sure if it would be the most efficient, but maybe you can think of
> the filesystem as a key value store, and write each batch to a sub-
> directory, where the directory name is the batch time. If the
> directory already exists, then you shouldn't write it. Then you may
> have a following batch job that will coalesce files, in order to
> "close the day".
>
> Burak
>
> On Mon, Nov 23, 2015 at 8:58 PM, Michael  wrote:
>> Hi all,
>>
>>
I'm working on project with spark streaming, the goal is to process log
>>
files from S3 and save them on hadoop to later analyze them with
>>
sparkSQL.
>>
Everything works well except when I kill the spark application and
>>
restart it: it picks up from the latest processed batch and reprocesses
>>
it which results in duplicate data on hdfs.
>>
>>
How can I make the writing step on hdfs idempotent ? I couldn't find any
>>
way to control for example the filenames of the parquet files being
>>
written, the idea being to include the batch time so that the same batch
>>
gets written always on the same path.
>>
I've also tried with mode("overwrite") but looks that each batch gets
>>
written on the same file every time.
>>
Any help would be greatly appreciated.
>>
>>
Thanks,
>>
Michael
>>
>>
--
>>  
>> 
def save_rdd(batch_time, rdd):
>> 
        sqlContext = SQLContext(rdd.context)
>> 
        df = sqlContext.createDataFrame(rdd, log_schema)
>> 
        
df.write.mode("append").partitionBy("log_date").parquet(hdfs_dest_directory)
>>  
>> 
def create_ssc(checkpoint_dir, spark_master):
>>  
>> 
    sc = SparkContext(spark_master, app_name)
>> 
    ssc = StreamingContext(sc, batch_interval)
>> 
    ssc.checkpoint(checkpoint_dir)
>>  
>> 
    parsed = dstream.map(lambda line: log_parser(line))
>> 
    parsed.foreachRDD(lambda batch_time, rdd: save_rdd(batch_time, rdd)
>>  
>> 
    return ssc
>>  
>> 
ssc = StreamingContext.getOrCreate(checkpoint_dir, lambda:
>> 
create_ssc(checkpoint_dir, spark_master)
>> 
ssc.start()
>> 
ssc.awaitTermination()
>>  
>> 
-
>> 
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> 
For additional commands, e-mail: user-h...@spark.apache.org
>>  
 


Spark 1.4.2- java.io.FileNotFoundException: Job aborted due to stage failure

2015-11-24 Thread Sahil Sareen
I tried increasing spark.shuffle.io.maxRetries to 10 but didn't help.

This is the exception that I am getting:

[MySparkApplication] WARN : Failed to execute SQL statement select *
from TableS s join TableC c on s.property = c.property from X YZ
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 4 in stage 5710.0 failed 4 times, most recent failure: Lost task
4.3 in stage 5710.0 (TID 341269,
ip-10-0-1-80.us-west-2.compute.internal):
java.io.FileNotFoundException:
/mnt/md0/var/lib/spark/spark-549f7d96-82da-4b8d-b9fe-7f6fe8238478/blockmgr-f44be41a-9036-4b93-8608-4a8b2fabbc06/0b/shuffle_3257_4_0.data
(Permission denied)
at java.io.FileOutputStream.open(Native Method)
at java.io.FileOutputStream.(FileOutputStream.java:213)
at 
org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:128)
at 
org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:203)
at 
org.apache.spark.util.collection.WritablePartitionedIterator$$anon$3.writeNext(WritablePartitionedPairCollection.scala:104)
at 
org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:757)
at 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:70)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1276)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1267)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1266)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1266)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1460)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1421)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Conversely, Hive is performing better than Spark-Sql

2015-11-24 Thread Sabarish Sasidharan
First of all, select * is not a useful SQL to evaluate. Very rarely would a
user require all 362K records for visual analysis.

Second, collect() forces movement of all data from executors to the driver.
Instead write it out to some other table or to HDFS.

Also Spark is more beneficial when you have subsequent
queries/transformations on the same dataset. You cache the table and then
can subsequent operations will be faster.

Regards
Sab

On Wed, Nov 25, 2015 at 12:30 PM, UMESH CHAUDHARY 
wrote:

> Hi,
> I am using Hive 1.1.0 and Spark 1.5.1 and creating hive context in
> spark-shell.
>
> Now, I am experiencing reversed performance by Spark-Sql over Hive.
> By default Hive gives result back in 27 seconds for plain select * query
> on 1 GB dataset containing 3623203 records, while spark-sql gives back in 2
> mins on collect operation.
>
> Cluster Config:
> Hive : 6 Node : 16 GB Memory, 4 cores each
> Spark : 4 Nodes : 16 GB Memory, 4 cores each
>
> My dataset has 45 partitions and spark-sql creates 82 jobs.
>
> I have tried all memory and garbage collection optimizations suggested on
> official website but failed to get better performance and its worth to
> mention that sometimes I get OOM error when I allocate executor memory less
> than 10G.
>
> Can somebody tell whats actually going on ?
>
>
>


-- 

Architect - Big Data
Ph: +91 99805 99458

Manthan Systems | *Company of the year - Analytics (2014 Frost and Sullivan
India ICT)*
+++


Is it relevant to use BinaryClassificationMetrics.aucROC / aucPR with LogisticRegressionModel ?

2015-11-24 Thread jmvllt
Hi guys,

This may be a stupid question. But I m facing an issue here.

I found the class BinaryClassificationMetrics and I wanted to compute the
aucROC or aucPR of my model. 
The thing is that the predict method of a LogisticRegressionModel only
returns the predicted class, and not the probability of belonging to the
positive class. So I will get:

val metrics = new BinaryClassificationMetrics(predictionAndLabels)
val aucROC = metrics.areaUnderROC

with predictionAndLabels as a RDD[(predictedClass,label)]. 

Here, because the predicted class will always be 0 or 1, there is no way to
vary the threshold to get the aucROC, right  Or am I totally wrong ? 

So, is it relevant to use BinaryClassificationMetrics.areUnderROC with
MLlib's classification models which in many cases only return the predicted
class and not the probability ?

Nevertheless, an easy solution for LogisticRegression would be to create my
own method who takes the weights' vector of the model as a parameter and
computes a predictionAndLabels with the real belonging probabilities. But is
this the only solution 

Thanks in advance.
Regards,
Jean.  




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Is-it-relevant-to-use-BinaryClassificationMetrics-aucROC-aucPR-with-LogisticRegressionModel-tp25465.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark 1.6 Build

2015-11-24 Thread Madabhattula Rajesh Kumar
Hi Prem,

Thank you for the details. I'm not able to build. I'm facing some issues.

Any repository link, where I can download (preview version of)  1.6
version of spark-core_2.11 and spark-sql_2.11 jar files.

Regards,
Rajesh

On Tue, Nov 24, 2015 at 6:03 PM, Prem Sure  wrote:

> you can refer..:
> https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/building-spark.html#building-with-buildmvn
>
>
> On Tue, Nov 24, 2015 at 7:16 AM, Madabhattula Rajesh Kumar <
> mrajaf...@gmail.com> wrote:
>
>> Hi,
>>
>> I'm not able to build Spark 1.6 from source. Could you please share the
>> steps to build Spark 1.16
>>
>> Regards,
>> Rajesh
>>
>
>


Re: Spark 1.6 Build

2015-11-24 Thread Ted Yu
See:
http://search-hadoop.com/m/q3RTtF1Zmw12wTWX/spark+1.6+preview=+ANNOUNCE+Spark+1+6+0+Release+Preview

On Tue, Nov 24, 2015 at 9:31 AM, Madabhattula Rajesh Kumar <
mrajaf...@gmail.com> wrote:

> Hi Prem,
>
> Thank you for the details. I'm not able to build. I'm facing some issues.
>
> Any repository link, where I can download (preview version of)  1.6
> version of spark-core_2.11 and spark-sql_2.11 jar files.
>
> Regards,
> Rajesh
>
> On Tue, Nov 24, 2015 at 6:03 PM, Prem Sure  wrote:
>
>> you can refer..:
>> https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/building-spark.html#building-with-buildmvn
>>
>>
>> On Tue, Nov 24, 2015 at 7:16 AM, Madabhattula Rajesh Kumar <
>> mrajaf...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I'm not able to build Spark 1.6 from source. Could you please share the
>>> steps to build Spark 1.16
>>>
>>> Regards,
>>> Rajesh
>>>
>>
>>
>


Re: Is it relevant to use BinaryClassificationMetrics.aucROC / aucPR with LogisticRegressionModel ?

2015-11-24 Thread Sean Owen
Your reasoning is correct; you need probabilities (or at least some
score) out of the model and not just a 0/1 label in order for a ROC /
PR curve to have meaning.

But you just need to call clearThreshold() on the model to make it
return a probability.

On Tue, Nov 24, 2015 at 5:19 PM, jmvllt  wrote:
> Hi guys,
>
> This may be a stupid question. But I m facing an issue here.
>
> I found the class BinaryClassificationMetrics and I wanted to compute the
> aucROC or aucPR of my model.
> The thing is that the predict method of a LogisticRegressionModel only
> returns the predicted class, and not the probability of belonging to the
> positive class. So I will get:
>
> val metrics = new BinaryClassificationMetrics(predictionAndLabels)
> val aucROC = metrics.areaUnderROC
>
> with predictionAndLabels as a RDD[(predictedClass,label)].
>
> Here, because the predicted class will always be 0 or 1, there is no way to
> vary the threshold to get the aucROC, right  Or am I totally wrong ?
>
> So, is it relevant to use BinaryClassificationMetrics.areUnderROC with
> MLlib's classification models which in many cases only return the predicted
> class and not the probability ?
>
> Nevertheless, an easy solution for LogisticRegression would be to create my
> own method who takes the weights' vector of the model as a parameter and
> computes a predictionAndLabels with the real belonging probabilities. But is
> this the only solution 
>
> Thanks in advance.
> Regards,
> Jean.
>
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Is-it-relevant-to-use-BinaryClassificationMetrics-aucROC-aucPR-with-LogisticRegressionModel-tp25465.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark 1.6 Build

2015-11-24 Thread Madabhattula Rajesh Kumar
Hi Ted,

I'm not able find "spark-core_2.11 and spark-sql_2.11 jar files" in above
link.

Regards,
Rajesh

On Tue, Nov 24, 2015 at 11:03 PM, Ted Yu  wrote:

> See:
>
> http://search-hadoop.com/m/q3RTtF1Zmw12wTWX/spark+1.6+preview=+ANNOUNCE+Spark+1+6+0+Release+Preview
>
> On Tue, Nov 24, 2015 at 9:31 AM, Madabhattula Rajesh Kumar <
> mrajaf...@gmail.com> wrote:
>
>> Hi Prem,
>>
>> Thank you for the details. I'm not able to build. I'm facing some issues.
>>
>> Any repository link, where I can download (preview version of)  1.6
>> version of spark-core_2.11 and spark-sql_2.11 jar files.
>>
>> Regards,
>> Rajesh
>>
>> On Tue, Nov 24, 2015 at 6:03 PM, Prem Sure  wrote:
>>
>>> you can refer..:
>>> https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/building-spark.html#building-with-buildmvn
>>>
>>>
>>> On Tue, Nov 24, 2015 at 7:16 AM, Madabhattula Rajesh Kumar <
>>> mrajaf...@gmail.com> wrote:
>>>
 Hi,

 I'm not able to build Spark 1.6 from source. Could you please share the
 steps to build Spark 1.16

 Regards,
 Rajesh

>>>
>>>
>>
>


Re: Spark 1.6 Build

2015-11-24 Thread Stephen Boesch
thx for mentioning the build requirement

But actually it is -*D*scala-2.11   (i.e. -D for java property instead of
-P for profile)

details:

We can see this in the pom.xml

   
  scala-2.11
  
scala-2.11
  
  
2.11.7
2.11
  


So the scala-2.11 profile is activated by detecting the scala-2.11 system
property being set



2015-11-24 10:01 GMT-08:00 Ted Yu :

> See also:
>
> https://repository.apache.org/content/repositories/orgapachespark-1162/org/apache/spark/spark-core_2.11/v1.6.0-preview2/
>
> w.r.t. building locally, please specify -Pscala-2.11
>
> Cheers
>
> On Tue, Nov 24, 2015 at 9:58 AM, Stephen Boesch  wrote:
>
>> HI Madabhattula
>>  Scala 2.11 requires building from source.  Prebuilt binaries are
>> available only for scala 2.10
>>
>> From the src folder:
>>
>>dev/change-scala-version.sh 2.11
>>
>> Then build as you would normally either from mvn or sbt
>>
>> The above info *is* included in the spark docs but a little hard to find.
>>
>>
>>
>> 2015-11-24 9:50 GMT-08:00 Madabhattula Rajesh Kumar 
>> :
>>
>>> Hi Ted,
>>>
>>> I'm not able find "spark-core_2.11 and spark-sql_2.11 jar files" in
>>> above link.
>>>
>>> Regards,
>>> Rajesh
>>>
>>> On Tue, Nov 24, 2015 at 11:03 PM, Ted Yu  wrote:
>>>
 See:

 http://search-hadoop.com/m/q3RTtF1Zmw12wTWX/spark+1.6+preview=+ANNOUNCE+Spark+1+6+0+Release+Preview

 On Tue, Nov 24, 2015 at 9:31 AM, Madabhattula Rajesh Kumar <
 mrajaf...@gmail.com> wrote:

> Hi Prem,
>
> Thank you for the details. I'm not able to build. I'm facing some
> issues.
>
> Any repository link, where I can download (preview version of)  1.6
> version of spark-core_2.11 and spark-sql_2.11 jar files.
>
> Regards,
> Rajesh
>
> On Tue, Nov 24, 2015 at 6:03 PM, Prem Sure 
> wrote:
>
>> you can refer..:
>> https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/building-spark.html#building-with-buildmvn
>>
>>
>> On Tue, Nov 24, 2015 at 7:16 AM, Madabhattula Rajesh Kumar <
>> mrajaf...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I'm not able to build Spark 1.6 from source. Could you please share
>>> the steps to build Spark 1.16
>>>
>>> Regards,
>>> Rajesh
>>>
>>
>>
>

>>>
>>
>


Re: newbie : why are thousands of empty files being created on HDFS?

2015-11-24 Thread Andy Davidson
Hi Sabarish


Thanks for the suggestion. I did not know about wholeTextFiles()

By the way once your suggestion about repartitioning was spot on!. My run
time for count() when from elapsed time:0:56:42.902407 to elapsed
time:0:00:03.215143 on a data set of about 34M of 4720 records.

Andy

From:  Sabarish Sasidharan 
Date:  Monday, November 23, 2015 at 7:57 PM
To:  Andrew Davidson 
Cc:  Xiao Li , "user @spark" 
Subject:  Re: newbie : why are thousands of empty files being created on
HDFS?

> 
> Hi Andy
> 
> You can try sc.wholeTextFiles() instead of sc.textFile()
> 
> Regards
> Sab
> 
> On 24-Nov-2015 4:01 am, "Andy Davidson"  wrote:
>> Hi Xiao and Sabarish
>> 
>> Using the Stage tab on the UI. It turns out you can see how many
>> partitions there are. If I did nothing I would have 228155 partition.
>> (This confirms what Sabarish said). I tried coalesce(3). RDD.count()
>> fails. I though given I have 3 workers and 1/3 of the data would easily
>> fit into memory this would be a good choice.
>> 
>> If I use coalesce(30) count works. How ever it still seems slow. It took
>> 2.42 min to read 4720 records. My total data set size is 34M.
>> 
>> Any suggestions how to choose the number of partitions.?
>> 
>>  ('spark.executor.memory', '2G¹) ('spark.driver.memory', '2G')
>> 
>> 
>> The data was originally collected using spark stream. I noticed that the
>> number of default partitions == the number of files create on hdfs. I bet
>> each file is one spark streaming mini-batchI suspect if I concatenate
>> these into a small number of files things will run much faster. I suspect
>> I would not need to call coalesce() and that coalesce() is taking a lot of
>> time. Any suggestions how to choose the file number of files.
>> 
>> Kind regards
>> 
>> Andy
>> 
>> 
>> From:  Xiao Li 
>> Date:  Monday, November 23, 2015 at 12:21 PM
>> To:  Andrew Davidson 
>> Cc:  Sabarish Sasidharan , "user @spark"
>> 
>> Subject:  Re: newbie : why are thousands of empty files being created on
>> HDFS?
>> 
>> 
>>> >In your case, maybe you can try to call the function coalesce?
>>> >Good luck,
>>> >
>>> >Xiao Li
>>> >
>>> >2015-11-23 12:15 GMT-08:00 Andy Davidson :
>>> >
>>> >Hi Sabarish
>>> >
>>> >I am but a simple padawan :-) I do not understand your answer. Why would
>>> >Spark be creating so many empty partitions? My real problem is my
>>> >application is very slow. I happened to notice thousands of empty files
>>> >being created. I thought this is a hint to why my app is slow.
>>> >
>>> >My program calls sample( 0.01).filter(not null).saveAsTextFile(). This
>>> >takes about 35 min, to scan 500,000 JSON strings and write 5000 to disk.
>>> >The total data writing in 38M.
>>> >
>>> >The data is read from HDFS. My understanding is Spark can not know in
>>> >advance how HDFS partitioned the data. Spark knows I have a master and 3
>>> >slaves machines. It knows how many works/executors are assigned to my
>>> >Job. I would expect spark would be smart enough not create more
>>> >partitions than I have worker machines?
>>> >
>>> >Also given I am not using any key/value operations like Join() or doing
>>> >multiple scans I would assume my app would not benefit from partitioning.
>>> >
>>> >
>>> >Kind regards
>>> >
>>> >Andy
>>> >
>>> >
>>> >From:  Sabarish Sasidharan 
>>> >Date:  Saturday, November 21, 2015 at 7:20 PM
>>> >To:  Andrew Davidson 
>>> >Cc:  "user @spark" 
>>> >Subject:  Re: newbie : why are thousands of empty files being created on
>>> >HDFS?
>>> >
>>> >
>>> >
>>> >Those are empty partitions. I don't see the number of partitions
>>> >specified in code. That then implies the default parallelism config is
>>> >being used and is set to a very high number, the sum of empty + non empty
>>> >files.
>>> >Regards
>>> >Sab
>>> >On 21-Nov-2015 11:59 pm, "Andy Davidson" 
>>> >wrote:
>>> >
>>> >I start working on a very simple ETL pipeline for a POC. It reads a in a
>>> >data set of tweets stored as JSON strings on in HDFS and randomly selects
>>> >1% of the observations and writes them to HDFS. It seems to run very
>>> >slowly. E.G. To write 4720 observations takes 1:06:46.577795. I
>>> >Also noticed that RDD saveAsTextFile is creating thousands of empty
>>> >files.
>>> >
>>> >I assume creating all these empty files must be slowing down the system.
>>> >Any idea why this is happening? Do I have write a script to periodical
>>> >remove empty files?
>>> >
>>> >
>>> >Kind regards
>>> >
>>> >Andy
>>> >
>>> >tweetStrings = sc.textFile(inputDataURL)
>>> >
>>> >
>>> >def removeEmptyLines(line) :
>>> >if line:
>>> >return True
>>> >else :
>>> 

Re: newbie : why are thousands of empty files being created on HDFS?

2015-11-24 Thread Andy Davidson
Hi Don

I went to a presentation given by Professor Ion Stoica. He mentioned that
Python was a little slower in general because of the type system. I do not
remember all of his comments. I think the context had to do with spark SQL
and data frames.

I wonder if the python issue is similar to the boxing/unboxing issue in
Java?

Andy


From:  Don Drake 
Date:  Monday, November 23, 2015 at 7:10 PM
To:  Andrew Davidson 
Cc:  Xiao Li , Sabarish Sasidharan
, "user @spark" 
Subject:  Re: newbie : why are thousands of empty files being created on
HDFS?

> I'm seeing similar slowness in saveAsTextFile(), but only in Python.
> 
> I'm sorting data in a dataframe, then transform it and get a RDD, and then
> coalesce(1).saveAsTextFile().
> 
> I converted the Python to Scala and the run-times were similar, except for the
> saveAsTextFile() stage.  The scala version was much faster.
> 
> When looking at the executor logs during that stage, I see the following when
> running the Scala code:
> 15/11/23 20:51:26 INFO storage.ShuffleBlockFetcherIterator: Getting 600
> non-empty blocks out of 600 blocks
> 
> 15/11/23 20:51:26 INFO storage.ShuffleBlockFetcherIterator: Started 184 remote
> fetches in 64 ms
> 
> 15/11/23 20:51:30 INFO sort.UnsafeExternalSorter: Thread 57 spilling sort data
> of 146.0 MB to disk (0  time so far)
> 
> 15/11/23 20:51:35 INFO sort.UnsafeExternalSorter: Thread 57 spilling sort data
> of 146.0 MB to disk (1  time so far)
> 
> 15/11/23 20:51:40 INFO sort.UnsafeExternalSorter: Thread 57 spilling sort data
> of 146.0 MB to disk (2  times so far)
> 
> 15/11/23 20:51:45 INFO sort.UnsafeExternalSorter: Thread 57 spilling sort data
> of 146.0 MB to disk (3  times so far)
> 
> 15/11/23 20:51:50 INFO sort.UnsafeExternalSorter: Thread 57 spilling sort data
> of 146.0 MB to disk (4  times so far)
> 
> 15/11/23 20:51:54 INFO sort.UnsafeExternalSorter: Thread 57 spilling sort data
> of 146.0 MB to disk (5  times so far)
> 
> 15/11/23 20:51:59 INFO sort.UnsafeExternalSorter: Thread 57 spilling sort data
> of 146.0 MB to disk (6  times so far)
> 
> 15/11/23 20:52:04 INFO sort.UnsafeExternalSorter: Thread 57 spilling sort data
> of 146.0 MB to disk (7  times so far)
> 
> 15/11/23 20:52:09 INFO sort.UnsafeExternalSorter: Thread 57 spilling sort data
> of 146.0 MB to disk (8  times so far)
> 
>  
> 
> When running the Python version during the saveAsTextFile() stage, I see:
> 
> 15/11/23 21:04:03 INFO python.PythonRunner: Times: total = 16190, boot = 5,
> init = 144, finish = 16041
> 
> 15/11/23 21:04:03 INFO storage.ShuffleBlockFetcherIterator: Getting 300
> non-empty blocks out of 300 blocks
> 
> 15/11/23 21:04:03 INFO storage.ShuffleBlockFetcherIterator: Started 231 remote
> fetches in 82 ms
> 
> 15/11/23 21:04:15 INFO python.PythonRunner: Times: total = 12180, boot = -415,
> init = 447, finish = 12148
> 
> 15/11/23 21:04:15 INFO storage.ShuffleBlockFetcherIterator: Getting 300
> non-empty blocks out of 300 blocks
> 
> 15/11/23 21:04:15 INFO storage.ShuffleBlockFetcherIterator: Started 231 remote
> fetches in 129 ms
> 
> 15/11/23 21:04:27 INFO python.PythonRunner: Times: total = 11450, boot = -372,
> init = 398, finish = 11424
> 
> 15/11/23 21:04:27 INFO storage.ShuffleBlockFetcherIterator: Getting 300
> non-empty blocks out of 300 blocks
> 
> 15/11/23 21:04:27 INFO storage.ShuffleBlockFetcherIterator: Started 231 remote
> fetches in 70 ms
> 
> 15/11/23 21:04:42 INFO python.PythonRunner: Times: total = 14480, boot = -378,
> init = 403, finish = 14455
> 
> 15/11/23 21:04:42 INFO storage.ShuffleBlockFetcherIterator: Getting 300
> non-empty blocks out of 300 blocks
> 
> 15/11/23 21:04:42 INFO storage.ShuffleBlockFetcherIterator: Started 231 remote
> fetches in 62 ms
> 
> 15/11/23 21:04:54 INFO python.PythonRunner: Times: total = 11868, boot = -366,
> init = 381, finish = 11853
> 
> 15/11/23 21:04:54 INFO storage.ShuffleBlockFetcherIterator: Getting 300
> non-empty blocks out of 300 blocks
> 
> 15/11/23 21:04:54 INFO storage.ShuffleBlockFetcherIterator: Started 231 remote
> fetches in 59 ms
> 
> 15/11/23 21:05:10 INFO python.PythonRunner: Times: total = 15375, boot = -392,
> init = 403, finish = 15364
> 
> 15/11/23 21:05:10 INFO storage.ShuffleBlockFetcherIterator: Getting 300
> non-empty blocks out of 300 blocks
> 
> 15/11/23 21:05:10 INFO storage.ShuffleBlockFetcherIterator: Started 231 remote
> fetches in 48 ms
> 
> 
> 
> The python version is approximately 10 times slower than the Scala version.
> Any ideas why?
> 
> 
> 
> -Don
> 
> 
> On Mon, Nov 23, 2015 at 4:31 PM, Andy Davidson 
> wrote:
>> Hi Xiao and Sabarish
>> 
>> Using the Stage tab on the UI. It turns out you can see how many
>> partitions there are. If I did nothing I would have 228155 partition.
>> (This confirms what Sabarish said). I tried coalesce(3). RDD.count()
>> fails. I though given I 

Re: Spark Kafka Direct Error

2015-11-24 Thread swetha kasireddy
Is it possible that the kafka offset api is somehow returning the wrong
offsets. Because each time the job fails for different partitions with an
error similar to the error that I get below.

Job aborted due to stage failure: Task 20 in stage 117.0 failed 4 times,
most recent failure: Lost task 20.3 in stage 117.0 (TID 2114, 10.227.64.52):
java.lang.AssertionError: assertion failed: Ran out of messages before
reaching ending offset 221572238 for topic hubble_stream partition 88 start
221563725. This should not happen, and indicates that messages may have been
lost

On Tue, Nov 24, 2015 at 6:31 AM, Cody Koeninger  wrote:

> No, the direct stream only communicates with Kafka brokers, not Zookeeper
> directly.  It asks the leader for each topicpartition what the highest
> available offsets are, using the Kafka offset api.
>
> On Mon, Nov 23, 2015 at 11:36 PM, swetha kasireddy <
> swethakasire...@gmail.com> wrote:
>
>> Does Kafka direct query the offsets from the zookeeper directly? From
>> where does it get the offsets? There is data in those offsets, but somehow
>> Kafka Direct does not seem to pick it up. Other Consumers that use Zoo
>> Keeper Quorum of that Stream seems to be fine. Only Kafka Direct seems to
>> have issues. How does Kafka Direct know which offsets to query after
>> getting the initial batches from  "auto.offset.reset" -> "largest"?
>>
>> On Mon, Nov 23, 2015 at 11:01 AM, Cody Koeninger 
>> wrote:
>>
>>> No, that means that at the time the batch was scheduled, the kafka
>>> leader reported the ending offset was 221572238, but during processing,
>>> kafka stopped returning messages before reaching that ending offset.
>>>
>>> That probably means something got screwed up with Kafka - e.g. you lost
>>> a leader and lost messages in the process.
>>>
>>> On Mon, Nov 23, 2015 at 12:57 PM, swetha 
>>> wrote:
>>>
 Hi,

 I see the following error in my Spark Kafka Direct. Would this mean that
 Kafka Direct is not able to catch up with the messages and is failing?

 Job aborted due to stage failure: Task 20 in stage 117.0 failed 4 times,
 most recent failure: Lost task 20.3 in stage 117.0 (TID 2114,
 10.227.64.52):
 java.lang.AssertionError: assertion failed: Ran out of messages before
 reaching ending offset 221572238 for topic hubble_stream partition 88
 start
 221563725. This should not happen, and indicates that messages may have
 been
 lost

 Thanks,
 Swetha



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Kafka-Direct-Error-tp25454.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


>>>
>>
>


Re: Spark Kafka Direct Error

2015-11-24 Thread Cody Koeninger
Anything's possible, but that sounds pretty unlikely to me.
Are the partitions it's failing for all on the same leader?
Have there been any leader rebalances?
Do you have enough log retention?
If you log the offset for each message as it's processed, when do you see
the problem?

On Tue, Nov 24, 2015 at 10:28 AM, swetha kasireddy <
swethakasire...@gmail.com> wrote:

> Is it possible that the kafka offset api is somehow returning the wrong
> offsets. Because each time the job fails for different partitions with an
> error similar to the error that I get below.
>
> Job aborted due to stage failure: Task 20 in stage 117.0 failed 4 times,
> most recent failure: Lost task 20.3 in stage 117.0 (TID 2114,
> 10.227.64.52):
> java.lang.AssertionError: assertion failed: Ran out of messages before
> reaching ending offset 221572238 for topic hubble_stream partition 88 start
> 221563725. This should not happen, and indicates that messages may have
> been
> lost
>
> On Tue, Nov 24, 2015 at 6:31 AM, Cody Koeninger 
> wrote:
>
>> No, the direct stream only communicates with Kafka brokers, not Zookeeper
>> directly.  It asks the leader for each topicpartition what the highest
>> available offsets are, using the Kafka offset api.
>>
>> On Mon, Nov 23, 2015 at 11:36 PM, swetha kasireddy <
>> swethakasire...@gmail.com> wrote:
>>
>>> Does Kafka direct query the offsets from the zookeeper directly? From
>>> where does it get the offsets? There is data in those offsets, but somehow
>>> Kafka Direct does not seem to pick it up. Other Consumers that use Zoo
>>> Keeper Quorum of that Stream seems to be fine. Only Kafka Direct seems to
>>> have issues. How does Kafka Direct know which offsets to query after
>>> getting the initial batches from  "auto.offset.reset" -> "largest"?
>>>
>>> On Mon, Nov 23, 2015 at 11:01 AM, Cody Koeninger 
>>> wrote:
>>>
 No, that means that at the time the batch was scheduled, the kafka
 leader reported the ending offset was 221572238, but during
 processing, kafka stopped returning messages before reaching that ending
 offset.

 That probably means something got screwed up with Kafka - e.g. you lost
 a leader and lost messages in the process.

 On Mon, Nov 23, 2015 at 12:57 PM, swetha 
 wrote:

> Hi,
>
> I see the following error in my Spark Kafka Direct. Would this mean
> that
> Kafka Direct is not able to catch up with the messages and is failing?
>
> Job aborted due to stage failure: Task 20 in stage 117.0 failed 4
> times,
> most recent failure: Lost task 20.3 in stage 117.0 (TID 2114,
> 10.227.64.52):
> java.lang.AssertionError: assertion failed: Ran out of messages before
> reaching ending offset 221572238 for topic hubble_stream partition 88
> start
> 221563725. This should not happen, and indicates that messages may
> have been
> lost
>
> Thanks,
> Swetha
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Kafka-Direct-Error-tp25454.html
> Sent from the Apache Spark User List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

>>>
>>
>


Re: spark-ec2 script to launch cluster running Spark 1.5.2 built with HIVE?

2015-11-24 Thread Jeff Schecter
There is no codepath in the script /root/spark-ec2/spark/init.sh that can
actually get to the version of spark 1.5.2 pre-built with Hadoop 2.6. I
think the 2.4 version includes Hive as well... but setting hadoop major
version to 2 won't actually get you there.

Sigh. The documentation is the source code, I guess.

Thanks for your help!

# Pre-packaged spark version:
else
  case "$SPARK_VERSION" in

[ ... ]

1.2.1)
  if [[ "$HADOOP_MAJOR_VERSION" == "1" ]]; then
wget
http://s3.amazonaws.com/spark-related-packages/spark-1.2.1-bin-hadoop1.tgz
  elif [[ "$HADOOP_MAJOR_VERSION" == "2" ]]; then
wget
http://s3.amazonaws.com/spark-related-packages/spark-1.2.1-bin-cdh4.tgz
  else
wget
http://s3.amazonaws.com/spark-related-packages/spark-1.2.1-bin-hadoop2.4.tgz
  fi
  ;;
*)
  if [[ "$HADOOP_MAJOR_VERSION" == "1" ]]; then
wget
http://s3.amazonaws.com/spark-related-packages/spark-$SPARK_VERSION-bin-hadoop1.tgz
  elif [[ "$HADOOP_MAJOR_VERSION" == "2" ]]; then
wget
http://s3.amazonaws.com/spark-related-packages/spark-$SPARK_VERSION-bin-cdh4.tgz
  else
wget
http://s3.amazonaws.com/spark-related-packages/spark-$SPARK_VERSION-bin-hadoop2.4.tgz
  fi
  if [ $? != 0 ]; then
echo "ERROR: Unknown Spark version"
return -1
  fi
  esac


On Mon, Nov 23, 2015 at 6:33 PM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> Don't the Hadoop builds include Hive already? Like
> spark-1.5.2-bin-hadoop2.6.tgz?
>
> On Mon, Nov 23, 2015 at 7:49 PM Jeff Schecter  wrote:
>
>> Hi all,
>>
>> As far as I can tell, the bundled spark-ec2 script provides no way to
>> launch a cluster running Spark 1.5.2 pre-built with HIVE.
>>
>> That is to say, all of the pre-build versions of Spark 1.5.2 in the s3
>> bin spark-related-packages are missing HIVE.
>>
>> aws s3 ls s3://spark-related-packages/ | grep 1.5.2
>>
>>
>> Am I missing something here? I'd rather avoid resorting to whipping up
>> hacky patching scripts that might break with the next Spark point release
>> if at all possible.
>>
>


Re: Spark 1.6 Build

2015-11-24 Thread Stephen Boesch
HI Madabhattula
 Scala 2.11 requires building from source.  Prebuilt binaries are
available only for scala 2.10

>From the src folder:

   dev/change-scala-version.sh 2.11

Then build as you would normally either from mvn or sbt

The above info *is* included in the spark docs but a little hard to find.



2015-11-24 9:50 GMT-08:00 Madabhattula Rajesh Kumar :

> Hi Ted,
>
> I'm not able find "spark-core_2.11 and spark-sql_2.11 jar files" in above
> link.
>
> Regards,
> Rajesh
>
> On Tue, Nov 24, 2015 at 11:03 PM, Ted Yu  wrote:
>
>> See:
>>
>> http://search-hadoop.com/m/q3RTtF1Zmw12wTWX/spark+1.6+preview=+ANNOUNCE+Spark+1+6+0+Release+Preview
>>
>> On Tue, Nov 24, 2015 at 9:31 AM, Madabhattula Rajesh Kumar <
>> mrajaf...@gmail.com> wrote:
>>
>>> Hi Prem,
>>>
>>> Thank you for the details. I'm not able to build. I'm facing some
>>> issues.
>>>
>>> Any repository link, where I can download (preview version of)  1.6
>>> version of spark-core_2.11 and spark-sql_2.11 jar files.
>>>
>>> Regards,
>>> Rajesh
>>>
>>> On Tue, Nov 24, 2015 at 6:03 PM, Prem Sure 
>>> wrote:
>>>
 you can refer..:
 https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/building-spark.html#building-with-buildmvn


 On Tue, Nov 24, 2015 at 7:16 AM, Madabhattula Rajesh Kumar <
 mrajaf...@gmail.com> wrote:

> Hi,
>
> I'm not able to build Spark 1.6 from source. Could you please share
> the steps to build Spark 1.16
>
> Regards,
> Rajesh
>


>>>
>>
>


Re: Spark 1.6 Build

2015-11-24 Thread Ted Yu
See also:
https://repository.apache.org/content/repositories/orgapachespark-1162/org/apache/spark/spark-core_2.11/v1.6.0-preview2/

w.r.t. building locally, please specify -Pscala-2.11

Cheers

On Tue, Nov 24, 2015 at 9:58 AM, Stephen Boesch  wrote:

> HI Madabhattula
>  Scala 2.11 requires building from source.  Prebuilt binaries are
> available only for scala 2.10
>
> From the src folder:
>
>dev/change-scala-version.sh 2.11
>
> Then build as you would normally either from mvn or sbt
>
> The above info *is* included in the spark docs but a little hard to find.
>
>
>
> 2015-11-24 9:50 GMT-08:00 Madabhattula Rajesh Kumar :
>
>> Hi Ted,
>>
>> I'm not able find "spark-core_2.11 and spark-sql_2.11 jar files" in above
>> link.
>>
>> Regards,
>> Rajesh
>>
>> On Tue, Nov 24, 2015 at 11:03 PM, Ted Yu  wrote:
>>
>>> See:
>>>
>>> http://search-hadoop.com/m/q3RTtF1Zmw12wTWX/spark+1.6+preview=+ANNOUNCE+Spark+1+6+0+Release+Preview
>>>
>>> On Tue, Nov 24, 2015 at 9:31 AM, Madabhattula Rajesh Kumar <
>>> mrajaf...@gmail.com> wrote:
>>>
 Hi Prem,

 Thank you for the details. I'm not able to build. I'm facing some
 issues.

 Any repository link, where I can download (preview version of)  1.6
 version of spark-core_2.11 and spark-sql_2.11 jar files.

 Regards,
 Rajesh

 On Tue, Nov 24, 2015 at 6:03 PM, Prem Sure 
 wrote:

> you can refer..:
> https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/building-spark.html#building-with-buildmvn
>
>
> On Tue, Nov 24, 2015 at 7:16 AM, Madabhattula Rajesh Kumar <
> mrajaf...@gmail.com> wrote:
>
>> Hi,
>>
>> I'm not able to build Spark 1.6 from source. Could you please share
>> the steps to build Spark 1.16
>>
>> Regards,
>> Rajesh
>>
>
>

>>>
>>
>


Re: Spark Kafka Direct Error

2015-11-24 Thread swetha kasireddy
I see the assertion error when I compare the offset ranges as shown below.
How do I log the offset for each message?


kafkaStream.transform { rdd =>
  // Get the offset ranges in the RDD
  offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
  rdd
}.foreachRDD { rdd =>
  for (o <- offsetRanges) {
LOGGER.info(s"${o.topic} ${o.partition} ${o.fromOffset}
${o.untilOffset}"+"Queried offsets")
  }
 val collected = rdd.mapPartitionsWithIndex { (i, iter) =>
// For each partition, get size of the range in the partition,
// and the number of items in the partition
val off = offsetRanges(i)
val all = iter.toSeq
val partSize = all.size
val rangeSize = off.untilOffset - off.fromOffset
Iterator((partSize, rangeSize))
  }.collect*/

  // Verify whether number of elements in each partition
  // matches with the corresponding offset range
  collected.foreach { case (partSize, rangeSize) =>
assert(partSize == rangeSize, "offset ranges are wrong")
  }
}


On Tue, Nov 24, 2015 at 8:33 AM, Cody Koeninger  wrote:

> Anything's possible, but that sounds pretty unlikely to me.
> Are the partitions it's failing for all on the same leader?
> Have there been any leader rebalances?
> Do you have enough log retention?
> If you log the offset for each message as it's processed, when do you see
> the problem?
>
> On Tue, Nov 24, 2015 at 10:28 AM, swetha kasireddy <
> swethakasire...@gmail.com> wrote:
>
>> Is it possible that the kafka offset api is somehow returning the wrong
>> offsets. Because each time the job fails for different partitions with an
>> error similar to the error that I get below.
>>
>> Job aborted due to stage failure: Task 20 in stage 117.0 failed 4 times,
>> most recent failure: Lost task 20.3 in stage 117.0 (TID 2114,
>> 10.227.64.52):
>> java.lang.AssertionError: assertion failed: Ran out of messages before
>> reaching ending offset 221572238 for topic hubble_stream partition 88
>> start
>> 221563725. This should not happen, and indicates that messages may have
>> been
>> lost
>>
>> On Tue, Nov 24, 2015 at 6:31 AM, Cody Koeninger 
>> wrote:
>>
>>> No, the direct stream only communicates with Kafka brokers, not
>>> Zookeeper directly.  It asks the leader for each topicpartition what the
>>> highest available offsets are, using the Kafka offset api.
>>>
>>> On Mon, Nov 23, 2015 at 11:36 PM, swetha kasireddy <
>>> swethakasire...@gmail.com> wrote:
>>>
 Does Kafka direct query the offsets from the zookeeper directly? From
 where does it get the offsets? There is data in those offsets, but somehow
 Kafka Direct does not seem to pick it up. Other Consumers that use Zoo
 Keeper Quorum of that Stream seems to be fine. Only Kafka Direct seems to
 have issues. How does Kafka Direct know which offsets to query after
 getting the initial batches from  "auto.offset.reset" -> "largest"?

 On Mon, Nov 23, 2015 at 11:01 AM, Cody Koeninger 
 wrote:

> No, that means that at the time the batch was scheduled, the kafka
> leader reported the ending offset was 221572238, but during
> processing, kafka stopped returning messages before reaching that ending
> offset.
>
> That probably means something got screwed up with Kafka - e.g. you
> lost a leader and lost messages in the process.
>
> On Mon, Nov 23, 2015 at 12:57 PM, swetha 
> wrote:
>
>> Hi,
>>
>> I see the following error in my Spark Kafka Direct. Would this mean
>> that
>> Kafka Direct is not able to catch up with the messages and is failing?
>>
>> Job aborted due to stage failure: Task 20 in stage 117.0 failed 4
>> times,
>> most recent failure: Lost task 20.3 in stage 117.0 (TID 2114,
>> 10.227.64.52):
>> java.lang.AssertionError: assertion failed: Ran out of messages before
>> reaching ending offset 221572238 for topic hubble_stream partition 88
>> start
>> 221563725. This should not happen, and indicates that messages may
>> have been
>> lost
>>
>> Thanks,
>> Swetha
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Kafka-Direct-Error-tp25454.html
>> Sent from the Apache Spark User List mailing list archive at
>> Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

>>>
>>
>


Re: Spark SQL Save CSV with JSON Column

2015-11-24 Thread Davies Liu
I think you could have a Python UDF to turn the properties into JSON string:

import simplejson
def to_json(row):
 return simplejson.dumps(row.asDict(recursive=Trye))

to_json_udf = pyspark.sql.funcitons.udf(to_json)

df.select("col_1", "col_2",
to_json_udf(df.properties)).write.format("com.databricks.spark.csv").save()


On Tue, Nov 24, 2015 at 7:36 AM,   wrote:
> I am generating a set of tables in pyspark SQL from a JSON source dataset. I 
> am writing those tables to disk as CSVs using 
> df.write.format(com.databricks.spark.csv).save(…). I have a schema like:
>
> root
>  |-- col_1: string (nullable = true)
>  |-- col_2: string (nullable = true)
>  |-- col_3: timestamp (nullable = true)
> ...
>  |-- properties: struct (nullable = true)
>  ||-- prop_1: string (nullable = true)
>  ||-- prop_2: string (nullable = true)
>  ||-- prop3: string (nullable = true)
> …
>
> Currently I am dropping the properties section when I write to CSV, but I 
> would like to write it as a JSON column. How can I go about this? My final 
> result would be a CSV with col_1, col_2, col_3 as usual but the ‘properties’ 
> column would contain formatted JSON objects.
>
> Thanks

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Re: A Problem About Running Spark 1.5 on YARN with Dynamic Alloction

2015-11-24 Thread 谢廷稳
OK, yarn.scheduler.maximum-allocation-mb is 16384.

I have ran it again, the command to run it is:
./bin/spark-submit --class org.apache.spark.examples.SparkPi --master
yarn-cluster -
-driver-memory 4g  --executor-memory 8g lib/spark-examples*.jar 200



>
>
> 15/11/24 16:15:56 INFO yarn.ApplicationMaster: Registered signal handlers for 
> [TERM, HUP, INT]
>
> 15/11/24 16:15:57 INFO yarn.ApplicationMaster: ApplicationAttemptId: 
> appattempt_1447834709734_0120_01
>
> 15/11/24 16:15:58 INFO spark.SecurityManager: Changing view acls to: hdfs-test
>
> 15/11/24 16:15:58 INFO spark.SecurityManager: Changing modify acls to: 
> hdfs-test
>
> 15/11/24 16:15:58 INFO spark.SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(hdfs-test); 
> users with modify permissions: Set(hdfs-test)
>
> 15/11/24 16:15:58 INFO yarn.ApplicationMaster: Starting the user application 
> in a separate Thread
>
> 15/11/24 16:15:58 INFO yarn.ApplicationMaster: Waiting for spark context 
> initialization
>
> 15/11/24 16:15:58 INFO yarn.ApplicationMaster: Waiting for spark context 
> initialization ...
> 15/11/24 16:15:58 INFO spark.SparkContext: Running Spark version 1.5.0
>
> 15/11/24 16:15:58 INFO spark.SecurityManager: Changing view acls to: hdfs-test
>
> 15/11/24 16:15:58 INFO spark.SecurityManager: Changing modify acls to: 
> hdfs-test
>
> 15/11/24 16:15:58 INFO spark.SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(hdfs-test); 
> users with modify permissions: Set(hdfs-test)
> 15/11/24 16:15:58 INFO slf4j.Slf4jLogger: Slf4jLogger started
> 15/11/24 16:15:59 INFO Remoting: Starting remoting
>
> 15/11/24 16:15:59 INFO Remoting: Remoting started; listening on addresses 
> :[akka.tcp://sparkDriver@X.X.X.X
> ]
>
> 15/11/24 16:15:59 INFO util.Utils: Successfully started service 'sparkDriver' 
> on port 61904.
> 15/11/24 16:15:59 INFO spark.SparkEnv: Registering MapOutputTracker
> 15/11/24 16:15:59 INFO spark.SparkEnv: Registering BlockManagerMaster
>
> 15/11/24 16:15:59 INFO storage.DiskBlockManager: Created local directory at 
> /data1/hadoop/nm-local-dir/usercache/hdfs-test/appcache/application_1447834709734_0120/blockmgr-33fbe6c4-5138-4eff-83b4-fb0c886667b7
>
> 15/11/24 16:15:59 INFO storage.MemoryStore: MemoryStore started with capacity 
> 1966.1 MB
>
> 15/11/24 16:15:59 INFO spark.HttpFileServer: HTTP File server directory is 
> /data1/hadoop/nm-local-dir/usercache/hdfs-test/appcache/application_1447834709734_0120/spark-fbbfa2bd-6d30-421e-a634-4546134b3b5f/httpd-e31d7b8e-ca8f-400e-8b4b-d2993fb6f1d1
> 15/11/24 16:15:59 INFO spark.HttpServer: Starting HTTP Server
> 15/11/24 16:15:59 INFO server.Server: jetty-8.y.z-SNAPSHOT
> 15/11/24 16:15:59 INFO server.AbstractConnector: Started
> SocketConnector@0.0.0.0:14692
>
> 15/11/24 16:15:59 INFO util.Utils: Successfully started service 'HTTP file 
> server' on port 14692.
> 15/11/24 16:15:59 INFO spark.SparkEnv: Registering OutputCommitCoordinator
>
> 15/11/24 16:15:59 INFO ui.JettyUtils: Adding filter: 
> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
> 15/11/24 16:15:59 INFO server.Server: jetty-8.y.z-SNAPSHOT
> 15/11/24 16:15:59 INFO server.AbstractConnector: Started
> SelectChannelConnector@0.0.0.0:15948
> 15/11/24 16:15:59 INFO util.Utils: Successfully started service 'SparkUI' on 
> port 15948.
>
> 15/11/24 16:15:59 INFO ui.SparkUI: Started SparkUI at X.X.X.X
>
> 15/11/24 16:15:59 INFO cluster.YarnClusterScheduler: Created 
> YarnClusterScheduler
>
> 15/11/24 16:15:59 WARN metrics.MetricsSystem: Using default name DAGScheduler 
> for source because
> spark.app.id is not set.
>
> 15/11/24 16:15:59 INFO util.Utils: Successfully started service 
> 'org.apache.spark.network.netty.NettyBlockTransferService' on port 41830.
> 15/11/24 16:15:59 INFO netty.NettyBlockTransferService: Server created on 
> 41830
>
> 15/11/24 16:15:59 INFO storage.BlockManagerMaster: Trying to register 
> BlockManager
>
> 15/11/24 16:15:59 INFO storage.BlockManagerMasterEndpoint: Registering block 
> manager X.X.X.X:41830 with 1966.1 MB RAM, BlockManagerId(driver, 10.12.30.2, 
> 41830)
>
> 15/11/24 16:15:59 INFO storage.BlockManagerMaster: Registered BlockManager
> 15/11/24 16:16:00 INFO scheduler.EventLoggingListener: Logging events to 
> hdfs:///tmp/latest-spark-events/application_1447834709734_0120_1
>
> 15/11/24 16:16:00 INFO cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: 
> ApplicationMaster registered as 
> AkkaRpcEndpointRef(Actor[akka://sparkDriver/user/YarnAM#293602859])
>
> 15/11/24 16:16:00 INFO client.RMProxy: Connecting to ResourceManager at 
> X.X.X.X
>
> 15/11/24 16:16:00 INFO yarn.YarnRMClient: Registering the ApplicationMaster
>
> 15/11/24 16:16:00 INFO yarn.ApplicationMaster: Started progress reporter 
> thread with (heartbeat : 3000, initial allocation : 200) intervals
>
> 15/11/24 16:16:29 INFO cluster.YarnClusterSchedulerBackend: SchedulerBackend 
> is 

Re: Re: A Problem About Running Spark 1.5 on YARN with Dynamic Alloction

2015-11-24 Thread Sabarish Sasidharan
If yarn has only 50 cores then it can support max 49 executors plus 1
driver application master.

Regards
Sab
On 24-Nov-2015 1:58 pm, "谢廷稳"  wrote:

> OK, yarn.scheduler.maximum-allocation-mb is 16384.
>
> I have ran it again, the command to run it is:
> ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master
> yarn-cluster -
> -driver-memory 4g  --executor-memory 8g lib/spark-examples*.jar 200
>
>
>
>>
>>
>> 15/11/24 16:15:56 INFO yarn.ApplicationMaster: Registered signal handlers 
>> for [TERM, HUP, INT]
>>
>> 15/11/24 16:15:57 INFO yarn.ApplicationMaster: ApplicationAttemptId: 
>> appattempt_1447834709734_0120_01
>>
>> 15/11/24 16:15:58 INFO spark.SecurityManager: Changing view acls to: 
>> hdfs-test
>>
>> 15/11/24 16:15:58 INFO spark.SecurityManager: Changing modify acls to: 
>> hdfs-test
>>
>> 15/11/24 16:15:58 INFO spark.SecurityManager: SecurityManager: 
>> authentication disabled; ui acls disabled; users with view permissions: 
>> Set(hdfs-test); users with modify permissions: Set(hdfs-test)
>>
>> 15/11/24 16:15:58 INFO yarn.ApplicationMaster: Starting the user application 
>> in a separate Thread
>>
>> 15/11/24 16:15:58 INFO yarn.ApplicationMaster: Waiting for spark context 
>> initialization
>>
>> 15/11/24 16:15:58 INFO yarn.ApplicationMaster: Waiting for spark context 
>> initialization ...
>> 15/11/24 16:15:58 INFO spark.SparkContext: Running Spark version 1.5.0
>>
>> 15/11/24 16:15:58 INFO spark.SecurityManager: Changing view acls to: 
>> hdfs-test
>>
>> 15/11/24 16:15:58 INFO spark.SecurityManager: Changing modify acls to: 
>> hdfs-test
>>
>> 15/11/24 16:15:58 INFO spark.SecurityManager: SecurityManager: 
>> authentication disabled; ui acls disabled; users with view permissions: 
>> Set(hdfs-test); users with modify permissions: Set(hdfs-test)
>> 15/11/24 16:15:58 INFO slf4j.Slf4jLogger: Slf4jLogger started
>> 15/11/24 16:15:59 INFO Remoting: Starting remoting
>>
>> 15/11/24 16:15:59 INFO Remoting: Remoting started; listening on addresses 
>> :[akka.tcp://sparkDriver@X.X.X.X
>> ]
>>
>> 15/11/24 16:15:59 INFO util.Utils: Successfully started service 
>> 'sparkDriver' on port 61904.
>> 15/11/24 16:15:59 INFO spark.SparkEnv: Registering MapOutputTracker
>> 15/11/24 16:15:59 INFO spark.SparkEnv: Registering BlockManagerMaster
>>
>> 15/11/24 16:15:59 INFO storage.DiskBlockManager: Created local directory at 
>> /data1/hadoop/nm-local-dir/usercache/hdfs-test/appcache/application_1447834709734_0120/blockmgr-33fbe6c4-5138-4eff-83b4-fb0c886667b7
>>
>> 15/11/24 16:15:59 INFO storage.MemoryStore: MemoryStore started with 
>> capacity 1966.1 MB
>>
>> 15/11/24 16:15:59 INFO spark.HttpFileServer: HTTP File server directory is 
>> /data1/hadoop/nm-local-dir/usercache/hdfs-test/appcache/application_1447834709734_0120/spark-fbbfa2bd-6d30-421e-a634-4546134b3b5f/httpd-e31d7b8e-ca8f-400e-8b4b-d2993fb6f1d1
>> 15/11/24 16:15:59 INFO spark.HttpServer: Starting HTTP Server
>> 15/11/24 16:15:59 INFO server.Server: jetty-8.y.z-SNAPSHOT
>> 15/11/24 16:15:59 INFO server.AbstractConnector: Started
>> SocketConnector@0.0.0.0:14692
>>
>> 15/11/24 16:15:59 INFO util.Utils: Successfully started service 'HTTP file 
>> server' on port 14692.
>> 15/11/24 16:15:59 INFO spark.SparkEnv: Registering OutputCommitCoordinator
>>
>> 15/11/24 16:15:59 INFO ui.JettyUtils: Adding filter: 
>> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
>> 15/11/24 16:15:59 INFO server.Server: jetty-8.y.z-SNAPSHOT
>> 15/11/24 16:15:59 INFO server.AbstractConnector: Started
>> SelectChannelConnector@0.0.0.0:15948
>> 15/11/24 16:15:59 INFO util.Utils: Successfully started service 'SparkUI' on 
>> port 15948.
>>
>> 15/11/24 16:15:59 INFO ui.SparkUI: Started SparkUI at X.X.X.X
>>
>> 15/11/24 16:15:59 INFO cluster.YarnClusterScheduler: Created 
>> YarnClusterScheduler
>>
>> 15/11/24 16:15:59 WARN metrics.MetricsSystem: Using default name 
>> DAGScheduler for source because
>> spark.app.id is not set.
>>
>> 15/11/24 16:15:59 INFO util.Utils: Successfully started service 
>> 'org.apache.spark.network.netty.NettyBlockTransferService' on port 41830.
>> 15/11/24 16:15:59 INFO netty.NettyBlockTransferService: Server created on 
>> 41830
>>
>> 15/11/24 16:15:59 INFO storage.BlockManagerMaster: Trying to register 
>> BlockManager
>>
>> 15/11/24 16:15:59 INFO storage.BlockManagerMasterEndpoint: Registering block 
>> manager X.X.X.X:41830 with 1966.1 MB RAM, BlockManagerId(driver, 10.12.30.2, 
>> 41830)
>>
>> 15/11/24 16:15:59 INFO storage.BlockManagerMaster: Registered BlockManager
>> 15/11/24 16:16:00 INFO scheduler.EventLoggingListener: Logging events to 
>> hdfs:///tmp/latest-spark-events/application_1447834709734_0120_1
>>
>> 15/11/24 16:16:00 INFO cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: 
>> ApplicationMaster registered as 
>> AkkaRpcEndpointRef(Actor[akka://sparkDriver/user/YarnAM#293602859])
>>
>> 15/11/24 16:16:00 INFO client.RMProxy: Connecting to ResourceManager at 
>> X.X.X.X
>>
>>
>> 15/11/24 

Re: Re: A Problem About Running Spark 1.5 on YARN with Dynamic Alloction

2015-11-24 Thread Saisai Shao
Did you set this configuration "spark.dynamicAllocation.initialExecutors" ?

You can set spark.dynamicAllocation.initialExecutors 50 to take try again.

I guess you might be hitting this issue since you're running 1.5.0,
https://issues.apache.org/jira/browse/SPARK-9092. But it still cannot
explain why 49 executors can be worked.

On Tue, Nov 24, 2015 at 4:42 PM, Sabarish Sasidharan <
sabarish.sasidha...@manthan.com> wrote:

> If yarn has only 50 cores then it can support max 49 executors plus 1
> driver application master.
>
> Regards
> Sab
> On 24-Nov-2015 1:58 pm, "谢廷稳"  wrote:
>
>> OK, yarn.scheduler.maximum-allocation-mb is 16384.
>>
>> I have ran it again, the command to run it is:
>> ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master
>> yarn-cluster -
>> -driver-memory 4g  --executor-memory 8g lib/spark-examples*.jar 200
>>
>>
>>
>>>
>>>
>>> 15/11/24 16:15:56 INFO yarn.ApplicationMaster: Registered signal handlers 
>>> for [TERM, HUP, INT]
>>>
>>> 15/11/24 16:15:57 INFO yarn.ApplicationMaster: ApplicationAttemptId: 
>>> appattempt_1447834709734_0120_01
>>>
>>> 15/11/24 16:15:58 INFO spark.SecurityManager: Changing view acls to: 
>>> hdfs-test
>>>
>>> 15/11/24 16:15:58 INFO spark.SecurityManager: Changing modify acls to: 
>>> hdfs-test
>>>
>>> 15/11/24 16:15:58 INFO spark.SecurityManager: SecurityManager: 
>>> authentication disabled; ui acls disabled; users with view permissions: 
>>> Set(hdfs-test); users with modify permissions: Set(hdfs-test)
>>>
>>> 15/11/24 16:15:58 INFO yarn.ApplicationMaster: Starting the user 
>>> application in a separate Thread
>>>
>>> 15/11/24 16:15:58 INFO yarn.ApplicationMaster: Waiting for spark context 
>>> initialization
>>>
>>> 15/11/24 16:15:58 INFO yarn.ApplicationMaster: Waiting for spark context 
>>> initialization ...
>>> 15/11/24 16:15:58 INFO spark.SparkContext: Running Spark version 1.5.0
>>>
>>> 15/11/24 16:15:58 INFO spark.SecurityManager: Changing view acls to: 
>>> hdfs-test
>>>
>>> 15/11/24 16:15:58 INFO spark.SecurityManager: Changing modify acls to: 
>>> hdfs-test
>>>
>>> 15/11/24 16:15:58 INFO spark.SecurityManager: SecurityManager: 
>>> authentication disabled; ui acls disabled; users with view permissions: 
>>> Set(hdfs-test); users with modify permissions: Set(hdfs-test)
>>> 15/11/24 16:15:58 INFO slf4j.Slf4jLogger: Slf4jLogger started
>>> 15/11/24 16:15:59 INFO Remoting: Starting remoting
>>>
>>> 15/11/24 16:15:59 INFO Remoting: Remoting started; listening on addresses 
>>> :[akka.tcp://sparkDriver@X.X.X.X
>>> ]
>>>
>>> 15/11/24 16:15:59 INFO util.Utils: Successfully started service 
>>> 'sparkDriver' on port 61904.
>>> 15/11/24 16:15:59 INFO spark.SparkEnv: Registering MapOutputTracker
>>> 15/11/24 16:15:59 INFO spark.SparkEnv: Registering BlockManagerMaster
>>>
>>> 15/11/24 16:15:59 INFO storage.DiskBlockManager: Created local directory at 
>>> /data1/hadoop/nm-local-dir/usercache/hdfs-test/appcache/application_1447834709734_0120/blockmgr-33fbe6c4-5138-4eff-83b4-fb0c886667b7
>>>
>>> 15/11/24 16:15:59 INFO storage.MemoryStore: MemoryStore started with 
>>> capacity 1966.1 MB
>>>
>>> 15/11/24 16:15:59 INFO spark.HttpFileServer: HTTP File server directory is 
>>> /data1/hadoop/nm-local-dir/usercache/hdfs-test/appcache/application_1447834709734_0120/spark-fbbfa2bd-6d30-421e-a634-4546134b3b5f/httpd-e31d7b8e-ca8f-400e-8b4b-d2993fb6f1d1
>>> 15/11/24 16:15:59 INFO spark.HttpServer: Starting HTTP Server
>>> 15/11/24 16:15:59 INFO server.Server: jetty-8.y.z-SNAPSHOT
>>> 15/11/24 16:15:59 INFO server.AbstractConnector: Started
>>> SocketConnector@0.0.0.0:14692
>>>
>>> 15/11/24 16:15:59 INFO util.Utils: Successfully started service 'HTTP file 
>>> server' on port 14692.
>>>
>>> 15/11/24 16:15:59 INFO spark.SparkEnv: Registering OutputCommitCoordinator
>>>
>>> 15/11/24 16:15:59 INFO ui.JettyUtils: Adding filter: 
>>> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
>>> 15/11/24 16:15:59 INFO server.Server: jetty-8.y.z-SNAPSHOT
>>> 15/11/24 16:15:59 INFO server.AbstractConnector: Started
>>> SelectChannelConnector@0.0.0.0:15948
>>> 15/11/24 16:15:59 INFO util.Utils: Successfully started service 'SparkUI' 
>>> on port 15948.
>>>
>>> 15/11/24 16:15:59 INFO ui.SparkUI: Started SparkUI at X.X.X.X
>>>
>>> 15/11/24 16:15:59 INFO cluster.YarnClusterScheduler: Created 
>>> YarnClusterScheduler
>>>
>>> 15/11/24 16:15:59 WARN metrics.MetricsSystem: Using default name 
>>> DAGScheduler for source because
>>> spark.app.id is not set.
>>>
>>> 15/11/24 16:15:59 INFO util.Utils: Successfully started service 
>>> 'org.apache.spark.network.netty.NettyBlockTransferService' on port 41830.
>>> 15/11/24 16:15:59 INFO netty.NettyBlockTransferService: Server created on 
>>> 41830
>>>
>>> 15/11/24 16:15:59 INFO storage.BlockManagerMaster: Trying to register 
>>> BlockManager
>>>
>>> 15/11/24 16:15:59 INFO storage.BlockManagerMasterEndpoint: Registering 
>>> block manager X.X.X.X:41830 with 1966.1 MB RAM, BlockManagerId(driver, 

Getting ParquetDecodingException when I am running my spark application from spark-submit

2015-11-24 Thread Kapil Raaj
The relevant error lines are:

Caused by: parquet.io.ParquetDecodingException: Can't read value in
column [roll_key] BINARY at value 19600 out of 4814, 19600 out of
19600 in currentPage. repetition level: 0, definition level: 1
Caused by: org.apache.spark.SparkException: Job aborted due to stage
failure: Task 131 in stage 0.0 failed 4 times, most recent failure:
Lost task 131.3 in stage 0.0 (TID 198, dap.changed.com):
parquet.io.ParquetDecodingException: Can not read value at 19600 in
block 0 in file
hdfs://dap.changed.com:8020/data/part-r-00177-51654832-053d-4074-b906-b97ac173807a.gz.parquet

But when I am using spark client and reading it, I am not getting any error.

sqlContext.read.load("/data/").select("roll_key")

Kindly let me know how to debug it.

-- 
-Kapil Rajak 


Re: Re: A Problem About Running Spark 1.5 on YARN with Dynamic Alloction

2015-11-24 Thread 谢廷稳
@Sab Thank you for your reply, but the cluster has 6 nodes which contain
300 cores and Spark application did not request resource from YARN.

@SaiSai I have ran it successful with "
spark.dynamicAllocation.initialExecutors"  equals 50, but in
http://spark.apache.org/docs/latest/configuration.html#dynamic-allocation
it says that

"spark.dynamicAllocation.initialExecutors" equals "
spark.dynamicAllocation.minExecutors". So, I think something was wrong, did
it?

Thanks.



2015-11-24 16:47 GMT+08:00 Saisai Shao :

> Did you set this configuration "spark.dynamicAllocation.initialExecutors"
> ?
>
> You can set spark.dynamicAllocation.initialExecutors 50 to take try again.
>
> I guess you might be hitting this issue since you're running 1.5.0,
> https://issues.apache.org/jira/browse/SPARK-9092. But it still cannot
> explain why 49 executors can be worked.
>
> On Tue, Nov 24, 2015 at 4:42 PM, Sabarish Sasidharan <
> sabarish.sasidha...@manthan.com> wrote:
>
>> If yarn has only 50 cores then it can support max 49 executors plus 1
>> driver application master.
>>
>> Regards
>> Sab
>> On 24-Nov-2015 1:58 pm, "谢廷稳"  wrote:
>>
>>> OK, yarn.scheduler.maximum-allocation-mb is 16384.
>>>
>>> I have ran it again, the command to run it is:
>>> ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master
>>> yarn-cluster -
>>> -driver-memory 4g  --executor-memory 8g lib/spark-examples*.jar 200
>>>
>>>
>>>


 15/11/24 16:15:56 INFO yarn.ApplicationMaster: Registered signal handlers 
 for [TERM, HUP, INT]

 15/11/24 16:15:57 INFO yarn.ApplicationMaster: ApplicationAttemptId: 
 appattempt_1447834709734_0120_01

 15/11/24 16:15:58 INFO spark.SecurityManager: Changing view acls to: 
 hdfs-test

 15/11/24 16:15:58 INFO spark.SecurityManager: Changing modify acls to: 
 hdfs-test

 15/11/24 16:15:58 INFO spark.SecurityManager: SecurityManager: 
 authentication disabled; ui acls disabled; users with view permissions: 
 Set(hdfs-test); users with modify permissions: Set(hdfs-test)

 15/11/24 16:15:58 INFO yarn.ApplicationMaster: Starting the user 
 application in a separate Thread

 15/11/24 16:15:58 INFO yarn.ApplicationMaster: Waiting for spark context 
 initialization

 15/11/24 16:15:58 INFO yarn.ApplicationMaster: Waiting for spark context 
 initialization ...
 15/11/24 16:15:58 INFO spark.SparkContext: Running Spark version 1.5.0

 15/11/24 16:15:58 INFO spark.SecurityManager: Changing view acls to: 
 hdfs-test

 15/11/24 16:15:58 INFO spark.SecurityManager: Changing modify acls to: 
 hdfs-test

 15/11/24 16:15:58 INFO spark.SecurityManager: SecurityManager: 
 authentication disabled; ui acls disabled; users with view permissions: 
 Set(hdfs-test); users with modify permissions: Set(hdfs-test)
 15/11/24 16:15:58 INFO slf4j.Slf4jLogger: Slf4jLogger started
 15/11/24 16:15:59 INFO Remoting: Starting remoting

 15/11/24 16:15:59 INFO Remoting: Remoting started; listening on addresses 
 :[akka.tcp://sparkDriver@X.X.X.X
 ]

 15/11/24 16:15:59 INFO util.Utils: Successfully started service 
 'sparkDriver' on port 61904.
 15/11/24 16:15:59 INFO spark.SparkEnv: Registering MapOutputTracker
 15/11/24 16:15:59 INFO spark.SparkEnv: Registering BlockManagerMaster

 15/11/24 16:15:59 INFO storage.DiskBlockManager: Created local directory 
 at 
 /data1/hadoop/nm-local-dir/usercache/hdfs-test/appcache/application_1447834709734_0120/blockmgr-33fbe6c4-5138-4eff-83b4-fb0c886667b7

 15/11/24 16:15:59 INFO storage.MemoryStore: MemoryStore started with 
 capacity 1966.1 MB

 15/11/24 16:15:59 INFO spark.HttpFileServer: HTTP File server directory is 
 /data1/hadoop/nm-local-dir/usercache/hdfs-test/appcache/application_1447834709734_0120/spark-fbbfa2bd-6d30-421e-a634-4546134b3b5f/httpd-e31d7b8e-ca8f-400e-8b4b-d2993fb6f1d1
 15/11/24 16:15:59 INFO spark.HttpServer: Starting HTTP Server
 15/11/24 16:15:59 INFO server.Server: jetty-8.y.z-SNAPSHOT
 15/11/24 16:15:59 INFO server.AbstractConnector: Started
 SocketConnector@0.0.0.0:14692

 15/11/24 16:15:59 INFO util.Utils: Successfully started service 'HTTP file 
 server' on port 14692.

 15/11/24 16:15:59 INFO spark.SparkEnv: Registering OutputCommitCoordinator

 15/11/24 16:15:59 INFO ui.JettyUtils: Adding filter: 
 org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
 15/11/24 16:15:59 INFO server.Server: jetty-8.y.z-SNAPSHOT
 15/11/24 16:15:59 INFO server.AbstractConnector: Started
 SelectChannelConnector@0.0.0.0:15948
 15/11/24 16:15:59 INFO util.Utils: Successfully started service 'SparkUI' 
 on port 15948.

 15/11/24 16:15:59 INFO ui.SparkUI: Started SparkUI at X.X.X.X

 15/11/24 16:15:59 INFO cluster.YarnClusterScheduler: 

Re: Re: A Problem About Running Spark 1.5 on YARN with Dynamic Alloction

2015-11-24 Thread Saisai Shao
The document is right. Because of a bug introduce in
https://issues.apache.org/jira/browse/SPARK-9092 which makes this
configuration fail to work.

It is fixed in https://issues.apache.org/jira/browse/SPARK-10790, you could
change to newer version of Spark.

On Tue, Nov 24, 2015 at 5:12 PM, 谢廷稳  wrote:

> @Sab Thank you for your reply, but the cluster has 6 nodes which contain
> 300 cores and Spark application did not request resource from YARN.
>
> @SaiSai I have ran it successful with "
> spark.dynamicAllocation.initialExecutors"  equals 50, but in
> http://spark.apache.org/docs/latest/configuration.html#dynamic-allocation
> it says that
>
> "spark.dynamicAllocation.initialExecutors" equals "
> spark.dynamicAllocation.minExecutors". So, I think something was wrong,
> did it?
>
> Thanks.
>
>
>
> 2015-11-24 16:47 GMT+08:00 Saisai Shao :
>
>> Did you set this configuration "spark.dynamicAllocation.initialExecutors"
>> ?
>>
>> You can set spark.dynamicAllocation.initialExecutors 50 to take try
>> again.
>>
>> I guess you might be hitting this issue since you're running 1.5.0,
>> https://issues.apache.org/jira/browse/SPARK-9092. But it still cannot
>> explain why 49 executors can be worked.
>>
>> On Tue, Nov 24, 2015 at 4:42 PM, Sabarish Sasidharan <
>> sabarish.sasidha...@manthan.com> wrote:
>>
>>> If yarn has only 50 cores then it can support max 49 executors plus 1
>>> driver application master.
>>>
>>> Regards
>>> Sab
>>> On 24-Nov-2015 1:58 pm, "谢廷稳"  wrote:
>>>
 OK, yarn.scheduler.maximum-allocation-mb is 16384.

 I have ran it again, the command to run it is:
 ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master
 yarn-cluster -
 -driver-memory 4g  --executor-memory 8g lib/spark-examples*.jar 200



>
>
> 15/11/24 16:15:56 INFO yarn.ApplicationMaster: Registered signal handlers 
> for [TERM, HUP, INT]
>
> 15/11/24 16:15:57 INFO yarn.ApplicationMaster: ApplicationAttemptId: 
> appattempt_1447834709734_0120_01
>
> 15/11/24 16:15:58 INFO spark.SecurityManager: Changing view acls to: 
> hdfs-test
>
> 15/11/24 16:15:58 INFO spark.SecurityManager: Changing modify acls to: 
> hdfs-test
>
> 15/11/24 16:15:58 INFO spark.SecurityManager: SecurityManager: 
> authentication disabled; ui acls disabled; users with view permissions: 
> Set(hdfs-test); users with modify permissions: Set(hdfs-test)
>
> 15/11/24 16:15:58 INFO yarn.ApplicationMaster: Starting the user 
> application in a separate Thread
>
> 15/11/24 16:15:58 INFO yarn.ApplicationMaster: Waiting for spark context 
> initialization
>
> 15/11/24 16:15:58 INFO yarn.ApplicationMaster: Waiting for spark context 
> initialization ...
> 15/11/24 16:15:58 INFO spark.SparkContext: Running Spark version 1.5.0
>
> 15/11/24 16:15:58 INFO spark.SecurityManager: Changing view acls to: 
> hdfs-test
>
> 15/11/24 16:15:58 INFO spark.SecurityManager: Changing modify acls to: 
> hdfs-test
>
> 15/11/24 16:15:58 INFO spark.SecurityManager: SecurityManager: 
> authentication disabled; ui acls disabled; users with view permissions: 
> Set(hdfs-test); users with modify permissions: Set(hdfs-test)
> 15/11/24 16:15:58 INFO slf4j.Slf4jLogger: Slf4jLogger started
> 15/11/24 16:15:59 INFO Remoting: Starting remoting
>
> 15/11/24 16:15:59 INFO Remoting: Remoting started; listening on addresses 
> :[akka.tcp://sparkDriver@X.X.X.X
> ]
>
> 15/11/24 16:15:59 INFO util.Utils: Successfully started service 
> 'sparkDriver' on port 61904.
> 15/11/24 16:15:59 INFO spark.SparkEnv: Registering MapOutputTracker
> 15/11/24 16:15:59 INFO spark.SparkEnv: Registering BlockManagerMaster
>
> 15/11/24 16:15:59 INFO storage.DiskBlockManager: Created local directory 
> at 
> /data1/hadoop/nm-local-dir/usercache/hdfs-test/appcache/application_1447834709734_0120/blockmgr-33fbe6c4-5138-4eff-83b4-fb0c886667b7
>
> 15/11/24 16:15:59 INFO storage.MemoryStore: MemoryStore started with 
> capacity 1966.1 MB
>
> 15/11/24 16:15:59 INFO spark.HttpFileServer: HTTP File server directory 
> is 
> /data1/hadoop/nm-local-dir/usercache/hdfs-test/appcache/application_1447834709734_0120/spark-fbbfa2bd-6d30-421e-a634-4546134b3b5f/httpd-e31d7b8e-ca8f-400e-8b4b-d2993fb6f1d1
> 15/11/24 16:15:59 INFO spark.HttpServer: Starting HTTP Server
> 15/11/24 16:15:59 INFO server.Server: jetty-8.y.z-SNAPSHOT
> 15/11/24 16:15:59 INFO server.AbstractConnector: Started
> SocketConnector@0.0.0.0:14692
>
> 15/11/24 16:15:59 INFO util.Utils: Successfully started service 'HTTP 
> file server' on port 14692.
>
> 15/11/24 16:15:59 INFO spark.SparkEnv: Registering OutputCommitCoordinator
>
> 15/11/24 16:15:59 INFO ui.JettyUtils: Adding filter: 
> 

Re: Please add us to the Powered by Spark page

2015-11-24 Thread Sean Owen
Not sure who generally handles that, but I just made the edit.

On Mon, Nov 23, 2015 at 6:26 PM, Sujit Pal  wrote:
> Sorry to be a nag, I realize folks with edit rights on the Powered by Spark
> page are very busy people, but its been 10 days since my original request,
> was wondering if maybe it just fell through the cracks. If I should submit
> via some other channel that will make sure it is looked at (or better yet, a
> self service option), please let me know and I will do so.
>
> Here is the information again.
>
> Organization Name: Elsevier Labs
> URL: http://labs.elsevier.com
> Spark components: Spark Core, Spark SQL, MLLib, GraphX.
> Use Case: Building Machine Reading Pipeline, Knowledge Graphs, Content as a
> Service, Content and Event Analytics, Content/Event based Predictive Models
> and Big Data Processing. We use Scala and Python over Databricks Notebooks
> for most of our work.
>
> Thanks very much,
> Sujit
>
> On Fri, Nov 13, 2015 at 9:21 AM, Sujit Pal  wrote:
>>
>> Hello,
>>
>> We have been using Spark at Elsevier Labs for a while now. Would love to
>> be added to the “Powered By Spark” page.
>>
>> Organization Name: Elsevier Labs
>> URL: http://labs.elsevier.com
>> Spark components: Spark Core, Spark SQL, MLLib, GraphX.
>> Use Case: Building Machine Reading Pipeline, Knowledge Graphs, Content as
>> a Service, Content and Event Analytics, Content/Event based Predictive
>> Models and Big Data Processing. We use Scala and Python over Databricks
>> Notebooks for most of our work.
>>
>> Thanks very much,
>>
>> Sujit Pal
>> Technical Research Director
>> Elsevier Labs
>> sujit@elsevier.com
>>
>>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



[streaming] KafkaUtils.createDirectStream - how to start streming from checkpoints?

2015-11-24 Thread ponkin
HI,

When I create stream with KafkaUtils.createDirectStream I can explicitly define 
the position "largest" or "smallest" - where to read topic from.
What if I have previous checkpoints( in HDFS for example) with offsets, and I 
want to start reading from the last checkpoint?
In source code of KafkaUtils.createDirectStream I see the following

 val reset = kafkaParams.get("auto.offset.reset").map(_.toLowerCase)
 
 (for {
   topicPartitions <- kc.getPartitions(topics).right
   leaderOffsets <- (if (reset == Some("smallest")) {
 kc.getEarliestLeaderOffsets(topicPartitions)
   } else {
 kc.getLatestLeaderOffsets(topicPartitions)
   }).right
...

So it turns out that, I have no options to start reading from checkpoints(and 
offsets)?
Am I right?
How can I force Spark to start reading from saved offesets(in checkpoints)? Is 
it possible at all or I need to store offsets in external datastore?

Alexey Ponkin




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/streaming-KafkaUtils-createDirectStream-how-to-start-streming-from-checkpoints-tp25461.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Re: A Problem About Running Spark 1.5 on YARN with Dynamic Alloction

2015-11-24 Thread 谢廷稳
Thank you very much, after change to newer version, it did work well!

2015-11-24 17:15 GMT+08:00 Saisai Shao :

> The document is right. Because of a bug introduce in
> https://issues.apache.org/jira/browse/SPARK-9092 which makes this
> configuration fail to work.
>
> It is fixed in https://issues.apache.org/jira/browse/SPARK-10790, you
> could change to newer version of Spark.
>
> On Tue, Nov 24, 2015 at 5:12 PM, 谢廷稳  wrote:
>
>> @Sab Thank you for your reply, but the cluster has 6 nodes which contain
>> 300 cores and Spark application did not request resource from YARN.
>>
>> @SaiSai I have ran it successful with "
>> spark.dynamicAllocation.initialExecutors"  equals 50, but in
>> http://spark.apache.org/docs/latest/configuration.html#dynamic-allocation
>> it says that
>>
>> "spark.dynamicAllocation.initialExecutors" equals "
>> spark.dynamicAllocation.minExecutors". So, I think something was wrong,
>> did it?
>>
>> Thanks.
>>
>>
>>
>> 2015-11-24 16:47 GMT+08:00 Saisai Shao :
>>
>>> Did you set this configuration "spark.dynamicAllocation.initialExecutors"
>>> ?
>>>
>>> You can set spark.dynamicAllocation.initialExecutors 50 to take try
>>> again.
>>>
>>> I guess you might be hitting this issue since you're running 1.5.0,
>>> https://issues.apache.org/jira/browse/SPARK-9092. But it still cannot
>>> explain why 49 executors can be worked.
>>>
>>> On Tue, Nov 24, 2015 at 4:42 PM, Sabarish Sasidharan <
>>> sabarish.sasidha...@manthan.com> wrote:
>>>
 If yarn has only 50 cores then it can support max 49 executors plus 1
 driver application master.

 Regards
 Sab
 On 24-Nov-2015 1:58 pm, "谢廷稳"  wrote:

> OK, yarn.scheduler.maximum-allocation-mb is 16384.
>
> I have ran it again, the command to run it is:
> ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master
> yarn-cluster -
> -driver-memory 4g  --executor-memory 8g lib/spark-examples*.jar 200
>
>
>
>>
>>
>> 15/11/24 16:15:56 INFO yarn.ApplicationMaster: Registered signal 
>> handlers for [TERM, HUP, INT]
>>
>> 15/11/24 16:15:57 INFO yarn.ApplicationMaster: ApplicationAttemptId: 
>> appattempt_1447834709734_0120_01
>>
>> 15/11/24 16:15:58 INFO spark.SecurityManager: Changing view acls to: 
>> hdfs-test
>>
>> 15/11/24 16:15:58 INFO spark.SecurityManager: Changing modify acls to: 
>> hdfs-test
>>
>> 15/11/24 16:15:58 INFO spark.SecurityManager: SecurityManager: 
>> authentication disabled; ui acls disabled; users with view permissions: 
>> Set(hdfs-test); users with modify permissions: Set(hdfs-test)
>>
>> 15/11/24 16:15:58 INFO yarn.ApplicationMaster: Starting the user 
>> application in a separate Thread
>>
>> 15/11/24 16:15:58 INFO yarn.ApplicationMaster: Waiting for spark context 
>> initialization
>>
>> 15/11/24 16:15:58 INFO yarn.ApplicationMaster: Waiting for spark context 
>> initialization ...
>> 15/11/24 16:15:58 INFO spark.SparkContext: Running Spark version 1.5.0
>>
>> 15/11/24 16:15:58 INFO spark.SecurityManager: Changing view acls to: 
>> hdfs-test
>>
>> 15/11/24 16:15:58 INFO spark.SecurityManager: Changing modify acls to: 
>> hdfs-test
>>
>> 15/11/24 16:15:58 INFO spark.SecurityManager: SecurityManager: 
>> authentication disabled; ui acls disabled; users with view permissions: 
>> Set(hdfs-test); users with modify permissions: Set(hdfs-test)
>> 15/11/24 16:15:58 INFO slf4j.Slf4jLogger: Slf4jLogger started
>> 15/11/24 16:15:59 INFO Remoting: Starting remoting
>>
>> 15/11/24 16:15:59 INFO Remoting: Remoting started; listening on 
>> addresses :[akka.tcp://sparkDriver@X.X.X.X
>> ]
>>
>> 15/11/24 16:15:59 INFO util.Utils: Successfully started service 
>> 'sparkDriver' on port 61904.
>> 15/11/24 16:15:59 INFO spark.SparkEnv: Registering MapOutputTracker
>> 15/11/24 16:15:59 INFO spark.SparkEnv: Registering BlockManagerMaster
>>
>> 15/11/24 16:15:59 INFO storage.DiskBlockManager: Created local directory 
>> at 
>> /data1/hadoop/nm-local-dir/usercache/hdfs-test/appcache/application_1447834709734_0120/blockmgr-33fbe6c4-5138-4eff-83b4-fb0c886667b7
>>
>> 15/11/24 16:15:59 INFO storage.MemoryStore: MemoryStore started with 
>> capacity 1966.1 MB
>>
>> 15/11/24 16:15:59 INFO spark.HttpFileServer: HTTP File server directory 
>> is 
>> /data1/hadoop/nm-local-dir/usercache/hdfs-test/appcache/application_1447834709734_0120/spark-fbbfa2bd-6d30-421e-a634-4546134b3b5f/httpd-e31d7b8e-ca8f-400e-8b4b-d2993fb6f1d1
>> 15/11/24 16:15:59 INFO spark.HttpServer: Starting HTTP Server
>> 15/11/24 16:15:59 INFO server.Server: jetty-8.y.z-SNAPSHOT
>> 15/11/24 16:15:59 INFO server.AbstractConnector: Started
>> SocketConnector@0.0.0.0:14692
>>
>> 

Re: Spark Expand Cluster

2015-11-24 Thread Dinesh Ranganathan
Thanks Christopher, I will try that.

Dan

On 20 November 2015 at 21:41, Bozeman, Christopher 
wrote:

> Dan,
>
>
>
> Even though you may be adding more nodes to the cluster, the Spark
> application has to be requesting additional executors in order to thus use
> the added resources.  Or the Spark application can be using Dynamic
> Resource Allocation (
> http://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation)
> [which may use the resources based on application need and availability].
> For example, in EMR release 4.x (
> http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-spark-configure.html#spark-dynamic-allocation)
> you can request Spark Dynamic Resource Allocation as the default
> configuration at cluster creation.
>
>
>
> Best regards,
>
> Christopher
>
>
>
>
>
> *From:* Dinesh Ranganathan [mailto:dineshranganat...@gmail.com]
> *Sent:* Monday, November 16, 2015 4:57 AM
> *To:* Sabarish Sasidharan
> *Cc:* user
> *Subject:* Re: Spark Expand Cluster
>
>
>
> Hi Sab,
>
>
>
> I did not specify number of executors when I submitted the spark
> application. I was in the impression spark looks at the cluster and figures
> out the number of executors it can use based on the cluster size
> automatically, is this what you call dynamic allocation?. I am spark
> newbie, so apologies if I am missing the obvious. While the application was
> running I added more core nodes by resizing my EMR instance and I can see
> the new nodes on the resource manager but my running application did not
> pick up those machines I've just added.   Let me know If i am missing a
> step here.
>
>
>
> Thanks,
>
> Dan
>
>
>
> On 16 November 2015 at 12:38, Sabarish Sasidharan <
> sabarish.sasidha...@manthan.com> wrote:
>
> Spark will use the number of executors you specify in spark-submit. Are
> you saying that Spark is not able to use more executors after you modify it
> in spark-submit? Are you using dynamic allocation?
>
>
>
> Regards
>
> Sab
>
>
>
> On Mon, Nov 16, 2015 at 5:54 PM, dineshranganathan <
> dineshranganat...@gmail.com> wrote:
>
> I have my Spark application deployed on AWS EMR on yarn cluster mode.
> When I
> increase the capacity of my cluster by adding more Core instances on AWS, I
> don't see Spark picking up the new instances dynamically. Is there anything
> I can do to tell Spark to pick up the newly added boxes??
>
> Dan
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Expand-Cluster-tp25393.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>
>
>
> --
>
>
>
> Architect - Big Data
>
> Ph: +91 99805 99458
>
>
>
> Manthan Systems | *Company of the year - Analytics (2014 Frost and
> Sullivan India ICT)*
>
> +++
>
>
>
>
>
> --
>
> Dinesh Ranganathan
>



-- 
Dinesh Ranganathan


indexedrdd and radix tree: how to search indexedRDD using all prefixes?

2015-11-24 Thread Mina
Hello, I have a question about radix tree (PART) implementation in Spark,
IndexedRDD.
I explored the source code and found out that the Radix tree used in
IndexedRDD, only returns exact matches. However, it seems to have an
restricted use,
For example, I want to find children nodes using prefix from an indexedRDD. 
Thank you,
Mina




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/indexedrdd-and-radix-tree-how-to-search-indexedRDD-using-all-prefixes-tp25459.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: indexedrdd and radix tree: how to search indexedRDD using all prefixes?

2015-11-24 Thread Mina
This is what a Radix tree returns



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/indexedrdd-and-radix-tree-how-to-search-indexedRDD-using-all-prefixes-tp25459p25460.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Please add us to the Powered by Spark page

2015-11-24 Thread Reynold Xin
I just updated the page to say "email dev" instead of "email user".


On Tue, Nov 24, 2015 at 1:16 AM, Sean Owen  wrote:

> Not sure who generally handles that, but I just made the edit.
>
> On Mon, Nov 23, 2015 at 6:26 PM, Sujit Pal  wrote:
> > Sorry to be a nag, I realize folks with edit rights on the Powered by
> Spark
> > page are very busy people, but its been 10 days since my original
> request,
> > was wondering if maybe it just fell through the cracks. If I should
> submit
> > via some other channel that will make sure it is looked at (or better
> yet, a
> > self service option), please let me know and I will do so.
> >
> > Here is the information again.
> >
> > Organization Name: Elsevier Labs
> > URL: http://labs.elsevier.com
> > Spark components: Spark Core, Spark SQL, MLLib, GraphX.
> > Use Case: Building Machine Reading Pipeline, Knowledge Graphs, Content
> as a
> > Service, Content and Event Analytics, Content/Event based Predictive
> Models
> > and Big Data Processing. We use Scala and Python over Databricks
> Notebooks
> > for most of our work.
> >
> > Thanks very much,
> > Sujit
> >
> > On Fri, Nov 13, 2015 at 9:21 AM, Sujit Pal 
> wrote:
> >>
> >> Hello,
> >>
> >> We have been using Spark at Elsevier Labs for a while now. Would love to
> >> be added to the “Powered By Spark” page.
> >>
> >> Organization Name: Elsevier Labs
> >> URL: http://labs.elsevier.com
> >> Spark components: Spark Core, Spark SQL, MLLib, GraphX.
> >> Use Case: Building Machine Reading Pipeline, Knowledge Graphs, Content
> as
> >> a Service, Content and Event Analytics, Content/Event based Predictive
> >> Models and Big Data Processing. We use Scala and Python over Databricks
> >> Notebooks for most of our work.
> >>
> >> Thanks very much,
> >>
> >> Sujit Pal
> >> Technical Research Director
> >> Elsevier Labs
> >> sujit@elsevier.com
> >>
> >>
> >
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Spark 1.6 Build

2015-11-24 Thread Madabhattula Rajesh Kumar
Hi,

I'm not able to build Spark 1.6 from source. Could you please share the
steps to build Spark 1.16

Regards,
Rajesh


Re: DateTime Support - Hive Parquet

2015-11-24 Thread Cheng Lian
I see, then this is actually irrelevant to Parquet. I guess can support 
Joda DateTime in Spark SQL reflective schema inference to have this, 
provided that this is a frequent use case and Spark SQL already has Joda 
as a direct dependency.


On the other hand, if you are using Scala, you can write a simple 
implicit conversion method to avoid all the manual conversions.


Cheng

On 11/24/15 7:25 PM, Bryan wrote:


Cheng,

That’s exactly what I was hoping for – native support for writing 
DateTime objects. As it stands Spark 1.5.2 seems to leave no option 
but to do manual conversion (to nanos, Timestamp, etc) prior to 
writing records to hive.


Regards,

Bryan Jeffrey

Sent from Outlook Mail


*From: *Cheng Lian
*Sent: *Tuesday, November 24, 2015 1:42 AM
*To: *Bryan Jeffrey;user
*Subject: *Re: DateTime Support - Hive Parquet

Hey Bryan,

What do you mean by "DateTime properties"? Hive and Spark SQL both

support DATE and TIMESTAMP types, but there's no DATETIME type. So I

assume you are referring to Java class DateTime (possibly the one in

joda)? Could you please provide a sample snippet that illustrates your

requirement?

Cheng

On 11/23/15 9:40 PM, Bryan Jeffrey wrote:

> All,

>

> I am attempting to write objects that include a DateTime properties to

> a persistent table using Spark 1.5.2 / HiveContext.  In 1.4.1 I was

> forced to convert the DateTime properties to Timestamp properties.  I

> was under the impression that this issue was fixed in the default Hive

> supported with 1.5.2 - however, I am still seeing the associated errors.

>

> Is there a bug I can follow to determine when DateTime will be

> supported for Parquet?

>

> Regards,

>

> Bryan Jeffrey





RE: DateTime Support - Hive Parquet

2015-11-24 Thread Bryan
Cheng,

I am using Scala. I have an implicit conversion from Joda DateTime to 
timestamp. My tables are defined with Timestamp. However explicit conversation 
appears to be required. Do you have an example of implicit conversion for this 
case? Do you convert on insert or on RDD to DF conversion?

Regards,

Bryan Jeffrey

Sent from Outlook Mail



From: Cheng Lian
Sent: Tuesday, November 24, 2015 6:49 AM
To: Bryan;user
Subject: Re: DateTime Support - Hive Parquet


I see, then this is actually irrelevant to Parquet. I guess can support Joda 
DateTime in Spark SQL reflective schema inference to have this, provided that 
this is a frequent use case and Spark SQL already has Joda as a direct 
dependency.

On the other hand, if you are using Scala, you can write a simple implicit 
conversion method to avoid all the manual conversions.

Cheng
On 11/24/15 7:25 PM, Bryan wrote:
Cheng,
 
That’s exactly what I was hoping for – native support for writing DateTime 
objects. As it stands Spark 1.5.2 seems to leave no option but to do manual 
conversion (to nanos, Timestamp, etc) prior to writing records to hive. 
 
Regards,
 
Bryan Jeffrey
 
Sent from Outlook Mail
 
 

From: Cheng Lian
Sent: Tuesday, November 24, 2015 1:42 AM
To: Bryan Jeffrey;user
Subject: Re: DateTime Support - Hive Parquet
 
 
Hey Bryan,
 
What do you mean by "DateTime properties"? Hive and Spark SQL both 
support DATE and TIMESTAMP types, but there's no DATETIME type. So I 
assume you are referring to Java class DateTime (possibly the one in 
joda)? Could you please provide a sample snippet that illustrates your 
requirement?
 
Cheng
 
On 11/23/15 9:40 PM, Bryan Jeffrey wrote:
> All,
> 
> I am attempting to write objects that include a DateTime properties to 
> a persistent table using Spark 1.5.2 / HiveContext.  In 1.4.1 I was 
> forced to convert the DateTime properties to Timestamp properties.  I 
> was under the impression that this issue was fixed in the default Hive 
> supported with 1.5.2 - however, I am still seeing the associated errors.
> 
> Is there a bug I can follow to determine when DateTime will be 
> supported for Parquet?
> 
> Regards,
> 
> Bryan Jeffrey
 
 
 





Re: [streaming] KafkaUtils.createDirectStream - how to start streming from checkpoints?

2015-11-24 Thread Deng Ching-Mallete
Hi,

If you wish to read from checkpoints, you need to use
StreamingContext.getOrCreate(checkpointDir, functionToCreateContext) to
create the streaming context that you pass in to
KafkaUtils.createDirectStream(...). You may refer to
http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing
for an example.

HTH,
Deng

On Tue, Nov 24, 2015 at 5:46 PM, ponkin  wrote:

> HI,
>
> When I create stream with KafkaUtils.createDirectStream I can explicitly
> define the position "largest" or "smallest" - where to read topic from.
> What if I have previous checkpoints( in HDFS for example) with offsets,
> and I want to start reading from the last checkpoint?
> In source code of KafkaUtils.createDirectStream I see the following
>
>  val reset = kafkaParams.get("auto.offset.reset").map(_.toLowerCase)
>
>  (for {
>topicPartitions <- kc.getPartitions(topics).right
>leaderOffsets <- (if (reset == Some("smallest")) {
>  kc.getEarliestLeaderOffsets(topicPartitions)
>} else {
>  kc.getLatestLeaderOffsets(topicPartitions)
>}).right
> ...
>
> So it turns out that, I have no options to start reading from
> checkpoints(and offsets)?
> Am I right?
> How can I force Spark to start reading from saved offesets(in
> checkpoints)? Is it possible at all or I need to store offsets in external
> datastore?
>
> Alexey Ponkin
>
> --
> View this message in context: [streaming] KafkaUtils.createDirectStream -
> how to start streming from checkpoints?
> 
> Sent from the Apache Spark User List mailing list archive
>  at Nabble.com.
>


RE: DateTime Support - Hive Parquet

2015-11-24 Thread Bryan
Cheng,

That’s exactly what I was hoping for – native support for writing DateTime 
objects. As it stands Spark 1.5.2 seems to leave no option but to do manual 
conversion (to nanos, Timestamp, etc) prior to writing records to hive. 

Regards,

Bryan Jeffrey

Sent from Outlook Mail



From: Cheng Lian
Sent: Tuesday, November 24, 2015 1:42 AM
To: Bryan Jeffrey;user
Subject: Re: DateTime Support - Hive Parquet


Hey Bryan,

What do you mean by "DateTime properties"? Hive and Spark SQL both 
support DATE and TIMESTAMP types, but there's no DATETIME type. So I 
assume you are referring to Java class DateTime (possibly the one in 
joda)? Could you please provide a sample snippet that illustrates your 
requirement?

Cheng

On 11/23/15 9:40 PM, Bryan Jeffrey wrote:
> All,
>
> I am attempting to write objects that include a DateTime properties to 
> a persistent table using Spark 1.5.2 / HiveContext.  In 1.4.1 I was 
> forced to convert the DateTime properties to Timestamp properties.  I 
> was under the impression that this issue was fixed in the default Hive 
> supported with 1.5.2 - however, I am still seeing the associated errors.
>
> Is there a bug I can follow to determine when DateTime will be 
> supported for Parquet?
>
> Regards,
>
> Bryan Jeffrey





Re: [streaming] KafkaUtils.createDirectStream - how to start streming from checkpoints?

2015-11-24 Thread Понькин Алексей
Great, thank you.
Sorry for being so inattentive) Need to read docs carefully.

-- 
Яндекс.Почта — надёжная почта
http://mail.yandex.ru/neo2/collect/?exp=1=1


24.11.2015, 15:15, "Deng Ching-Mallete" :
> Hi,
>
> If you wish to read from checkpoints, you need to use 
> StreamingContext.getOrCreate(checkpointDir, functionToCreateContext) to 
> create the streaming context that you pass in to 
> KafkaUtils.createDirectStream(...). You may refer to 
> http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing
>  for an example.
>
> HTH,
> Deng
>
> On Tue, Nov 24, 2015 at 5:46 PM, ponkin  wrote:
>> HI,
>>
>> When I create stream with KafkaUtils.createDirectStream I can explicitly 
>> define the position "largest" or "smallest" - where to read topic from.
>> What if I have previous checkpoints( in HDFS for example) with offsets, and 
>> I want to start reading from the last checkpoint?
>> In source code of KafkaUtils.createDirectStream I see the following
>>
>>  val reset = kafkaParams.get("auto.offset.reset").map(_.toLowerCase)
>>
>>      (for {
>>        topicPartitions <- kc.getPartitions(topics).right
>>        leaderOffsets <- (if (reset == Some("smallest")) {
>>          kc.getEarliestLeaderOffsets(topicPartitions)
>>        } else {
>>          kc.getLatestLeaderOffsets(topicPartitions)
>>        }).right
>> ...
>>
>> So it turns out that, I have no options to start reading from 
>> checkpoints(and offsets)?
>> Am I right?
>> How can I force Spark to start reading from saved offesets(in checkpoints)? 
>> Is it possible at all or I need to store offsets in external datastore?
>>
>> Alexey Ponkin
>>
>> 
>> View this message in context: [streaming] KafkaUtils.createDirectStream - 
>> how to start streming from checkpoints?
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark 1.6 Build

2015-11-24 Thread Prem Sure
you can refer..:
https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/building-spark.html#building-with-buildmvn


On Tue, Nov 24, 2015 at 7:16 AM, Madabhattula Rajesh Kumar <
mrajaf...@gmail.com> wrote:

> Hi,
>
> I'm not able to build Spark 1.6 from source. Could you please share the
> steps to build Spark 1.16
>
> Regards,
> Rajesh
>


Re: Experiences about NoSQL databases with Spark

2015-11-24 Thread Ted Yu
You should consider using HBase as the NoSQL database.
w.r.t. 'The data in the DB should be indexed', you need to design the
schema in HBase carefully so that the retrieval is fast.

Disclaimer: I work on HBase.

On Tue, Nov 24, 2015 at 4:46 AM, sparkuser2345 
wrote:

> I'm interested in knowing which NoSQL databases you use with Spark and what
> are your experiences.
>
> On a general level, I would like to use Spark streaming to process incoming
> data, fetch relevant aggregated data from the database, and update the
> aggregates in the DB based on the incoming records. The data in the DB
> should be indexed to be able to fetch the relevant data fast and to allow
> fast interactive visualization of the data.
>
> I've been reading about MongoDB+Spark and I've got the impression that
> there
> are some challenges in fetching data by indices and in updating documents,
> but things are moving so fast, so I don't know if these are relevant
> anymore. Do you find any benefit from using HBase with Spark as HBase is
> built on top of HDFS?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Experiences-about-NoSQL-databases-with-Spark-tp25462.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Please add us to the Powered by Spark page

2015-11-24 Thread Sujit Pal
Thank you Sean, much appreciated.

And yes, perhaps "email dev" is a better option since the traffic is
(probably) lighter and these sorts of requests are more likely to get
noticed. Although one would need to subscribe to the dev list to do that...

-sujit

On Tue, Nov 24, 2015 at 1:16 AM, Sean Owen  wrote:

> Not sure who generally handles that, but I just made the edit.
>
> On Mon, Nov 23, 2015 at 6:26 PM, Sujit Pal  wrote:
> > Sorry to be a nag, I realize folks with edit rights on the Powered by
> Spark
> > page are very busy people, but its been 10 days since my original
> request,
> > was wondering if maybe it just fell through the cracks. If I should
> submit
> > via some other channel that will make sure it is looked at (or better
> yet, a
> > self service option), please let me know and I will do so.
> >
> > Here is the information again.
> >
> > Organization Name: Elsevier Labs
> > URL: http://labs.elsevier.com
> > Spark components: Spark Core, Spark SQL, MLLib, GraphX.
> > Use Case: Building Machine Reading Pipeline, Knowledge Graphs, Content
> as a
> > Service, Content and Event Analytics, Content/Event based Predictive
> Models
> > and Big Data Processing. We use Scala and Python over Databricks
> Notebooks
> > for most of our work.
> >
> > Thanks very much,
> > Sujit
> >
> > On Fri, Nov 13, 2015 at 9:21 AM, Sujit Pal 
> wrote:
> >>
> >> Hello,
> >>
> >> We have been using Spark at Elsevier Labs for a while now. Would love to
> >> be added to the “Powered By Spark” page.
> >>
> >> Organization Name: Elsevier Labs
> >> URL: http://labs.elsevier.com
> >> Spark components: Spark Core, Spark SQL, MLLib, GraphX.
> >> Use Case: Building Machine Reading Pipeline, Knowledge Graphs, Content
> as
> >> a Service, Content and Event Analytics, Content/Event based Predictive
> >> Models and Big Data Processing. We use Scala and Python over Databricks
> >> Notebooks for most of our work.
> >>
> >> Thanks very much,
> >>
> >> Sujit Pal
> >> Technical Research Director
> >> Elsevier Labs
> >> sujit@elsevier.com
> >>
> >>
> >
>